Completed
Push — master ( 0e9a84...49fafc )
by Dafne van
06:16
created

fetch_data()   B

Complexity

Conditions 4

Size

Total Lines 28

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 1
CRAP Score 17.0071

Importance

Changes 3
Bugs 0 Features 0
Metric Value
cc 4
c 3
b 0
f 0
dl 0
loc 28
ccs 1
cts 15
cp 0.0667
crap 17.0071
rs 8.5806
1
"""
2
 Summary:
3
 Function fetch_and_preprocess from tutorial_pamap2.py helps to fetch and
4
 preproces the data.
5
 Example function calls in 'Tutorial mcfly on PAMAP2.ipynb'
6
"""
7 1
import numpy as np
8 1
from numpy import genfromtxt
9 1
import pandas as pd
10 1
import matplotlib.pyplot as plt
11 1
from os import listdir
12 1
import os.path
13 1
import zipfile
14 1
import keras
15 1
from keras.utils.np_utils import to_categorical
16 1
import sys
17 1
import six.moves.urllib as urllib
18
19
20 1
def split_activities(labels, X, borders=10*100):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
21
    """
22
    Splits up the data per activity and exclude activity=0.
23
    Also remove borders for each activity.
24
    Returns lists with subdatasets
25
    """
26 1
    tot_len = len(labels)
27 1
    startpoints = np.where([1] + [labels[i] != labels[i-1] \
28
        for i in range(1, tot_len)])[0]
29 1
    endpoints = np.append(startpoints[1:]-1, tot_len-1)
30 1
    acts = [labels[s] for s, e in zip(startpoints, endpoints)]
31
    #Also split up the data, and only keep the non-zero activities
32 1
    xysplit = [(X[s+borders:e-borders+1, :], a) \
33
        for s, e, a in zip(startpoints, endpoints, acts) if a != 0]
34 1
    xysplit = [(X, y) for X, y in xysplit if len(X) > 0]
35 1
    Xlist = [X for X, y in xysplit]
0 ignored issues
show
Coding Style Naming introduced by
The name Xlist does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
36 1
    ylist = [y for X, y in xysplit]
37 1
    return Xlist, ylist
38
39 1
def sliding_window(frame_length, step, Xsamples,\
0 ignored issues
show
Coding Style Naming introduced by
The name Xsamples does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name Xsampleslist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
40
    ysamples, Xsampleslist, ysampleslist):
41
    """
42
    Splits time series in ysampleslist and Xsampleslist
43
    into segments by applying a sliding overlapping window
44
    of size equal to frame_length with steps equal to step
45
    it does this for all the samples and appends all the output together.
46
    So, the participant distinction is not kept
47
    """
48 1
    for j in range(len(Xsampleslist)):
49 1
        X = Xsampleslist[j]
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
50 1
        ybinary = ysampleslist[j]
51 1
        for i in range(0, X.shape[0]-frame_length, step):
52 1
            xsub = X[i:i+frame_length, :]
53 1
            ysub = ybinary
54 1
            Xsamples.append(xsub)
55 1
            ysamples.append(ysub)
56
57 1
def transform_y(y, mapclasses, nr_classes):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
58
    """
59
    Transforms y, a list with one sequence of A timesteps
60
    and B unique classes into a binary Numpy matrix of
61
    shape (A, B)
62
    """
63 1
    ymapped = np.array([mapclasses[c] for c in y], dtype='int')
64 1
    ybinary = to_categorical(ymapped, nr_classes)
65 1
    return ybinary
66
67 1
def addheader(datasets):
68
    """
69
    The columns of the pandas data frame are numbers
70
    this function adds the column labels
71
    """
72 1
    axes = ['x', 'y', 'z']
73 1
    IMUsensor_columns = ['temperature'] + \
0 ignored issues
show
Coding Style Naming introduced by
The name IMUsensor_columns does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
74
                    ['acc_16g_' + i for i in axes] + \
75
                    ['acc_6g_' + i for i in axes] + \
76
                    ['gyroscope_' + i for i in axes] + \
77
                    ['magnometer_' + i for i in axes] + \
78
                    ['orientation_' + str(i) for i in range(4)]
79 1
    header = ["timestamp", "activityID", "heartrate"] + ["hand_"+s \
80
        for s in IMUsensor_columns] \
81
        + ["chest_"+s for s in IMUsensor_columns]+ ["ankle_"+s \
82
            for s in IMUsensor_columns]
83 1
    for i in range(0, len(datasets)):
84 1
        datasets[i].columns = header
85 1
    return datasets
86
87 1
def numpify_and_store(X, y, xname, yname, outdatapath, shuffle=False):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
88
    """
89
    Converts python lists x 3D and y 1D into numpy arrays
90
    and stores the numpy array in directory outdatapath
91
    shuffle is optional and shuffles the samples
92
    """
93 1
    X = np.array(X)
94 1
    y = np.array(y)
95
    # Shuffle the train set
96 1
    if shuffle is True:
97 1
        np.random.seed(123)
98 1
        neworder = np.random.permutation(X.shape[0])
99 1
        X = X[neworder, :, :]
100 1
        y = y[neworder, :]
101
    # Save binary file
102 1
    np.save(outdatapath+ xname, X)
103 1
    np.save(outdatapath+ yname, y)
104
105
106 1
def fetch_data(directory_to_extract_to):
107
    """
108
    Fetch the data and extract the contents of the zip file
109
    to the directory_to_extract_to.
110
    First check whether this was done before, if yes, then skip
111
    """
112
    targetdir = directory_to_extract_to + '/PAMAP2'
113
    if os.path.exists(targetdir):
114
        print('Data previously downloaded and stored in ' + targetdir)
115
    else:
116
        os.makedirs(targetdir) # create target directory
117
        # Download the PAMAP2 data, this is 688 Mb
118
        path_to_zip_file = directory_to_extract_to + '/PAMAP2_Dataset.zip'
119
        test_file_exist = os.path.isfile(path_to_zip_file)
120
        if test_file_exist is False:
121
            url = str('https://archive.ics.uci.edu/ml/' +
122
                'machine-learning-databases/00231/PAMAP2_Dataset.zip')
123
            #retrieve data from url
124
            local_fn, headers = urllib.request.urlretrieve(url,\
125
                    filename=path_to_zip_file)
126
            print('Download complete and stored in: ' + path_to_zip_file)
127
        else:
128
            print('The data was previously downloaded and stored in ' +
129
                path_to_zip_file)
130
        # unzip
131
        with zipfile.ZipFile(path_to_zip_file, "r") as zip_ref:
132
            zip_ref.extractall(targetdir)
133
    return targetdir
134
135
136 1
def slidingwindow_store(y_list, x_list, X_name, y_name, outdatapath, shuffle):
0 ignored issues
show
Coding Style Naming introduced by
The name X_name does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
137
    # Take sliding-window frames. Target is label of last time step
138
    # Data is 100 Hz
139
    frame_length = int(5.12 * 100)
140
    step = 1 * 100
141
    x_set = []
142
    y_set = []
143
    sliding_window(frame_length, step, x_set, y_set, x_list, y_list)
144
    numpify_and_store(x_set, y_set, X_name, y_name, \
145
        outdatapath, shuffle)
146
147 1
def map_clas(datasets_filled):
148
    ysetall = [set(np.array(data.activityID)) - set([0]) \
149
        for data in datasets_filled]
150
    classlabels = list(set.union(*[set(y) for y in ysetall]))
151
    nr_classes = len(classlabels)
152
    mapclasses = {classlabels[i] : i for i in range(len(classlabels))}
153
    return classlabels, nr_classes, mapclasses
154
155 1
def split_data(Xlists, ybinarylists, indices):
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
156
    """ Function takes subset from list given indices
157
    Arguments:
158
    - Xlists: tuple (samples) of lists (windows)
159
            of numpy-arrays (time, variable)
160
    - ybinarylist: list (samples) of numpy-arrays (window, class)
161
    - indices: indices of the slice of data (samples) to be taken
162
    Value (output):
163
    - x_setlist: list (windows across samples) of numpy-arrays (time, variable)
164
    - y_setlist: list (windows across samples) of numpy-arrays (class, )
165
    """
166 1
    tty = str(type(indices))
167
    # or statement in next line is to account for python2 and python3
168
    # difference
169 1
    if  tty == "<class 'slice'>" or tty == "<type 'slice'>":
170 1
        x_setlist = [X for Xlist in Xlists[indices] for X in Xlist]
171 1
        y_setlist = [y for ylist in ybinarylists[indices] for y in ylist]
172
    else:
173
        x_setlist = [X for X in Xlists[indices]]
174
        y_setlist = [y for y in ybinarylists[indices]]
175 1
    return x_setlist, y_setlist
176
177 1
def preprocess(targetdir, outdatapath, columns_to_use):
178
    """ Function to preprocess the PAMAP2 data after it is fetched
179
    Arguments:
180
    - targetdir: subdirectory of directory_to_extract_to, targetdir
181
        is defined by function fetch_data
182
    - outdatapath: a subdirectory of directory_to_extract_to, outdatapath
183
        is the direcotry where the Numpy output will be stored.
184
    Value (output):
185
    - None
186
    """
187
    datadir = targetdir + '/PAMAP2_Dataset/Protocol'
188
    filenames = listdir(datadir)
189
    print('Start pre-processing all ' + str(len(filenames)) + ' files...')
190
    # load the files and put them in a list of pandas dataframes:
191
    datasets = [pd.read_csv(datadir+'/'+fn, header=None, sep=' ') \
192
        for fn in filenames]
193
    datasets = addheader(datasets) # add headers to the datasets
194
    #Interpolate dataset to get same sample rate between channels
195
    datasets_filled = [d.interpolate() for d in datasets]
196
    # Create mapping for class labels
197
    classlabels, nr_classes, mapclasses = map_clas(datasets_filled)
198
    #Create input (x) and output (y) sets
199
    xall = [np.array(data[columns_to_use]) for data in datasets_filled]
200
    yall = [np.array(data.activityID) for data in datasets_filled]
201
    xylists = [split_activities(y, x) for x, y in zip(xall, yall)]
202
    Xlists, ylists = zip(*xylists)
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
203
    ybinarylists = [transform_y(y, mapclasses, nr_classes) for y in ylists]
204
    # Split in train, test and val
205
    x_vallist, y_vallist = split_data(Xlists, ybinarylists, indices=6)
206
    test_range = slice(7, len(datasets_filled))
207
    x_testlist, y_testlist = split_data(Xlists, ybinarylists, test_range)
208
    x_trainlist, y_trainlist = split_data(Xlists, ybinarylists, \
209
        indices=slice(0, 6))
210
    # Take sliding-window frames, target is label of last time step,
211
    # and store as numpy file
212
    slidingwindow_store(y_list=y_trainlist, x_list=x_trainlist, \
213
                X_name='X_train', y_name='y_train', \
214
                outdatapath=outdatapath, shuffle=True)
215
    slidingwindow_store(y_list=y_vallist, x_list=x_vallist, \
216
        X_name='X_val', y_name='y_val', \
217
        outdatapath=outdatapath, shuffle=False)
218
    slidingwindow_store(y_list=y_testlist, x_list=x_testlist, \
219
            X_name='X_test', y_name='y_test', \
220
            outdatapath=outdatapath, shuffle=False)
221
    print('Processed data succesfully stored in ' + outdatapath)
222
    return None
223
224 1
def fetch_and_preprocess(directory_to_extract_to, columns_to_use=None):
225
    """
226
    High level function to fetch_and_preprocess the PAMAP2 dataset
227
    Arguments:
228
    - directory_to_extract_to: the directory where the data will be stored
229
    - columns_to_use: the columns to use
230
    Values (output):
231
    - outdatapath: The directory in which the numpy files are stored
232
    """
233
    if columns_to_use is None:
234
        columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
235
                     'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
236
                     'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
237
    targetdir = fetch_data(directory_to_extract_to)
238
    outdatapath = targetdir + '/PAMAP2_Dataset/slidingwindow512cleaned/'
239
    if not os.path.exists(outdatapath):
240
        os.makedirs(outdatapath)
241
    if os.path.isfile(outdatapath+'x_train.npy'):
242
        print('Data previously pre-processed and np-files saved to ' +
243
            outdatapath)
244
    else:
245
        preprocess(targetdir, outdatapath, columns_to_use)
246
    return outdatapath
247
248 1
def load_data(outputpath):
249
    """ Function to load the numpy data as stored in directory
250
    outputpath.
251
    """
252
    ext = '.npy'
253
    x_train = np.load(outputpath+'X_train'+ext)
254
    y_train_binary = np.load(outputpath+'y_train'+ext)
255
    x_val = np.load(outputpath+'X_val'+ext)
256
    y_val_binary = np.load(outputpath+'y_val'+ext)
257
    x_test = np.load(outputpath+'X_test'+ext)
258
    y_test_binary = np.load(outputpath+'y_test'+ext)
259
    return x_train, y_train_binary, x_val, y_val_binary, x_test, y_test_binary
260