Completed
Push — master ( 9041e5...e26168 )
by Dafne van
08:06
created

load_data()   B

Complexity

Conditions 1

Size

Total Lines 26

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 1
CRAP Score 1.7023

Importance

Changes 4
Bugs 0 Features 0
Metric Value
cc 1
c 4
b 0
f 0
dl 0
loc 26
ccs 1
cts 9
cp 0.1111
crap 1.7023
rs 8.8571
1
"""
2
 Summary:
3
 Function fetch_and_preprocess from tutorial_pamap2.py helps to fetch and
4
 preproces the data.
5
 Example function calls in 'Tutorial mcfly on PAMAP2.ipynb'
6
"""
7 1
import numpy as np
8 1
from numpy import genfromtxt
9 1
import pandas as pd
10 1
import matplotlib.pyplot as plt
11 1
from os import listdir
12 1
import os.path
13 1
import zipfile
14 1
import keras
15 1
from keras.utils.np_utils import to_categorical
16 1
import sys
17 1
import six.moves.urllib as urllib
18
19
20 1
def split_activities(labels, X, borders=10 * 100):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
21
    """
22
    Splits up the data per activity and exclude activity=0.
23
    Also remove borders for each activity.
24
    Returns lists with subdatasets
25
26
    Parameters
27
    ----------
28
    labels : numpy array
29
        Activity labels
30
    X : numpy array
31
        Data points
32
    borders : int
33
        Nr of timesteps to remove from the borders of an activity
34
35
    Returns
36
    -------
37
    X_list
38
    y_list
39
    """
40 1
    tot_len = len(labels)
41 1
    startpoints = np.where([1] + [labels[i] != labels[i - 1]
42
                                  for i in range(1, tot_len)])[0]
43 1
    endpoints = np.append(startpoints[1:] - 1, tot_len - 1)
44 1
    acts = [labels[s] for s, e in zip(startpoints, endpoints)]
45
    # Also split up the data, and only keep the non-zero activities
46 1
    xysplit = [(X[s + borders:e - borders + 1, :], a)
47
               for s, e, a in zip(startpoints, endpoints, acts) if a != 0]
48 1
    xysplit = [(X, y) for X, y in xysplit if len(X) > 0]
49 1
    Xlist = [X for X, y in xysplit]
0 ignored issues
show
Coding Style Naming introduced by
The name Xlist does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
50 1
    ylist = [y for X, y in xysplit]
51 1
    return Xlist, ylist
52
53
54 1
def sliding_window(frame_length, step, Xsamples,
0 ignored issues
show
Coding Style Naming introduced by
The name Xsamples does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name Xsampleslist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
55
                   ysamples, Xsampleslist, ysampleslist):
56
    """
57
    Splits time series in ysampleslist and Xsampleslist
58
    into segments by applying a sliding overlapping window
59
    of size equal to frame_length with steps equal to step
60
    it does this for all the samples and appends all the output together.
61
    So, the participant distinction is not kept
62
63
    Parameters
64
    ----------
65
    frame_length : int
66
        Length of sliding window
67
    step : int
68
        Stepsize between windows
69
    Xsamples : list
70
        Existing list of window fragments
71
    ysamples : list
72
        Existing list of window fragments
73
    Xsampleslist : list
74
        Samples to take sliding windows from
75
    ysampleslist
76
        Samples to take sliding windows from
77
78
    """
79 1
    for j in range(len(Xsampleslist)):
80 1
        X = Xsampleslist[j]
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
81 1
        ybinary = ysampleslist[j]
82 1
        for i in range(0, X.shape[0] - frame_length, step):
83 1
            xsub = X[i:i + frame_length, :]
84 1
            ysub = ybinary
85 1
            Xsamples.append(xsub)
86 1
            ysamples.append(ysub)
87
88
89 1
def transform_y(y, mapclasses, nr_classes):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
90
    """
91
    Transforms y, a list with one sequence of A timesteps
92
    and B unique classes into a binary Numpy matrix of
93
    shape (A, B)
94
95
    Parameters
96
    ----------
97
    y : list or array
98
        List of classes
99
    mapclasses : dict
100
        dictionary that maps the classes to numbers
101
    nr_classes : int
102
        total number of classes
103
    """
104 1
    ymapped = np.array([mapclasses[c] for c in y], dtype='int')
105 1
    ybinary = to_categorical(ymapped, nr_classes)
106 1
    return ybinary
107
108
109 1
def addheader(datasets):
110
    """
111
    The columns of the pandas data frame are numbers
112
    this function adds the column labels
113
114
    Parameters
115
    ----------
116
    datasets : list
117
        List of pandas dataframes
118
    """
119 1
    axes = ['x', 'y', 'z']
120 1
    IMUsensor_columns = ['temperature'] + \
0 ignored issues
show
Coding Style Naming introduced by
The name IMUsensor_columns does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
121
        ['acc_16g_' + i for i in axes] + \
122
        ['acc_6g_' + i for i in axes] + \
123
        ['gyroscope_' + i for i in axes] + \
124
        ['magnometer_' + i for i in axes] + \
125
        ['orientation_' + str(i) for i in range(4)]
126 1
    header = ["timestamp", "activityID", "heartrate"] + ["hand_" + s
127
                                                         for s in IMUsensor_columns] \
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (86/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
128
        + ["chest_" + s for s in IMUsensor_columns] + ["ankle_" + s
129
                                                       for s in IMUsensor_columns]
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (82/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
130 1
    for i in range(0, len(datasets)):
131 1
        datasets[i].columns = header
132 1
    return datasets
133
134
135 1
def numpify_and_store(X, y, xname, yname, outdatapath, shuffle=False):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
136
    """
137
    Converts python lists x 3D and y 1D into numpy arrays
138
    and stores the numpy array in directory outdatapath
139
    shuffle is optional and shuffles the samples
140
141
    Parameters
142
    ----------
143
    X : list
144
        list with data
145
    y : list
146
        list with data
147
    xname : str
148
        name to store the x arrays
149
    yname : str
150
        name to store the y arrays
151
    outdatapath : str
152
        path to the directory to store the data
153
    shuffle : bool
154
        whether to shuffle the data before storing
155
    """
156 1
    X = np.array(X)
157 1
    y = np.array(y)
158
    # Shuffle the train set
159 1
    if shuffle is True:
160 1
        np.random.seed(123)
161 1
        neworder = np.random.permutation(X.shape[0])
162 1
        X = X[neworder, :, :]
163 1
        y = y[neworder, :]
164
    # Save binary file
165 1
    np.save(outdatapath + xname, X)
166 1
    np.save(outdatapath + yname, y)
167
168
169 1
def fetch_data(directory_to_extract_to):
170
    """
171
    Fetch the data and extract the contents of the zip file
172
    to the directory_to_extract_to.
173
    First check whether this was done before, if yes, then skip
174
175
    Parameters
176
    ----------
177
    directory_to_extract_to : str
178
        directory to create subfolder 'PAMAP2'
179
180
    Returns
181
    -------
182
    targetdir: str
183
        directory where the data is extracted
184
    """
185
    targetdir = directory_to_extract_to + '/PAMAP2'
186
    if os.path.exists(targetdir):
187
        print('Data previously downloaded and stored in ' + targetdir)
188
    else:
189
        os.makedirs(targetdir)  # create target directory
190
        # Download the PAMAP2 data, this is 688 Mb
191
        path_to_zip_file = directory_to_extract_to + '/PAMAP2_Dataset.zip'
192
        test_file_exist = os.path.isfile(path_to_zip_file)
193
        if test_file_exist is False:
194
            url = str('https://archive.ics.uci.edu/ml/' +
195
                      'machine-learning-databases/00231/PAMAP2_Dataset.zip')
196
            # retrieve data from url
197
            local_fn, headers = urllib.request.urlretrieve(url,
198
                                                           filename=path_to_zip_file)
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (85/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
199
            print('Download complete and stored in: ' + path_to_zip_file)
200
        else:
201
            print('The data was previously downloaded and stored in ' +
202
                  path_to_zip_file)
203
        # unzip
204
        with zipfile.ZipFile(path_to_zip_file, "r") as zip_ref:
205
            zip_ref.extractall(targetdir)
206
    return targetdir
207
208
209 1
def slidingwindow_store(y_list, x_list, X_name, y_name, outdatapath, shuffle):
0 ignored issues
show
Coding Style Naming introduced by
The name X_name does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
210
    """
211
    Take sliding-window frames. Target is label of last time step
212
    Data is 100 Hz
213
214
    Parameters
215
    ----------
216
    y_list : list
217
        list of arrays with classes
218
    x_list : list
219
        list of numpy arrays with data
220
    X_name : str
221
        Name for X file
222
    y_name : str
223
        Name for y file
224
    outdatapath : str
225
        directory to store the data
226
    shuffle : bool
227
        whether to shuffle the data
228
    """
229
    frame_length = int(5.12 * 100)
230
    step = 1 * 100
231
    x_set = []
232
    y_set = []
233
    sliding_window(frame_length, step, x_set, y_set, x_list, y_list)
234
    numpify_and_store(x_set, y_set, X_name, y_name,
235
                      outdatapath, shuffle)
236
237
238 1
def map_class(datasets_filled):
239
    ysetall = [set(np.array(data.activityID)) - set([0])
240
               for data in datasets_filled]
241
    classlabels = list(set.union(*[set(y) for y in ysetall]))
242
    nr_classes = len(classlabels)
243
    mapclasses = {classlabels[i]: i for i in range(len(classlabels))}
244
    return classlabels, nr_classes, mapclasses
245
246
247 1
def split_data(Xlists, ybinarylists, indices):
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
248
    """ Function takes subset from list given indices
249
250
    Parameters
251
    ----------
252
    Xlists: tuple
253
        tuple (samples) of lists (windows) of numpy-arrays (time, variable)
254
    ybinarylist :
255
        list (samples) of numpy-arrays (window, class)
256
    indices :
257
        indices of the slice of data (samples) to be taken
258
259
    Returns
260
    -------
261
    x_setlist : list
262
        list (windows across samples) of numpy-arrays (time, variable)
263
    y_setlist: list
264
        list (windows across samples) of numpy-arrays (class, )
265
    """
266 1
    tty = str(type(indices))
267
    # or statement in next line is to account for python2 and python3
268
    # difference
269 1
    if tty == "<class 'slice'>" or tty == "<type 'slice'>":
270 1
        x_setlist = [X for Xlist in Xlists[indices] for X in Xlist]
271 1
        y_setlist = [y for ylist in ybinarylists[indices] for y in ylist]
272
    else:
273
        x_setlist = [X for X in Xlists[indices]]
274
        y_setlist = [y for y in ybinarylists[indices]]
275 1
    return x_setlist, y_setlist
276
277
278 1
def preprocess(targetdir, outdatapath, columns_to_use):
279
    """ Function to preprocess the PAMAP2 data after it is fetched
280
281
    Parameters
282
    ----------
283
    targetdir : str
284
        subdirectory of directory_to_extract_to, targetdir
285
        is defined by function fetch_data
286
    outdatapath : str
287
        a subdirectory of directory_to_extract_to, outdatapath
288
        is the direcotry where the Numpy output will be stored.
289
    columns_to_use : list
290
        list of column names to use
291
292
    Returns
293
    -------
294
    None
295
    """
296
    datadir = targetdir + '/PAMAP2_Dataset/Protocol'
297
    filenames = listdir(datadir)
298
    print('Start pre-processing all ' + str(len(filenames)) + ' files...')
299
    # load the files and put them in a list of pandas dataframes:
300
    datasets = [pd.read_csv(datadir + '/' + fn, header=None, sep=' ')
301
                for fn in filenames]
302
    datasets = addheader(datasets)  # add headers to the datasets
303
    # Interpolate dataset to get same sample rate between channels
304
    datasets_filled = [d.interpolate() for d in datasets]
305
    # Create mapping for class labels
306
    classlabels, nr_classes, mapclasses = map_class(datasets_filled)
307
    # Create input (x) and output (y) sets
308
    xall = [np.array(data[columns_to_use]) for data in datasets_filled]
309
    yall = [np.array(data.activityID) for data in datasets_filled]
310
    xylists = [split_activities(y, x) for x, y in zip(xall, yall)]
311
    Xlists, ylists = zip(*xylists)
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
312
    ybinarylists = [transform_y(y, mapclasses, nr_classes) for y in ylists]
313
    # Split in train, test and val
314
    x_vallist, y_vallist = split_data(Xlists, ybinarylists, indices=6)
315
    test_range = slice(7, len(datasets_filled))
316
    x_testlist, y_testlist = split_data(Xlists, ybinarylists, test_range)
317
    x_trainlist, y_trainlist = split_data(Xlists, ybinarylists,
318
                                          indices=slice(0, 6))
319
    # Take sliding-window frames, target is label of last time step,
320
    # and store as numpy file
321
    slidingwindow_store(y_list=y_trainlist, x_list=x_trainlist,
322
                        X_name='X_train', y_name='y_train',
323
                        outdatapath=outdatapath, shuffle=True)
324
    slidingwindow_store(y_list=y_vallist, x_list=x_vallist,
325
                        X_name='X_val', y_name='y_val',
326
                        outdatapath=outdatapath, shuffle=False)
327
    slidingwindow_store(y_list=y_testlist, x_list=x_testlist,
328
                        X_name='X_test', y_name='y_test',
329
                        outdatapath=outdatapath, shuffle=False)
330
    print('Processed data succesfully stored in ' + outdatapath)
331
    return None
332
333
334 1
def fetch_and_preprocess(directory_to_extract_to, columns_to_use=None):
335
    """
336
    High level function to fetch_and_preprocess the PAMAP2 dataset
337
338
    Parameters
339
    ----------
340
    directory_to_extract_to : str
341
        the directory where the data will be stored
342
    columns_to_use : list
343
        the columns to use
344
345
    Returns
346
    -------
347
    outdatapath: str
348
        The directory in which the numpy files are stored
349
    """
350
    if columns_to_use is None:
351
        columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
352
                          'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (82/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
353
                          'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (82/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
354
    targetdir = fetch_data(directory_to_extract_to)
355
    outdatapath = targetdir + '/PAMAP2_Dataset/slidingwindow512cleaned/'
356
    if not os.path.exists(outdatapath):
357
        os.makedirs(outdatapath)
358
    if os.path.isfile(outdatapath + 'x_train.npy'):
359
        print('Data previously pre-processed and np-files saved to ' +
360
              outdatapath)
361
    else:
362
        preprocess(targetdir, outdatapath, columns_to_use)
363
    return outdatapath
364
365
366 1
def load_data(outputpath):
367
    """ Function to load the numpy data as stored in directory
368
    outputpath.
369
370
    Parameters
371
    ----------
372
    outputpath : str
373
        directory where the numpy files are stored
374
375
    Returns
376
    -------
377
    x_train
378
    y_train_binary
379
    x_val
380
    y_val_binary
381
    x_test
382
    y_test_binary
383
    """
384
    ext = '.npy'
385
    x_train = np.load(outputpath + 'X_train' + ext)
386
    y_train_binary = np.load(outputpath + 'y_train' + ext)
387
    x_val = np.load(outputpath + 'X_val' + ext)
388
    y_val_binary = np.load(outputpath + 'y_val' + ext)
389
    x_test = np.load(outputpath + 'X_test' + ext)
390
    y_test_binary = np.load(outputpath + 'y_test' + ext)
391
    return x_train, y_train_binary, x_val, y_val_binary, x_test, y_test_binary
392