Completed
Push — master ( 81cbcf...c2b817 )
by Christiaan
06:36
created

load_data()   A

Complexity

Conditions 1

Size

Total Lines 17

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 1
CRAP Score 1.7023

Importance

Changes 3
Bugs 0 Features 0
Metric Value
cc 1
c 3
b 0
f 0
dl 0
loc 17
ccs 1
cts 9
cp 0.1111
crap 1.7023
rs 9.4285
1
"""
2
 Summary:
3
 Function fetch_and_preprocess from tutorial_pamap2.py helps to fetch and
4
 preproces the data.
5
 Example function calls in 'Tutorial mcfly on PAMAP2.ipynb'
6
"""
7 1
import numpy as np
8 1
from numpy import genfromtxt
9 1
import pandas as pd
10 1
import matplotlib.pyplot as plt
11 1
from os import listdir
12 1
import os.path
13 1
import zipfile
14 1
import keras
15 1
from keras.utils.np_utils import to_categorical
16 1
import sys
17 1
import six.moves.urllib as urllib
18
19
20 1
def split_activities(labels, X, borders=10 * 100):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
21
    """
22
    Splits up the data per activity and exclude activity=0.
23
    Also remove borders for each activity.
24
    Returns lists with subdatasets
25
26
    Parameters
27
    ----------
28
    labels : numpy array
29
        Activity labels
30
    X : numpy array
31
        Data points
32
    borders : int
33
        Nr of timesteps to remove from the borders of an activity
34
35
    Returns
36
    -------
37
    X_list, y_list
38
    """
39 1
    tot_len = len(labels)
40 1
    startpoints = np.where([1] + [labels[i] != labels[i - 1]
41
                                  for i in range(1, tot_len)])[0]
42 1
    endpoints = np.append(startpoints[1:] - 1, tot_len - 1)
43 1
    acts = [labels[s] for s, e in zip(startpoints, endpoints)]
44
    # Also split up the data, and only keep the non-zero activities
45 1
    xysplit = [(X[s + borders:e - borders + 1, :], a)
46
               for s, e, a in zip(startpoints, endpoints, acts) if a != 0]
47 1
    xysplit = [(X, y) for X, y in xysplit if len(X) > 0]
48 1
    Xlist = [X for X, y in xysplit]
0 ignored issues
show
Coding Style Naming introduced by
The name Xlist does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
49 1
    ylist = [y for X, y in xysplit]
50 1
    return Xlist, ylist
51
52
53 1
def sliding_window(frame_length, step, Xsamples,
0 ignored issues
show
Coding Style Naming introduced by
The name Xsamples does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name Xsampleslist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
54
                   ysamples, Xsampleslist, ysampleslist):
55
    """
56
    Splits time series in ysampleslist and Xsampleslist
57
    into segments by applying a sliding overlapping window
58
    of size equal to frame_length with steps equal to step
59
    it does this for all the samples and appends all the output together.
60
    So, the participant distinction is not kept
61
62
    Parameters
63
    ----------
64
    frame_length : int
65
        Length of sliding window
66
    step : int
67
        Stepsize between windows
68
    Xsamples : list
69
        Existing list of window fragments
70
    ysamples : list
71
        Existing list of window fragments
72
    Xsampleslist : list
73
        Samples to take sliding windows from
74
    ysampleslist
75
        Samples to take sliding windows from
76
77
    """
78 1
    for j in range(len(Xsampleslist)):
79 1
        X = Xsampleslist[j]
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
80 1
        ybinary = ysampleslist[j]
81 1
        for i in range(0, X.shape[0] - frame_length, step):
82 1
            xsub = X[i:i + frame_length, :]
83 1
            ysub = ybinary
84 1
            Xsamples.append(xsub)
85 1
            ysamples.append(ysub)
86
87
88 1
def transform_y(y, mapclasses, nr_classes):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
89
    """
90
    Transforms y, a list with one sequence of A timesteps
91
    and B unique classes into a binary Numpy matrix of
92
    shape (A, B)
93
94
    Parameters
95
    ----------
96
    y : list or array
97
        List of classes
98
    mapclasses : dict
99
        dictionary that maps the classes to numbers
100
    nr_classes : int
101
        total number of classes
102
    """
103 1
    ymapped = np.array([mapclasses[c] for c in y], dtype='int')
104 1
    ybinary = to_categorical(ymapped, nr_classes)
105 1
    return ybinary
106
107
108 1
def addheader(datasets):
109
    """
110
    The columns of the pandas data frame are numbers
111
    this function adds the column labels
112
113
    Parameters
114
    ----------
115
    datasets : list
116
        List of pandas dataframes
117
    """
118 1
    axes = ['x', 'y', 'z']
119 1
    IMUsensor_columns = ['temperature'] + \
0 ignored issues
show
Coding Style Naming introduced by
The name IMUsensor_columns does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
120
        ['acc_16g_' + i for i in axes] + \
121
        ['acc_6g_' + i for i in axes] + \
122
        ['gyroscope_' + i for i in axes] + \
123
        ['magnometer_' + i for i in axes] + \
124
        ['orientation_' + str(i) for i in range(4)]
125 1
    header = ["timestamp", "activityID", "heartrate"] + ["hand_" + s
126
                                                         for s in IMUsensor_columns] \
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (86/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
127
        + ["chest_" + s for s in IMUsensor_columns] + ["ankle_" + s
128
                                                       for s in IMUsensor_columns]
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (82/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
129 1
    for i in range(0, len(datasets)):
130 1
        datasets[i].columns = header
131 1
    return datasets
132
133
134 1
def numpify_and_store(X, y, xname, yname, outdatapath, shuffle=False):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
135
    """
136
    Converts python lists x 3D and y 1D into numpy arrays
137
    and stores the numpy array in directory outdatapath
138
    shuffle is optional and shuffles the samples
139
140
    Parameters
141
    ----------
142
    X : list
143
        list with data
144
    y : list
145
        list with data
146
    xname : str
147
        name to store the x arrays
148
    yname : str
149
        name to store the y arrays
150
    outdatapath : str
151
        path to the directory to store the data
152
    shuffle : bool
153
        whether to shuffle the data before storing
154
    """
155 1
    X = np.array(X)
156 1
    y = np.array(y)
157
    # Shuffle the train set
158 1
    if shuffle is True:
159 1
        np.random.seed(123)
160 1
        neworder = np.random.permutation(X.shape[0])
161 1
        X = X[neworder, :, :]
162 1
        y = y[neworder, :]
163
    # Save binary file
164 1
    np.save(outdatapath + xname, X)
165 1
    np.save(outdatapath + yname, y)
166
167
168 1
def fetch_data(directory_to_extract_to):
169
    """
170
    Fetch the data and extract the contents of the zip file
171
    to the directory_to_extract_to.
172
    First check whether this was done before, if yes, then skip
173
174
    Parameters
175
    ----------
176
    directory_to_extract_to : str
177
        directory to create subfolder 'PAMAP2'
178
179
    Returns
180
    -------
181
    targetdir: str
182
        directory where the data is extracted
183
    """
184
    targetdir = directory_to_extract_to + '/PAMAP2'
185
    if os.path.exists(targetdir):
186
        print('Data previously downloaded and stored in ' + targetdir)
187
    else:
188
        os.makedirs(targetdir)  # create target directory
189
        # Download the PAMAP2 data, this is 688 Mb
190
        path_to_zip_file = directory_to_extract_to + '/PAMAP2_Dataset.zip'
191
        test_file_exist = os.path.isfile(path_to_zip_file)
192
        if test_file_exist is False:
193
            url = str('https://archive.ics.uci.edu/ml/' +
194
                      'machine-learning-databases/00231/PAMAP2_Dataset.zip')
195
            # retrieve data from url
196
            local_fn, headers = urllib.request.urlretrieve(url,
197
                                                           filename=path_to_zip_file)
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (85/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
198
            print('Download complete and stored in: ' + path_to_zip_file)
199
        else:
200
            print('The data was previously downloaded and stored in ' +
201
                  path_to_zip_file)
202
        # unzip
203
        with zipfile.ZipFile(path_to_zip_file, "r") as zip_ref:
204
            zip_ref.extractall(targetdir)
205
    return targetdir
206
207
208 1
def slidingwindow_store(y_list, x_list, X_name, y_name, outdatapath, shuffle):
0 ignored issues
show
Coding Style Naming introduced by
The name X_name does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
209
    """
210
    Take sliding-window frames. Target is label of last time step
211
    Data is 100 Hz
212
213
    Parameters
214
    ----------
215
    y_list : list
216
        list of arrays with classes
217
    x_list : list
218
        list of numpy arrays with data
219
    X_name : str
220
        Name for X file
221
    y_name : str
222
        Name for y file
223
    outdatapath : str
224
        directory to store the data
225
    shuffle : bool
226
        whether to shuffle the data
227
    """
228
    frame_length = int(5.12 * 100)
229
    step = 1 * 100
230
    x_set = []
231
    y_set = []
232
    sliding_window(frame_length, step, x_set, y_set, x_list, y_list)
233
    numpify_and_store(x_set, y_set, X_name, y_name,
234
                      outdatapath, shuffle)
235
236
237 1
def map_class(datasets_filled):
238
    ysetall = [set(np.array(data.activityID)) - set([0])
239
               for data in datasets_filled]
240
    classlabels = list(set.union(*[set(y) for y in ysetall]))
241
    nr_classes = len(classlabels)
242
    mapclasses = {classlabels[i]: i for i in range(len(classlabels))}
243
    return classlabels, nr_classes, mapclasses
244
245
246 1
def split_data(Xlists, ybinarylists, indices):
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
247
    """ Function takes subset from list given indices
248
249
    Parameters
250
    ----------
251
    Xlists: tuple
252
        tuple (samples) of lists (windows) of numpy-arrays (time, variable)
253
    ybinarylist :
254
        list (samples) of numpy-arrays (window, class)
255
    indices :
256
        indices of the slice of data (samples) to be taken
257
258
    Returns
259
    -------
260
    x_setlist :
261
        list (windows across samples) of numpy-arrays (time, variable)
262
    y_setlist:
263
        list (windows across samples) of numpy-arrays (class, )
264
    """
265 1
    tty = str(type(indices))
266
    # or statement in next line is to account for python2 and python3
267
    # difference
268 1
    if tty == "<class 'slice'>" or tty == "<type 'slice'>":
269 1
        x_setlist = [X for Xlist in Xlists[indices] for X in Xlist]
270 1
        y_setlist = [y for ylist in ybinarylists[indices] for y in ylist]
271
    else:
272
        x_setlist = [X for X in Xlists[indices]]
273
        y_setlist = [y for y in ybinarylists[indices]]
274 1
    return x_setlist, y_setlist
275
276
277 1
def preprocess(targetdir, outdatapath, columns_to_use):
278
    """ Function to preprocess the PAMAP2 data after it is fetched
279
280
    Parameters
281
    ----------
282
    targetdir : str
283
        subdirectory of directory_to_extract_to, targetdir
284
        is defined by function fetch_data
285
    outdatapath : str
286
        a subdirectory of directory_to_extract_to, outdatapath
287
        is the direcotry where the Numpy output will be stored.
288
    columns_to_use : list
289
        list of column names to use
290
291
    Returns
292
    -------
293
    None
294
    """
295
    datadir = targetdir + '/PAMAP2_Dataset/Protocol'
296
    filenames = listdir(datadir)
297
    print('Start pre-processing all ' + str(len(filenames)) + ' files...')
298
    # load the files and put them in a list of pandas dataframes:
299
    datasets = [pd.read_csv(datadir + '/' + fn, header=None, sep=' ')
300
                for fn in filenames]
301
    datasets = addheader(datasets)  # add headers to the datasets
302
    # Interpolate dataset to get same sample rate between channels
303
    datasets_filled = [d.interpolate() for d in datasets]
304
    # Create mapping for class labels
305
    classlabels, nr_classes, mapclasses = map_class(datasets_filled)
306
    # Create input (x) and output (y) sets
307
    xall = [np.array(data[columns_to_use]) for data in datasets_filled]
308
    yall = [np.array(data.activityID) for data in datasets_filled]
309
    xylists = [split_activities(y, x) for x, y in zip(xall, yall)]
310
    Xlists, ylists = zip(*xylists)
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
311
    ybinarylists = [transform_y(y, mapclasses, nr_classes) for y in ylists]
312
    # Split in train, test and val
313
    x_vallist, y_vallist = split_data(Xlists, ybinarylists, indices=6)
314
    test_range = slice(7, len(datasets_filled))
315
    x_testlist, y_testlist = split_data(Xlists, ybinarylists, test_range)
316
    x_trainlist, y_trainlist = split_data(Xlists, ybinarylists,
317
                                          indices=slice(0, 6))
318
    # Take sliding-window frames, target is label of last time step,
319
    # and store as numpy file
320
    slidingwindow_store(y_list=y_trainlist, x_list=x_trainlist,
321
                        X_name='X_train', y_name='y_train',
322
                        outdatapath=outdatapath, shuffle=True)
323
    slidingwindow_store(y_list=y_vallist, x_list=x_vallist,
324
                        X_name='X_val', y_name='y_val',
325
                        outdatapath=outdatapath, shuffle=False)
326
    slidingwindow_store(y_list=y_testlist, x_list=x_testlist,
327
                        X_name='X_test', y_name='y_test',
328
                        outdatapath=outdatapath, shuffle=False)
329
    print('Processed data succesfully stored in ' + outdatapath)
330
    return None
331
332
333 1
def fetch_and_preprocess(directory_to_extract_to, columns_to_use=None):
334
    """
335
    High level function to fetch_and_preprocess the PAMAP2 dataset
336
337
    Parameters
338
    ----------
339
    directory_to_extract_to : str
340
        the directory where the data will be stored
341
    columns_to_use : list
342
        the columns to use
343
344
    Returns
345
    -------
346
    outdatapath: The directory in which the numpy files are stored
347
    """
348
    if columns_to_use is None:
349
        columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
350
                          'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (82/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
351
                          'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (82/79).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
352
    targetdir = fetch_data(directory_to_extract_to)
353
    outdatapath = targetdir + '/PAMAP2_Dataset/slidingwindow512cleaned/'
354
    if not os.path.exists(outdatapath):
355
        os.makedirs(outdatapath)
356
    if os.path.isfile(outdatapath + 'x_train.npy'):
357
        print('Data previously pre-processed and np-files saved to ' +
358
              outdatapath)
359
    else:
360
        preprocess(targetdir, outdatapath, columns_to_use)
361
    return outdatapath
362
363
364 1
def load_data(outputpath):
365
    """ Function to load the numpy data as stored in directory
366
    outputpath.
367
368
    Parameters
369
    ----------
370
    outputpath : str
371
        directory where the numpy files are stored
372
    """
373
    ext = '.npy'
374
    x_train = np.load(outputpath + 'X_train' + ext)
375
    y_train_binary = np.load(outputpath + 'y_train' + ext)
376
    x_val = np.load(outputpath + 'X_val' + ext)
377
    y_val_binary = np.load(outputpath + 'y_val' + ext)
378
    x_test = np.load(outputpath + 'X_test' + ext)
379
    y_test_binary = np.load(outputpath + 'y_test' + ext)
380
    return x_train, y_train_binary, x_val, y_val_binary, x_test, y_test_binary
381