Completed
Push — master ( 126da3...7b8a0a )
by
unknown
07:48
created

split_data()   C

Complexity

Conditions 8

Size

Total Lines 18

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 5
CRAP Score 9.4924

Importance

Changes 2
Bugs 0 Features 0
Metric Value
cc 8
c 2
b 0
f 0
dl 0
loc 18
ccs 5
cts 7
cp 0.7143
crap 9.4924
rs 6.6666
1
"""
2
 Summary:
3
 Function fetch_and_preprocess from tutorial_pamap2.py helps to fetch and
4
 preproces the data.
5
 Example function calls in 'Tutorial mcfly on PAMAP2.ipynb'
6
"""
7 1
import numpy as np
8 1
from numpy import genfromtxt
9 1
import pandas as pd
10 1
import matplotlib.pyplot as plt
11 1
from os import listdir
12 1
import os.path
13 1
import zipfile
14 1
import keras
15 1
from keras.utils.np_utils import to_categorical
16 1
import sys
17 1
if sys.version_info <= (3,): #python2
18
    import urllib
19
else: #python3
20 1
    import urllib.request
21
22
23 1
def split_activities(labels, X, borders=10*100):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
24
    """
25
    Splits up the data per activity and exclude activity=0.
26
    Also remove borders for each activity.
27
    Returns lists with subdatasets
28
    """
29 1
    tot_len = len(labels)
30 1
    startpoints = np.where([1] + [labels[i] != labels[i-1] \
31
        for i in range(1, tot_len)])[0]
32 1
    endpoints = np.append(startpoints[1:]-1, tot_len-1)
33 1
    acts = [labels[s] for s, e in zip(startpoints, endpoints)]
34
    #Also split up the data, and only keep the non-zero activities
35 1
    xysplit = [(X[s+borders:e-borders+1, :], a) \
36
        for s, e, a in zip(startpoints, endpoints, acts) if a != 0]
37 1
    xysplit = [(X, y) for X, y in xysplit if len(X) > 0]
38 1
    Xlist = [X for X, y in xysplit]
0 ignored issues
show
Coding Style Naming introduced by
The name Xlist does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
39 1
    ylist = [y for X, y in xysplit]
40 1
    return Xlist, ylist
41
42 1
def sliding_window(frame_length, step, Xsamples,\
0 ignored issues
show
Coding Style Naming introduced by
The name Xsamples does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name Xsampleslist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
43
    ysamples, Xsampleslist, ysampleslist):
44
    """
45
    Splits time series in ysampleslist and Xsampleslist
46
    into segments by applying a sliding overlapping window
47
    of size equal to frame_length with steps equal to step
48
    it does this for all the samples and appends all the output together.
49
    So, the participant distinction is not kept
50
    """
51 1
    for j in range(len(Xsampleslist)):
52 1
        X = Xsampleslist[j]
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
53 1
        ybinary = ysampleslist[j]
54 1
        for i in range(0, X.shape[0]-frame_length, step):
55 1
            xsub = X[i:i+frame_length, :]
56 1
            ysub = ybinary
57 1
            Xsamples.append(xsub)
58 1
            ysamples.append(ysub)
59
60 1
def transform_y(y, mapclasses, nr_classes):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
61
    """
62
    Transforms y, a list with one sequence of A timesteps
63
    and B unique classes into a binary Numpy matrix of
64
    shape (A, B)
65
    """
66 1
    ymapped = np.array([mapclasses[c] for c in y], dtype='int')
67 1
    ybinary = to_categorical(ymapped, nr_classes)
68 1
    return ybinary
69
70 1
def addheader(datasets):
71
    """
72
    The columns of the pandas data frame are numbers
73
    this function adds the column labels
74
    """
75 1
    axes = ['x', 'y', 'z']
76 1
    IMUsensor_columns = ['temperature'] + \
0 ignored issues
show
Coding Style Naming introduced by
The name IMUsensor_columns does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
77
                    ['acc_16g_' + i for i in axes] + \
78
                    ['acc_6g_' + i for i in axes] + \
79
                    ['gyroscope_' + i for i in axes] + \
80
                    ['magnometer_' + i for i in axes] + \
81
                    ['orientation_' + str(i) for i in range(4)]
82 1
    header = ["timestamp", "activityID", "heartrate"] + ["hand_"+s \
83
        for s in IMUsensor_columns] \
84
        + ["chest_"+s for s in IMUsensor_columns]+ ["ankle_"+s \
85
            for s in IMUsensor_columns]
86 1
    for i in range(0, len(datasets)):
87 1
        datasets[i].columns = header
88 1
    return datasets
89
90 1
def numpify_and_store(X, y, xname, yname, outdatapath, shuffle=False):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
91
    """
92
    Converts python lists x 3D and y 1D into numpy arrays
93
    and stores the numpy array in directory outdatapath
94
    shuffle is optional and shuffles the samples
95
    """
96 1
    X = np.array(X)
97 1
    y = np.array(y)
98
    #Shuffle around the train set
99 1
    if shuffle is True:
100 1
        np.random.seed(123)
101 1
        neworder = np.random.permutation(X.shape[0])
102 1
        X = X[neworder, :, :]
103 1
        y = y[neworder, :]
104
    # Save binary file
105 1
    np.save(outdatapath+ xname, X)
106 1
    np.save(outdatapath+ yname, y)
107
108
109 1
def fetch_data(directory_to_extract_to):
110
    """
111
    Fetch the data and extract the contents of the zip file
112
    to the directory_to_extract_to.
113
    First check whether this was done before, if yes, then skip
114
    """
115
    targetdir = directory_to_extract_to + '/PAMAP2'
116
    if os.path.exists(targetdir):
117
        print('Data previously downloaded and stored in ' + targetdir)
118
    else:
119
        os.makedirs(targetdir) # create target directory
120
        #download the PAMAP2 data, this is 688 Mb
121
        path_to_zip_file = directory_to_extract_to + '/PAMAP2_Dataset.zip'
122
        test_file_exist = os.path.isfile(path_to_zip_file)
123
        if test_file_exist is False:
124
            url = str('https://archive.ics.uci.edu/ml/' +
125
                'machine-learning-databases/00231/PAMAP2_Dataset.zip')
126
            #retrieve data from url
127
            if sys.version_info <= (3,): #python2
128
                local_fn, headers = urllib.urlretrieve(url, \
129
                    filename=path_to_zip_file)
130
            else: #python3
131
                local_fn, headers = urllib.request.urlretrieve(url,\
132
                    filename=path_to_zip_file)
133
            print('Download complete and stored in: ' + path_to_zip_file)
134
        else:
135
            print('The data was previously downloaded and stored in ' +
136
                path_to_zip_file)
137
        # unzip
138
        with zipfile.ZipFile(path_to_zip_file, "r") as zip_ref:
139
            zip_ref.extractall(targetdir)
140
    return targetdir
141
142
143 1
def slidingwindow_store(y_list, x_list, X_name, y_name, outdatapath, shuffle):
0 ignored issues
show
Coding Style Naming introduced by
The name X_name does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
144
    # Take sliding-window frames. Target is label of last time step
145
    # Data is 100 Hz
146
    frame_length = int(5.12 * 100)
147
    step = 1 * 100
148
    x_set = []
149
    y_set = []
150
    sliding_window(frame_length, step, x_set, y_set, x_list, y_list)
151
    numpify_and_store(x_set, y_set, X_name, y_name, \
152
        outdatapath, shuffle)
153
154 1
def map_clas(datasets_filled):
155
    ysetall = [set(np.array(data.activityID)) - set([0]) \
156
        for data in datasets_filled]
157
    classlabels = list(set.union(*[set(y) for y in ysetall]))
158
    nr_classes = len(classlabels)
159
    mapclasses = {classlabels[i] : i for i in range(len(classlabels))}
160
    return classlabels, nr_classes, mapclasses
161
162 1
def split_data(Xlists, ybinarylists, indices):
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
163
    """ Function takes subset from list given indices
164
    Arguments:
165
    - Xlists: tuple (samples) of lists (windows)
166
            of numpy-arrays (time, variable)
167
    - ybinarylist: list (samples) of numpy-arrays (window, class)
168
    - indices: indices of the slice of data (samples) to be taken
169
    Value (output):
170
    - x_setlist: list (windows across samples) of numpy-arrays (time, variable)
171
    - y_setlist: list (windows across samples) of numpy-arrays (class, )
172
    """
173 1
    if str(type(indices)) == "<class 'slice'>":
174 1
        x_setlist = [X for Xlist in Xlists[indices] for X in Xlist]
175 1
        y_setlist = [y for ylist in ybinarylists[indices] for y in ylist]
176
    else:
177
        x_setlist = [X for X in Xlists[indices]]
178
        y_setlist = [y for y in ybinarylists[indices]]
179 1
    return x_setlist, y_setlist
180
181 1
def preprocess(targetdir, outdatapath, columns_to_use):
182
    """ Function to preprocess the PAMAP2 data after it is fetched
183
    Arguments:
184
    - targetdir: subdirectory of directory_to_extract_to, targetdir
185
        is defined by function fetch_data
186
    - outdatapath: a subdirectory of directory_to_extract_to, outdatapath
187
        is the direcotry where the Numpy output will be stored.
188
    Value (output):
189
    - None
190
    """
191
    datadir = targetdir + '/PAMAP2_Dataset/Protocol'
192
    filenames = listdir(datadir)
193
    print('Start pre-processing all ' + str(len(filenames)) + ' files...')
194
    # load the files and put them in a list of pandas dataframes:
195
    datasets = [pd.read_csv(datadir+'/'+fn, header=None, sep=' ') \
196
        for fn in filenames]
197
    datasets = addheader(datasets) # add headers to the datasets
198
    #Interpolate dataset to get same sample rate between channels
199
    datasets_filled = [d.interpolate() for d in datasets]
200
    # Create mapping for class labels
201
    classlabels, nr_classes, mapclasses = map_clas(datasets_filled)
202
    #Create input (x) and output (y) sets
203
    xall = [np.array(data[columns_to_use]) for data in datasets_filled]
204
    yall = [np.array(data.activityID) for data in datasets_filled]
205
    xylists = [split_activities(y, x) for x, y in zip(xall, yall)]
206
    Xlists, ylists = zip(*xylists)
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
207
    ybinarylists = [transform_y(y, mapclasses, nr_classes) for y in ylists]
208
    # Split in train, test and val
209
    x_vallist, y_vallist = split_data(Xlists, ybinarylists, indices=6)
210
    test_range = slice(7, len(datasets_filled))
211
    x_testlist, y_testlist = split_data(Xlists, ybinarylists, test_range)
212
    x_trainlist, y_trainlist = split_data(Xlists, ybinarylists, \
213
        indices=slice(0, 6))
214
    # Take sliding-window frames, target is label of last time step,
215
    # and store as numpy file
216
    slidingwindow_store(y_list=y_trainlist, x_list=x_trainlist, \
217
                X_name='X_train', y_name='y_train', \
218
                outdatapath=outdatapath, shuffle=True)
219
    slidingwindow_store(y_list=y_vallist, x_list=x_vallist, \
220
        X_name='X_val', y_name='y_val', \
221
        outdatapath=outdatapath, shuffle=False)
222
    slidingwindow_store(y_list=y_testlist, x_list=x_testlist, \
223
            X_name='X_test', y_name='y_test', \
224
            outdatapath=outdatapath, shuffle=False)
225
    print('Processed data succesfully stored in ' + outdatapath)
226
    return None
227
228 1
def fetch_and_preprocess(directory_to_extract_to, columns_to_use=None):
229
    """
230
    High level function to fetch_and_preprocess the PAMAP2 dataset
231
    Arguments:
232
    - directory_to_extract_to: the directory where the data will be stored
233
    - columns_to_use: the columns to use
234
    Values (output):
235
    - outdatapath: The directory in which the numpy files are stored
236
    """
237
    if columns_to_use is None:
238
        columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
239
                     'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
240
                     'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
241
    targetdir = fetch_data(directory_to_extract_to)
242
    outdatapath = targetdir + '/PAMAP2_Dataset/slidingwindow512cleaned/'
243
    if not os.path.exists(outdatapath):
244
        os.makedirs(outdatapath)
245
    if os.path.isfile(outdatapath+'x_train.npy'):
246
        print('Data previously pre-processed and np-files saved to ' +
247
            outdatapath)
248
    else:
249
        preprocess(targetdir, outdatapath, columns_to_use)
250
    return outdatapath
251
252 1
def load_data(outputpath):
253
    """ Function to load the numpy data as stored in directory
254
    outputpath.
255
    """
256
    ext = '.npy'
257
    x_train = np.load(outputpath+'X_train'+ext)
258
    y_train_binary = np.load(outputpath+'y_train'+ext)
259
    x_val = np.load(outputpath+'X_val'+ext)
260
    y_val_binary = np.load(outputpath+'y_val'+ext)
261
    x_test = np.load(outputpath+'X_test'+ext)
262
    y_test_binary = np.load(outputpath+'y_test'+ext)
263
    return x_train, y_train_binary, x_val, y_val_binary, x_test, y_test_binary
264