Completed
Push — master ( 9b04b8...290c30 )
by
unknown
05:21
created

numpify_and_store()   A

Complexity

Conditions 2

Size

Total Lines 17

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 0
CRAP Score 6

Importance

Changes 3
Bugs 0 Features 0
Metric Value
cc 2
c 3
b 0
f 0
dl 0
loc 17
ccs 0
cts 10
cp 0
crap 6
rs 9.4285
1
"""
2
 Summary:
3
 Function fetch_and_preprocess from tutorial_pamap2.py helps to fetch and
4
 preproces the data.
5
 Example function calls in 'Tutorial mcfly on PAMAP2.ipynb'
6
"""
7 1
import numpy as np
8 1
from numpy import genfromtxt
9 1
import pandas as pd
10
import matplotlib.pyplot as plt
11
from os import listdir
12
import os.path
13
import urllib.request
14
import zipfile
15
import keras
16
from keras.utils.np_utils import to_categorical
17
18
def split_activities(labels, X, borders=10*100):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
19
    """
20
    Splits up the data per activity and exclude activity=0.
21
    Also remove borders for each activity.
22
    Returns lists with subdatasets
23
    """
24
    tot_len = len(labels)
25
    startpoints = np.where([1] + [labels[i] != labels[i-1] \
26
        for i in range(1, tot_len)])[0]
27
    endpoints = np.append(startpoints[1:]-1, tot_len-1)
28
    acts = [labels[s] for s, e in zip(startpoints, endpoints)]
29
    #Also split up the data, and only keep the non-zero activities
30
    xysplit = [(X[s+borders:e-borders+1, :], a) \
31
        for s, e, a in zip(startpoints, endpoints, acts) if a != 0]
32
    xysplit = [(X, y) for X, y in xysplit if len(X) > 0]
33
    Xlist = [X for X, y in xysplit]
0 ignored issues
show
Coding Style Naming introduced by
The name Xlist does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
34
    ylist = [y for X, y in xysplit]
35
    return Xlist, ylist
36
37
def sliding_window(frame_length, step, Xsamples,\
0 ignored issues
show
Coding Style Naming introduced by
The name Xsamples does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name Xsampleslist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
38
    ysamples, ysampleslist, Xsampleslist):
39
    """
40
    Splits time series in ysampleslist and Xsampleslist
41
    into segments by applying a sliding overlapping window
42
    of size equal to frame_length with steps equal to step
43
    it does this for all the samples and appends all the output together.
44
    So, the participant distinction is not kept
45
    """
46
    for j in range(len(Xsampleslist)):
47
        X = Xsampleslist[j]
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
48
        ybinary = ysampleslist[j]
49
        for i in range(0, X.shape[0]-frame_length, step):
50
            xsub = X[i:i+frame_length, :]
51
            ysub = ybinary
52
            Xsamples.append(xsub)
53
            ysamples.append(ysub)
54
55
def transform_y(y, mapclasses, nr_classes):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
56
    """
57
    Transforms y, a tuple with sequences of class per time segment per sample,
58
    into a binary matrix per sample
59
    """
60
    ymapped = np.array([mapclasses[c] for c in y], dtype='int')
61
    ybinary = to_categorical(ymapped, nr_classes)
62
    return ybinary
63
64
def addheader(datasets):
65
    """
66
    The columns of the pandas data frame are numbers
67
    this function adds the column labels
68
    """
69
    axes = ['x', 'y', 'z']
70
    IMUsensor_columns = ['temperature'] + \
0 ignored issues
show
Coding Style Naming introduced by
The name IMUsensor_columns does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
71
                    ['acc_16g_' + i for i in axes] + \
72
                    ['acc_6g_' + i for i in axes] + \
73
                    ['gyroscope_'+ i for i in axes] + \
74
                    ['magnometer_'+ i for i in axes] + \
75
                    ['orientation_' + str(i) for i in range(4)]
76
    header = ["timestamp", "activityID", "heartrate"] + ["hand_"+s \
77
        for s in IMUsensor_columns] \
78
        + ["chest_"+s for s in IMUsensor_columns]+ ["ankle_"+s \
79
            for s in IMUsensor_columns]
80
    for i in range(0, len(datasets)):
81
            datasets[i].columns = header
0 ignored issues
show
Coding Style introduced by
The indentation here looks off. 8 spaces were expected, but 12 were found.
Loading history...
82
    return datasets
83
84
def numpify_and_store(X, y, xname, yname, outdatapath, shuffle=False):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
85
    """
86
    Converts python lists x and y into numpy arrays
87
    and stores the numpy array in directory outdatapath
88
    shuffle is optional and shuffles the samples
89
    """
90
    X = np.array(X)
91
    y = np.array(y)
92
    #Shuffle around the train set
93
    if shuffle is True:
94
        np.random.seed(123)
95
        neworder = np.random.permutation(X.shape[0])
96
        X = X[neworder, :, :]
97
        y = y[neworder, :]
98
    # Save binary file
99
    np.save(outdatapath+ xname, X)
100
    np.save(outdatapath+ yname, y)
101
102
103
def fetch_data(directory_to_extract_to):
104
    """
105
    Fetch the data and extract the contents of the zip file
106
    to the directory_to_extract_to.
107
    First check whether this was done before, if yes, then skip
108
    """
109
    targetdir = directory_to_extract_to + '/PAMAP2'
110
    if os.path.exists(targetdir):
111
        print('Data previously downloaded and stored in ' + targetdir)
112
    else:
113
        os.makedirs(targetdir) # create target directory
114
        #download the PAMAP2 data, this is 688 Mb
115
        path_to_zip_file = directory_to_extract_to + '/PAMAP2_Dataset.zip'
116
        test_file_exist = os.path.isfile(path_to_zip_file)
117
        if test_file_exist is False:
118
            url = str('https://archive.ics.uci.edu/ml/' +
119
                'machine-learning-databases/00231/PAMAP2_Dataset.zip')
120
            #retrieve data from url
121
            local_fn, headers = urllib.request.urlretrieve(url,\
122
                filename=path_to_zip_file)
123
            print('Download complete and stored in: ' + path_to_zip_file)
124
        else:
125
            print('The data was previously downloaded and stored in ' +
126
                path_to_zip_file)
127
        # unzip
128
        with zipfile.ZipFile(path_to_zip_file ,"r") as zip_ref:
0 ignored issues
show
Coding Style introduced by
No space allowed before comma
with zipfile.ZipFile(path_to_zip_file ,"r") as zip_ref:
^
Loading history...
Coding Style introduced by
Exactly one space required after comma
with zipfile.ZipFile(path_to_zip_file ,"r") as zip_ref:
^
Loading history...
129
            zip_ref.extractall(targetdir)
130
    return targetdir
131
132
133
def fetch_and_preprocess(directory_to_extract_to, columns_to_use=None):
134
    """
135
    High level function to fetch_and_preprocess the PAMAP2 dataset
136
    directory_to_extract_to: the directory where the data will be stored
137
    columns_to_use: the columns to use
138
    """
139
    if columns_to_use is None:
140
        columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
141
                     'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
142
                     'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
143
    targetdir = fetch_data(directory_to_extract_to)
144
    outdatapath = targetdir + '/PAMAP2_Dataset' + '/slidingwindow512cleaned/'
145
    if not os.path.exists(outdatapath):
146
        os.makedirs(outdatapath)
147
    if os.path.isfile(outdatapath+'x_train.npy'):
148
        print('Data previously pre-processed and np-files saved to ' +
149
            outdatapath)
150
    else:
151
        datadir = targetdir + '/PAMAP2_Dataset/Protocol'
152
        filenames = listdir(datadir)
153
        print('Start pre-processing all ' + str(len(filenames)) + ' files...')
154
        # load the files and put them in a list of pandas dataframes:
155
        datasets = [pd.read_csv(datadir+'/'+fn, header=None, sep=' ') \
156
            for fn in filenames]
157
        datasets = addheader(datasets) # add headers to the datasets
158
        #print(len(datasets))
159
        print(datasets[0].shape)
160
161
        #Interpolate dataset to get same sample rate between channels
162
        datasets_filled = [d.interpolate() for d in datasets]
163
        # Create mapping for class labels
164
        ysetall = [set(np.array(data.activityID)) - set([0]) \
165
            for data in datasets_filled]
166
        classlabels = list(set.union(*[set(y) for y in ysetall]))
167
        nr_classes = len(classlabels)
168
        mapclasses = {classlabels[i] : i for i in range(len(classlabels))}
169
        #Create input (x) and output (y) sets
170
        xall = [np.array(data[columns_to_use]) for data in datasets_filled]
171
        yall = [np.array(data.activityID) for data in datasets_filled]
172
        xylists = [split_activities(y, x) for x, y in zip(xall, yall)]
173
        Xlists, ylists = zip(*xylists)
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
174
        ybinarylists = [transform_y(y, mapclasses, nr_classes) for y in ylists]
175
        # Split in train, test and val
176
        train_range = slice(0, 6)
177
        val_range = 6
178
        test_range = slice(7, len(datasets_filled))
179
        x_trainlist = [X for Xlist in Xlists[train_range] for X in Xlist]
180
        x_vallist = [X for X in Xlists[val_range]]
181
        x_testlist = [X for Xlist in Xlists[test_range] for X in Xlist]
182
        y_trainlist = [y for ylist in ybinarylists[train_range] for y in ylist]
183
        y_vallist = [y for y in ybinarylists[val_range]]
184
        y_testlist = [y for ylist in ybinarylists[test_range] for y in ylist]
185
186
        # Take sliding-window frames. Target is label of last time step
187
        # Data is 100 Hz
188
        frame_length = int(5.12 * 100)
189
        step = 1 * 100
190
        x_train = []
191
        y_train = []
192
        x_val = []
193
        y_val = []
194
        x_test = []
195
        y_test = []
196
        sliding_window(frame_length, step, x_train, y_train, y_trainlist, \
197
            x_trainlist)
198
        sliding_window(frame_length, step, x_val, y_val, y_vallist, x_vallist)
199
        sliding_window(frame_length, step, x_test, y_test, \
200
            y_testlist, x_testlist)
201
        numpify_and_store(x_train, y_train, 'X_train', 'y_train', \
202
        outdatapath, shuffle=True)
203
        numpify_and_store(x_val, y_val, 'X_val', 'y_val', outdatapath, \
204
            shuffle=False)
205
        numpify_and_store(x_test, y_test, 'X_test', 'y_test', outdatapath, \
206
            shuffle=False)
207
        print('Processed data succesfully stored in ' + outdatapath)
208
    return outdatapath
209
210
def load_data(outputpath):
211
    ext = '.npy'
212
    x_train = np.load(outputpath+'X_train'+ext)
213
    y_train_binary = np.load(outputpath+'y_train'+ext)
214
    x_val = np.load(outputpath+'X_val'+ext)
215
    y_val_binary = np.load(outputpath+'y_val'+ext)
216
    x_test = np.load(outputpath+'X_test'+ext)
217
    y_test_binary = np.load(outputpath+'y_test'+ext)
218
    return x_train, y_train_binary, x_val, y_val_binary, x_test, y_test_binary
219