Completed
Push — master ( 20a5b0...54f26e )
by
unknown
185:02 queued 183:45
created

fetch_and_preprocess()   D

Complexity

Conditions 10

Size

Total Lines 53

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 1
CRAP Score 100.3389

Importance

Changes 7
Bugs 0 Features 0
Metric Value
cc 10
c 7
b 0
f 0
dl 0
loc 53
ccs 1
cts 30
cp 0.0333
crap 100.3389
rs 4.8

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like fetch_and_preprocess() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""
2
 Summary:
3
 Function fetch_and_preprocess from tutorial_pamap2.py helps to fetch and
4
 preproces the data.
5
 Example function calls in 'Tutorial mcfly on PAMAP2.ipynb'
6
"""
7 1
import numpy as np
8 1
from numpy import genfromtxt
9 1
import pandas as pd
10 1
import matplotlib.pyplot as plt
11 1
from os import listdir
12 1
import os.path
13 1
import zipfile
14 1
import keras
15 1
from keras.utils.np_utils import to_categorical
16 1
import sys
17 1
if sys.version_info <= (3,): #python2
18
    import urllib
19
else: #python3
20 1
    import urllib.request
21
22 1
def split_activities(labels, X, borders=10*100):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
23
    """
24
    Splits up the data per activity and exclude activity=0.
25
    Also remove borders for each activity.
26
    Returns lists with subdatasets
27
    """
28 1
    tot_len = len(labels)
29 1
    startpoints = np.where([1] + [labels[i] != labels[i-1] \
30
        for i in range(1, tot_len)])[0]
31 1
    endpoints = np.append(startpoints[1:]-1, tot_len-1)
32 1
    acts = [labels[s] for s, e in zip(startpoints, endpoints)]
33
    #Also split up the data, and only keep the non-zero activities
34 1
    xysplit = [(X[s+borders:e-borders+1, :], a) \
35
        for s, e, a in zip(startpoints, endpoints, acts) if a != 0]
36 1
    xysplit = [(X, y) for X, y in xysplit if len(X) > 0]
37 1
    Xlist = [X for X, y in xysplit]
0 ignored issues
show
Coding Style Naming introduced by
The name Xlist does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
38 1
    ylist = [y for X, y in xysplit]
39 1
    return Xlist, ylist
40
41 1
def sliding_window(frame_length, step, Xsamples,\
0 ignored issues
show
Coding Style Naming introduced by
The name Xsamples does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name Xsampleslist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
42
    ysamples, Xsampleslist, ysampleslist):
43
    """
44
    Splits time series in ysampleslist and Xsampleslist
45
    into segments by applying a sliding overlapping window
46
    of size equal to frame_length with steps equal to step
47
    it does this for all the samples and appends all the output together.
48
    So, the participant distinction is not kept
49
    """
50 1
    for j in range(len(Xsampleslist)):
51 1
        X = Xsampleslist[j]
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
52 1
        ybinary = ysampleslist[j]
53 1
        for i in range(0, X.shape[0]-frame_length, step):
54
            xsub = X[i:i+frame_length, :]
55
            ysub = ybinary
56
            Xsamples.append(xsub)
57
            ysamples.append(ysub)
58
59 1
def transform_y(y, mapclasses, nr_classes):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
60
    """
61
    Transforms y, a list with one sequence of A timesteps
62
    and B unique classes into a binary Numpy matrix of
63
    shape (A, B)
64
    """
65 1
    ymapped = np.array([mapclasses[c] for c in y], dtype='int')
66 1
    ybinary = to_categorical(ymapped, nr_classes)
67 1
    return ybinary
68
69 1
def addheader(datasets):
70
    """
71
    The columns of the pandas data frame are numbers
72
    this function adds the column labels
73
    """
74 1
    axes = ['x', 'y', 'z']
75 1
    IMUsensor_columns = ['temperature'] + \
0 ignored issues
show
Coding Style Naming introduced by
The name IMUsensor_columns does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
76
                    ['acc_16g_' + i for i in axes] + \
77
                    ['acc_6g_' + i for i in axes] + \
78
                    ['gyroscope_'+ i for i in axes] + \
79
                    ['magnometer_'+ i for i in axes] + \
80
                    ['orientation_' + str(i) for i in range(4)]
81 1
    header = ["timestamp", "activityID", "heartrate"] + ["hand_"+s \
82
        for s in IMUsensor_columns] \
83
        + ["chest_"+s for s in IMUsensor_columns]+ ["ankle_"+s \
84
            for s in IMUsensor_columns]
85 1
    for i in range(0, len(datasets)):
86 1
            datasets[i].columns = header
0 ignored issues
show
Coding Style introduced by
The indentation here looks off. 8 spaces were expected, but 12 were found.
Loading history...
87 1
    return datasets
88
89 1
def numpify_and_store(X, y, xname, yname, outdatapath, shuffle=False):
0 ignored issues
show
Coding Style Naming introduced by
The name X does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
90
    """
91
    Converts python lists x 3D and y 1D into numpy arrays
92
    and stores the numpy array in directory outdatapath
93
    shuffle is optional and shuffles the samples
94
    """
95 1
    X = np.array(X)
96 1
    y = np.array(y)
97
    #Shuffle around the train set
98 1
    if shuffle is True:
99 1
        np.random.seed(123)
100 1
        neworder = np.random.permutation(X.shape[0])
101 1
        X = X[neworder, :, :]
102 1
        y = y[neworder, :]
103
    # Save binary file
104 1
    np.save(outdatapath+ xname, X)
105 1
    np.save(outdatapath+ yname, y)
106
107
108 1
def fetch_data(directory_to_extract_to):
109
    """
110
    Fetch the data and extract the contents of the zip file
111
    to the directory_to_extract_to.
112
    First check whether this was done before, if yes, then skip
113
    """
114
    targetdir = directory_to_extract_to + '/PAMAP2'
115
    if os.path.exists(targetdir):
116
        print('Data previously downloaded and stored in ' + targetdir)
117
    else:
118
        os.makedirs(targetdir) # create target directory
119
        #download the PAMAP2 data, this is 688 Mb
120
        path_to_zip_file = directory_to_extract_to + '/PAMAP2_Dataset.zip'
121
        test_file_exist = os.path.isfile(path_to_zip_file)
122
        if test_file_exist is False:
123
            url = str('https://archive.ics.uci.edu/ml/' +
124
                'machine-learning-databases/00231/PAMAP2_Dataset.zip')
125
            #retrieve data from url
126
            if sys.version_info <= (3,): #python2
127
                local_fn, headers = urllib.urlretrieve(url,\
128
                    filename=path_to_zip_file)
129
            else: #python3
130
                local_fn, headers = urllib.request.urlretrieve(url,\
131
                    filename=path_to_zip_file)
132
            print('Download complete and stored in: ' + path_to_zip_file)
133
        else:
134
            print('The data was previously downloaded and stored in ' +
135
                path_to_zip_file)
136
        # unzip
137
        with zipfile.ZipFile(path_to_zip_file ,"r") as zip_ref:
0 ignored issues
show
Coding Style introduced by
No space allowed before comma
with zipfile.ZipFile(path_to_zip_file ,"r") as zip_ref:
^
Loading history...
Coding Style introduced by
Exactly one space required after comma
with zipfile.ZipFile(path_to_zip_file ,"r") as zip_ref:
^
Loading history...
138
            zip_ref.extractall(targetdir)
139
    return targetdir
140
141
142 1
def slidingwindow_store(y_list, x_list,X_name, y_name, outdatapath, shuffle):
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
def slidingwindow_store(y_list, x_list,X_name, y_name, outdatapath, shuffle):
^
Loading history...
Coding Style Naming introduced by
The name X_name does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
143
    # Take sliding-window frames. Target is label of last time step
144
    # Data is 100 Hz
145
    frame_length = int(5.12 * 100)
146
    step = 1 * 100
147
    x_set = []
148
    y_set = []
149
    sliding_window(frame_length, step, x_set, y_set, x_list, y_list)
150
    numpify_and_store(x_set, y_set, X_name, y_name, \
151
        outdatapath, shuffle)
152
153 1
def map_clas(datasets_filled):
154
    ysetall = [set(np.array(data.activityID)) - set([0]) \
155
        for data in datasets_filled]
156
    classlabels = list(set.union(*[set(y) for y in ysetall]))
157
    nr_classes = len(classlabels)
158
    mapclasses = {classlabels[i] : i for i in range(len(classlabels))}
159
    return classlabels, nr_classes, mapclasses
160
161 1
def split_data(Xlist,ylist,indices):
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
def split_data(Xlist,ylist,indices):
^
Loading history...
Coding Style introduced by
Exactly one space required after comma
def split_data(Xlist,ylist,indices):
^
Loading history...
Coding Style Naming introduced by
The name Xlist does not conform to the argument naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
162
    """ Function takes subset from list given indices"""
163
    if str(type(indices)) == "<class 'slice'>":
164
        x_setlist = [X for Xlist in Xlists[indices] for X in Xlist]
165
        y_setlist = [y for ylist in ybinarylists[indices] for y in ylist]
166
    else:
167
        x_setlist = [X for X in Xlists[indices]]
168
        y_setlist = [y for y in ybinarylists[indices]]
169
        return x_setlist, ysetlist
170
171 1
def fetch_and_preprocess(directory_to_extract_to, columns_to_use=None):
172
    """
173
    High level function to fetch_and_preprocess the PAMAP2 dataset
174
    directory_to_extract_to: the directory where the data will be stored
175
    columns_to_use: the columns to use
176
    """
177
    if columns_to_use is None:
178
        columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
179
                     'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
180
                     'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
181
    targetdir = fetch_data(directory_to_extract_to)
182
    outdatapath = targetdir + '/PAMAP2_Dataset/slidingwindow512cleaned/'
183
    if not os.path.exists(outdatapath):
184
        os.makedirs(outdatapath)
185
    if os.path.isfile(outdatapath+'x_train.npy'):
186
        print('Data previously pre-processed and np-files saved to ' +
187
            outdatapath)
188
    else:
189
        datadir = targetdir + '/PAMAP2_Dataset/Protocol'
190
        filenames = listdir(datadir)
191
        print('Start pre-processing all ' + str(len(filenames)) + ' files...')
192
        # load the files and put them in a list of pandas dataframes:
193
        datasets = [pd.read_csv(datadir+'/'+fn, header=None, sep=' ') \
194
            for fn in filenames]
195
        datasets = addheader(datasets) # add headers to the datasets
196
        #Interpolate dataset to get same sample rate between channels
197
        datasets_filled = [d.interpolate() for d in datasets]
198
        # Create mapping for class labels
199
        classlabels, nr_classes, mapclasses = map_clas(datasets_filled)
200
        #Create input (x) and output (y) sets
201
        xall = [np.array(data[columns_to_use]) for data in datasets_filled]
202
        yall = [np.array(data.activityID) for data in datasets_filled]
203
        xylists = [split_activities(y, x) for x, y in zip(xall, yall)]
204
        Xlists, ylists = zip(*xylists)
0 ignored issues
show
Coding Style Naming introduced by
The name Xlists does not conform to the variable naming conventions ([a-z_][a-z0-9_]{1,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
205
        ybinarylists = [transform_y(y, mapclasses, nr_classes) for y in ylists]
206
        # Split in train, test and val
207
        x_trainlist, y_trainlist = split_data(Xlist,ylist,slice(0, 6))
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
x_trainlist, y_trainlist = split_data(Xlist,ylist,slice(0, 6))
^
Loading history...
Coding Style introduced by
Exactly one space required after comma
x_trainlist, y_trainlist = split_data(Xlist,ylist,slice(0, 6))
^
Loading history...
208
        x_vallist, y_vallist = split_data(Xlist,ylist,indices=6)
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
x_vallist, y_vallist = split_data(Xlist,ylist,indices=6)
^
Loading history...
Coding Style introduced by
Exactly one space required after comma
x_vallist, y_vallist = split_data(Xlist,ylist,indices=6)
^
Loading history...
209
        test_range = slice(7, len(datasets_filled))
210
        x_testlist, y_testlist = split_data(Xlist,ylist,test_range)
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
x_testlist, y_testlist = split_data(Xlist,ylist,test_range)
^
Loading history...
Coding Style introduced by
Exactly one space required after comma
x_testlist, y_testlist = split_data(Xlist,ylist,test_range)
^
Loading history...
211
        # Take sliding-window frames, target is label of last time step,
212
        # and store as numpy file
213
        slidingwindow_store(y_list=y_trainlist, x_list=x_trainlist, \
214
                    X_name='X_train', y_name='y_train', \
215
                    outdatapath=outdatapath,shuffle=True)
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
outdatapath=outdatapath,shuffle=True)
^
Loading history...
216
        slidingwindow_store(y_list=y_vallist, x_list=x_vallist, \
217
            X_name='X_val', y_name='y_val', \
218
            outdatapath=outdatapath,shuffle=False)
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
outdatapath=outdatapath,shuffle=False)
^
Loading history...
219
        slidingwindow_store(y_list=y_testlist, x_list=x_testlist, \
220
                X_name='X_test', y_name='y_test', \
221
                outdatapath=outdatapath,shuffle=False)
0 ignored issues
show
Coding Style introduced by
Exactly one space required after comma
outdatapath=outdatapath,shuffle=False)
^
Loading history...
222
        print('Processed data succesfully stored in ' + outdatapath)
223
    return outdatapath
224
225 1
def load_data(outputpath):
226
    ext = '.npy'
227
    x_train = np.load(outputpath+'X_train'+ext)
228
    y_train_binary = np.load(outputpath+'y_train'+ext)
229
    x_val = np.load(outputpath+'X_val'+ext)
230
    y_val_binary = np.load(outputpath+'y_val'+ext)
231
    x_test = np.load(outputpath+'X_test'+ext)
232
    y_test_binary = np.load(outputpath+'y_test'+ext)
233
    return x_train, y_train_binary, x_val, y_val_binary, x_test, y_test_binary
234