kNN_accuracy() - Code Metrics - Inspection of "Added docstrings and autopep8-ed" - NLeSC/mcfly - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Push — master ( ed5c34...292ca9 )

by Dafne van

created 2016-08-04 12:57 UTC

kNN_accuracy() B

↳ Parent: Project

Complexity

Conditions

Size

Total Lines

Duplication

Lines	0
Ratio	0 %

Code Coverage

Tests	7
CRAP Score	1

Importance

Changes	1
Bugs	0	Features	0

Metric	Value
cc	1
c	1
b	0
f	0
dl	0
loc	35
ccs	7
cts	7
cp	1
crap	1
rs	8.8571

'''
 Summary:
 Function generate_models from modelgen.py generates and compiles models
 Function train_models_on_samples trains those models
 Function plotTrainingProcess plots the training process
 Function find_best_architecture is wrapper function that combines
 these steps
 Example function calls in 'EvaluateDifferentModels.ipynb'
'''
import numpy as np
from matplotlib import pyplot as plt
from . import modelgen
from sklearn import neighbors, metrics
import warnings


def train_models_on_samples(X_train, y_train, X_val, y_val, models,

                            nr_epochs=5, subset_size=100, verbose=True):
    """
    Given a list of compiled models, this function trains
    them all on a subset of the train data. If the given size of the subset is
    smaller then the size of the data, the complete data set is used.
    Parameters
    ----------
    X_train : numpy array of shape (num_samples, num_timesteps, num_channels)
        The input dataset for training
    y_train : numpy array of shape (num_samples, num_classes)
        The output classes for the train data, in binary format
    X_val : numpy array of shape (num_samples_val, num_timesteps, num_channels)
        The input dataset for validation
    y_val : numpy array of shape (num_samples_val, num_classes)
        The output classes for the validation data, in binary format
    models : list of model, params, modeltypes
        List of keras models to train
    nr_epochs : int, optional
        nr of epochs to use for training one model
    subset_size :
        The number of samples used from the complete train set
    subsize_set : int, optional
        number of samples to use from the training set for training these models

    verbose : bool, optional
        flag for displaying verbose output

    Returns
    ----------
    histories : list of Keras History objects
        train histories for all models
    val_accuracies : list of floats
        validation accuraracies of the models
    val_losses : list of floats
        validation losses of the models
    """
    # if subset_size is smaller then X_train, this will work fine
    X_train_sub = X_train[:subset_size, :, :]

    y_train_sub = y_train[:subset_size, :]

    histories = []
    val_accuracies = []
    val_losses = []
    for model, params, model_types in models:
        history = model.fit(X_train_sub, y_train_sub,
                            nb_epoch=nr_epochs, batch_size=20,
                            # see comment on subsize_set
                            validation_data=(X_val, y_val),
                            verbose=verbose)
        histories.append(history)
        val_accuracies.append(history.history['val_acc'][-1])
        val_losses.append(history.history['val_loss'][-1])

    return histories, val_accuracies, val_losses


def plotTrainingProcess(history, name='Model', ax=None):

    """
    This function plots the loss and accuracy on the train and validation set,
    for each epoch in the history of one model.

    Parameters
    ----------
    history : keras History object for one model
        The history object of the training process corresponding to one model

    """
    if ax is None:
        fig, ax = plt.subplots()
    ax2 = ax.twinx()
    LN = len(history.history['val_loss'])

    val_loss, = ax.plot(range(LN), history.history['val_loss'], 'g--',
                        label='validation loss')
    train_loss, = ax.plot(range(LN), history.history['loss'], 'g-',
                          label='train loss')
    val_acc, = ax2.plot(range(LN), history.history['val_acc'], 'b--',
                        label='validation accuracy')
    train_acc, = ax2.plot(range(LN), history.history['acc'], 'b-',
                          label='train accuracy')
    ax.set_xlabel('epoch')
    ax.set_ylabel('loss', color='g')
    ax2.set_ylabel('accuracy', color='b')
    plt.legend(handles=[val_loss, train_loss, val_acc, train_acc],
               loc=2, bbox_to_anchor=(1.1, 1))
    plt.title(name)


def find_best_architecture(X_train, y_train, X_val, y_val, verbose=True,

                           number_of_models=5, nr_epochs=5, subset_size=100,
                           **kwargs
                           ):
    """
    Tries out a number of models on a subsample of the data,
    and outputs the best found architecture and hyperparameters.

    Parameters
    ----------
    X_train : numpy array of shape (num_samples, num_timesteps, num_channels)
        The input dataset for training
    y_train : numpy array of shape (num_samples, num_classes)
        The output classes for the train data, in binary format
    X_val : numpy array of shape (num_samples_val, num_timesteps, num_channels)
        The input dataset for validation
    y_val : numpy array of shape (num_samples_val, num_classes)
        The output classes for the validation data, in binary format
    verbose : bool, optional
        flag for displaying verbose output
    number_of_models : int
        The number of models to generate and test
    nr_epochs : int
        The number of epochs that each model is trained
    subset_size : int
        The size of the subset of the data that is used for finding the optimal architecture

    **kwargs: key-value parameters
        parameters for generating the models (see docstring for modelgen.generate_models)


    Returns
    ----------
    best_model : Keras model
        Best performing model, already trained on a small sample data set.
    best_params : dict
        Dictionary containing the hyperparameters for the best model
    best_model_type : str
        Type of the best model
    knn_acc : float
        accuaracy for kNN prediction on validation set
    """
    models = modelgen.generate_models(X_train.shape, y_train.shape[1],
                                      number_of_models=number_of_models,
                                      **kwargs)
    histories, val_accuracies, val_losses = train_models_on_samples(X_train,
                                                                    y_train,
                                                                    X_val,
                                                                    y_val,
                                                                    models,
                                                                    nr_epochs,
                                                                    subset_size=subset_size,

                                                                    verbose=verbose)

    best_model_index = np.argmax(val_accuracies)
    best_model, best_params, best_model_type = models[best_model_index]
    knn_acc = kNN_accuracy(
        X_train[:subset_size, :, :], y_train[:subset_size, :], X_val, y_val)
    if verbose:
        for i in range(len(models)):  # <= now one plot per model, ultimately we

            # may want all models in one plot to allow for direct comparison
            name = str(models[i][1])
            plotTrainingProcess(histories[i], name)
        print('Best model: model ', best_model_index)
        print('Model type: ', best_model_type)
        print('Hyperparameters: ', best_params)
        print('Accuracy on validation set: ', val_accuracies[best_model_index])
        print('Accuracy of kNN on validation set', knn_acc)

    if val_accuracies[best_model_index] < knn_acc:
        warnings.warn('Best model not better than kNN: ' +
                      str(val_accuracies[best_model_index]) + ' vs  ' +
                      str(knn_acc)
                      )
    return best_model, best_params, best_model_type, knn_acc


def kNN_accuracy(X_train, y_train, X_val, y_val, k=1):

    """
    Performs k-Neigherst Neighbors and returns the accuracy score.

    Parameters
    ----------
    X_train : numpy array
        Train set of shape (num_samples, num_timesteps, num_channels)
    y_train : numpy array
        Class labels for train set
    X_val : numpy array
        Validation set of shape (num_samples, num_timesteps, num_channels)
    y_val : numpy array
        Class labels for validation set
    k : int
        number of neighbors to use for classifying

    Returns
    -------
    accuracy: float
        accuracy score on the validation set
    """
    num_samples, num_timesteps, num_channels = X_train.shape
    clf = neighbors.KNeighborsClassifier(k)
    clf.fit(
        X_train.reshape(
            num_samples,
            num_timesteps *
            num_channels),
        y_train)
    num_samples, num_timesteps, num_channels = X_val.shape
    val_predict = clf.predict(
        X_val.reshape(num_samples,
                      num_timesteps * num_channels))
    return metrics.accuracy_score(val_predict, y_val)


1		'''
2		Summary:
3		Function generate_models from modelgen.py generates and compiles models
4		Function train_models_on_samples trains those models
5		Function plotTrainingProcess plots the training process
6		Function find_best_architecture is wrapper function that combines
7		these steps
8		Example function calls in 'EvaluateDifferentModels.ipynb'
9		'''
10	1	import numpy as np
11	1	from matplotlib import pyplot as plt
12	1	from . import modelgen
13	1	from sklearn import neighbors, metrics
14	1	import warnings
15
16
17	1	def train_models_on_samples(X_train, y_train, X_val, y_val, models,
		0 ignored issues – show Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_train` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_val` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
18		nr_epochs=5, subset_size=100, verbose=True):
19		"""
20		Given a list of compiled models, this function trains
21		them all on a subset of the train data. If the given size of the subset is
22		smaller then the size of the data, the complete data set is used.
23		Parameters
24		----------
25		X_train : numpy array of shape (num_samples, num_timesteps, num_channels)
26		The input dataset for training
27		y_train : numpy array of shape (num_samples, num_classes)
28		The output classes for the train data, in binary format
29		X_val : numpy array of shape (num_samples_val, num_timesteps, num_channels)
30		The input dataset for validation
31		y_val : numpy array of shape (num_samples_val, num_classes)
32		The output classes for the validation data, in binary format
33		models : list of model, params, modeltypes
34		List of keras models to train
35		nr_epochs : int, optional
36		nr of epochs to use for training one model
37		subset_size :
38		The number of samples used from the complete train set
39		subsize_set : int, optional
40		number of samples to use from the training set for training these models
		0 ignored issues – show Coding Style introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (80/79). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
41		verbose : bool, optional
42		flag for displaying verbose output
43
44		Returns
45		----------
46		histories : list of Keras History objects
47		train histories for all models
48		val_accuracies : list of floats
49		validation accuraracies of the models
50		val_losses : list of floats
51		validation losses of the models
52		"""
53		# if subset_size is smaller then X_train, this will work fine
54	1	X_train_sub = X_train[:subset_size, :, :]
		0 ignored issues – show Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_train_sub` does not conform to the variable naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
55	1	y_train_sub = y_train[:subset_size, :]
56
57	1	histories = []
58	1	val_accuracies = []
59	1	val_losses = []
60	1	for model, params, model_types in models:
61	1	history = model.fit(X_train_sub, y_train_sub,
62		nb_epoch=nr_epochs, batch_size=20,
63		# see comment on subsize_set
64		validation_data=(X_val, y_val),
65		verbose=verbose)
66	1	histories.append(history)
67	1	val_accuracies.append(history.history['val_acc'][-1])
68	1	val_losses.append(history.history['val_loss'][-1])
69
70	1	return histories, val_accuracies, val_losses
71
72
73	1	def plotTrainingProcess(history, name='Model', ax=None):
		0 ignored issues – show Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `plotTrainingProcess` does not conform to the function naming conventions (`[a-z_][a-z0-9_]{2,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
74		"""
75		This function plots the loss and accuracy on the train and validation set,
76		for each epoch in the history of one model.
77
78		Parameters
79		----------
80		history : keras History object for one model
81		The history object of the training process corresponding to one model
82
83		"""
84		if ax is None:
85		fig, ax = plt.subplots()
86		ax2 = ax.twinx()
87		LN = len(history.history['val_loss'])
		0 ignored issues – show Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `LN` does not conform to the variable naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
88		val_loss, = ax.plot(range(LN), history.history['val_loss'], 'g--',
89		label='validation loss')
90		train_loss, = ax.plot(range(LN), history.history['loss'], 'g-',
91		label='train loss')
92		val_acc, = ax2.plot(range(LN), history.history['val_acc'], 'b--',
93		label='validation accuracy')
94		train_acc, = ax2.plot(range(LN), history.history['acc'], 'b-',
95		label='train accuracy')
96		ax.set_xlabel('epoch')
97		ax.set_ylabel('loss', color='g')
98		ax2.set_ylabel('accuracy', color='b')
99		plt.legend(handles=[val_loss, train_loss, val_acc, train_acc],
100		loc=2, bbox_to_anchor=(1.1, 1))
101		plt.title(name)
102
103
104	1	def find_best_architecture(X_train, y_train, X_val, y_val, verbose=True,
		0 ignored issues – show Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_train` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_val` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
105		number_of_models=5, nr_epochs=5, subset_size=100,
106		**kwargs
107		):
108		"""
109		Tries out a number of models on a subsample of the data,
110		and outputs the best found architecture and hyperparameters.
111
112		Parameters
113		----------
114		X_train : numpy array of shape (num_samples, num_timesteps, num_channels)
115		The input dataset for training
116		y_train : numpy array of shape (num_samples, num_classes)
117		The output classes for the train data, in binary format
118		X_val : numpy array of shape (num_samples_val, num_timesteps, num_channels)
119		The input dataset for validation
120		y_val : numpy array of shape (num_samples_val, num_classes)
121		The output classes for the validation data, in binary format
122		verbose : bool, optional
123		flag for displaying verbose output
124		number_of_models : int
125		The number of models to generate and test
126		nr_epochs : int
127		The number of epochs that each model is trained
128		subset_size : int
129		The size of the subset of the data that is used for finding the optimal architecture
		0 ignored issues – show Coding Style introduced 2016-08-04 13:04 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (92/79). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
130		**kwargs: key-value parameters
131		parameters for generating the models (see docstring for modelgen.generate_models)
		0 ignored issues – show Coding Style introduced 2016-08-04 12:29 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (89/79). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
132
133		Returns
134		----------
135		best_model : Keras model
136		Best performing model, already trained on a small sample data set.
137		best_params : dict
138		Dictionary containing the hyperparameters for the best model
139		best_model_type : str
140		Type of the best model
141		knn_acc : float
142		accuaracy for kNN prediction on validation set
143		"""
144	1	models = modelgen.generate_models(X_train.shape, y_train.shape[1],
145		number_of_models=number_of_models,
146		**kwargs)
147	1	histories, val_accuracies, val_losses = train_models_on_samples(X_train,
148		y_train,
149		X_val,
150		y_val,
151		models,
152		nr_epochs,
153		subset_size=subset_size,
		0 ignored issues – show Coding Style introduced 2016-08-04 09:16 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (92/79). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
154		verbose=verbose)
		0 ignored issues – show Coding Style introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (84/79). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
155	1	best_model_index = np.argmax(val_accuracies)
156	1	best_model, best_params, best_model_type = models[best_model_index]
157	1	knn_acc = kNN_accuracy(
158		X_train[:subset_size, :, :], y_train[:subset_size, :], X_val, y_val)
159	1	if verbose:
160		for i in range(len(models)): # <= now one plot per model, ultimately we
		0 ignored issues – show Coding Style introduced 2016-08-04 13:04 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (80/79). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
161		# may want all models in one plot to allow for direct comparison
162		name = str(models[i][1])
163		plotTrainingProcess(histories[i], name)
164		print('Best model: model ', best_model_index)
165		print('Model type: ', best_model_type)
166		print('Hyperparameters: ', best_params)
167		print('Accuracy on validation set: ', val_accuracies[best_model_index])
168		print('Accuracy of kNN on validation set', knn_acc)
169
170	1	if val_accuracies[best_model_index] < knn_acc:
171		warnings.warn('Best model not better than kNN: ' +
172		str(val_accuracies[best_model_index]) + ' vs ' +
173		str(knn_acc)
174		)
175	1	return best_model, best_params, best_model_type, knn_acc
176
177
178	1	def kNN_accuracy(X_train, y_train, X_val, y_val, k=1):
		0 ignored issues – show Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `kNN_accuracy` does not conform to the function naming conventions (`[a-z_][a-z0-9_]{2,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_train` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Coding Style Naming introduced 2016-07-07 14:47 UTC by Report Bug Copy Issue Report The name `X_val` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{1,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
179		"""
180		Performs k-Neigherst Neighbors and returns the accuracy score.
181
182		Parameters
183		----------
184		X_train : numpy array
185		Train set of shape (num_samples, num_timesteps, num_channels)
186		y_train : numpy array
187		Class labels for train set
188		X_val : numpy array
189		Validation set of shape (num_samples, num_timesteps, num_channels)
190		y_val : numpy array
191		Class labels for validation set
192		k : int
193		number of neighbors to use for classifying
194
195		Returns
196		-------
197		accuracy: float
198		accuracy score on the validation set
199		"""
200	1	num_samples, num_timesteps, num_channels = X_train.shape
201	1	clf = neighbors.KNeighborsClassifier(k)
202	1	clf.fit(
203		X_train.reshape(
204		num_samples,
205		num_timesteps *
206		num_channels),
207		y_train)
208	1	num_samples, num_timesteps, num_channels = X_val.shape
209	1	val_predict = clf.predict(
210		X_val.reshape(num_samples,
211		num_timesteps * num_channels))
212		return metrics.accuracy_score(val_predict, y_val)
213

NLeSC / mcfly

Push — master ( ed5c34...292ca9 )

kNN_accuracy() B

Complexity

Size

Duplication

Code Coverage

Importance

Duplication Side-by-Side

Filter issues like