Completed
Push — master ( 77cbe9...b56edc )
by Chad
10:40
created

diff_classifier.pca.partial_corr()   A

Complexity

Conditions 3

Size

Total Lines 62
Code Lines 18

Duplication

Lines 62
Ratio 100 %

Importance

Changes 0
Metric Value
eloc 18
dl 62
loc 62
rs 9.5
c 0
b 0
f 0
cc 3
nop 1

How to fix   Long Method   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
import pandas as pd
3
import numpy as np
4
from sklearn.decomposition import PCA as pca
5
from sklearn.preprocessing import StandardScaler as stscale
6
from sklearn.preprocessing import Imputer
7
import scipy.stats as stats
8
from scipy import stats, linalg
0 ignored issues
show
Unused Code introduced by
The import stats was already done on line 7. You should be able to
remove this line.
Loading history...
9
import numpy as np
0 ignored issues
show
Unused Code introduced by
The import numpy was already done on line 3. You should be able to
remove this line.
Loading history...
introduced by
Imports from package numpy are not grouped
Loading history...
10
import matplotlib.pyplot as plt
11
from matplotlib.pyplot import cm
12
13
14 View Code Duplication
def partial_corr(C):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
Coding Style Naming introduced by
The name C does not conform to the argument naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
15
    """
16
    Returns the sample linear partial correlation coefficients between pairs of variables in C, controlling 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
Coding Style introduced by
This line is too long as per the coding-style (107/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
17
    for the remaining variables in C.
18
19
    Partial Correlation in Python (clone of Matlab's partialcorr)
20
21
    This uses the linear regression approach to compute the partial 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
22
    correlation (might be slow for a huge number of variables). The 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
23
    algorithm is detailed here:
24
25
        http://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression
26
27
    Taking X and Y two variables of interest and Z the matrix with all the variable minus {X, Y},
28
    the algorithm can be summarized as
29
30
        1) perform a normal linear least-squares regression with X as the target and Z as the predictor
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (103/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
31
        2) calculate the residuals in Step #1
32
        3) perform a normal linear least-squares regression with Y as the target and Z as the predictor
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (103/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
33
        4) calculate the residuals in Step #3
34
        5) calculate the correlation coefficient between the residuals from Steps #2 and #4; 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
35
36
        The result is the partial correlation between X and Y while controlling for the effect of Z
37
38
39
    Date: Nov 2014
40
    Author: Fabian Pedregosa-Izquierdo, [email protected]
41
    Testing: Valentina Borghesani, [email protected]
42
43
    Parameters
44
    ----------
45
    C : array-like, shape (n, p)
46
        Array with the different variables. Each column of C is taken as a variable
47
48
49
    Returns
50
    -------
51
    P : array-like, shape (p, p)
52
        P[i, j] contains the partial correlation of C[:, i] and C[:, j] controlling
53
        for the remaining variables in C.
54
    """
55
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
56
    C = np.asarray(C)
57
    p = C.shape[1]
0 ignored issues
show
Coding Style Naming introduced by
The name p does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
58
    P_corr = np.zeros((p, p), dtype=np.float)
0 ignored issues
show
Coding Style Naming introduced by
The name P_corr does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
59
    for i in range(p):
60
        P_corr[i, i] = 1
61
        for j in range(i+1, p):
62
            idx = np.ones(p, dtype=np.bool)
63
            idx[i] = False
64
            idx[j] = False
65
            beta_i = linalg.lstsq(C[:, idx], C[:, j])[0]
66
            beta_j = linalg.lstsq(C[:, idx], C[:, i])[0]
67
68
            res_j = C[:, j] - C[:, idx].dot( beta_i)
0 ignored issues
show
Coding Style introduced by
No space allowed after bracket
Loading history...
69
            res_i = C[:, i] - C[:, idx].dot(beta_j)
70
            
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
71
            corr = stats.pearsonr(res_i, res_j)[0]
72
            P_corr[i, j] = corr
73
            P_corr[j, i] = corr
74
        
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
75
    return P_corr
76
77
78 View Code Duplication
def kmo(dataset):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
79
    """
80
    Calculates the Kaiser-Meyer-Olkin measure on an input dataset.
81
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
82
    Based on calculations shown here:
83
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
84
    http://www.statisticshowto.com/kaiser-meyer-olkin/
85
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
86
        -- 0.00-0.49  unacceptable
87
        -- 0.50-0.59  miserable
88
        -- 0.60-0.69  mediocre
89
        -- 0.70-0.79  middling
90
        -- 0.80-0.89  meritorious
91
        -- 0.90-1.00  marvelous
92
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
93
    Parameters
94
    ----------
95
    dataset : array-like, shape (n, p)
96
              Array containing n samples and p features. Must have no NaNs.
97
              Ideally scaled before performing test.
98
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
99
    Returns
100
    -------
101
    mo : KMO test value
102
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
103
    """
104
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
105
    #Correlation matrix and the partial covariance matrix.
106
    corrmatrix = np.corrcoef(dataset.transpose())
107
    pcorr = partial_corr(dataset)
108
109
    #Calculation of the KMO statistic
110
    matrix = corrmatrix*corrmatrix
111
    rows = matrix.shape[0]
112
    cols = matrix.shape[1]
113
    rij = 0
114
    uij = 0
115
    for row in range(0, rows):
116
        for col in range(0, cols):
117
            if not row == col:
118
                rij = rij + matrix[row, col]
119
                uij = uij + pcorr[row, col]
120
121
    mo = rij/(rij+uij)
0 ignored issues
show
Coding Style Naming introduced by
The name mo does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
122
    print(mo)
123
    return mo
124
125
126 View Code Duplication
def pca_analysis(dataset, dropcols=[], imputenans=True, scale=True, n_components=5):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
Bug Best Practice introduced by
The default value [] might cause unintended side-effects.

Objects as default values are only created once in Python and not on each invocation of the function. If the default object is modified, this modification is carried over to the next invocation of the method.

# Bad:
# If array_param is modified inside the function, the next invocation will
# receive the modified object.
def some_function(array_param=[]):
    # ...

# Better: Create an array on each invocation
def some_function(array_param=None):
    array_param = array_param or []
    # ...
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (26/15).
Loading history...
127
    """
128
    Performs a primary component analysis on an input dataset
129
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
130
    Parameters
131
    ----------
132
    dataset : pandas dataframe of shape (n, p)
133
        Input dataset with n samples and p features
134
    dropcols : list
135
        Columns to exclude from pca analysis. At a minimum, user must exclude
136
        non-numeric columns.
137
    imputenans : boolean
138
        If True, impute NaN values as column means.
139
    scale : boolean
140
        If True, columns will be scaled to a mean of zero and a standard deviation of 1.
141
    n_components : integer
142
        Desired number of components in principle component analysis.
143
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
144
    Returns
145
    -------
146
    dataset_scaled : numpy array of shape (n, p)
147
        Scaled dataset with n samples and p features
148
    dataset_pca : Pandas dataframe of shape (n, n_components)
149
        Output array of n_component features of each original sample
150
    dataset_final : Pandas dataframe of shape (n, p+n_components)
151
        Output array with principle components append to original array.
152
    prcs : Pandas dataframe of shape (5, n_components)
153
        Output array displaying the top 5 features contributing to each
154
        principle component.
155
    prim_vals : Dictionary of lists
156
        Output dictionary of of the pca scores for the top 5 features
157
        contributing to each principle component.
158
    components : Pandas dataframe of shape (p, n_components)
159
        Raw pca scores.
160
        
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
161
    Examples
162
    --------
163
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
164
    """
165
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
166
    dataset_num = dataset.drop(dropcols, axis=1)
0 ignored issues
show
Unused Code introduced by
The variable dataset_num seems to be unused.
Loading history...
167
    dataset_raw = dataset.as_matrix()
168
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
169
    if imputenans:
170
        imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
171
        imp.fit(dataset_raw)
172
        dataset_clean = imp.transform(dataset_raw)
173
    else:
174
        dataset_clean = dataset_raw
175
        
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
176
    if scale:
177
        scaler = stscale()
178
        scaler.fit(dataset_clean)
179
        dataset_scaled = scaler.transform(dataset_clean)
180
    else:
181
        dataset_scaled = dataset_clean
182
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
183
    pca1 = pca(n_components=n_components)
184
    pca1.fit(dataset_scaled)
185
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
186
    #Cumulative explained variance ratio
187
    x = 0
0 ignored issues
show
Coding Style Naming introduced by
The name x does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
188
    explained_v = pca1.explained_variance_ratio_
189
    print('Cumulative explained variance:')
190
    for i in range(0, n_components):
191
        x = x + explained_v[i]
0 ignored issues
show
Coding Style Naming introduced by
The name x does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
192
        print('{} component: {}'.format(i, x))
193
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
194
    prim_comps = {}
195
    prim_vals = {}
196
    comps = pca1.components_
197
    components = pd.DataFrame(comps.transpose())
198
199
    for num in range(0, n_components):
200
        highest = np.abs(components[num]).as_matrix().argsort()[-5:][::-1]
201
        pels = []
202
        prim_vals[num] = components[num].as_matrix()[highest]
203
        for col in highest:
204
            pels.append(dataset.columns[col])
205
        prim_comps[num] = pels
206
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
207
    #Main contributors to each primary component
208
    prcs = pd.DataFrame.from_dict(prim_comps)
209
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
210
    dataset_pca = pd.DataFrame(pca1.transform(dataset_scaled))
211
    dataset_final = pd.concat([dataset, dataset_pca], axis=1)
212
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
213
    return dataset_scaled, dataset_pca, dataset_final, prcs, prim_vals, components
214
215
216 View Code Duplication
def plot_pca(datasets, figsize=(8, 8), lwidth=8.0,
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
Bug Best Practice introduced by
The default value [] might cause unintended side-effects.

Objects as default values are only created once in Python and not on each invocation of the function. If the default object is modified, this modification is carried over to the next invocation of the method.

# Bad:
# If array_param is modified inside the function, the next invocation will
# receive the modified object.
def some_function(array_param=[]):
    # ...

# Better: Create an array on each invocation
def some_function(array_param=None):
    array_param = array_param or []
    # ...
Loading history...
best-practice introduced by
Too many arguments (6/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (17/15).
Loading history...
217
             labels = ['Sample1', 'Sample2'], savefig=True, filename='test.png'):
0 ignored issues
show
Coding Style introduced by
No space allowed around keyword argument assignment
Loading history...
218
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
219
    """
220
    Plots the average output features from a PCA analysis in polar coordinates.
221
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
222
    Parameters
223
    ----------
224
    datasets : dictionary (keys = n) of numpy arrays of shape p
225
        Dictionary with n samples and p features to plot.
226
    figize : list
227
        Dimensions of output figure e.g. (8, 8)
228
    lwidth : float
229
        Width of plotted lines in figure
230
    labels : list of string
231
        Labels to display in legend.
232
    savefig : boolean
233
        If True, saves figure
234
    filename : string
235
        Desired output filename
236
        
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
237
    Returns
238
    -------
239
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
240
    """
241
242
    fig = plt.figure(figsize=figsize)
0 ignored issues
show
Unused Code introduced by
The variable fig seems to be unused.
Loading history...
243
    for key in datasets:
244
        N = datasets[key].shape[0]
0 ignored issues
show
Coding Style Naming introduced by
The name N does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
245
    width = (2*np.pi) / N
0 ignored issues
show
introduced by
The variable N does not seem to be defined in case the for loop on line 243 is not entered. Are you sure this can never be the case?
Loading history...
Unused Code introduced by
The variable width seems to be unused.
Loading history...
246
    color=iter(cm.viridis(np.linspace(0,1,N)))
0 ignored issues
show
Coding Style introduced by
Exactly one space required around assignment
Loading history...
Coding Style introduced by
Exactly one space required after comma
Loading history...
Bug introduced by
The Module matplotlib.cm does not seem to have a member named viridis.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
247
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
248
    theta = np.linspace(0.0, 2 * np.pi, N+1, endpoint=True)
249
    radii = {}
250
    bars = {}
251
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
252
    ax = plt.subplot(111, polar=True)
0 ignored issues
show
Coding Style Naming introduced by
The name ax does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
253
    counter = 0
254
    for key in datasets:
255
        c=next(color)
0 ignored issues
show
Coding Style introduced by
Exactly one space required around assignment
Loading history...
Coding Style Naming introduced by
The name c does not conform to the variable naming conventions ((([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
256
        radii[key] = np.append(datasets[key], datasets[key][0]) 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
257
        bars[key] = ax.plot(theta, radii[key], linewidth=lwidth, color=c, label=labels[counter])
258
        counter = counter + 1
259
    plt.legend(bbox_to_anchor=(0.90, 1), loc=2, borderaxespad=0., frameon=False, fontsize=20)
260
261
    # # Use custom colors and opacity
262
    # for r, bar in zip(radii, bars):
263
    #     bar.set_facecolor(plt.cm.jet(np.abs(r / 2.5)))
264
    #     bar.set_alpha(0.8)
265
    ax.set_xticks(np.pi/180. * np.linspace(0, 360, N, endpoint=False))
266
    ax.set_xticklabels(list(range(0, N)))
267
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
268
    if savefig:
269
        plt.savefig(filename, bbox_inches='tight')
270
    
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
271
    plt.show()
0 ignored issues
show
Coding Style introduced by
Final newline missing
Loading history...