| Conditions | 6 |
| Total Lines | 88 |
| Code Lines | 36 |
| Lines | 88 |
| Ratio | 100 % |
| Changes | 0 | ||
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
| 1 | |||
| 126 | View Code Duplication | def pca_analysis(dataset, dropcols=[], imputenans=True, scale=True, n_components=5): |
|
| 127 | """ |
||
| 128 | Performs a primary component analysis on an input dataset |
||
| 129 | |||
| 130 | Parameters |
||
| 131 | ---------- |
||
| 132 | dataset : pandas dataframe of shape (n, p) |
||
| 133 | Input dataset with n samples and p features |
||
| 134 | dropcols : list |
||
| 135 | Columns to exclude from pca analysis. At a minimum, user must exclude |
||
| 136 | non-numeric columns. |
||
| 137 | imputenans : boolean |
||
| 138 | If True, impute NaN values as column means. |
||
| 139 | scale : boolean |
||
| 140 | If True, columns will be scaled to a mean of zero and a standard deviation of 1. |
||
| 141 | n_components : integer |
||
| 142 | Desired number of components in principle component analysis. |
||
| 143 | |||
| 144 | Returns |
||
| 145 | ------- |
||
| 146 | dataset_scaled : numpy array of shape (n, p) |
||
| 147 | Scaled dataset with n samples and p features |
||
| 148 | dataset_pca : Pandas dataframe of shape (n, n_components) |
||
| 149 | Output array of n_component features of each original sample |
||
| 150 | dataset_final : Pandas dataframe of shape (n, p+n_components) |
||
| 151 | Output array with principle components append to original array. |
||
| 152 | prcs : Pandas dataframe of shape (5, n_components) |
||
| 153 | Output array displaying the top 5 features contributing to each |
||
| 154 | principle component. |
||
| 155 | prim_vals : Dictionary of lists |
||
| 156 | Output dictionary of of the pca scores for the top 5 features |
||
| 157 | contributing to each principle component. |
||
| 158 | components : Pandas dataframe of shape (p, n_components) |
||
| 159 | Raw pca scores. |
||
| 160 | |||
| 161 | Examples |
||
| 162 | -------- |
||
| 163 | |||
| 164 | """ |
||
| 165 | |||
| 166 | dataset_num = dataset.drop(dropcols, axis=1) |
||
| 167 | dataset_raw = dataset.as_matrix() |
||
| 168 | |||
| 169 | if imputenans: |
||
| 170 | imp = Imputer(missing_values='NaN', strategy='mean', axis=0) |
||
| 171 | imp.fit(dataset_raw) |
||
| 172 | dataset_clean = imp.transform(dataset_raw) |
||
| 173 | else: |
||
| 174 | dataset_clean = dataset_raw |
||
| 175 | |||
| 176 | if scale: |
||
| 177 | scaler = stscale() |
||
| 178 | scaler.fit(dataset_clean) |
||
| 179 | dataset_scaled = scaler.transform(dataset_clean) |
||
| 180 | else: |
||
| 181 | dataset_scaled = dataset_clean |
||
| 182 | |||
| 183 | pca1 = pca(n_components=n_components) |
||
| 184 | pca1.fit(dataset_scaled) |
||
| 185 | |||
| 186 | #Cumulative explained variance ratio |
||
| 187 | x = 0 |
||
| 188 | explained_v = pca1.explained_variance_ratio_ |
||
| 189 | print('Cumulative explained variance:') |
||
| 190 | for i in range(0, n_components): |
||
| 191 | x = x + explained_v[i] |
||
| 192 | print('{} component: {}'.format(i, x)) |
||
| 193 | |||
| 194 | prim_comps = {} |
||
| 195 | prim_vals = {} |
||
| 196 | comps = pca1.components_ |
||
| 197 | components = pd.DataFrame(comps.transpose()) |
||
| 198 | |||
| 199 | for num in range(0, n_components): |
||
| 200 | highest = np.abs(components[num]).as_matrix().argsort()[-5:][::-1] |
||
| 201 | pels = [] |
||
| 202 | prim_vals[num] = components[num].as_matrix()[highest] |
||
| 203 | for col in highest: |
||
| 204 | pels.append(dataset.columns[col]) |
||
| 205 | prim_comps[num] = pels |
||
| 206 | |||
| 207 | #Main contributors to each primary component |
||
| 208 | prcs = pd.DataFrame.from_dict(prim_comps) |
||
| 209 | |||
| 210 | dataset_pca = pd.DataFrame(pca1.transform(dataset_scaled)) |
||
| 211 | dataset_final = pd.concat([dataset, dataset_pca], axis=1) |
||
| 212 | |||
| 213 | return dataset_scaled, dataset_pca, dataset_final, prcs, prim_vals, components |
||
| 214 | |||
| 271 | plt.show() |
||
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.