Conditions | 6 |
Total Lines | 88 |
Code Lines | 36 |
Lines | 88 |
Ratio | 100 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
1 | |||
126 | View Code Duplication | def pca_analysis(dataset, dropcols=[], imputenans=True, scale=True, n_components=5): |
|
127 | """ |
||
128 | Performs a primary component analysis on an input dataset |
||
129 | |||
130 | Parameters |
||
131 | ---------- |
||
132 | dataset : pandas dataframe of shape (n, p) |
||
133 | Input dataset with n samples and p features |
||
134 | dropcols : list |
||
135 | Columns to exclude from pca analysis. At a minimum, user must exclude |
||
136 | non-numeric columns. |
||
137 | imputenans : boolean |
||
138 | If True, impute NaN values as column means. |
||
139 | scale : boolean |
||
140 | If True, columns will be scaled to a mean of zero and a standard deviation of 1. |
||
141 | n_components : integer |
||
142 | Desired number of components in principle component analysis. |
||
143 | |||
144 | Returns |
||
145 | ------- |
||
146 | dataset_scaled : numpy array of shape (n, p) |
||
147 | Scaled dataset with n samples and p features |
||
148 | dataset_pca : Pandas dataframe of shape (n, n_components) |
||
149 | Output array of n_component features of each original sample |
||
150 | dataset_final : Pandas dataframe of shape (n, p+n_components) |
||
151 | Output array with principle components append to original array. |
||
152 | prcs : Pandas dataframe of shape (5, n_components) |
||
153 | Output array displaying the top 5 features contributing to each |
||
154 | principle component. |
||
155 | prim_vals : Dictionary of lists |
||
156 | Output dictionary of of the pca scores for the top 5 features |
||
157 | contributing to each principle component. |
||
158 | components : Pandas dataframe of shape (p, n_components) |
||
159 | Raw pca scores. |
||
160 | |||
161 | Examples |
||
162 | -------- |
||
163 | |||
164 | """ |
||
165 | |||
166 | dataset_num = dataset.drop(dropcols, axis=1) |
||
167 | dataset_raw = dataset.as_matrix() |
||
168 | |||
169 | if imputenans: |
||
170 | imp = Imputer(missing_values='NaN', strategy='mean', axis=0) |
||
171 | imp.fit(dataset_raw) |
||
172 | dataset_clean = imp.transform(dataset_raw) |
||
173 | else: |
||
174 | dataset_clean = dataset_raw |
||
175 | |||
176 | if scale: |
||
177 | scaler = stscale() |
||
178 | scaler.fit(dataset_clean) |
||
179 | dataset_scaled = scaler.transform(dataset_clean) |
||
180 | else: |
||
181 | dataset_scaled = dataset_clean |
||
182 | |||
183 | pca1 = pca(n_components=n_components) |
||
184 | pca1.fit(dataset_scaled) |
||
185 | |||
186 | #Cumulative explained variance ratio |
||
187 | x = 0 |
||
188 | explained_v = pca1.explained_variance_ratio_ |
||
189 | print('Cumulative explained variance:') |
||
190 | for i in range(0, n_components): |
||
191 | x = x + explained_v[i] |
||
192 | print('{} component: {}'.format(i, x)) |
||
193 | |||
194 | prim_comps = {} |
||
195 | prim_vals = {} |
||
196 | comps = pca1.components_ |
||
197 | components = pd.DataFrame(comps.transpose()) |
||
198 | |||
199 | for num in range(0, n_components): |
||
200 | highest = np.abs(components[num]).as_matrix().argsort()[-5:][::-1] |
||
201 | pels = [] |
||
202 | prim_vals[num] = components[num].as_matrix()[highest] |
||
203 | for col in highest: |
||
204 | pels.append(dataset.columns[col]) |
||
205 | prim_comps[num] = pels |
||
206 | |||
207 | #Main contributors to each primary component |
||
208 | prcs = pd.DataFrame.from_dict(prim_comps) |
||
209 | |||
210 | dataset_pca = pd.DataFrame(pca1.transform(dataset_scaled)) |
||
211 | dataset_final = pd.concat([dataset, dataset_pca], axis=1) |
||
212 | |||
213 | return dataset_scaled, dataset_pca, dataset_final, prcs, prim_vals, components |
||
214 | |||
271 | plt.show() |
||
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.