Conditions | 9 |
Total Lines | 67 |
Code Lines | 25 |
Lines | 0 |
Ratio | 0 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
1 | ''' |
||
20 | def mv_col_handler(data, target=None, mv_threshold=0.1, corr_thresh_features=0.6, corr_thresh_target=0.3): |
||
21 | ''' |
||
22 | Converts columns with a high ratio of missing values into binary features and eventually drops them based on \ |
||
23 | their correlation with other features and the target variable. This function follows a three step process: |
||
24 | - 1) Identify features with a high ratio of missing values |
||
25 | - 2) Identify high correlations of these features among themselves and with other features in the dataset. |
||
26 | - 3) Features with high ratio of missing values and high correlation among each other are dropped unless \ |
||
27 | they correlate reasonably well with the target variable. |
||
28 | |||
29 | Parameters |
||
30 | ---------- |
||
31 | data: 2D dataset that can be coerced into Pandas DataFrame. |
||
32 | |||
33 | target: string, list, np.array or pd.Series, default None |
||
34 | Specify target for correlation. E.g. label column to generate only the correlations between each feature \ |
||
35 | and the label. |
||
36 | |||
37 | mv_threshold: float, default 0.1 |
||
38 | Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates \ |
||
39 | for dropping and undergo further analysis. |
||
40 | |||
41 | corr_thresh_features: float, default 0.6 |
||
42 | Value between 0 <= threshold <= 1. Maximum correlation a previously identified features with a high mv-ratio is\ |
||
43 | allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis. |
||
44 | |||
45 | corr_thresh_target: float, default 0.3 |
||
46 | Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a \ |
||
47 | high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met \ |
||
48 | the feature is ultimately dropped. |
||
49 | |||
50 | Returns |
||
51 | ------- |
||
52 | data: Updated Pandas DataFrame |
||
53 | drop_cols: List of dropped columns |
||
54 | ''' |
||
55 | |||
56 | # Validate Inputs |
||
57 | _validate_input_range(mv_threshold, 'mv_threshold', 0, 1) |
||
58 | _validate_input_range(corr_thresh_features, 'corr_thresh_features', 0, 1) |
||
59 | _validate_input_range(corr_thresh_target, 'corr_thresh_target', 0, 1) |
||
60 | |||
61 | data = pd.DataFrame(data).copy() |
||
62 | mv_ratios = _missing_vals(data)['mv_cols_ratio'] |
||
63 | cols_mv = mv_ratios[mv_ratios > mv_threshold].index.tolist() |
||
64 | data_mv_binary = data[cols_mv].applymap(lambda x: 1 if not pd.isnull(x) else x).fillna(0) |
||
65 | |||
66 | for col in cols_mv: |
||
67 | data[col] = data_mv_binary[col] |
||
68 | |||
69 | high_corr_features = [] |
||
70 | data_temp = data.copy() |
||
71 | for col in cols_mv: |
||
72 | corrmat = corr_mat(data_temp, colored=False) |
||
73 | if abs(corrmat[col]).nlargest(2)[1] > corr_thresh_features: |
||
74 | high_corr_features.append(col) |
||
75 | data_temp = data_temp.drop(columns=[col]) |
||
76 | |||
77 | drop_cols = [] |
||
78 | if target is None: |
||
79 | data = data_temp |
||
80 | else: |
||
81 | for col in high_corr_features: |
||
82 | if pd.DataFrame(data_mv_binary[col]).corrwith(target)[0] < corr_thresh_target: |
||
83 | drop_cols.append(col) |
||
84 | data = data.drop(columns=[col]) |
||
85 | |||
86 | return data, drop_cols |
||
87 | |||
150 |