Conditions | 3 |
Total Lines | 76 |
Code Lines | 19 |
Lines | 0 |
Ratio | 0 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.
There are several approaches to avoid long parameter lists:
1 | ''' |
||
98 | def data_cleaning(data, drop_threshold_cols=0.95, drop_threshold_rows=0.95, drop_duplicates=True, |
||
99 | convert_dtypes=True, category=True, cat_threshold=0.03, cat_exclude=None, show='changes'): |
||
100 | ''' |
||
101 | Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty \ |
||
102 | columns as well as optimizing the datatypes. |
||
103 | |||
104 | Parameters |
||
105 | ---------- |
||
106 | data: 2D dataset that can be coerced into Pandas DataFrame. |
||
107 | |||
108 | drop_threshold_cols: float, default 0.95 |
||
109 | Drop columns with NA-ratio above the specified threshold. |
||
110 | |||
111 | drop_threshold_rows: float, default 0.95 |
||
112 | Drop rows with NA-ratio above the specified threshold. |
||
113 | |||
114 | drop_duplicates: bool, default True |
||
115 | Drops duplicate rows, keeping the first occurence. This step comes after the dropping of missing values. |
||
116 | |||
117 | convert_dtypes: bool, default True |
||
118 | Convert dtypes using pd.convert_dtypes(). |
||
119 | |||
120 | category: bool, default True |
||
121 | Change dtypes of columns to "category". Set threshold using cat_threshold. Requires convert_dtypes=True |
||
122 | |||
123 | cat_threshold: float, default 0.03 |
||
124 | Ratio of unique values below which categories are inferred and column dtype is changed to categorical. |
||
125 | |||
126 | cat_exclude: list, default None |
||
127 | List of columns to exclude from categorical conversion. |
||
128 | |||
129 | show: {'all', 'changes', None} default 'all' |
||
130 | Specify verbosity of the output. |
||
131 | * 'all': Print information about the data before and after cleaning as well as information about changes. |
||
132 | * 'changes': Print out differences in the data before and after cleaning. |
||
133 | * None: No information about the data and the data cleaning is printed. |
||
134 | |||
135 | Returns |
||
136 | ------- |
||
137 | data_cleaned: Pandas DataFrame |
||
138 | |||
139 | See Also |
||
140 | -------- |
||
141 | convert_datatypes: Converts columns to best possible dtypes. |
||
142 | drop_missing : Flexibly drops columns and rows. |
||
143 | _memory_usage: Gives the total memory usage in kilobytes. |
||
144 | _missing_vals: Metrics about missing values in the dataset. |
||
145 | |||
146 | Notes |
||
147 | ----- |
||
148 | The category dtype is not grouped in the summary, unless it contains exactly the same categories. |
||
149 | |||
150 | ''' |
||
151 | |||
152 | # Validate Inputs |
||
153 | _validate_input_range(drop_threshold_cols, 'drop_threshold_cols', 0, 1) |
||
154 | _validate_input_range(drop_threshold_rows, 'drop_threshold_rows', 0, 1) |
||
155 | _validate_input_bool(drop_duplicates, 'drop_duplicates') |
||
156 | _validate_input_bool(convert_dtypes, 'convert_datatypes') |
||
157 | _validate_input_bool(category, 'category') |
||
158 | _validate_input_range(cat_threshold, 'cat_threshold', 0, 1) |
||
159 | |||
160 | data = pd.DataFrame(data).copy() |
||
161 | data_cleaned = drop_missing(data, drop_threshold_cols, drop_threshold_rows) |
||
162 | single_val_cols = data_cleaned.columns[data_cleaned.nunique(dropna=False) == 1].tolist() |
||
163 | data_cleaned = data_cleaned.drop(columns=single_val_cols) |
||
164 | |||
165 | if drop_duplicates: |
||
166 | data_cleaned, dupl_rows = _drop_duplicates(data_cleaned) |
||
167 | if convert_dtypes: |
||
168 | data_cleaned = convert_datatypes(data_cleaned, category=category, cat_threshold=cat_threshold, |
||
169 | cat_exclude=cat_exclude) |
||
170 | |||
171 | _diff_report(data, data_cleaned, dupl_rows=dupl_rows, single_val_cols=single_val_cols, show=show) |
||
172 | |||
173 | return data_cleaned |
||
174 |