GitHub Access Token became invalid

It seems like the GitHub access token used for retrieving details about this repository from GitHub became invalid. This might prevent certain types of inspections from being run (in particular, everything related to pull requests).
Please ask an admin of your repository to re-new the access token on this website.
Passed
Push — master ( 99c19c...2e344c )
by Andreas
01:12
created

klib.clean.data_cleaning()   A

Complexity

Conditions 3

Size

Total Lines 76
Code Lines 19

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 3
eloc 19
nop 9
dl 0
loc 76
rs 9.45
c 0
b 0
f 0

How to fix   Long Method    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
'''
2
Functions for data cleaning.
3
4
:author: Andreas Kanz
5
6
'''
7
8
# Imports
9
import pandas as pd
10
11
from .utils import _diff_report
12
from .utils import _drop_duplicates
13
from .utils import _missing_vals
14
from .utils import _validate_input_range
15
from .utils import _validate_input_bool
16
17
18
def convert_datatypes(data, category=True, cat_threshold=0.05, cat_exclude=None):
19
    '''
20
    Converts columns to best possible dtypes using dtypes supporting pd.NA.
21
22
    Parameters
23
    ----------
24
    data: 2D dataset that can be coerced into Pandas DataFrame.
25
26
    category: bool, default True
27
        Change dtypes of columns with dtype "object" to "category". Set threshold using cat_threshold or exclude \
28
        columns using cat_exclude.
29
30
    cat_threshold: float, default 0.05
31
        Ratio of unique values below which categories are inferred and column dtype is changed to categorical.
32
33
    cat_exclude: list, default None
34
        List of columns to exclude from categorical conversion.
35
36
    Returns
37
    -------
38
    data: Pandas DataFrame
39
40
    '''
41
42
    # Validate Inputs
43
    _validate_input_bool(category, 'Category')
44
    _validate_input_range(cat_threshold, 'cat_threshold', 0, 1)
45
46
    cat_exclude = [] if cat_exclude is None else cat_exclude.copy()
47
48
    data = pd.DataFrame(data).copy()
49
    for col in data.columns:
50
        unique_vals_ratio = data[col].nunique(dropna=False) / data.shape[0]
51
        if (category and
52
            unique_vals_ratio < cat_threshold and
53
            col not in cat_exclude and
54
                data[col].dtype == 'object'):
55
            data[col] = data[col].astype('category')
56
        data[col] = data[col].convert_dtypes()
57
58
    return data
59
60
61
def drop_missing(data, drop_threshold_cols=1, drop_threshold_rows=1):
62
    '''
63
    Drops completely empty columns and rows by default and optionally provides flexibility to loosen restrictions to \
64
    drop additional columns and rows based on the fraction of remaining NA-values.
65
66
    Parameters
67
    ----------
68
    data: 2D dataset that can be coerced into Pandas DataFrame.
69
70
    drop_threshold_cols: float, default 1
71
        Drop columns with NA-ratio above the specified threshold.
72
73
    drop_threshold_rows: float, default 1
74
        Drop rows with NA-ratio above the specified threshold.
75
76
    Returns
77
    -------
78
    data_cleaned: Pandas DataFrame
79
80
    Notes
81
    -----
82
    Columns are dropped first. Rows are dropped based on the remaining data.
83
84
    '''
85
86
    # Validate Inputs
87
    _validate_input_range(drop_threshold_cols, 'drop_threshold_cols', 0, 1)
88
    _validate_input_range(drop_threshold_rows, 'drop_threshold_rows', 0, 1)
89
90
    data = pd.DataFrame(data).copy()
91
    data = data.dropna(axis=0, how='all').dropna(axis=1, how='all')
92
    data = data.drop(columns=data.loc[:, _missing_vals(data)['mv_cols_ratio'] > drop_threshold_cols].columns)
93
    data_cleaned = data.drop(index=data.loc[_missing_vals(data)['mv_rows_ratio'] > drop_threshold_rows, :].index)
94
95
    return data_cleaned
96
97
98
def data_cleaning(data, drop_threshold_cols=0.95, drop_threshold_rows=0.95, drop_duplicates=True,
99
                  convert_dtypes=True, category=True, cat_threshold=0.03, cat_exclude=None, show='changes'):
100
    '''
101
    Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty \
102
        columns as well as optimizing the datatypes.
103
104
    Parameters
105
    ----------
106
    data: 2D dataset that can be coerced into Pandas DataFrame.
107
108
    drop_threshold_cols: float, default 0.95
109
        Drop columns with NA-ratio above the specified threshold.
110
111
    drop_threshold_rows: float, default 0.95
112
        Drop rows with NA-ratio above the specified threshold.
113
114
    drop_duplicates: bool, default True
115
        Drops duplicate rows, keeping the first occurence. This step comes after the dropping of missing values.
116
117
    convert_dtypes: bool, default True
118
        Convert dtypes using pd.convert_dtypes().
119
120
    category: bool, default True
121
        Change dtypes of columns to "category". Set threshold using cat_threshold. Requires convert_dtypes=True
122
123
    cat_threshold: float, default 0.03
124
        Ratio of unique values below which categories are inferred and column dtype is changed to categorical.
125
126
    cat_exclude: list, default None
127
        List of columns to exclude from categorical conversion.
128
129
    show: {'all', 'changes', None} default 'all'
130
        Specify verbosity of the output.
131
        * 'all': Print information about the data before and after cleaning as well as information about changes.
132
        * 'changes': Print out differences in the data before and after cleaning.
133
        * None: No information about the data and the data cleaning is printed.
134
135
    Returns
136
    -------
137
    data_cleaned: Pandas DataFrame
138
139
    See Also
140
    --------
141
    convert_datatypes: Converts columns to best possible dtypes.
142
    drop_missing : Flexibly drops columns and rows.
143
    _memory_usage: Gives the total memory usage in kilobytes.
144
    _missing_vals: Metrics about missing values in the dataset.
145
146
    Notes
147
    -----
148
    The category dtype is not grouped in the summary, unless it contains exactly the same categories.
149
150
    '''
151
152
    # Validate Inputs
153
    _validate_input_range(drop_threshold_cols, 'drop_threshold_cols', 0, 1)
154
    _validate_input_range(drop_threshold_rows, 'drop_threshold_rows', 0, 1)
155
    _validate_input_bool(drop_duplicates, 'drop_duplicates')
156
    _validate_input_bool(convert_dtypes, 'convert_datatypes')
157
    _validate_input_bool(category, 'category')
158
    _validate_input_range(cat_threshold, 'cat_threshold', 0, 1)
159
160
    data = pd.DataFrame(data).copy()
161
    data_cleaned = drop_missing(data, drop_threshold_cols, drop_threshold_rows)
162
    single_val_cols = data_cleaned.columns[data_cleaned.nunique(dropna=False) == 1].tolist()
163
    data_cleaned = data_cleaned.drop(columns=single_val_cols)
164
165
    if drop_duplicates:
166
        data_cleaned, dupl_rows = _drop_duplicates(data_cleaned)
167
    if convert_dtypes:
168
        data_cleaned = convert_datatypes(data_cleaned, category=category, cat_threshold=cat_threshold,
169
                                         cat_exclude=cat_exclude)
170
171
    _diff_report(data, data_cleaned, dupl_rows=dupl_rows, single_val_cols=single_val_cols, show=show)
172
173
    return data_cleaned
174