GitHub Access Token became invalid

It seems like the GitHub access token used for retrieving details about this repository from GitHub became invalid. This might prevent certain types of inspections from being run (in particular, everything related to pull requests).
Please ask an admin of your repository to re-new the access token on this website.
Passed
Push — main ( a91adb...84a88d )
by Andreas
01:59
created

klib.clean   B

Complexity

Total Complexity 46

Size/Duplication

Total Lines 770
Duplicated Lines 0 %

Test Coverage

Coverage 67.04%

Importance

Changes 0
Metric Value
eloc 338
dl 0
loc 770
ccs 120
cts 179
cp 0.6704
rs 8.72
c 0
b 0
f 0
wmc 46

9 Methods

Rating   Name   Duplication   Size   Complexity  
A MVColHandler.__init__() 0 13 1
A DataCleaner.__init__() 0 23 1
A DataCleaner.transform() 0 15 1
A MVColHandler.fit() 0 2 1
A MVColHandler.transform() 0 14 1
A DataCleaner.fit() 0 2 1
A SubsetPooler.transform() 0 13 1
A SubsetPooler.fit() 0 2 1
A SubsetPooler.__init__() 0 11 1

8 Functions

Rating   Name   Duplication   Size   Complexity  
A optimize_ints() 0 5 1
A drop_missing() 0 57 2
A optimize_floats() 0 5 1
B convert_datatypes() 0 57 7
B data_cleaning() 0 114 4
B mv_col_handling() 0 90 7
C clean_column_names() 0 80 8
B pool_duplicate_subsets() 0 110 7

How to fix   Complexity   

Complexity

Complex classes like klib.clean often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""
2
Functions for data cleaning.
3
4
:author: Andreas Kanz
5
"""
6
7
# Imports
8 1
import itertools
9 1
import numpy as np
10 1
import pandas as pd
11 1
import re
12 1
from sklearn.base import BaseEstimator, TransformerMixin
13 1
from typing import List, Optional, Union
14
15 1
from klib.describe import corr_mat
16 1
from klib.utils import (
17
    _diff_report,
18
    _drop_duplicates,
19
    _missing_vals,
20
    _validate_input_bool,
21
    _validate_input_range,
22
)
23
24 1
__all__ = [
25
    "clean_column_names",
26
    "convert_datatypes",
27
    "data_cleaning",
28
    "drop_missing",
29
    "mv_col_handling",
30
]
31
32
33 1
def optimize_ints(data: Union[pd.Series, pd.DataFrame]) -> pd.DataFrame:
34 1
    data = pd.DataFrame(data).copy()
35 1
    ints = data.select_dtypes(include=["int64"]).columns.tolist()
36 1
    data[ints] = data[ints].apply(pd.to_numeric, downcast="integer")
37 1
    return data
38
39
40 1
def optimize_floats(data: Union[pd.Series, pd.DataFrame]) -> pd.DataFrame:
41 1
    data = pd.DataFrame(data).copy()
42 1
    floats = data.select_dtypes(include=["float64"]).columns.tolist()
43 1
    data[floats] = data[floats].apply(pd.to_numeric, downcast="float")
44 1
    return data
45
46
47 1
def clean_column_names(data: pd.DataFrame, hints: bool = True) -> pd.DataFrame:
48
    """ Cleans the column names of the provided Pandas Dataframe and optionally \
49
        provides hints on duplicate and long column names.
50
51
    Parameters
52
    ----------
53
    data : pd.DataFrame
54
        Original Dataframe with columns to be cleaned
55
    hints : bool, optional
56
        Print out hints on column name duplication and colum name length, by default \
57
        True
58
59
    Returns
60
    -------
61
    pd.DataFrame
62
        Pandas DataFrame with cleaned column names
63
    """
64
65 1
    _validate_input_bool(hints, "hints")
66
67
    # Handle CamelCase
68 1
    for i, col in enumerate(data.columns):
69 1
        matches = re.findall(re.compile("[a-z][A-Z]"), col)
70 1
        column = col
71 1
        for match in matches:
72 1
            column = column.replace(match, match[0] + "_" + match[1])
73 1
            data.rename(columns={data.columns[i]: column}, inplace=True)
74
75 1
    data.columns = (
76
        data.columns.str.replace("\n", "_", regex=False)
77
        .str.replace("(", "_", regex=False)
78
        .str.replace(")", "_", regex=False)
79
        .str.replace("'", "_", regex=False)
80
        .str.replace('"', "_", regex=False)
81
        .str.replace(".", "_", regex=False)
82
        .str.replace("-", "_", regex=False)
83
        .str.replace(r"[!?:;/]", "_", regex=True)
84
        .str.replace("+", "_plus_", regex=False)
85
        .str.replace("*", "_times_", regex=False)
86
        .str.replace("<", "_smaller", regex=False)
87
        .str.replace(">", "_larger_", regex=False)
88
        .str.replace("=", "_equal_", regex=False)
89
        .str.replace("ä", "ae", regex=False)
90
        .str.replace("ö", "oe", regex=False)
91
        .str.replace("ü", "ue", regex=False)
92
        .str.replace("ß", "ss", regex=False)
93
        .str.replace("%", "_percent_", regex=False)
94
        .str.replace("$", "_dollar_", regex=False)
95
        .str.replace("€", "_euro_", regex=False)
96
        .str.replace("@", "_at_", regex=False)
97
        .str.replace("#", "_hash_", regex=False)
98
        .str.replace("&", "_and_", regex=False)
99
        .str.replace(r"\s+", "_", regex=True)
100
        .str.replace(r"_+", "_", regex=True)
101
        .str.strip("_")
102
        .str.lower()
103
    )
104
105 1
    dupl_idx = [i for i, x in enumerate(data.columns.duplicated()) if x]
106 1
    if len(dupl_idx) > 0:
107 1
        dupl_before = data.columns[dupl_idx].tolist()
108 1
        data.columns = [
109
            col if col not in data.columns[:i] else col + "_" + str(i)
110
            for i, col in enumerate(data.columns)
111
        ]
112 1
        if hints:
113 1
            print(
114
                f"Duplicate column names detected! Columns with index {dupl_idx} and "
115
                f"names {dupl_before}) have been renamed to "
116
                f"{data.columns[dupl_idx].tolist()}."
117
            )
118
119 1
    long_col_names = [x for x in data.columns if len(x) > 25]
120 1
    if len(long_col_names) > 0 and hints:
121 1
        print(
122
            "Long column names detected (>25 characters). Consider renaming the "
123
            f"following columns {long_col_names}."
124
        )
125
126 1
    return data
127
128
129 1
def convert_datatypes(
130
    data: pd.DataFrame,
131
    category: bool = True,
132
    cat_threshold: float = 0.05,
133
    cat_exclude: Optional[List[Union[str, int]]] = None,
134
) -> pd.DataFrame:
135
    """ Converts columns to best possible dtypes using dtypes supporting pd.NA.
136
    Temporarily not converting to integers due to an issue in pandas. This is expected \
137
        to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803
138
139
    Parameters
140
    ----------
141
    data : pd.DataFrame
142
        2D dataset that can be coerced into Pandas DataFrame
143
    category : bool, optional
144
        Change dtypes of columns with dtype "object" to "category". Set threshold \
145
        using cat_threshold or exclude columns using cat_exclude, by default True
146
    cat_threshold : float, optional
147
        Ratio of unique values below which categories are inferred and column dtype is \
148
        changed to categorical, by default 0.05
149
    cat_exclude : Optional[List[Union[str, int]]], optional
150
        List of columns to exclude from categorical conversion, by default None
151
152
    Returns
153
    -------
154
    pd.DataFrame
155
        Pandas DataFrame with converted Datatypes
156
    """
157
158
    # Validate Inputs
159 1
    _validate_input_bool(category, "Category")
160 1
    _validate_input_range(cat_threshold, "cat_threshold", 0, 1)
161
162 1
    cat_exclude = [] if cat_exclude is None else cat_exclude.copy()
163
164 1
    data = pd.DataFrame(data).copy()
165 1
    for col in data.columns:
166 1
        unique_vals_ratio = data[col].nunique(dropna=False) / data.shape[0]
167 1
        if (
168
            category
169
            and unique_vals_ratio < cat_threshold
170
            and col not in cat_exclude
171
            and data[col].dtype == "object"
172
        ):
173 1
            data[col] = data[col].astype("category")
174
175 1
        data[col] = data[col].convert_dtypes(
176
            infer_objects=True,
177
            convert_string=True,
178
            convert_integer=False,
179
            convert_boolean=True,
180
        )
181
182 1
    data = optimize_ints(data)
183 1
    data = optimize_floats(data)
184
185 1
    return data
186
187
188 1
def drop_missing(
189
    data: pd.DataFrame,
190
    drop_threshold_cols: float = 1,
191
    drop_threshold_rows: float = 1,
192
    col_exclude: Optional[List[str]] = None,
193
) -> pd.DataFrame:
194
    """ Drops completely empty columns and rows by default and optionally provides \
195
        flexibility to loosen restrictions to drop additional non-empty columns and \
196
        rows based on the fraction of NA-values.
197
198
    Parameters
199
    ----------
200
    data : pd.DataFrame
201
        2D dataset that can be coerced into Pandas DataFrame
202
    drop_threshold_cols : float, optional
203
        Drop columns with NA-ratio equal to or above the specified threshold, by \
204
        default 1
205
    drop_threshold_rows : float, optional
206
        Drop rows with NA-ratio equal to or above the specified threshold, by default 1
207
    col_exclude : Optional[List[str]], optional
208
        Specify a list of columns to exclude from dropping. The excluded columns do \
209
        not affect the drop thresholds, by default None
210
211
    Returns
212
    -------
213
    pd.DataFrame
214
        Pandas DataFrame without any empty columns or rows
215
216
    Notes
217
    -----
218
    Columns are dropped first
219
    """
220
221
    # Validate Inputs
222 1
    _validate_input_range(drop_threshold_cols, "drop_threshold_cols", 0, 1)
223 1
    _validate_input_range(drop_threshold_rows, "drop_threshold_rows", 0, 1)
224
225 1
    col_exclude = [] if col_exclude is None else col_exclude.copy()
226 1
    data_exclude = data[col_exclude]
227
228 1
    data = pd.DataFrame(data).copy()
229
230 1
    data_dropped = data.drop(columns=col_exclude, errors="ignore")
231 1
    data_dropped = data_dropped.drop(
232
        columns=data_dropped.loc[
233
            :, _missing_vals(data)["mv_cols_ratio"] > drop_threshold_cols
234
        ].columns
235
    ).dropna(axis=1, how="all")
236
237 1
    data = pd.concat([data_dropped, data_exclude], axis=1)
238
239 1
    data_cleaned = data.drop(
240
        index=data.loc[
241
            _missing_vals(data)["mv_rows_ratio"] > drop_threshold_rows, :
242
        ].index
243
    ).dropna(axis=0, how="all")
244 1
    return data_cleaned
245
246
247 1
def data_cleaning(
248
    data: pd.DataFrame,
249
    drop_threshold_cols: float = 0.9,
250
    drop_threshold_rows: float = 0.9,
251
    drop_duplicates: bool = True,
252
    convert_dtypes: bool = True,
253
    col_exclude: Optional[List[str]] = None,
254
    category: bool = True,
255
    cat_threshold: float = 0.03,
256
    cat_exclude: Optional[List[Union[str, int]]] = None,
257
    clean_col_names: bool = True,
258
    show: str = "changes",
259
) -> pd.DataFrame:
260
    """ Perform initial data cleaning tasks on a dataset, such as dropping single \
261
        valued and empty rows, empty columns as well as optimizing the datatypes.
262
263
    Parameters
264
    ----------
265
    data : pd.DataFrame
266
        2D dataset that can be coerced into Pandas DataFrame
267
    drop_threshold_cols : float, optional
268
        Drop columns with NA-ratio equal to or above the specified threshold, by \
269
        default 0.9
270
    drop_threshold_rows : float, optional
271
        Drop rows with NA-ratio equal to or above the specified threshold, by \
272
        default 0.9
273
    drop_duplicates : bool, optional
274
        Drop duplicate rows, keeping the first occurence. This step comes after the \
275
        dropping of missing values, by default True
276
    convert_dtypes : bool, optional
277
        Convert dtypes using pd.convert_dtypes(), by default True
278
    col_exclude : Optional[List[str]], optional
279
        Specify a list of columns to exclude from dropping, by default None
280
    category : bool, optional
281
        Enable changing dtypes of "object" columns to "category". Set threshold using \
282
        cat_threshold. Requires convert_dtypes=True, by default True
283
    cat_threshold : float, optional
284
        Ratio of unique values below which categories are inferred and column dtype is \
285
        changed to categorical, by default 0.03
286
    cat_exclude : Optional[List[str]], optional
287
        List of columns to exclude from categorical conversion, by default None
288
    clean_column_names: bool, optional
289
        Cleans the column names and provides hints on duplicate and long names, by \
290
        default True
291
    show : str, optional
292
        {"all", "changes", None}, by default "changes"
293
        Specify verbosity of the output:
294
295
            * "all": Print information about the data before and after cleaning as \
296
            well as information about  changes and memory usage (deep). Please be \
297
            aware, that this can slow down the function by quite a bit.
298
            * "changes": Print out differences in the data before and after cleaning.
299
            * None: No information about the data and the data cleaning is printed.
300
301
    Returns
302
    -------
303
    pd.DataFrame
304
        Cleaned Pandas DataFrame
305
306
    See also
307
    --------
308
    convert_datatypes: Convert columns to best possible dtypes.
309
    drop_missing : Flexibly drop columns and rows.
310
    _memory_usage: Gives the total memory usage in megabytes.
311
    _missing_vals: Metrics about missing values in the dataset.
312
313
    Notes
314
    -----
315
    The category dtype is not grouped in the summary, unless it contains exactly the \
316
    same categories.
317
    """
318
319
    # Validate Inputs
320 1
    _validate_input_range(drop_threshold_cols, "drop_threshold_cols", 0, 1)
321 1
    _validate_input_range(drop_threshold_rows, "drop_threshold_rows", 0, 1)
322 1
    _validate_input_bool(drop_duplicates, "drop_duplicates")
323 1
    _validate_input_bool(convert_dtypes, "convert_datatypes")
324 1
    _validate_input_bool(category, "category")
325 1
    _validate_input_range(cat_threshold, "cat_threshold", 0, 1)
326
327 1
    data = pd.DataFrame(data).copy()
328 1
    data_cleaned = drop_missing(
329
        data, drop_threshold_cols, drop_threshold_rows, col_exclude=col_exclude
330
    )
331
332 1
    if clean_col_names:
333 1
        data_cleaned = clean_column_names(data_cleaned)
334
335 1
    single_val_cols = data_cleaned.columns[
336
        data_cleaned.nunique(dropna=False) == 1
337
    ].tolist()
338 1
    data_cleaned = data_cleaned.drop(columns=single_val_cols)
339
340 1
    dupl_rows = None
341
342 1
    if drop_duplicates:
343 1
        data_cleaned, dupl_rows = _drop_duplicates(data_cleaned)
344 1
    if convert_dtypes:
345 1
        data_cleaned = convert_datatypes(
346
            data_cleaned,
347
            category=category,
348
            cat_threshold=cat_threshold,
349
            cat_exclude=cat_exclude,
350
        )
351
352 1
    _diff_report(
353
        data,
354
        data_cleaned,
355
        dupl_rows=dupl_rows,
356
        single_val_cols=single_val_cols,
357
        show=show,
358
    )
359
360 1
    return data_cleaned
361
362
363 1
class DataCleaner(BaseEstimator, TransformerMixin):
364
    """ Wrapper for data_cleaning(). Allows data_cleaning() to be put into a pipeline \
365
    with similar functions (e.g. using MVColHandler() or SubsetPooler()).
366
367
    Parameters:
368
    ---------´
369
    drop_threshold_cols: float, default 0.9
370
        Drop columns with NA-ratio equal to or above the specified threshold.
371
    drop_threshold_rows: float, default 0.9
372
        Drop rows with NA-ratio equal to or above the specified threshold.
373
    drop_duplicates: bool, default True
374
        Drop duplicate rows, keeping the first occurence. This step comes after the \
375
        dropping of missing values.
376
    convert_dtypes: bool, default True
377
        Convert dtypes using pd.convert_dtypes().
378
    col_exclude: list, default None
379
        Specify a list of columns to exclude from dropping.
380
    category: bool, default True
381
        Change dtypes of columns to "category". Set threshold using cat_threshold. \
382
        Requires convert_dtypes=True
383
    cat_threshold: float, default 0.03
384
        Ratio of unique values below which categories are inferred and column dtype is \
385
        changed to categorical.
386
    cat_exclude: list, default None
387
        List of columns to exclude from categorical conversion.
388
    clean_column_names: bool, optional
389
        Cleans the column names and provides hints on duplicate and long names, by \
390
        default True
391
    show: str, optional
392
        {"all", "changes", None}, by default "changes"
393
        Specify verbosity of the output:
394
            * "all": Print information about the data before and after cleaning as \
395
            well as information about changes and memory usage (deep). Please be \
396
            aware, that this can slow down the function by quite a bit.
397
            * "changes": Print out differences in the data before and after cleaning.
398
            * None: No information about the data and the data cleaning is printed.
399
400
    Returns
401
    -------
402
    data_cleaned: Pandas DataFrame
403
    """
404
405 1
    def __init__(
406
        self,
407
        drop_threshold_cols: float = 0.9,
408
        drop_threshold_rows: float = 0.9,
409
        drop_duplicates: bool = True,
410
        convert_dtypes: bool = True,
411
        col_exclude: Optional[List[str]] = None,
412
        category: bool = True,
413
        cat_threshold: float = 0.03,
414
        cat_exclude: Optional[List[Union[str, int]]] = None,
415
        clean_col_names: bool = True,
416
        show: str = "changes",
417
    ):
418
        self.drop_threshold_cols = drop_threshold_cols
419
        self.drop_threshold_rows = drop_threshold_rows
420
        self.drop_duplicates = drop_duplicates
421
        self.convert_dtypes = convert_dtypes
422
        self.col_exclude = col_exclude
423
        self.category = category
424
        self.cat_threshold = cat_threshold
425
        self.cat_exclude = cat_exclude
426
        self.clean_col_names = clean_col_names
427
        self.show = show
428
429 1
    def fit(self, data, target=None):
430
        return self
431
432 1
    def transform(self, data, target=None):
433
        data_cleaned = data_cleaning(
434
            data,
435
            drop_threshold_cols=self.drop_threshold_cols,
436
            drop_threshold_rows=self.drop_threshold_rows,
437
            drop_duplicates=self.drop_duplicates,
438
            convert_dtypes=self.convert_dtypes,
439
            col_exclude=self.col_exclude,
440
            category=self.category,
441
            cat_threshold=self.cat_threshold,
442
            cat_exclude=self.cat_exclude,
443
            clean_col_names=self.clean_col_names,
444
            show=self.show,
445
        )
446
        return data_cleaned
447
448
449 1
def mv_col_handling(
450
    data: pd.DataFrame,
451
    target: Optional[Union[str, pd.Series, List]] = None,
452
    mv_threshold: float = 0.1,
453
    corr_thresh_features: float = 0.5,
454
    corr_thresh_target: float = 0.3,
455
    return_details: bool = False,
456
) -> pd.DataFrame:
457
    """ Converts columns with a high ratio of missing values into binary features and \
458
    eventually drops them based on their correlation with other features and the \
459
    target variable. This function follows a three step process:
460
    - 1) Identify features with a high ratio of missing values (above 'mv_threshold').
461
    - 2) Identify high correlations of these features among themselves and with \
462
        other features in the dataset (above 'corr_thresh_features').
463
    - 3) Features with high ratio of missing values and high correlation among each \
464
        other are dropped unless they correlate reasonably well with the target \
465
        variable (above 'corr_thresh_target').
466
467
    Note: If no target is provided, the process exits after step two and drops columns \
468
    identified up to this point.
469
470
    Parameters
471
    ----------
472
    data : pd.DataFrame
473
        2D dataset that can be coerced into Pandas DataFrame
474
    target : Optional[Union[str, pd.Series, List]], optional
475
        Specify target for correlation. I.e. label column to generate only the \
476
        correlations between each feature and the label, by default None
477
    mv_threshold : float, optional
478
        Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger \
479
        than mv_threshold are candidates for dropping and undergo further analysis, by \
480
        default 0.1
481
    corr_thresh_features : float, optional
482
        Value between 0 <= threshold <= 1. Maximum correlation a previously identified \
483
        features (with a high mv-ratio) is allowed to have with another feature. If \
484
        this threshold is overstepped, the feature undergoes further analysis, by \
485
        default 0.5
486
    corr_thresh_target : float, optional
487
        Value between 0 <= threshold <= 1. Minimum required correlation of a remaining \
488
        feature (i.e. feature with a high mv-ratio and high correlation to another \
489
        existing feature) with the target. If this threshold is not met the feature is \
490
        ultimately dropped, by default 0.3
491
    return_details : bool, optional
492
        Provdies flexibility to return intermediary results, by default False
493
494
    Returns
495
    -------
496
    pd.DataFrame
497
        Updated Pandas DataFrame
498
499
    optional:
500
    cols_mv: Columns with missing values included in the analysis
501
    drop_cols: List of dropped columns
502
    """
503
504
    # Validate Inputs
505
    _validate_input_range(mv_threshold, "mv_threshold", 0, 1)
506
    _validate_input_range(corr_thresh_features, "corr_thresh_features", 0, 1)
507
    _validate_input_range(corr_thresh_target, "corr_thresh_target", 0, 1)
508
509
    data = pd.DataFrame(data).copy()
510
    data_local = data.copy()
511
    mv_ratios = _missing_vals(data_local)["mv_cols_ratio"]
512
    cols_mv = mv_ratios[mv_ratios > mv_threshold].index.tolist()
513
    data_local[cols_mv] = (
514
        data_local[cols_mv].applymap(lambda x: 1 if not pd.isnull(x) else x).fillna(0)
515
    )
516
517
    high_corr_features = []
518
    data_temp = data_local.copy()
519
    for col in cols_mv:
520
        corrmat = corr_mat(data_temp, colored=False)
521
        if abs(corrmat[col]).nlargest(2)[1] > corr_thresh_features:
522
            high_corr_features.append(col)
523
            data_temp = data_temp.drop(columns=[col])
524
525
    drop_cols = []
526
    if target is None:
527
        data = data.drop(columns=high_corr_features)
528
    else:
529
        corrs = corr_mat(data_local, target=target, colored=False).loc[
530
            high_corr_features
531
        ]
532
        drop_cols = corrs.loc[abs(corrs.iloc[:, 0]) < corr_thresh_target].index.tolist()
533
        data = data.drop(columns=drop_cols)
534
535
    if return_details:
536
        return data, cols_mv, drop_cols
537
538
    return data
539
540
541 1
class MVColHandler(BaseEstimator, TransformerMixin):
542
    """ Wrapper for mv_col_handling(). Allows mv_col_handling() to be put into a \
543
        pipeline with similar functions (e.g. using DataCleaner() or SubsetPooler()).
544
545
    Parameters
546
    ----------
547
    target: string, list, np.array or pd.Series, default None
548
        Specify target for correlation. E.g. label column to generate only the \
549
        correlations between each feature and the label.
550
    mv_threshold: float, default 0.1
551
        Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger \
552
        than mv_threshold are candidates for dropping and undergo further analysis.
553
    corr_thresh_features: float, default 0.6
554
        Value between 0 <= threshold <= 1. Maximum correlation a previously identified \
555
        features with a high mv-ratio is allowed to have with another feature. If this \
556
        threshold is overstepped, the feature undergoes further analysis.
557
    corr_thresh_target: float, default 0.3
558
        Value between 0 <= threshold <= 1. Minimum required correlation of a remaining \
559
        feature (i.e. feature with a high mv-ratio and high correlation to another \
560
        existing feature) with the target. If this threshold is not met the feature is \
561
        ultimately dropped.
562
    return_details: bool, default True
563
        Provdies flexibility to return intermediary results.
564
565
    Returns
566
    -------
567
    data: Updated Pandas DataFrame
568
    """
569
570 1
    def __init__(
571
        self,
572
        target: Optional[Union[str, pd.Series, List]] = None,
573
        mv_threshold: float = 0.1,
574
        corr_thresh_features: float = 0.6,
575
        corr_thresh_target: float = 0.3,
576
        return_details: bool = True,
577
    ):
578
        self.target = target
579
        self.mv_threshold = mv_threshold
580
        self.corr_thresh_features = corr_thresh_features
581
        self.corr_thresh_target = corr_thresh_target
582
        self.return_details = return_details
583
584 1
    def fit(self, data, target=None):
585
        return self
586
587 1
    def transform(self, data, target=None):
588
        data, cols_mv, dropped_cols = mv_col_handling(
589
            data,
590
            target=self.target,
591
            mv_threshold=self.mv_threshold,
592
            corr_thresh_features=self.corr_thresh_features,
593
            corr_thresh_target=self.corr_thresh_target,
594
            return_details=self.return_details,
595
        )
596
597
        print(f"\nFeatures with MV-ratio > {self.mv_threshold}: {len(cols_mv)}")
598
        print("Features dropped:", len(dropped_cols), dropped_cols)
599
600
        return data
601
602
603 1
def pool_duplicate_subsets(
604
    data: pd.DataFrame,
605
    col_dupl_thresh: float = 0.2,
606
    subset_thresh: float = 0.2,
607
    min_col_pool: int = 3,
608
    exclude: Optional[List[str]] = None,
609
    return_details=False,
610
) -> pd.DataFrame:
611
    """ Checks for duplicates in subsets of columns and pools them. This can reduce \
612
        the number of columns in the data without loosing much information. Suitable \
613
        columns are combined to subsets and tested for duplicates. In case sufficient \
614
        duplicates can be found, the respective columns are aggregated into a \
615
        "pooled_var" column. Identical numbers in the "pooled_var" column indicate \
616
        identical information in the respective rows.
617
618
        Note:  It is advised to exclude features that provide sufficient informational \
619
        content by themselves as well as the target column by using the "exclude" \
620
        setting.
621
622
    Parameters
623
    ----------
624
    data : pd.DataFrame
625
        2D dataset that can be coerced into Pandas DataFrame
626
    col_dupl_thresh : float, optional
627
        Columns with a ratio of duplicates higher than "col_dupl_thresh" are \
628
        considered in the further analysis. Columns with a lower ratio are not \
629
        considered for pooling, by default 0.2
630
    subset_thresh : float, optional
631
        The first subset with a duplicate threshold higher than "subset_thresh" is \
632
        chosen and aggregated. If no subset reaches the threshold, the algorithm \
633
        continues with continuously smaller subsets until "min_col_pool" is reached, \
634
        by default 0.2
635
    min_col_pool : int, optional
636
        Minimum number of columns to pool. The algorithm attempts to combine as many \
637
        columns as possible to suitable subsets and stops when "min_col_pool" is \
638
        reached, by default 3
639
    exclude : Optional[List[str]], optional
640
        List of column names to be excluded from the analysis. These columns are \
641
        passed through without modification, by default None
642
    return_details : bool, optional
643
        Provdies flexibility to return intermediary results, by default False
644
645
    Returns
646
    -------
647
    pd.DataFrame
648
        DataFrame with low cardinality columns pooled
649
650
    optional:
651
    subset_cols: List of columns used as subset
652
    """
653
654
    # Input validation
655 1
    _validate_input_range(col_dupl_thresh, "col_dupl_thresh", 0, 1)
656 1
    _validate_input_range(subset_thresh, "subset_thresh", 0, 1)
657 1
    _validate_input_range(min_col_pool, "min_col_pool", 0, data.shape[1])
658
659 1
    excluded_cols = []
660 1
    if exclude is not None:
661
        excluded_cols = data[exclude]
662
        data = data.drop(columns=exclude)
663
664 1
    subset_cols = []
665 1
    for i in range(data.shape[1] + 1 - min_col_pool):
666
        # Consider only columns with lots of duplicates
667 1
        check_list = [
668
            col
669
            for col in data.columns
670
            if data.duplicated(subset=col).mean() > col_dupl_thresh
671
        ]
672
673
        # Identify all possible combinations for the current interation
674 1
        if check_list:
675 1
            combinations = itertools.combinations(check_list, len(check_list) - i)
676
        else:
677
            continue
678
679
        # Check subsets for all possible combinations
680 1
        ratios = [
681
            *map(lambda comb: data.duplicated(subset=list(comb)).mean(), combinations)
682
        ]
683 1
        max_idx = np.argmax(ratios)
684
685 1
        if max(ratios) > subset_thresh:
686
            # Get the best possible iterator and process the data
687 1
            best_subset = itertools.islice(
688
                itertools.combinations(check_list, len(check_list) - i),
689
                max_idx,
690
                max_idx + 1,
691
            )
692
693 1
            best_subset = data[list(list(best_subset)[0])]
694 1
            subset_cols = best_subset.columns.tolist()
695
696 1
            unique_subset = (
697
                best_subset.drop_duplicates()
698
                .reset_index()
699
                .rename(columns={"index": "pooled_vars"})
700
            )
701 1
            data = data.merge(unique_subset, how="left", on=subset_cols).drop(
702
                columns=subset_cols
703
            )
704 1
            data.index = pd.RangeIndex(len(data))
705 1
            break
706
707 1
    data = pd.concat([data, pd.DataFrame(excluded_cols)], axis=1)
708
709 1
    if return_details:
710
        return data, subset_cols
711
712 1
    return data
713
714
715 1
class SubsetPooler(BaseEstimator, TransformerMixin):
716
    """ Wrapper for pool_duplicate_subsets(). Allows pool_duplicate_subsets() to be \
717
        put into a pipeline with similar functions (e.g. using DataCleaner() or \
718
        MVColHandler()).
719
720
    Parameters
721
    ----------
722
    col_dupl_ratio: float, default 0.2
723
        Columns with a ratio of duplicates higher than "col_dupl_ratio" are considered \
724
        in the further analysis. Columns with a lower ratio are not considered for \
725
        pooling.
726
    dupl_thresh: float, default 0.2
727
        The first subset with a duplicate threshold higher than "dupl_thresh" is \
728
        chosen and aggregated. If no subset reaches the threshold, the algorithm \
729
        continues with continuously smaller subsets until "min_col_pool" is reached.
730
    min_col_pool: integer, default 3
731
        Minimum number of columns to pool. The algorithm attempts to combine as many \
732
        columns as possible to suitable subsets and stops when "min_col_pool" is \
733
        reached.
734
    return_details: bool, default False
735
        Provdies flexibility to return intermediary results.
736
737
    Returns:
738
    -------
739
    data: pd.DataFrame
740
    """
741
742 1
    def __init__(
743
        self,
744
        col_dupl_thresh=0.2,
745
        subset_thresh=0.2,
746
        min_col_pool=3,
747
        return_details=True,
748
    ):
749
        self.col_dupl_thresh = col_dupl_thresh
750
        self.subset_thresh = subset_thresh
751
        self.min_col_pool = min_col_pool
752
        self.return_details = return_details
753
754 1
    def fit(self, data, target=None):
755
        return self
756
757 1
    @staticmethod
758 1
    def transform(data, target=None):
759
        data, subset_cols = pool_duplicate_subsets(
760
            data,
761
            col_dupl_thresh=0.2,
762
            subset_thresh=0.2,
763
            min_col_pool=3,
764
            return_details=True,
765
        )
766
767
        print("Combined columns:", len(subset_cols), subset_cols)
768
769
        return data
770