klib.preprocess.train_dev_test_split() - Code Metrics - Inspection of "elminiate unnecessary for-loop in mch()" - akanz1/klib - Measure and Improve Code Quality continuously with Scrutinizer

Passed

Push — master ( 473386...41158c )

by Andreas

created 2020-04-27 07:56 UTC

klib.preprocess.train_dev_test_split() B

↳ Parent: klib.preprocess

Complexity

Conditions

Size

Total Lines	61
Code Lines	22

Duplication

Lines	0
Ratio	0 %

Importance

Changes

Metric	Value
cc	5
eloc	22
nop	6
dl	0
loc	61
rs	8.8853
c	0
b	0
f	0

How to fix Long Method

'''
Functions for data preprocessing.

:author: Andreas Kanz

'''

# Imports
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from .describe import corr_mat
from .utils import _missing_vals
from .utils import _validate_input_int
from .utils import _validate_input_range


def mv_col_handler(data, target=None, mv_threshold=0.1, corr_thresh_features=0.6, corr_thresh_target=0.3):
    '''
    Converts columns with a high ratio of missing values into binary features and eventually drops them based on \
    their correlation with other features and the target variable. This function follows a three step process:
    - 1) Identify features with a high ratio of missing values
    - 2) Identify high correlations of these features among themselves and with other features in the dataset.
    - 3) Features with high ratio of missing values and high correlation among each other are dropped unless \
         they correlate reasonably well with the target variable.

    Note: If no target is provided, the process exits after step two and drops columns identified up to this point.

    Parameters
    ----------
    data: 2D dataset that can be coerced into Pandas DataFrame.

    target: string, list, np.array or pd.Series, default None
        Specify target for correlation. E.g. label column to generate only the correlations between each feature \
        and the label.

    mv_threshold: float, default 0.1
        Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates \
        for dropping and undergo further analysis.

    corr_thresh_features: float, default 0.6
        Value between 0 <= threshold <= 1. Maximum correlation a previously identified features with a high mv-ratio is\
         allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis.

    corr_thresh_target: float, default 0.3
        Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a \
        high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met \
        the feature is ultimately dropped.

    Returns
    -------
    data: Updated Pandas DataFrame
    cols_mv: Columns with missing values included in the analysis
    drop_cols: List of dropped columns
    '''

    # Validate Inputs
    _validate_input_range(mv_threshold, 'mv_threshold', 0, 1)
    _validate_input_range(corr_thresh_features, 'corr_thresh_features', 0, 1)
    _validate_input_range(corr_thresh_target, 'corr_thresh_target', 0, 1)

    data = pd.DataFrame(data).copy()
    data_local = data.copy()
    mv_ratios = _missing_vals(data_local)['mv_cols_ratio']
    cols_mv = mv_ratios[mv_ratios > mv_threshold].index.tolist()
    data_local[cols_mv] = data_local[cols_mv].applymap(lambda x: 1 if not pd.isnull(x) else x).fillna(0)

    high_corr_features = []
    data_temp = data_local.copy()
    for col in cols_mv:
        corrmat = corr_mat(data_temp, colored=False)
        if abs(corrmat[col]).nlargest(2)[1] > corr_thresh_features:
            high_corr_features.append(col)
            data_temp = data_temp.drop(columns=[col])
    
    drop_cols = []
    if target is None:
        data = data.drop(columns=high_corr_features)
    else:
        for col in high_corr_features:
            if pd.DataFrame(data_local[col]).corrwith(target)[0] < corr_thresh_target:
                drop_cols.append(col)
                data = data.drop(columns=[col])

    return data, cols_mv, drop_cols


def train_dev_test_split(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=1234):
    '''
    Split a dataset and a label column into train, dev and test sets.

    Parameters:
    ----------

    data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
    information is used to label the plots.

    target: string, list, np.array or pd.Series, default None
        Specify target for correlation. E.g. label column to generate only the correlations between each feature \
        and the label.

    dev_size: float, default 0.1
        If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev \
        split.

    test_size: float, default 0.1
        If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test \
        split.

    stratify: target column, default None
        If not None, data is split in a stratified fashion, using the input as the class labels.

    random_state: integer
        Random_state is the seed used by the random number generator.

    Returns
    -------
    tuple: Tuple containing train-dev-test split of inputs.
    '''

    # Validate Inputs
    _validate_input_range(dev_size, 'dev_size', 0, 1)
    _validate_input_range(test_size, 'test_size', 0, 1)
    _validate_input_int(random_state, 'random_state')

    target_data = []
    if isinstance(target, str):
        target_data = data[target]
        data = data.drop(target, axis=1)

    elif isinstance(target, (list, pd.Series, np.ndarray)):
        target_data = pd.Series(target)
        target = target.name

    X_train, X_dev_test, y_train, y_dev_test = train_test_split(data, target_data,
                                                                test_size=dev_size+test_size,
                                                                random_state=random_state,
                                                                stratify=stratify)

    if (dev_size == 0) or (test_size == 0):
        return X_train, X_dev_test, y_train, y_dev_test

    else:
        X_dev, X_test, y_dev, y_test = train_test_split(X_dev_test, y_dev_test,
                                                        test_size=test_size/(dev_size+test_size),
                                                        random_state=random_state,
                                                        stratify=y_dev_test)
        return X_train, X_dev, X_test, y_train, y_dev, y_test


1			'''
2			Functions for data preprocessing.
3
4			:author: Andreas Kanz
5
6			'''
7
8			# Imports
9			import numpy as np
10			import pandas as pd
11
12			from sklearn.model_selection import train_test_split
13
14			from .describe import corr_mat
15			from .utils import _missing_vals
16			from .utils import _validate_input_int
17			from .utils import _validate_input_range
18
19
20			def mv_col_handler(data, target=None, mv_threshold=0.1, corr_thresh_features=0.6, corr_thresh_target=0.3):
21			'''
22			Converts columns with a high ratio of missing values into binary features and eventually drops them based on \
23			their correlation with other features and the target variable. This function follows a three step process:
24			- 1) Identify features with a high ratio of missing values
25			- 2) Identify high correlations of these features among themselves and with other features in the dataset.
26			- 3) Features with high ratio of missing values and high correlation among each other are dropped unless \
27			they correlate reasonably well with the target variable.
28
29			Note: If no target is provided, the process exits after step two and drops columns identified up to this point.
30
31			Parameters
32			----------
33			data: 2D dataset that can be coerced into Pandas DataFrame.
34
35			target: string, list, np.array or pd.Series, default None
36			Specify target for correlation. E.g. label column to generate only the correlations between each feature \
37			and the label.
38
39			mv_threshold: float, default 0.1
40			Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates \
41			for dropping and undergo further analysis.
42
43			corr_thresh_features: float, default 0.6
44			Value between 0 <= threshold <= 1. Maximum correlation a previously identified features with a high mv-ratio is\
45			allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis.
46
47			corr_thresh_target: float, default 0.3
48			Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a \
49			high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met \
50			the feature is ultimately dropped.
51
52			Returns
53			-------
54			data: Updated Pandas DataFrame
55			cols_mv: Columns with missing values included in the analysis
56			drop_cols: List of dropped columns
57			'''
58
59			# Validate Inputs
60			_validate_input_range(mv_threshold, 'mv_threshold', 0, 1)
61			_validate_input_range(corr_thresh_features, 'corr_thresh_features', 0, 1)
62			_validate_input_range(corr_thresh_target, 'corr_thresh_target', 0, 1)
63
64			data = pd.DataFrame(data).copy()
65			data_local = data.copy()
66			mv_ratios = _missing_vals(data_local)['mv_cols_ratio']
67			cols_mv = mv_ratios[mv_ratios > mv_threshold].index.tolist()
68			data_local[cols_mv] = data_local[cols_mv].applymap(lambda x: 1 if not pd.isnull(x) else x).fillna(0)
69
70			high_corr_features = []
71			data_temp = data_local.copy()
72			for col in cols_mv:
73			corrmat = corr_mat(data_temp, colored=False)
74			if abs(corrmat[col]).nlargest(2)[1] > corr_thresh_features:
75			high_corr_features.append(col)
76			data_temp = data_temp.drop(columns=[col])
77
78			drop_cols = []
79			if target is None:
80			data = data.drop(columns=high_corr_features)
81			else:
82			for col in high_corr_features:
83			if pd.DataFrame(data_local[col]).corrwith(target)[0] < corr_thresh_target:
84			drop_cols.append(col)
85			data = data.drop(columns=[col])
86
87			return data, cols_mv, drop_cols
88
89
90			def train_dev_test_split(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=1234):
91			'''
92			Split a dataset and a label column into train, dev and test sets.
93
94			Parameters:
95			----------
96
97			data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
98			information is used to label the plots.
99
100			target: string, list, np.array or pd.Series, default None
101			Specify target for correlation. E.g. label column to generate only the correlations between each feature \
102			and the label.
103
104			dev_size: float, default 0.1
105			If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev \
106			split.
107
108			test_size: float, default 0.1
109			If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test \
110			split.
111
112			stratify: target column, default None
113			If not None, data is split in a stratified fashion, using the input as the class labels.
114
115			random_state: integer
116			Random_state is the seed used by the random number generator.
117
118			Returns
119			-------
120			tuple: Tuple containing train-dev-test split of inputs.
121			'''
122
123			# Validate Inputs
124			_validate_input_range(dev_size, 'dev_size', 0, 1)
125			_validate_input_range(test_size, 'test_size', 0, 1)
126			_validate_input_int(random_state, 'random_state')
127
128			target_data = []
129			if isinstance(target, str):
130			target_data = data[target]
131			data = data.drop(target, axis=1)
132
133			elif isinstance(target, (list, pd.Series, np.ndarray)):
134			target_data = pd.Series(target)
135			target = target.name
136
137			X_train, X_dev_test, y_train, y_dev_test = train_test_split(data, target_data,
138			test_size=dev_size+test_size,
139			random_state=random_state,
140			stratify=stratify)
141
142			if (dev_size == 0) or (test_size == 0):
143			return X_train, X_dev_test, y_train, y_dev_test
144
145			else:
146			X_dev, X_test, y_dev, y_test = train_test_split(X_dev_test, y_dev_test,
147			test_size=test_size/(dev_size+test_size),
148			random_state=random_state,
149			stratify=y_dev_test)
150			return X_train, X_dev, X_test, y_train, y_dev, y_test
151

akanz1 / klib

GitHub Access Token became invalid

Push — master ( 473386...41158c )

klib.preprocess.train_dev_test_split() B

Complexity

Size

Duplication

Importance

How to fix Long Method

Long Method

Duplication Side-by-Side

Filter issues like