GitHub Access Token became invalid

It seems like the GitHub access token used for retrieving details about this repository from GitHub became invalid. This might prevent certain types of inspections from being run (in particular, everything related to pull requests).
Please ask an admin of your repository to re-new the access token on this website.
Passed
Push — master ( 75a80a...1afdc9 )
by Andreas
02:08
created

klib.describe.dist_plot()   D

Complexity

Conditions 10

Size

Total Lines 142
Code Lines 81

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 10
eloc 81
nop 11
dl 0
loc 142
rs 4.8218
c 0
b 0
f 0

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like klib.describe.dist_plot() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
"""
2
Functions for descriptive analytics.
3
4
:author: Andreas Kanz
5
6
"""
7
8
# Imports
9
import matplotlib.pyplot as plt
10
import matplotlib.ticker as ticker
11
import numpy as np
12
import pandas as pd
13
import scipy
14
import seaborn as sns
15
16
from typing import Any, Dict, Optional, Tuple, Union
17
from .utils import (
18
    _corr_selector,
19
    _missing_vals,
20
    _validate_input_bool,
21
    _validate_input_int,
22
    _validate_input_smaller,
23
    _validate_input_range,
24
)
25
26
27
__all__ = ["cat_plot", "corr_mat", "corr_plot", "dist_plot", "missingval_plot"]
28
29
30
# Functions
31
32
# Categorical Plot
33
def cat_plot(
34
    data: pd.DataFrame,
35
    figsize: Tuple = (16, 16),
36
    top: int = 3,
37
    bottom: int = 3,
38
    bar_color_top: str = "#5ab4ac",
39
    bar_color_bottom: str = "#d8b365",
40
    cmap: str = "BrBG",
41
):
42
    """ Two-dimensional visualization of the number and frequency of categorical features.
43
44
    Parameters
45
    ----------
46
    data : pd.DataFrame
47
        2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
48
    information is used to label the plots
49
    figsize : Tuple, optional
50
        Use to control the figure size, by default (16, 16)
51
    top : int, optional
52
        Show the "top" most frequent values in a column, by default 3
53
    bottom : int, optional
54
        Show the "bottom" most frequent values in a column, by default 3
55
    bar_color_top : str, optional
56
        Use to control the color of the bars indicating the most common values, by default "#5ab4ac"
57
    bar_color_bottom : str, optional
58
        Use to control the color of the bars indicating the least common values, by default "#d8b365"
59
    cmap : str, optional
60
        The mapping from data values to color space, by default "BrBG"
61
62
    Returns
63
    -------
64
    Gridspec
65
        gs: Figure with array of Axes objects
66
    """
67
68
    # Validate Inputs
69
    _validate_input_int(top, "top")
70
    _validate_input_int(bottom, "bottom")
71
    _validate_input_range(top, "top", 0, data.shape[1])
72
    _validate_input_range(bottom, "bottom", 0, data.shape[1])
73
74
    data = pd.DataFrame(data).copy()
75
    cols = data.select_dtypes(exclude=["number"]).columns.tolist()
76
    data = data[cols]
77
78
    if len(cols) == 0:
79
        print("No columns with categorical data were detected.")
80
81
    fig = plt.figure(figsize=figsize)
82
    gs = fig.add_gridspec(nrows=6, ncols=len(cols), wspace=0.2)
83
84
    for count, col in enumerate(cols):
85
86
        n_unique = data[col].nunique(dropna=False)
87
        value_counts = data[col].value_counts()
88
        lim_top, lim_bot = top, bottom
89
90
        if n_unique < top + bottom:
91
            lim_top = lim_bot = int(n_unique // 2)
92
93
        value_counts_top = value_counts[0:lim_top]
94
        value_counts_idx_top = value_counts_top.index.tolist()
95
        value_counts_bot = value_counts[-lim_bot:]
96
        value_counts_idx_bot = value_counts_bot.index.tolist()
97
98
        if top == 0:
99
            value_counts_top: Any = None
100
            value_counts_idx_top = None
101
102
        elif bottom == 0:
103
            value_counts_bot: Any = None
104
            value_counts_idx_bot = None
105
106
        data.loc[data[col].isin(value_counts_idx_top), col] = 2
107
        data.loc[data[col].isin(value_counts_idx_bot), col] = -2
108
        data.loc[~((data[col] == 2) | (data[col] == -2)), col] = 0
109
110
        # Barcharts
111
        ax_top = fig.add_subplot(gs[:1, count : count + 1])
112
        ax_top.bar(value_counts_idx_top, value_counts_top, color=bar_color_top, width=0.85)
113
        ax_top.bar(value_counts_idx_bot, value_counts_bot, color=bar_color_bottom, width=0.85)
114
        ax_top.set(frame_on=False)
115
        ax_top.tick_params(axis="x", labelrotation=90)
116
117
        # Summary stats
118
        ax_bottom = fig.add_subplot(gs[1:2, count : count + 1])
119
        ax_bottom.get_yaxis().set_visible(False)
120
        ax_bottom.get_xaxis().set_visible(False)
121
        ax_bottom.set(frame_on=False)
122
        ax_bottom.text(
123
            0,
124
            0,
125
            f"Unique values: {n_unique}\n\n"
126
            f"Top {top} vals: {sum(value_counts_top)} ({sum(value_counts_top)/data.shape[0]*100:.1f}%)\n"
127
            f"Bot {bottom} vals: {sum(value_counts_bot)} " + f"({sum(value_counts_bot)/data.shape[0]*100:.1f}%)",
128
            transform=ax_bottom.transAxes,
129
            color="#111111",
130
            fontsize=11,
131
        )
132
133
    # Heatmap
134
    data = data.astype("int")
135
    ax_hm = fig.add_subplot(gs[2:, :])
136
    sns.heatmap(data, cmap=cmap, cbar=False, vmin=-4.25, vmax=4.25, ax=ax_hm)
137
    ax_hm.set_yticks(np.round(ax_hm.get_yticks()[0::5], -1))
138
    ax_hm.set_yticklabels(ax_hm.get_yticks())
139
    ax_hm.set_xticklabels(ax_hm.get_xticklabels(), horizontalalignment="center", fontweight="light", fontsize="medium")
140
    ax_hm.tick_params(length=1, colors="#111111")
141
142
    gs.figure.suptitle("Categorical data plot", x=0.47, y=0.925, fontsize=18, color="#111111")
143
144
    return gs
145
146
147
# Correlation Matrix
148
def corr_mat(
149
    data: pd.DataFrame,
150
    split: Optional[str] = None,
151
    threshold: float = 0,
152
    target: Optional[Union[pd.DataFrame, str]] = None,
153
    method: str = "pearson",
154
    colored: bool = True,
155
) -> Union[pd.DataFrame, Any]:
156
    """ Returns a color-encoded correlation matrix.
157
158
    Parameters
159
    ----------
160
    data : pd.DataFrame
161
        2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
162
    information is used to label the plots
163
    split : Optional[str], optional
164
        Type of split to be performed, by default None
165
        {None, 'pos', 'neg', 'above', 'below'}
166
    threshold : float, optional
167
        Value between 0 <= threshold <= 1, by default 0
168
    target : Optional[Union[pd.DataFrame, str]], optional
169
        Specify target for correlation. E.g. label column to generate only the correlations between each feature \
170
        and the label, by default None
171
    method : str, optional
172
        method: {'pearson', 'spearman', 'kendall'}, by default "pearson"
173
        * pearson: measures linear relationships and requires normally distributed and homoscedastic data.
174
        * spearman: ranked/ordinal correlation, measures monotonic relationships.
175
        * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but \
176
                    more robus in smaller dataets than 'spearman'
177
    colored : bool, optional
178
        If True the negative values in the correlation matrix are colored in red, by default True
179
180
    Returns
181
    -------
182
    Union[pd.DataFrame, pd.Styler]
183
        If colored = True - corr: Pandas Styler object
184
        If colored = False - corr: Pandas DataFrame
185
    """
186
187
    # Validate Inputs
188
    _validate_input_range(threshold, "threshold", -1, 1)
189
    _validate_input_bool(colored, "colored")
190
191
    def color_negative_red(val):
192
        color = "#FF3344" if val < 0 else None
193
        return "color: %s" % color
194
195
    data = pd.DataFrame(data)
196
197
    if isinstance(target, (str, list, pd.Series, np.ndarray)):
198
        target_data = []
199
        if isinstance(target, str):
200
            target_data = data[target]
201
            data = data.drop(target, axis=1)
202
203
        elif isinstance(target, (list, pd.Series, np.ndarray)):
204
            target_data = pd.Series(target)
205
            target = target_data.name
206
207
        corr = pd.DataFrame(data.corrwith(target_data))
208
        corr = corr.sort_values(corr.columns[0], ascending=False)
209
        corr.columns = [target]
210
211
    else:
212
        corr = data.corr(method=method)
213
214
    corr = _corr_selector(corr, split=split, threshold=threshold)
215
216
    if colored:
217
        return corr.style.applymap(color_negative_red).format("{:.2f}", na_rep="-")
218
    else:
219
        return corr
220
221
222
# Correlation matrix / heatmap
223
def corr_plot(
224
    data: pd.DataFrame,
225
    split: Optional[str] = None,
226
    threshold: float = 0,
227
    target: Optional[Union[pd.Series, str]] = None,
228
    method: str = "pearson",
229
    cmap: str = "BrBG",
230
    figsize: Tuple = (12, 10),
231
    annot: bool = True,
232
    dev: bool = False,
233
    **kwargs,
234
):
235
    """ Two-dimensional visualization of the correlation between feature-columns, excluding NA values.
236
237
    Parameters
238
    ----------
239
    data : pd.DataFrame
240
        2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
241
        information is used to label the plots
242
    split : Optional[str], optional
243
        Type of split to be performed, by default None
244
        {None, 'pos', 'neg', 'above', 'below'}
245
            * None: visualize all correlations between the feature-columns
246
            * pos: visualize all positive correlations between the feature-columns above the threshold
247
            * neg: visualize all negative correlations between the feature-columns below the threshold
248
            * above: visualize all correlations between the feature-columns for which abs(corr) > threshold is True
249
            * below: visualize all correlations between the feature-columns for which abs(corr) < threshold is True
250
    threshold : float, optional
251
        Value between 0 <= threshold <= 1, by default 0
252
    target : Optional[Union[pd.Series, str]], optional
253
        Specify target for correlation. E.g. label column to generate only the correlations between each feature \
254
        and the label, by default None
255
    method : str, optional
256
        method: {'pearson', 'spearman', 'kendall'}, by default "pearson"
257
            * pearson: measures linear relationships and requires normally distributed and homoscedastic data.
258
            * spearman: ranked/ordinal correlation, measures monotonic relationships.
259
            * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive \
260
            but more robust in smaller dataets than 'spearman'.
261
    cmap : str, optional
262
        The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default \
263
        "BrBG"
264
    figsize : Tuple, optional
265
        Use to control the figure size, by default (12, 10)
266
    annot : bool, optional
267
        Use to show or hide annotations, by default True
268
    dev : bool, optional
269
        Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by \
270
        default False
271
272
    **kwargs: optional
273
        Additional elements to control the visualization of the plot, e.g.:
274
275
        * mask: bool, default True
276
        If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this \
277
        case to avoid overlap.
278
        * vmax: float, default is calculated from the given correlation coefficients.
279
        Value between -1 or vmin <= vmax <= 1, limits the range of the colorbar.
280
        * vmin: float, default is calculated from the given correlation coefficients.
281
        Value between -1 <= vmin <= 1 or vmax, limits the range of the colorbar.
282
        * linewidths: float, default 0.5
283
        Controls the line-width inbetween the squares.
284
        * annot_kws: dict, default {'size' : 10}
285
        Controls the font size of the annotations. Only available when annot = True.
286
        * cbar_kws: dict, default {'shrink': .95, 'aspect': 30}
287
        Controls the size of the colorbar.
288
        * Many more kwargs are available, i.e. 'alpha' to control blending, or options to adjust labels, ticks ...
289
290
        Kwargs can be supplied through a dictionary of key-value pairs (see above).
291
292
    Returns
293
    -------
294
    ax: matplotlib Axes
295
        Returns the Axes object with the plot for further tweaking.
296
    """
297
298
    # Validate Inputs
299
    _validate_input_range(threshold, "threshold", -1, 1)
300
    _validate_input_bool(annot, "annot")
301
    _validate_input_bool(dev, "dev")
302
303
    data = pd.DataFrame(data)
304
305
    corr = corr_mat(data, split=split, threshold=threshold, target=target, method=method, colored=False)
306
307
    mask = np.zeros_like(corr, dtype=np.bool)
308
309
    if target is None:
310
        mask = np.triu(np.ones_like(corr, dtype=np.bool))
311
312
    vmax = np.round(np.nanmax(corr.where(~mask)) - 0.05, 2)
313
    vmin = np.round(np.nanmin(corr.where(~mask)) + 0.05, 2)
314
315
    fig, ax = plt.subplots(figsize=figsize)
316
317
    # Specify kwargs for the heatmap
318
    kwargs = {
319
        "mask": mask,
320
        "cmap": cmap,
321
        "annot": annot,
322
        "vmax": vmax,
323
        "vmin": vmin,
324
        "linewidths": 0.5,
325
        "annot_kws": {"size": 10},
326
        "cbar_kws": {"shrink": 0.95, "aspect": 30},
327
        **kwargs,
328
    }
329
330
    # Draw heatmap with mask and default settings
331
    sns.heatmap(corr, center=0, fmt=".2f", **kwargs)
332
333
    ax.set_title(f"Feature-correlation ({method})", fontdict={"fontsize": 18})
334
335
    # Settings
336
    if dev:
337
        fig.suptitle(
338
            f"\
339
            Settings (dev-mode): \n\
340
            - split-mode: {split} \n\
341
            - threshold: {threshold} \n\
342
            - method: {method} \n\
343
            - annotations: {annot} \n\
344
            - cbar: \n\
345
                - vmax: {vmax} \n\
346
                - vmin: {vmin} \n\
347
            - linewidths: {kwargs['linewidths']} \n\
348
            - annot_kws: {kwargs['annot_kws']} \n\
349
            - cbar_kws: {kwargs['cbar_kws']}",
350
            fontsize=12,
351
            color="gray",
352
            x=0.35,
353
            y=0.85,
354
            ha="left",
355
        )
356
357
    return ax
358
359
360
# Distribution plot
361
def dist_plot(
362
    data: pd.DataFrame,
363
    mean_color: str = "orange",
364
    figsize: Tuple = (14, 2),
365
    fill_range: Tuple = (0.025, 0.975),
366
    hist: bool = False,
367
    bins: int = 10,
368
    showall: bool = False,
369
    kde_kws: Dict[str, Any] = None,
370
    rug_kws: Dict[str, Any] = None,
371
    fill_kws: Dict[str, Any] = None,
372
    font_kws: Dict[str, Any] = None,
373
):
374
    """ Two-dimensional visualization of the distribution of numerical features.
375
376
    Parameters
377
    ----------
378
    data : pd.DataFrame
379
        2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
380
    information is used to label the plots
381
    mean_color : str, optional
382
        Color of the vertical line indicating the mean of the data, by default "orange"
383
    figsize : Tuple, optional
384
        Controls the figure size, by default (14, 2)
385
    fill_range : Tuple, optional
386
        Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations \
387
        above and below the mean, by default (0.025, 0.975)
388
    hist : bool, optional
389
        Set to True to display histogram bars in the plot, by default False
390
    bins : int, optional
391
        Specification of the number of hist bins. Requires hist = True, by default 10
392
    showall : bool, optional
393
        Set to True to remove the output limit of 20 plots, by default False
394
    kde_kws : Dict[str, Any], optional
395
        Keyword arguments for kdeplot(), by default {'color': 'k', 'alpha': 0.7, 'linewidth': 1}
396
    rug_kws : Dict[str, Any], optional
397
        Keyword arguments for rugplot(), by default {'color': 'brown', 'alpha': 0.5, 'linewidth': 2, 'height': 0.04}
398
    fill_kws : Dict[str, Any], optional
399
        Keyword arguments to control the fill, by default {'color': 'brown', 'alpha': 0.1}
400
    font_kws : Dict[str, Any], optional
401
        Keyword arguments to control the font, by default {'color':  '#111111', 'weight': 'normal', 'size': 11}
402
403
    Returns
404
    -------
405
    [type]
406
        [description]
407
    """
408
409
    # Validate Inputs
410
    _validate_input_range(fill_range[0], "fill_range_lower", 0, 1)
411
    _validate_input_range(fill_range[1], "fill_range_upper", 0, 1)
412
    _validate_input_smaller(fill_range[0], fill_range[1], "fill_range")
413
    _validate_input_bool(hist, "hist")
414
    _validate_input_int(bins, "bins")
415
    _validate_input_range(bins, "bins", 0, data.shape[0])
416
    _validate_input_bool(showall, "showall")
417
418
    # Handle dictionary defaults
419
    kde_kws = {"alpha": 0.7, "linewidth": 1.5} if kde_kws is None else kde_kws.copy()
420
    rug_kws = {"color": "brown", "alpha": 0.5, "linewidth": 2, "height": 0.04} if rug_kws is None else rug_kws.copy()
421
    fill_kws = {"color": "brown", "alpha": 0.1} if fill_kws is None else fill_kws.copy()
422
    font_kws = {"color": "#111111", "weight": "normal", "size": 11} if font_kws is None else font_kws.copy()
423
424
    data = pd.DataFrame(data.copy()).dropna(axis=1, how="all")
425
    cols = list(data.select_dtypes(include=["number"]).columns)
426
    data = data[cols]
427
428
    if len(cols) == 0:
429
        print("No columns with numeric data were detected.")
430
431
    elif len(cols) >= 20 and showall is False:
432
        print(
433
            f"Note: The number of numerical features is very large ({len(cols)}), please consider splitting the data. "
434
            "Showing plots for the first 20 numerical features. Override this by setting showall=True."
435
        )
436
        cols = cols[:20]
437
438
    for col in cols:
439
        dropped_values = data[col].isna().sum()
440
        if dropped_values > 0:
441
            col_data = data[col].dropna(axis=0)
442
            print(f"Dropped {dropped_values} missing values from column {col}.")
443
444
        else:
445
            col_data = data[col]
446
447
        _, ax = plt.subplots(figsize=figsize)
448
        ax = sns.distplot(
449
            col_data,
450
            bins=bins,
451
            hist=hist,
452
            rug=True,
453
            kde_kws=kde_kws,
454
            rug_kws=rug_kws,
455
            hist_kws={"alpha": 0.5, "histtype": "step"},
456
        )
457
458
        # Vertical lines and fill
459
        x, y = ax.lines[0].get_xydata().T
460
        ax.fill_between(
461
            x,
462
            y,
463
            where=((x >= np.quantile(col_data, fill_range[0])) & (x <= np.quantile(col_data, fill_range[1]))),
464
            label=f"{fill_range[0]*100:.1f}% - {fill_range[1]*100:.1f}%",
465
            **fill_kws,
466
        )
467
468
        mean = np.mean(col_data)
469
        std = scipy.stats.tstd(col_data)
470
        ax.vlines(x=mean, ymin=0, ymax=np.interp(mean, x, y), ls="dotted", color=mean_color, lw=2, label="mean")
471
        ax.vlines(
472
            x=np.median(col_data), ymin=0, ymax=np.interp(np.median(col_data), x, y), ls=":", color=".3", label="median"
473
        )
474
        ax.vlines(
475
            x=[mean - std, mean + std],
476
            ymin=0,
477
            ymax=[np.interp(mean - std, x, y), np.interp(mean + std, x, y)],
478
            ls=":",
479
            color=".5",
480
            label="\u03BC \u00B1 \u03C3",
481
        )
482
483
        ax.set_ylim(0,)
484
        ax.set_xlim(ax.get_xlim()[0] * 1.15, ax.get_xlim()[1] * 1.15)
485
486
        # Annotations and legend
487
        ax.text(0.01, 0.85, f"Mean: {np.round(mean,2)}", fontdict=font_kws, transform=ax.transAxes)
488
        ax.text(0.01, 0.7, f"Std. dev: {np.round(std,2)}", fontdict=font_kws, transform=ax.transAxes)
489
        ax.text(
490
            0.01, 0.55, f"Skew: {np.round(scipy.stats.skew(col_data),2)}", fontdict=font_kws, transform=ax.transAxes
491
        )
492
        ax.text(
493
            0.01,
494
            0.4,
495
            f"Kurtosis: {np.round(scipy.stats.kurtosis(col_data),2)}",  # Excess Kurtosis
496
            fontdict=font_kws,
497
            transform=ax.transAxes,
498
        )
499
        ax.text(0.01, 0.25, f"Count: {np.round(len(col_data))}", fontdict=font_kws, transform=ax.transAxes)
500
        ax.legend(loc="upper right")
501
502
    return ax
503
504
505
# Missing value plot
506
def missingval_plot(
507
    data: pd.DataFrame,
508
    cmap: str = "PuBuGn",
509
    figsize: Tuple = (20, 20),
510
    sort: bool = False,
511
    spine_color: str = "#EEEEEE",
512
):
513
    """ Two-dimensional visualization of the missing values in a dataset.
514
515
    Parameters
516
    ----------
517
    data : pd.DataFrame
518
        2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
519
    information is used to label the plots
520
    cmap : str, optional
521
        Any valid colormap can be used. E.g. 'Greys', 'RdPu'. More information can be found in the matplotlib \
522
        documentation, by default "PuBuGn"
523
    figsize : Tuple, optional
524
        Use to control the figure size, by default (20, 20)
525
    sort : bool, optional
526
        Sort columns based on missing values in descending order and drop columns without any missing values, \
527
        by default False
528
    spine_color : str, optional
529
        Set to 'None' to hide the spines on all plots or use any valid matplotlib color argument, by default "#EEEEEE"
530
531
    Returns
532
    -------
533
    GridSpec
534
        gs: Figure with array of Axes objects
535
    """
536
537
    # Validate Inputs
538
    _validate_input_bool(sort, "sort")
539
540
    data = pd.DataFrame(data)
541
542
    if sort:
543
        mv_cols_sorted = data.isna().sum(axis=0).sort_values(ascending=False)
544
        final_cols = mv_cols_sorted.drop(mv_cols_sorted[mv_cols_sorted.values == 0].keys().tolist()).keys().tolist()
545
        data = data[final_cols]
546
        print("Displaying only columns with missing values.")
547
548
    # Identify missing values
549
    mv_total, mv_rows, mv_cols, _, mv_cols_ratio = _missing_vals(data).values()
550
    total_datapoints = data.shape[0] * data.shape[1]
551
552
    if mv_total == 0:
553
        print("No missing values found in the dataset.")
554
    else:
555
        # Create figure and axes
556
        fig = plt.figure(figsize=figsize)
557
        gs = fig.add_gridspec(nrows=6, ncols=6, left=0.1, wspace=0.05)
558
        ax1 = fig.add_subplot(gs[:1, :5])
559
        ax2 = fig.add_subplot(gs[1:, :5])
560
        ax3 = fig.add_subplot(gs[:1, 5:])
561
        ax4 = fig.add_subplot(gs[1:, 5:])
562
563
        # ax1 - Barplot
564
        colors = plt.get_cmap(cmap)(mv_cols / np.max(mv_cols))  # color bars by height
565
        ax1.bar(range(len(mv_cols)), np.round((mv_cols_ratio) * 100, 2), color=colors)
566
        ax1.get_xaxis().set_visible(False)
567
        ax1.set(frame_on=False, xlim=(-0.5, len(mv_cols) - 0.5))
568
        ax1.set_ylim(0, np.max(mv_cols_ratio) * 100)
569
        ax1.grid(linestyle=":", linewidth=1)
570
        ax1.yaxis.set_major_formatter(ticker.PercentFormatter(decimals=0))
571
        ax1.tick_params(axis="y", colors="#111111", length=1)
572
573
        # annotate values on top of the bars
574
        for rect, label in zip(ax1.patches, mv_cols):
575
            height = rect.get_height()
576
            ax1.text(
577
                0.1 + rect.get_x() + rect.get_width() / 2,
578
                height + 0.5,
579
                label,
580
                ha="center",
581
                va="bottom",
582
                rotation="90",
583
                alpha=0.5,
584
                fontsize="11",
585
            )
586
587
        ax1.set_frame_on(True)
588
        for _, spine in ax1.spines.items():
589
            spine.set_visible(True)
590
            spine.set_color(spine_color)
591
        ax1.spines["top"].set_color(None)
592
593
        # ax2 - Heatmap
594
        sns.heatmap(data.isna(), cbar=False, cmap="binary", ax=ax2)
595
        ax2.set_yticks(np.round(ax2.get_yticks()[0::5], -1))
596
        ax2.set_yticklabels(ax2.get_yticks())
597
        ax2.set_xticklabels(ax2.get_xticklabels(), horizontalalignment="center", fontweight="light", fontsize="12")
598
        ax2.tick_params(length=1, colors="#111111")
599
        for _, spine in ax2.spines.items():
600
            spine.set_visible(True)
601
            spine.set_color(spine_color)
602
603
        # ax3 - Summary
604
        fontax3 = {
605
            "color": "#111111",
606
            "weight": "normal",
607
            "size": 14,
608
        }
609
        ax3.get_xaxis().set_visible(False)
610
        ax3.get_yaxis().set_visible(False)
611
        ax3.set(frame_on=False)
612
613
        ax3.text(
614
            0.025, 0.875, f"Total: {np.round(total_datapoints/1000,1)}K", transform=ax3.transAxes, fontdict=fontax3
615
        )
616
        ax3.text(0.025, 0.675, f"Missing: {np.round(mv_total/1000,1)}K", transform=ax3.transAxes, fontdict=fontax3)
617
        ax3.text(
618
            0.025,
619
            0.475,
620
            f"Relative: {np.round(mv_total/total_datapoints*100,1)}%",
621
            transform=ax3.transAxes,
622
            fontdict=fontax3,
623
        )
624
        ax3.text(
625
            0.025,
626
            0.275,
627
            f"Max-col: {np.round(mv_cols.max()/data.shape[0]*100)}%",
628
            transform=ax3.transAxes,
629
            fontdict=fontax3,
630
        )
631
        ax3.text(
632
            0.025,
633
            0.075,
634
            f"Max-row: {np.round(mv_rows.max()/data.shape[1]*100)}%",
635
            transform=ax3.transAxes,
636
            fontdict=fontax3,
637
        )
638
639
        # ax4 - Scatter plot
640
        ax4.get_yaxis().set_visible(False)
641
        for _, spine in ax4.spines.items():
642
            spine.set_color(spine_color)
643
        ax4.tick_params(axis="x", colors="#111111", length=1)
644
645
        ax4.scatter(mv_rows, range(len(mv_rows)), s=mv_rows, c=mv_rows, cmap=cmap, marker=".", vmin=1)
646
        ax4.set_ylim((0, len(mv_rows))[::-1])  # limit and invert y-axis
647
        ax4.set_xlim(0, max(mv_rows) + 0.5)
648
        ax4.grid(linestyle=":", linewidth=1)
649
650
        gs.figure.suptitle("Missing value plot", x=0.45, y=0.94, fontsize=18, color="#111111")
651
652
        return gs
653