GitHub Access Token became invalid

It seems like the GitHub access token used for retrieving details about this repository from GitHub became invalid. This might prevent certain types of inspections from being run (in particular, everything related to pull requests).
Please ask an admin of your repository to re-new the access token on this website.
Passed
Push — master ( 2625ff...cc4c68 )
by Andreas
01:13
created

klib.describe.dist_plot()   C

Complexity

Conditions 10

Size

Total Lines 127
Code Lines 64

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 10
eloc 64
nop 11
dl 0
loc 127
rs 5.3781
c 0
b 0
f 0

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like klib.describe.dist_plot() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
'''
2
Functions for descriptive analytics.
3
4
:author: Andreas Kanz
5
6
'''
7
8
# Imports
9
import matplotlib.pyplot as plt
10
import matplotlib.ticker as ticker
11
import numpy as np
12
import pandas as pd
13
import scipy
14
import seaborn as sns
15
16
from .clean import drop_missing
17
from .utils import _corr_selector
18
from .utils import _missing_vals
19
from .utils import _validate_input_bool
20
from .utils import _validate_input_int
21
from .utils import _validate_input_range
22
from .utils import _validate_input_smaller
23
24
25
# Functions
26
27
# Categorical Plot
28
def cat_plot(data, figsize=(10, 14), top=3, bottom=3, bar_color_top='#5ab4ac', bar_color_bottom='#d8b365', cmap='BrBG'):
29
    '''
30
    Two-dimensional visualization of the number and frequency of categorical features.
31
32
    Parameters
33
    ----------
34
35
    data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
36
    information is used to label the plots.
37
38
    figsize: tuple, default (10, 14)
39
        Use to control the figure size.
40
41
    top: int, default 3
42
        Show the "top" most frequent values in a column.
43
44
    bottom: int, default 3
45
        Show the "bottom" most frequent values in a column.
46
47
    bar_color_top: color, default '#5ab4ac'
48
        Use to control the color of the bars indicating the most common values.
49
50
    bar_color_bottom: color, default '#d8b365'
51
        Use to control the color of the bars indicating the least common values.
52
53
    cmap: matplotlib colormap name or object, or list of colors, default 'BrBG'
54
        The mapping from data values to color space.
55
56
    Returns
57
    -------
58
    gs: Figure with array of Axes objects.
59
60
    '''
61
62
    # Validate Inputs
63
    _validate_input_int(top, 'top')
64
    _validate_input_int(bottom, 'bottom')
65
    _validate_input_range(top, 'top', 0, data.shape[1])
66
    _validate_input_range(bottom, 'bottom', 0, data.shape[1])
67
68
    data = pd.DataFrame(data).copy()
69
    cols = list(data.select_dtypes(exclude=['number']).columns)  # categorical cols
70
    data = data[cols].applymap(str)
71
72
    if len(cols) == 0:
73
        print('No columns with categorical data were detected.')
74
75
    fig = plt.figure(figsize=figsize)
76
    gs = fig.add_gridspec(nrows=6, ncols=len(cols), wspace=0.2)
77
78
    for count, col in enumerate(cols):
79
80
        n_unique = data[col].nunique(dropna=False)
81
        value_counts = data[col].value_counts()
82
        lim_top, lim_bot = top, bottom
83
84
        if n_unique < top+bottom:
85
            lim_top = lim_bot = int(n_unique//2)
86
87
        value_counts_top = value_counts[0:lim_top]
88
        value_counts_idx_top = list(map(str, value_counts_top.index.tolist()))
89
        value_counts_bot = value_counts[-lim_bot:]
90
        value_counts_idx_bot = list(map(str, value_counts_bot.index.tolist()))
91
92
        if top == 0:
93
            value_counts_top = value_counts_idx_top = []
94
95
        elif bottom == 0:
96
            value_counts_bot = value_counts_idx_bot = []
97
98
        data[col][data[col].isin(value_counts_idx_top)] = 2
99
        data[col][data[col].isin(value_counts_idx_bot)] = -2
100
        data[col][~((data[col] == 2) | (data[col] == -2))] = 0
101
102
        # Barcharts
103
        ax_top = fig.add_subplot(gs[:1, count:count+1])
104
        ax_top.bar(value_counts_idx_top, value_counts_top, color=bar_color_top, width=0.85)
105
        ax_top.bar(value_counts_idx_bot, value_counts_bot, color=bar_color_bottom, width=0.85)
106
        ax_top.set(frame_on=False)
107
        ax_top.tick_params(axis='x', labelrotation=90)
108
109
        # Summary stats
110
        ax_bottom = fig.add_subplot(gs[1:2, count:count+1])
111
        ax_bottom.get_yaxis().set_visible(False)
112
        ax_bottom.get_xaxis().set_visible(False)
113
        ax_bottom.set(frame_on=False)
114
        ax_bottom.text(0, 0, f'Unique values: {n_unique}\n\n'
115
                       f'Top {top} vals: {sum(value_counts_top)} ({sum(value_counts_top)/data.shape[0]*100:.1f}%)\n'
116
                       f'Bottom {bottom} vals: {sum(value_counts_bot)} ' +
117
                       f'({sum(value_counts_bot)/data.shape[0]*100:.1f}%)',
118
                       transform=ax_bottom.transAxes, color='#111111', fontsize=11)
119
120
    # Heatmap
121
    data = data.astype('int')
122
    ax_hm = fig.add_subplot(gs[2:, :])
123
    sns.heatmap(data, cmap=cmap, cbar=False, vmin=-4.25, vmax=4.25, ax=ax_hm)
124
    ax_hm.set_yticks(np.round(ax_hm.get_yticks()[0::5], -1))
125
    ax_hm.set_yticklabels(ax_hm.get_yticks())
126
    ax_hm.set_xticklabels(ax_hm.get_xticklabels(),
127
                          horizontalalignment='center',
128
                          fontweight='light',
129
                          fontsize='medium')
130
    ax_hm.tick_params(length=1, colors='#111111')
131
132
    gs.figure.suptitle('Categorical data plot', x=0.47, y=0.925, fontsize=18, color='#111111')
133
134
    return gs
135
136
137
# Correlation Matrix
138
def corr_mat(data, split=None, threshold=0, method='pearson'):
139
    '''
140
    Returns a color-encoded correlation matrix.
141
142
    Parameters
143
    ----------
144
145
    data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
146
    information is used to label the plots.
147
148
    split: {None, 'pos', 'neg', 'high', 'low'}, default None
149
        Type of split to be performed.
150
151
    threshold: float, default 0
152
        Value between 0 <= threshold <= 1
153
154
    method: {'pearson', 'spearman', 'kendall'}, default 'pearson'
155
        * pearson: measures linear relationships and requires normally distributed and homoscedastic data.
156
        * spearman: ranked/ordinal correlation, measures monotonic relationships.
157
        * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but
158
                    more robus in smaller dataets than 'spearman'.
159
160
    Returns
161
    -------
162
    Pandas Styler object
163
164
    '''
165
166
    # Validate Inputs
167
    _validate_input_range(threshold, 'threshold', -1, 1)
168
169
    def color_negative_red(val):
170
        color = '#FF3344' if val < 0 else None
171
        return 'color: %s' % color
172
173
    data = pd.DataFrame(data)
174
    corr = data.corr(method=method)
175
176
    corr = _corr_selector(corr, split=split, threshold=threshold)
177
178
    return corr.style.applymap(color_negative_red).format("{:.2f}", na_rep='-')
179
180
181
# Correlation matrix / heatmap
182
def corr_plot(data, split=None, threshold=0, target=None, method='pearson', cmap='BrBG', figsize=(12, 10), annot=True,
183
              dev=False, **kwargs):
184
    '''
185
    Two-dimensional visualization of the correlation between feature-columns, excluding NA values.
186
187
    Parameters
188
    ----------
189
    data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
190
    information is used to label the plots.
191
192
    split: {None, 'pos', 'neg', 'high', 'low'}, default None
193
        Type of split to be performed.
194
195
        * None: visualize all correlations between the feature-columns.
196
        * pos: visualize all positive correlations between the feature-columns above the threshold.
197
        * neg: visualize all negative correlations between the feature-columns below the threshold.
198
        * high: visualize all correlations between the feature-columns for which abs(corr) > threshold is True.
199
        * low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True.
200
201
    threshold: float, default 0
202
        Value between 0 <= threshold <= 1
203
204
    target: string, list, np.array or pd.Series, default None
205
        Specify target for correlation. E.g. label column to generate only the correlations between each feature \
206
        and the label.
207
208
    method: {'pearson', 'spearman', 'kendall'}, default 'pearson'
209
        * pearson: measures linear relationships and requires normally distributed and homoscedastic data.
210
        * spearman: ranked/ordinal correlation, measures monotonic relationships.
211
        * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but
212
                   more robust in smaller dataets than 'spearman'.
213
214
    cmap: matplotlib colormap name or object, or list of colors, default 'BrBG'
215
        The mapping from data values to color space.
216
217
    figsize: tuple, default (12, 10)
218
        Use to control the figure size.
219
220
    annot: bool, default True
221
        Use to show or hide annotations.
222
223
    dev: bool, default False
224
        Display figure settings in the plot by setting dev = True. If False, the settings are not displayed.
225
226
    **kwargs: optional
227
        Additional elements to control the visualization of the plot, e.g.:
228
229
        * mask: bool, default True
230
        If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this \
231
        case to avoid overlap.
232
        * vmax: float, default is calculated from the given correlation coefficients.
233
        Value between -1 or vmin <= vmax <= 1, limits the range of the colorbar.
234
        * vmin: float, default is calculated from the given correlation coefficients.
235
        Value between -1 <= vmin <= 1 or vmax, limits the range of the colorbar.
236
        * linewidths: float, default 0.5
237
        Controls the line-width inbetween the squares.
238
        * annot_kws: dict, default {'size' : 10}
239
        Controls the font size of the annotations. Only available when annot = True.
240
        * cbar_kws: dict, default {'shrink': .95, 'aspect': 30}
241
        Controls the size of the colorbar.
242
        * Many more kwargs are available, i.e. 'alpha' to control blending, or options to adjust labels, ticks ...
243
244
        Kwargs can be supplied through a dictionary of key-value pairs (see above).
245
246
    Returns
247
    -------
248
    ax: matplotlib Axes
249
        Returns the Axes object with the plot for further tweaking.
250
251
    '''
252
253
    # Validate Inputs
254
    _validate_input_range(threshold, 'threshold', -1, 1)
255
    _validate_input_bool(annot, 'annot')
256
    _validate_input_bool(dev, 'dev')
257
258
    data = pd.DataFrame(data)
259
260
    # Obtain correlations
261
    if isinstance(target, (str, list, pd.Series, np.ndarray)):
262
        target_data = []
263
        if isinstance(target, str):
264
            target_data = data[target]
265
            data = data.drop(target, axis=1)
266
267
        elif isinstance(target, (list, pd.Series, np.ndarray)):
268
            target_data = pd.Series(target)
269
270
        corr = pd.DataFrame(data.corrwith(target_data)).rename_axis(target, axis=1)
271
        corr = _corr_selector(corr, split=split, threshold=threshold)
272
        corr = corr.sort_values(corr.columns[0], ascending=False)
273
        vmax = np.round(np.nanmax(corr)-0.05, 2)
274
        vmin = np.round(np.nanmin(corr)+0.05, 2)
275
        mask = False
276
        square = False
277
278
    else:
279
        corr = corr_mat(data, split=split, threshold=threshold, method=method).data
280
281
        mask = np.triu(np.ones_like(corr, dtype=np.bool))  # Generate mask for the upper triangle
282
        square = True
283
        vmax = np.round(np.nanmax(corr.where(~mask))-0.05, 2)
284
        vmin = np.round(np.nanmin(corr.where(~mask))+0.05, 2)
285
286
    fig, ax = plt.subplots(figsize=figsize)
287
288
    # Specify kwargs for the heatmap
289
    kwargs = {'mask': mask,
290
              'cmap': cmap,
291
              'annot': annot,
292
              'vmax': vmax,
293
              'vmin': vmin,
294
              'linewidths': .5,
295
              'annot_kws': {'size': 10},
296
              'cbar_kws': {'shrink': .95, 'aspect': 30},
297
              **kwargs}
298
299
    # Draw heatmap with mask and some default settings
300
    sns.heatmap(corr,
301
                center=0,
302
                square=square,
303
                fmt='.2f',
304
                **kwargs
305
                )
306
307
    ax.set_title(f'Feature-correlation ({method})', fontdict={'fontsize': 18})
308
309
    # Display settings
310
    if dev:
311
        fig.suptitle(f"\
312
            Settings (dev-mode): \n\
313
            - split-mode: {split} \n\
314
            - threshold: {threshold} \n\
315
            - method: {method} \n\
316
            - annotations: {annot} \n\
317
            - cbar: \n\
318
                - vmax: {vmax} \n\
319
                - vmin: {vmin} \n\
320
            - linewidths: {kwargs['linewidths']} \n\
321
            - annot_kws: {kwargs['annot_kws']} \n\
322
            - cbar_kws: {kwargs['cbar_kws']}",
323
                     fontsize=12,
324
                     color='gray',
325
                     x=0.35,
326
                     y=0.85,
327
                     ha='left')
328
329
    return ax
330
331
332
# Distribution plot
333
def dist_plot(data, mean_color='orange', figsize=(14, 2), fill_range=(0.025, 0.975), hist=False, bins=None,
334
              showall=False, kde_kws=None, rug_kws=None, fill_kws=None, font_kws=None):
335
    '''
336
    Two-dimensional visualization of the distribution of numerical features.
337
338
    Parameters
339
    ----------
340
    data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
341
    information is used to label the plots.
342
343
    mean_color: color, default 'orange'
344
        Color of the vertical line indicating the mean of the data.
345
346
    figsize: tuple, default (14, 2)
347
        Use to control the figure size.
348
349
    fill_range: tuple, default (0.025, 0.975)
350
        Use to control set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations\
351
        above and below the mean.
352
353
    hist: bool, default False
354
        Set to True to display histogram bars in the plot.
355
356
    bins: integer, default None
357
        Specification of the number of hist bins. Requires hist = True
358
359
    showall: bool, default False
360
        Set to True to remove the output limit of 20 plots.
361
362
    kdw_kws: dict, default {'color': 'k', 'alpha': 0.7, 'linewidth': 1}
363
        Keyword arguments for kdeplot().
364
365
    rug_kws: dict, default {'color': 'brown', 'alpha': 0.5, 'linewidth': 2, 'height': 0.04}
366
        Keyword arguments for rugplot().
367
368
    fill_kws: dict, default {'color': 'brown', 'alpha': 0.1}
369
        Keyword arguments to control the fill.
370
371
    font_kws: dict, default {'color':  '#111111', 'weight': 'normal', 'size': 11}
372
        Keyword arguments to control the font.
373
374
    Returns
375
    -------
376
    ax: matplotlib Axes
377
        Returns the Axes object with the plot for further tweaking.
378
379
    '''
380
381
    # Validate Inputs
382
    _validate_input_bool(hist, 'hist')
383
    _validate_input_bool(showall, 'showall')
384
    _validate_input_range(fill_range[0], 'fill_range_lower', 0, 1)
385
    _validate_input_range(fill_range[1], 'fill_range_upper', 0, 1)
386
    _validate_input_smaller(fill_range[0], fill_range[1], 'fill_range')
387
388
    # Handle dictionary defaults
389
    kde_kws = {'alpha': 0.7, 'linewidth': 1.5} if kde_kws is None else kde_kws.copy()
390
    rug_kws = {'color': 'brown', 'alpha': 0.5, 'linewidth': 2, 'height': 0.04} if rug_kws is None else rug_kws.copy()
391
    fill_kws = {'color': 'brown', 'alpha': 0.1} if fill_kws is None else fill_kws.copy()
392
    font_kws = {'color':  '#111111', 'weight': 'normal', 'size': 11} if font_kws is None else font_kws.copy()
393
394
    data = drop_missing(pd.DataFrame(data).copy())  # remove empty columns / rows
395
    cols = list(data.select_dtypes(include=['number']).columns)
396
    data = data[cols]
397
398
    if len(cols) == 0:
399
        print('No columns with numeric data were detected.')
400
401
    elif len(cols) >= 20 and showall is False:
402
        print(
403
            f'Note: The number of numerical features is very large ({len(cols)}), please consider splitting the data. '
404
            'Showing plots for the first 20 numerical features. Override this by setting showall=True.')
405
        cols = cols[:20]
406
407
    for col in cols:
408
        dropped_values = data[col].isna().sum()
409
        if dropped_values > 0:
410
            print(f'Dropped {dropped_values} missing values from column {col}.')
411
            col_data = data[col].dropna(axis=0)
412
        else:
413
            col_data = data[col]
414
415
        _, ax = plt.subplots(figsize=figsize)
416
        ax = sns.distplot(col_data, bins=bins, hist=hist, rug=True, kde_kws=kde_kws,
417
                          rug_kws=rug_kws, hist_kws={'alpha': 0.5, 'histtype': 'step'})
418
419
        # Vertical lines and fill
420
        x, y = ax.lines[0].get_xydata().T
421
        ax.fill_between(x, y,
422
                        where=(
423
                            (x >= np.quantile(col_data, fill_range[0])) &
424
                            (x <= np.quantile(col_data, fill_range[1]))),
425
                        label=f'{fill_range[0]*100:.1f}% - {fill_range[1]*100:.1f}%',
426
                        **fill_kws)
427
428
        mean = np.mean(col_data)
429
        std = scipy.stats.tstd(col_data)
430
        ax.vlines(x=mean,
431
                  ymin=0,
432
                  ymax=np.interp(mean, x, y),
433
                  ls='dotted', color=mean_color, lw=2, label='mean')
434
        ax.vlines(x=np.median(col_data),
435
                  ymin=0,
436
                  ymax=np.interp(np.median(col_data), x, y),
437
                  ls=':', color='.3', label='median')
438
        ax.vlines(x=[mean-std, mean+std],
439
                  ymin=0,
440
                  ymax=[np.interp(mean-std, x, y), np.interp(mean+std, x, y)], ls=':', color='.5',
441
                  label='\u03BC \u00B1 \u03C3')
442
443
        ax.set_ylim(0,)
444
        ax.set_xlim(ax.get_xlim()[0]*1.15, ax.get_xlim()[1]*1.15)
445
446
        # Annotations and legend
447
        ax.text(0.01, 0.85, f'Mean: {np.round(mean,2)}',
448
                fontdict=font_kws, transform=ax.transAxes)
449
        ax.text(0.01, 0.7, f'Std. dev: {np.round(std,2)}',
450
                fontdict=font_kws, transform=ax.transAxes)
451
        ax.text(0.01, 0.55, f'Skew: {np.round(scipy.stats.skew(col_data),2)}',
452
                fontdict=font_kws, transform=ax.transAxes)
453
        ax.text(0.01, 0.4, f'Kurtosis: {np.round(scipy.stats.kurtosis(col_data),2)}',  # Excess Kurtosis
454
                fontdict=font_kws, transform=ax.transAxes)
455
        ax.text(0.01, 0.25, f'Count: {np.round(len(col_data))}',
456
                fontdict=font_kws, transform=ax.transAxes)
457
        ax.legend(loc='upper right')
458
459
    return ax
460
461
462
# Missing value plot
463
def missingval_plot(data, cmap='PuBuGn', figsize=(12, 12), sort=False, spine_color='#EEEEEE'):
464
    '''
465
    Two-dimensional visualization of the missing values in a dataset.
466
467
    Parameters
468
    ----------
469
    data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column \
470
    information is used to label the plots.
471
472
    cmap: colormap, default 'PuBuGn'
473
        Any valid colormap can be used. E.g. 'Greys', 'RdPu'. More information can be found in the matplotlib \
474
        documentation.
475
476
    figsize: tuple, default (20, 12)
477
        Use to control the figure size.
478
479
    sort: bool, default False
480
        Sort columns based on missing values in descending order and drop columns without any missing values
481
482
    spine_color: color, default '#EEEEEE'
483
        Set to 'None' to hide the spines on all plots or use any valid matplotlib color argument.
484
485
    Returns
486
    -------
487
    gs: Figure with array of Axes objects.
488
489
    '''
490
491
    # Validate Inputs
492
    _validate_input_bool(sort, 'sort')
493
494
    data = pd.DataFrame(data)
495
496
    if sort:
497
        mv_cols_sorted = data.isna().sum(axis=0).sort_values(ascending=False)
498
        final_cols = mv_cols_sorted.drop(mv_cols_sorted[mv_cols_sorted.values == 0].keys().tolist()).keys().tolist()
499
        data = data[final_cols]
500
        print('Displaying only columns with missing values.')
501
502
    # Identify missing values
503
    mv_total, mv_rows, mv_cols, _, mv_cols_ratio = _missing_vals(data).values()
504
    total_datapoints = data.shape[0]*data.shape[1]
505
506
    if mv_total == 0:
507
        print('No missing values found in the dataset.')
508
    else:
509
        # Create figure and axes
510
        fig = plt.figure(figsize=figsize)
511
        gs = fig.add_gridspec(nrows=6, ncols=6, left=0.05, wspace=0.05)
512
        ax1 = fig.add_subplot(gs[:1, :5])
513
        ax2 = fig.add_subplot(gs[1:, :5])
514
        ax3 = fig.add_subplot(gs[:1, 5:])
515
        ax4 = fig.add_subplot(gs[1:, 5:])
516
517
        # ax1 - Barplot
518
        colors = plt.get_cmap(cmap)(mv_cols / np.max(mv_cols))  # color bars by height
519
        ax1.bar(range(len(mv_cols)), np.round((mv_cols_ratio)*100, 2), color=colors)
520
        ax1.get_xaxis().set_visible(False)
521
        ax1.set(frame_on=False, xlim=(-.5, len(mv_cols)-0.5))
522
        ax1.set_ylim(0, np.max(mv_cols_ratio)*100)
523
        ax1.grid(linestyle=':', linewidth=1)
524
        ax1.yaxis.set_major_formatter(ticker.PercentFormatter(decimals=0))
525
        ax1.tick_params(axis='y', colors='#111111', length=1)
526
527
        # annotate values on top of the bars
528
        for rect, label in zip(ax1.patches, mv_cols):
529
            height = rect.get_height()
530
            ax1.text(.1 + rect.get_x() + rect.get_width() / 2, height+0.5, label,
531
                     ha='center',
532
                     va='bottom',
533
                     rotation='90',
534
                     alpha=0.5,
535
                     fontsize='small')
536
537
        ax1.set_frame_on(True)
538
        for _, spine in ax1.spines.items():
539
            spine.set_visible(True)
540
            spine.set_color(spine_color)
541
        ax1.spines['top'].set_color(None)
542
543
        # ax2 - Heatmap
544
        sns.heatmap(data.isna(), cbar=False, cmap='binary', ax=ax2)
545
        ax2.set_yticks(np.round(ax2.get_yticks()[0::5], -1))
546
        ax2.set_yticklabels(ax2.get_yticks())
547
        ax2.set_xticklabels(
548
            ax2.get_xticklabels(),
549
            horizontalalignment='center',
550
            fontweight='light',
551
            fontsize='medium')
552
        ax2.tick_params(length=1, colors='#111111')
553
        for _, spine in ax2.spines.items():
554
            spine.set_visible(True)
555
            spine.set_color(spine_color)
556
557
        # ax3 - Summary
558
        fontax3 = {'color':  '#111111',
559
                   'weight': 'normal',
560
                   'size': 12,
561
                   }
562
        ax3.get_xaxis().set_visible(False)
563
        ax3.get_yaxis().set_visible(False)
564
        ax3.set(frame_on=False)
565
566
        ax3.text(0.1, 0.9, f"Total: {np.round(total_datapoints/1000,1)}K",
567
                 transform=ax3.transAxes,
568
                 fontdict=fontax3)
569
        ax3.text(0.1, 0.7, f"Missing: {np.round(mv_total/1000,1)}K",
570
                 transform=ax3.transAxes,
571
                 fontdict=fontax3)
572
        ax3.text(0.1, 0.5, f"Relative: {np.round(mv_total/total_datapoints*100,1)}%",
573
                 transform=ax3.transAxes,
574
                 fontdict=fontax3)
575
        ax3.text(0.1, 0.3, f"Max-col: {np.round(mv_cols.max()/data.shape[0]*100)}%",
576
                 transform=ax3.transAxes,
577
                 fontdict=fontax3)
578
        ax3.text(0.1, 0.1, f"Max-row: {np.round(mv_rows.max()/data.shape[1]*100)}%",
579
                 transform=ax3.transAxes,
580
                 fontdict=fontax3)
581
582
        # ax4 - Scatter plot
583
        ax4.get_yaxis().set_visible(False)
584
        for _, spine in ax4.spines.items():
585
            spine.set_color(spine_color)
586
        ax4.tick_params(axis='x', colors='#111111', length=1)
587
588
        ax4.scatter(mv_rows, range(len(mv_rows)), s=mv_rows, c=mv_rows, cmap=cmap, marker=".", vmin=1)
589
        ax4.set_ylim((0, len(mv_rows))[::-1])  # limit and invert y-axis
590
        ax4.set_xlim(0, max(mv_rows)+0.5)
591
        ax4.grid(linestyle=':', linewidth=1)
592
593
        gs.figure.suptitle('Missing value plot', x=0.45, y=0.94, fontsize=18, color='#111111')
594
595
        return gs
596