Completed
Push — master ( 4d243d...e4a84f )
by Rich
01:28
created

Tox21Converter.__init__()   B

Complexity

Conditions 2

Size

Total Lines 37

Duplication

Lines 0
Ratio 0 %

Importance

Changes 1
Bugs 0 Features 1
Metric Value
c 1
b 0
f 1
dl 0
loc 37
rs 8.8571
cc 2
1
#! /usr/bin/env python
2
#
3
# Copyright (C) 2016 Rich Lewis <[email protected]>
4
# License: 3-clause BSD
5
6
"""
7
## skchem.data.transformers.tox21
8
9
Module defining transformation techniques for tox21.
10
"""
11
12
import zipfile
13
import os
14
import logging
15
logger = logging.getLogger(__name__)
0 ignored issues
show
Coding Style Naming introduced by
The name logger does not conform to the constant naming conventions ((([A-Z_][A-Z0-9_]*)|(__.*__))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
16
17
import numpy as np
0 ignored issues
show
Configuration introduced by
The import numpy could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
18
import pandas as pd
0 ignored issues
show
Configuration introduced by
The import pandas could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
19
20
from .base import Converter
21
22
from ... import filters
0 ignored issues
show
Unused Code introduced by
The import filters seems to be unused.
Loading history...
23
from ... import io
24
from ... import core
25
from ... import standardizers
0 ignored issues
show
Unused Code introduced by
The import standardizers seems to be unused.
Loading history...
26
27
class Tox21Converter(Converter):
28
29
    """ Class to build tox21 dataset.
30
31
    """
32
    def __init__(self, directory, output_directory, output_filename='tox21.h5'):
0 ignored issues
show
Bug introduced by
The __init__ method of the super-class Converter is not called.

It is generally advisable to initialize the super-class by calling its __init__ method:

class SomeParent:
    def __init__(self):
        self.x = 1

class SomeChild(SomeParent):
    def __init__(self):
        # Initialize the super class
        SomeParent.__init__(self)
Loading history...
33
34
        output_path = os.path.join(output_directory, output_filename)
35
36
        # extract data
37
        train, valid, test = self.extract(directory)
38
39
        # read data
40
        train = self.read_train(train)
41
        valid = self.read_valid(valid)
42
        test = self.read_test(test, os.path.join(directory, 'test.txt'))
43
44
        # combine into full dataset
45
        data = pd.concat([train, valid, test], keys=['train', 'valid', 'test']).sort_index()
46
        data.index.names = 'ds', 'id'
47
48
        data = self.standardize(data)
49
        data = self.filter(data)
50
51
        # generate splits
52
        data = data.reset_index(0)
53
        split_arr = data['ds'].values
54
55
56
        splits = {}
57
        for split in 'train', 'valid', 'test':
58
            idx, = np.nonzero(split_arr == split)
59
            splits[split] = (min(idx), max(idx))
60
61
        data = data.drop('ds', axis=1)
62
63
        # get ms and targets together
64
        ms, y = data.structure, data.drop('structure', axis=1)
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
65
        y.columns.name = 'tasks'
66
67
        # call the Converter to make the final dataset
68
        self.run(ms, y, output_path, splits=splits, contiguous=True)
69
70
    @staticmethod
71
    def fix_id(s):
0 ignored issues
show
Coding Style Naming introduced by
The name s does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
72
        return s.split('-')[0]
73
74
    @staticmethod
75
    def fix_assay_name(s):
0 ignored issues
show
Coding Style Naming introduced by
The name s does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
76
        return s.replace('-', '_')
77
78
    @staticmethod
79
    def patch_test(test):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
80
        test_1 = pd.Series({
81
            'structure': core.Mol.from_smiles('FC(F)(F)c1[nH]c(c(C#N)c1Br)C1=CC=C(Cl)C=C1', name='NCGC00357062'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (113/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
Bug introduced by
The Class Mol does not seem to have a member named from_smiles.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
82
            'stochiometry': 0,
83
            'Compound ID': 'NCGC00357062',
84
            'Sample ID': 'NCGC00357062-01'}, name='NCGC00357062')
85
        test['NCGC00357062'] = test_1
86
        return test
87
88
    def read_train(self, train):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
89
90
        train = io.read_sdf(train)
91
        train.columns = train.columns.to_series().apply(self.fix_assay_name)
92
        train.index = train.index.to_series().apply(self.fix_id)
93
        self.assays = train.columns[-12:]
94
        self.keep_cols = ['structure'] + self.assays.tolist()
95
        train[self.assays] = train[self.assays].astype(float)
96
        train = train[self.keep_cols]
97
        train = train.sort_index()
98
        ms = train.structure[~train.index.duplicated()]
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
99
        train = train[self.assays].groupby(train.index).max()
100
        train = ms.to_frame().join(train)
101
        return train
102
103
    def read_valid(self, valid):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
104
105
        valid = io.read_sdf(valid)
106
        valid.columns = valid.columns.to_series().apply(self.fix_assay_name)
107
        valid = valid[self.keep_cols]
108
        valid[self.assays] = valid[self.assays].astype(float)
109
        return valid
110
111
    def read_test(self, test, test_data):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
112
113
        test = io.read_sdf(test)
114
        test = self.patch_test(test)
115
        test_data = pd.read_table(test_data)
116
        test_data['Sample ID'] = test_data['Sample ID'].apply(self.fix_id)
117
        test = test.join(test_data.set_index('Sample ID'))
118
119
        test.columns = test.columns.to_series().apply(self.fix_assay_name)
120
        test = test[self.keep_cols]
121
        test[test == 'x'] = np.nan
122
        test[self.assays] = test[self.assays].astype(float)
123
        return test
124
125
    def extract(self, directory):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
126
127
        with zipfile.ZipFile(os.path.join(directory, 'train.sdf.zip')) as f:
0 ignored issues
show
Coding Style Naming introduced by
The name f does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
128
            train = f.extract('tox21_10k_data_all.sdf')
129
130
        with zipfile.ZipFile(os.path.join(directory, 'valid.sdf.zip')) as f:
0 ignored issues
show
Coding Style Naming introduced by
The name f does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
131
            valid = f.extract('tox21_10k_challenge_test.sdf')
132
133
        with zipfile.ZipFile(os.path.join(directory, 'test.sdf.zip')) as f:
0 ignored issues
show
Coding Style Naming introduced by
The name f does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
134
            test = f.extract('tox21_10k_challenge_score.sdf')
135
136
        return train, valid, test
137
138
if __name__ == '__main__':
139
    logging.basicConfig(level=logging.INFO)
140
    Tox21Converter.convert()
141