Completed
Push — master ( 4d243d...e4a84f )
by Rich
01:28
created

Converter.process_splits()   B

Complexity

Conditions 6

Size

Total Lines 21

Duplication

Lines 0
Ratio 0 %

Importance

Changes 1
Bugs 0 Features 1
Metric Value
c 1
b 0
f 1
dl 0
loc 21
rs 7.8867
cc 6
1
#! /usr/bin/env python
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
#
3
# Copyright (C) 2016 Rich Lewis <[email protected]>
4
# License: 3-clause BSD
5
6
import warnings
7
import logging
8
import os
9
import functools
10
from collections import namedtuple
11
12
import numpy as np
0 ignored issues
show
Configuration introduced by
The import numpy could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
13
import pandas as pd
0 ignored issues
show
Configuration introduced by
The import pandas could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
14
15
import h5py
0 ignored issues
show
Configuration introduced by
The import h5py could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
16
from fuel.datasets import H5PYDataset
0 ignored issues
show
Configuration introduced by
The import fuel.datasets could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
17
18
from ... import filters
19
from ... import descriptors
20
from ... import cross_validation
21
from ... import standardizers
22
23
logger = logging.getLogger(__name__)
0 ignored issues
show
Coding Style Naming introduced by
The name logger does not conform to the constant naming conventions ((([A-Z_][A-Z0-9_]*)|(__.*__))$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
24
25
26
Feature = namedtuple('Feature', ['fper', 'key', 'axis_names'])
27
28
DEFAULT_FEATURES = (
29
    Feature(fper=descriptors.MorganFingerprinter(),
30
            key='X_morg',
31
            axis_names=['batch', 'features']),
32
    Feature(fper=descriptors.PhysicochemicalFingerprinter(),
33
            key='X_pc',
34
            axis_names=['batch', 'features']),
35
    Feature(fper=descriptors.AtomFeatureCalculator(),
36
            key='A',
37
            axis_names=['batch', 'atom_idx', 'features']),
38
    Feature(fper=descriptors.GraphDistanceCalculator(),
39
            key='G',
40
            axis_names=['batch', 'atom_idx', 'atom_idx']))
41
42
Filter = namedtuple('Filter', ['filter', 'kwargs'])
43
44
DEFAULT_FILTERS = (
45
    Filter(filters.is_organic, {}),
46
    Filter(filters.n_atoms, {'above': 5, 'below': 75}),
47
    Filter(filters.mass, {'below': 1000})
48
)
49
50
DEFAULT_STANDARDIZER = standardizers.ChemAxonStandardizer(keep_failed=True)
51
52
class Converter(object):
53
    """ Create a fuel dataset from molecules and targets.
54
55
    Args:
56
        ms (pd.Series):
57
            The molecules of the dataset.
58
        ys (pd.Series or pd.DataFrame):
59
            The target labels of the dataset.
60
        output_path (str):
61
            The path to which the dataset should be saved.
62
        features (list[Feature]):
63
            The features to calculate. Defaults are provided.
64
        splits (dict):
65
            A dictionary of different splits provided.
66
            The keys should be the split name, and values an array of indices.
67
            Alternatively, if `contiguous_splits` is `True`, the keys should be
68
            the split name, and the values a tuple of start and stop.
69
            If `None`, use `skchem.cross_validation.SimThresholdSplit`
70
    """
71
72
73
    def __init__(self, directory, output_directory, output_filename='default.h5'):
0 ignored issues
show
Unused Code introduced by
The argument directory seems to be unused.
Loading history...
Unused Code introduced by
The argument output_filename seems to be unused.
Loading history...
Unused Code introduced by
The argument output_directory seems to be unused.
Loading history...
74
        raise NotImplemented
0 ignored issues
show
Best Practice introduced by
NotImplemented raised - should raise NotImplementedError
Loading history...
75
76
    def run(self, ms, y, output_path,
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
best-practice introduced by
Too many arguments (7/5)
Loading history...
77
                features=DEFAULT_FEATURES, splits=None, contiguous=False):
0 ignored issues
show
Coding Style introduced by
Wrong continued indentation.
features=DEFAULT_FEATURES, splits=None, contiguous=False):
| ^
Loading history...
78
79
        self.contiguous = contiguous
0 ignored issues
show
Coding Style introduced by
The attribute contiguous was defined outside __init__.

It is generally a good practice to initialize all attributes to default values in the __init__ method:

class Foo:
    def __init__(self, x=None):
        self.x = x
Loading history...
80
        self.output_path = output_path
0 ignored issues
show
Coding Style introduced by
The attribute output_path was defined outside __init__.

It is generally a good practice to initialize all attributes to default values in the __init__ method:

class Foo:
    def __init__(self, x=None):
        self.x = x
Loading history...
81
        self.features = features
0 ignored issues
show
Coding Style introduced by
The attribute features was defined outside __init__.

It is generally a good practice to initialize all attributes to default values in the __init__ method:

class Foo:
    def __init__(self, x=None):
        self.x = x
Loading history...
82
        self.feature_names = [feat.key for feat in self.features] + ['y']
0 ignored issues
show
Coding Style introduced by
The attribute feature_names was defined outside __init__.

It is generally a good practice to initialize all attributes to default values in the __init__ method:

class Foo:
    def __init__(self, x=None):
        self.x = x
Loading history...
83
84
        self.create_file(output_path)
85
86
        if not splits:
87
            splits, idx = self.create_splits(ms)
88
            ms, y = ms.ix[idx], y.ix[idx]
89
90
        split_dict = self.process_splits(splits)
91
92
        self.save_splits(split_dict)
93
        self.save_molecules(ms)
94
        self.save_targets(y)
95
        self.save_features(ms)
96
97
    def create_file(self, path):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
98
        logger.info('Creating h5 file at %s...', self.output_path)
99
        self.data_file = h5py.File(path, 'w')
0 ignored issues
show
Coding Style introduced by
The attribute data_file was defined outside __init__.

It is generally a good practice to initialize all attributes to default values in the __init__ method:

class Foo:
    def __init__(self, x=None):
        self.x = x
Loading history...
100
        return self.data_file
101
102
    def filter(self, data, filters=DEFAULT_FILTERS):
0 ignored issues
show
Comprehensibility Bug introduced by
filters is re-defining a name which is already available in the outer-scope (previously defined on line 18).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
103
104
        """ Filter the compounds according to the usual filters. """
105
        logger.info('Filtering %s compounds', len(data))
106
        if isinstance(data, pd.DataFrame):
107
            ms = data.structure
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
108
        else:
109
            ms = data
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
110
        filt = functools.reduce(lambda a, b: a & b, (ms.apply(filt.filter, **filt.kwargs) for filt in filters))
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (111/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
Coding Style introduced by
Usage of * or ** arguments should usually be done with care.

Generally, there is nothing wrong with usage of * or ** arguments. For readability of the code base, we suggest to not over-use these language constructs though.

For more information, we can recommend this blog post from Ned Batchelder including its comments which also touches this aspect.

Loading history...
111
        logger.info('Filtered out %s compounds', (~filt).sum())
112
113
        return data[filt]
114
115
116
    def standardize(self, data, standardizer=DEFAULT_STANDARDIZER):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
117
118
        """ Standardize the compounds. """
119
        logger.info('Standardizing %s compounds', len(data))
120
        return standardizer.transform(data)
121
122
123
    def save_molecules(self, mols):
124
125
        """ Save the molecules to the data file. """
126
127
        logger.info('Writing molecules to file...')
128
        logger.debug('Writing %s molecules to %s', len(mols), self.data_file.filename)
129
        with warnings.catch_warnings():
130
            warnings.simplefilter('ignore')
131
            mols.to_hdf(self.data_file.filename, 'structure')
132
            mols.apply(lambda m: m.to_smiles().encode('utf-8')).to_hdf(self.data_file.filename, 'smiles')
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (105/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
133
134
    def save_targets(self, y):
0 ignored issues
show
Coding Style Naming introduced by
The name y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
135
136
        """ Save the targets to the data file. """
137
        y_name = getattr(y, 'name', None)
138
        if not y_name:
139
            y_name = getattr(y.columns, 'name', None)
140
        if not y_name:
141
            y_name = 'targets'
142
143
        logger.info('Writing %s', y_name)
144
        logger.debug('Writing targets of shape %s to %s', y.shape, self.data_file.filename)
145
146
        with warnings.catch_warnings():
147
            warnings.simplefilter('ignore')
148
            y.to_hdf(self.data_file.filename, '/targets/' + y_name)
149
150
        if isinstance(y, pd.Series):
151
            self.data_file['y'] = h5py.SoftLink('/targets/{}/values'.format(y_name))
152
            self.data_file['y'].dims[0].label = 'batch'
153
154
        elif isinstance(y, pd.DataFrame):
155
            self.data_file['y'] = h5py.SoftLink('/targets/{}/block0_values'.format(y_name))
156
            self.data_file['y'].dims[0].label = 'batch'
157
            self.data_file['y'].dims[0].label = 'task'
158
159
    def save_features(self, ms):
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
160
161
        """ Save all features for the dataset. """
162
        logger.debug('Saving features')
163
        for feat in self.features:
164
            self._save_feature(ms, feat)
165
166
    def _save_feature(self, ms, feat):
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
167
168
        """ Calculate and save a feature to the data file. """
169
        logger.info('Calculating %s', feat.key)
170
171
        fps = feat.fper.transform(ms)
172
        if len(feat.axis_names) > 2:
173
            fps = fps.transpose(2, 1, 0) # panel serialize backwards for some reason...
174
        logger.debug('Writing features with shape %s to %s', fps.shape, self.data_file.filename)
175
        with warnings.catch_warnings():
176
            warnings.simplefilter('ignore')
177
            fps.to_hdf(self.data_file.filename, 'features/{}'.format(feat.key))
178
        self.data_file[feat.key] = h5py.SoftLink('/features/{}/block0_values'.format(feat.key))
179
        self.data_file[feat.key].dims[0].label = feat.axis_names[0]
180
        self.data_file[feat.key].dims[1].label = feat.axis_names[1]
181
        if len(feat.axis_names) > 2:
182
            self.data_file[feat.key].dims[2].label = feat.axis_names[2]
183
184
    def create_splits(self, ms, contiguous=True):
0 ignored issues
show
Coding Style Naming introduced by
The name ms does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Unused Code introduced by
The argument contiguous seems to be unused.
Loading history...
185
186
        """ Create a split dict for fuel from mols, using SimThresholdSplit.
187
188
        Args:
189
            ms (pd.Series):
190
                The molecules to use to design the splits.
191
            contiguous (bool):
192
                Whether the split should be contiguous.  This allows for more
193
                efficient loading times.  This usually is the appropriate if
194
                there are no other splits for the dataset, and will reorder
195
                the dataset.
196
        Returns:
197
            (dict, idx)
198
                The split dict, and the index to align the data with.
199
        """
200
201
        logger.info('Creating Similarity Threshold splits...')
202
        cv = cross_validation.SimThresholdSplit(ms, memory_optimized=True)
0 ignored issues
show
Coding Style Naming introduced by
The name cv does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
203
        train, valid, test = cv.split((70, 15, 15))
204
205
        def bool_to_index(ser):
0 ignored issues
show
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
206
            return np.nonzero(ser.values)[0]
207
208
        if self.contiguous:
209
            dset = pd.Series(0, ms.index)
210
            dset[train] = 0
211
            dset[valid] = 1
212
            dset[test] = 2
213
            dset = dset.sort_values()
214
            idx = dset.index
215
            train_split = bool_to_index(dset == 0)
216
            valid_split = bool_to_index(dset == 1)
217
            test_split = bool_to_index(dset == 2)
218
            print('train', train_split)
219
            print('valid', valid_split)
220
            print('test', test_split)
221
            def min_max(split):
0 ignored issues
show
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
222
                return min(split), max(split)
223
224
            splits = {
225
                'train': min_max(train_split),
226
                'valid': min_max(valid_split),
227
                'test': min_max(test_split)
228
            }
229
230
        else:
231
232
            idx = ms.index
233
234
            splits = {
235
                'train': bool_to_index(train),
236
                'valid': bool_to_index(valid),
237
                'test': bool_to_index(test)
238
            }
239
240
        return splits, idx
241
242
    def process_splits(self, splits, contiguous=False):
0 ignored issues
show
Unused Code introduced by
The argument contiguous seems to be unused.
Loading history...
243
244
        """ Create a split dict for fuel from provided indexes. """
245
246
        logger.info('Creating split array.')
247
248
        split_dict = {}
249
250
        if self.contiguous:
251
            logger.debug('Contiguous splits.')
252
            for split_name, (start, stop) in splits.items():
253
                split_dict[split_name] = {feat: (start, stop, h5py.Reference()) for feat in self.feature_names}
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (111/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
254
        else:
255
            for split_name, split in splits.items():
256
                split_indices_name = '{}_indices'.format(split_name).encode('utf-8')
257
                logger.debug('Saving %s to %s', split_indices_name, self.data_file.filename)
258
                self.data_file[split_indices_name] = split
259
                split_ref = self.data_file[split_indices_name].ref
260
                split_dict[split_name] = {feat: (-1, -1, split_ref) for feat in self.feature_names}
261
262
        return split_dict
263
264
    def save_splits(self, split_dict):
265
266
        """ Save the splits to the data file. """
267
268
        logger.info('Producing dataset splits...')
269
        split = H5PYDataset.create_split_array(split_dict)
270
        logger.debug('split: %s', split)
271
        logger.info('Saving splits...')
272
        with warnings.catch_warnings():
273
            warnings.simplefilter('ignore')
274
            self.data_file.attrs['split'] = split
275
276
    @classmethod
277
    def convert(cls, **kwargs):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
278
        kwargs.setdefault('directory', os.getcwd())
279
        kwargs.setdefault('output_directory', os.getcwd())
280
281
        return cls(**kwargs).output_path,
282
283
    @classmethod
284
    def fill_subparser(cls, subparser):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Unused Code introduced by
The argument subparser seems to be unused.
Loading history...
285
        return cls.convert
286