Completed
Push — master ( 6e14ce...512eb0 )
by Rich
02:12
created

Pipeline.transform()   A

Complexity

Conditions 1

Size

Total Lines 2

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
c 0
b 0
f 0
dl 0
loc 2
rs 10
cc 1
1
#! /usr/bin/env python
2
#
3
# Copyright (C) 2016 Rich Lewis <[email protected]>
4
# License: 3-clause BSD
5
6
7
"""
8
# skchem.base
9
10
Base classes for scikit-chem objects.
11
"""
12
import subprocess
13
from abc import ABCMeta, abstractmethod
14
from tempfile import NamedTemporaryFile
15
import time
16
import logging
17
18
import pandas as pd
0 ignored issues
show
Configuration introduced by
The import pandas could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
19
20
from .utils import NamedProgressBar
21
from . import core
22
from .utils import iterable_to_series, optional_second_method, nanarray, squeeze
23
from . import io
24
25
LOGGER = logging.getLogger(__name__)
26
27
28
class BaseTransformer(object):
29
30
    """ Transformer Base Class.
31
32
    Specific Base Transformer classes inherit from this class and implement `transform` and `axis_names`.
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (105/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
33
    """
34
35
    __metaclass__ = ABCMeta
36
37
    # To share some functionality betweeen Transformer and AtomTransformer
38
39
    def __init__(self, verbose=True):
40
        self.verbose = verbose
41
42
43
    def optional_bar(self, **kwargs):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
44
        if self.verbose:
45
            bar = NamedProgressBar(name=self.__class__.__name__, **kwargs)
0 ignored issues
show
introduced by
Black listed name "bar"
Loading history...
46
        else:
47
            def bar(x):
0 ignored issues
show
introduced by
Black listed name "bar"
Loading history...
Coding Style Naming introduced by
The name x does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
48
                return x
49
        return bar
50
51
    @property
52
    @abstractmethod
53
    def axes_names(self):
54
        """ tuple: The names of the axes. """
55
        pass
56
57
    @abstractmethod
58
    def transform(self, mols):
59
        """ Transform objects according to the objects transform protocol.
60
61
        Args:
62
            mols (skchem.Mol or pd.Series or iterable):
63
                The mol objects to transform.
64
65
        Returns:
66
            pd.Series or pd.DataFrame
67
        """
68
        pass
69
70
71
class Transformer(BaseTransformer):
72
73
    """ Molecular based Transformer Base class.
74
75
    Concrete Transformers inherit from this class and must implement `_transform_mol` and `_columns`.
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (101/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
76
77
    See Also:
78
         AtomTransformer."""
79
80
    @property
81
    @abstractmethod
82
    def columns(self):
83
        """ pd.Index: The column index to use. """
84
        return pd.Index(None)
85
86
    @abstractmethod
87
    def _transform_mol(self, mol):
88
        """ Transform a molecule. """
89
        pass
90
91
    def _transform_series(self, ser):
92
        """ Transform a series of molecules to an np.ndarray. """
93
        bar = self.optional_bar()
0 ignored issues
show
introduced by
Black listed name "bar"
Loading history...
94
95
        return [self._transform_mol(mol) for mol in bar(ser)]
96
97
    @optional_second_method
98
    def transform(self, mols, **kwargs):
0 ignored issues
show
Unused Code introduced by
The argument kwargs seems to be unused.
Loading history...
99
        """ Transform objects according to the objects transform protocol.
100
101
        Args:
102
            mols (skchem.Mol or pd.Series or iterable):
103
                The mol objects to transform.
104
105
        Returns:
106
            pd.Series or pd.DataFrame
107
        """
108
        if isinstance(mols, core.Mol):
109
            # just squeeze works on series
110
            return pd.Series(self._transform_mol(mols),
111
                             index=self.columns,
112
                             name=self.__class__.__name__).squeeze()
113
114
        elif not isinstance(mols, pd.Series):
115
            mols = iterable_to_series(mols)
116
117
        res = pd.DataFrame(self._transform_series(mols),
118
                           index=mols.index,
119
                           columns=self.columns)
120
121
        return squeeze(res, axis=1)
122
123
    @property
124
    def axes_names(self):
125
        """ tuple: The names of the axes. """
126
        return 'batch', self.columns.name
127
128
129
class BatchTransformer(BaseTransformer):
130
    """ Transformer Mixin in which transforms on multiple molecules save overhead.
131
132
    Implement `_transform_series` with the transformation rather than `_transform_mol`. Must occur before
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (105/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
133
    `Transformer` or  `AtomTransformer` in method resolution order.
134
135
    See Also:
136
         Transformer, AtomTransformer.
137
    """
138
139
    def _transform_mol(self, mol):
140
        """ Transform a molecule. """
141
142
        v = self.verbose
0 ignored issues
show
Coding Style Naming introduced by
The name v does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
143
        self.verbose = False
144
        res = self.transform([mol]).iloc[0]
145
        self.verbose = v
146
        return res
147
148
    @abstractmethod
149
    def _transform_series(self, ser):
150
        """ Transform a series of molecules to an np.ndarray. """
151
        pass
152
153
154
class AtomTransformer(BaseTransformer):
155
    """ Transformer that will produce a Panel.
156
157
    Concrete classes inheriting from this should implement `_transform_atom`, `_transform_mol` and `minor_axis`.
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (112/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
158
159
    See Also:
160
        Transformer
161
    """
162
163
    def __init__(self, max_atoms=100, **kwargs):
164
        self.max_atoms = max_atoms
165
        self.major_axis = pd.RangeIndex(self.max_atoms, name='atom_idx')
166
        super(AtomTransformer, self).__init__(**kwargs)
167
168
    @property
169
    @abstractmethod
170
    def minor_axis(self):
171
        """ pd.Index: Minor axis of transformed values.  """
172
        return pd.Index(None)  # expects a length
173
174
    @property
175
    def axes_names(self):
176
        """ tuple: The names of the axes. """
177
        return 'batch', 'atom_idx', self.minor_axis.name
178
179
    @optional_second_method
180
    def transform(self, mols):
181
        """ Transform objects according to the objects transform protocol.
182
183
        Args:
184
            mols (skchem.Mol or pd.Series or iterable):
185
                The mol objects to transform.
186
187
        Returns:
188
            pd.Series or pd.DataFrame
189
        """
190
        if isinstance(mols, core.Atom):
191
            # just squeeze works on series
192
            return pd.Series(self._transform_atom(mols),
193
                             index=self.minor_axis).squeeze()
194
195
        elif isinstance(mols, core.Mol):
196
            res = pd.DataFrame(self._transform_mol(mols),
197
                               index=self.major_axis[:len(mols.atoms)],
198
                               columns=self.minor_axis)
199
            return squeeze(res, axis=1)
200
201
        elif not isinstance(mols, pd.Series):
202
            mols = iterable_to_series(mols)
203
204
        res = pd.Panel(self._transform_series(mols),
205
                       items=mols.index,
206
                       major_axis=self.major_axis,
207
                       minor_axis=self.minor_axis)
208
209
        return squeeze(res, axis=(1, 2))
210
211
    @abstractmethod
212
    def _transform_atom(self, atom):
213
        """ Transform an atom to a 1D array of length `len(self.columns)`. """
214
215
        pass
216
217
    def _transform_mol(self, mol):
218
        """ Transform a Mol to a 2D array. """
219
220
        res = nanarray((len(mol.atoms), len(self.minor_axis)))
221
        for i, atom in enumerate(mol.atoms):
222
            res[i] = self._transform_atom(atom)
223
        return res
224
225
    def _transform_series(self, ser):
226
        """ Transform a Series<Mol> to a 3D array. """
227
228
        if self.verbose:
229
            bar = NamedProgressBar(name=self.__class__.__name__)
0 ignored issues
show
introduced by
Black listed name "bar"
Loading history...
230
        else:
231
            # use identity.
232
            def bar(obj):
0 ignored issues
show
introduced by
Black listed name "bar"
Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
233
                return obj
234
235
        res = nanarray((len(ser), self.max_atoms, len(self.minor_axis)))
236
        for i, mol in enumerate(bar(ser)):
237
            res[i, :len(mol.atoms), :len(self.minor_axis)] = self._transform_mol(mol)
238
        return res
239
240
241
class External(object):
242
    """ Mixin for wrappers of external CLI tools.
243
244
     Concrete classes must implement `validate_install`."""
245
246
    __metaclass__ = ABCMeta
247
248
    install_hint = "" # give an explanation of how to install external tool here.
249
250
    def __init__(self, **kwargs):
251
        assert self.validated, 'External tool not installed. ' + self.install_hint
252
        super(External, self).__init__(**kwargs)
253
254
    @property
255
    def validated(self):
256
        """ bool: whether the external tool is installed and active. """
257
        if not hasattr(self.__class__, '_validated'):
258
            self.__class__._validated = self.validate_install()
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _validated was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
259
        return self.__class__._validated
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _validated was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
260
261
    @staticmethod
262
    @abstractmethod
263
    def validate_install():
264
        """ Determine if the external tool is available. """
265
        pass
266
267
268
class CLIWrapper(External, BaseTransformer):
269
    """ CLI wrapper.
270
271
    Concrete classes inheriting from this must implement `_cli_args`, `monitor_progress`,
272
    `_parse_outfile`, `_parse_errors`."""
273
274
    def __init__(self, error_on_fail=False, warn_on_fail=True, **kwargs):
275
        super(CLIWrapper, self).__init__(**kwargs)
276
        self.error_on_fail = error_on_fail
277
        self.warn_on_fail = warn_on_fail
278
279
    def _transform_series(self, ser):
280
        """ Transform a series. """
281
        with NamedTemporaryFile(suffix='.sdf') as infile, NamedTemporaryFile() as outfile:
282
            io.write_sdf(ser, infile.name)
283
            args = self._cli_args(infile.name, outfile.name)
284
            p = subprocess.Popen(args, stderr=subprocess.PIPE)
0 ignored issues
show
Coding Style Naming introduced by
The name p does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
285
286
            if self.verbose:
287
                bar = self.optional_bar(max_value=len(ser))
0 ignored issues
show
introduced by
Black listed name "bar"
Loading history...
288
                while p.poll() is None:
289
                    time.sleep(0.5)
290
                    bar.update(self.monitor_progress(outfile.name))
0 ignored issues
show
Bug introduced by
The Function bar does not seem to have a member named update.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
Bug introduced by
The Instance of NamedProgressBar does not seem to have a member named update.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
291
                bar.finish()
0 ignored issues
show
Bug introduced by
The Instance of NamedProgressBar does not seem to have a member named finish.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
Bug introduced by
The Function bar does not seem to have a member named finish.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
292
293
            p.wait()
294
            res = self._parse_outfile(outfile.name)
295
296
        errs = p.stderr.read().decode()
297
        errs = self._parse_errors(errs)
298
        # set the index of results to that of the input, with the failed indices removed
299
        if isinstance(res, (pd.Series, pd.DataFrame)):
300
            res.index = ser.index.delete(errs)
301
        elif isinstance(res, pd.Panel):
302
            res.items = ser.index.delete(errs)
303
        else:
304
            raise ValueError('Parsed datatype ({}) not supported.'.format(type(res)))
305
306
        # go through the errors and put them back in (transform doesn't lose instances)
307
        if len(errs):
308
            for err in errs:
309
                err = ser.index[err]
310
                if self.error_on_fail:
311
                    raise ValueError('Failed to transform {}.'.format(err))
312
                if self.warn_on_fail:
313
                    LOGGER.warn('Failed to transform %s', err)
314
                res.ix[err] = None
315
316
        return res.loc[ser.index].values
317
318
    @abstractmethod
319
    def _cli_args(self, infile, outfile):
320
        """ list: The cli arguments. """
321
        return []
322
323
    @abstractmethod
324
    def monitor_progress(self, filename):
325
        """ Report the progress. """
326
        pass
327
328
    @abstractmethod
329
    def _parse_outfile(self, outfile):
330
        """ Parse the file written and return a series. """
331
        pass
332
333
    @abstractmethod
334
    def _parse_errors(self, errs):
335
        """ Parse stderr and return error indices. """
336
        pass
337
338
339
class Featurizer(object):
340
341
    """ Base class for m -> data transforms, such as Fingerprinting etc.
342
343
    Concrete subclasses should implement `name`, returning a string uniquely identifying the featurizer. """
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (108/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
344
345
    __metaclass__ = ABCMeta
346
347
348