Passed
Push — main ( cdf0f7...3de8e8 )
by Douglas
01:40
created

mandos.entries.searcher.SearcherUtils.dl()   C

Complexity

Conditions 10

Size

Total Lines 57
Code Lines 43

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 10
eloc 43
nop 6
dl 0
loc 57
rs 5.9999
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like mandos.entries.searcher.SearcherUtils.dl() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""
2
Run searches and write files.
3
"""
4
5
from __future__ import annotations
6
7
import gzip
0 ignored issues
show
Unused Code introduced by
The import gzip seems to be unused.
Loading history...
8
from pathlib import Path
9
from typing import Sequence, Optional, Dict
10
11
import pandas as pd
0 ignored issues
show
introduced by
Unable to import 'pandas'
Loading history...
12
from pocketutils.core.dot_dict import NestedDotDict
0 ignored issues
show
introduced by
Unable to import 'pocketutils.core.dot_dict'
Loading history...
13
from pocketutils.tools.common_tools import CommonTools
0 ignored issues
show
introduced by
Unable to import 'pocketutils.tools.common_tools'
Loading history...
14
from pocketutils.tools.path_tools import PathTools
0 ignored issues
show
introduced by
Unable to import 'pocketutils.tools.path_tools'
Loading history...
Unused Code introduced by
Unused PathTools imported from pocketutils.tools.path_tools
Loading history...
15
from typeddfs import TypedDfs, UntypedDf
0 ignored issues
show
introduced by
Unable to import 'typeddfs'
Loading history...
16
17
from mandos import logger
18
from mandos.entries.paths import EntryPaths
19
from mandos.model import CompoundNotFoundError
20
from mandos.model.chembl_support.chembl_utils import ChemblUtils
21
from mandos.model.searches import Search
22
from mandos.model.settings import MANDOS_SETTINGS
23
from mandos.search.chembl import ChemblSearch
24
from mandos.search.pubchem import PubchemSearch
25
from mandos.entries.api_singletons import Apis
26
27
InputFrame = (TypedDfs.typed("InputFrame").require("inchikey")).build()
28
29
IdMatchFrame = (
30
    TypedDfs.typed("IdMatchFrame")
31
    .require("inchikey")
32
    .require("chembl_id")
33
    .require("pubchem_id")
34
    .strict()
35
).build()
36
37
38
class SearcherUtils:
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
39
    @classmethod
40
    def dl(
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
Unused Code introduced by
Either all return statements in a function should return an expression, or none of them should.
Loading history...
Coding Style Naming introduced by
Method name "dl" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
best-practice introduced by
Too many arguments (6/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
41
        cls,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
42
        inchikeys: Sequence[str],
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
43
        pubchem: bool = True,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
44
        chembl: bool = True,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
45
        hmdb: bool = True,
0 ignored issues
show
Unused Code introduced by
The argument hmdb seems to be unused.
Loading history...
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
46
        quiet: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
47
    ) -> IdMatchFrame:
48
        # we actually cache the results, even though the underlying APIs cache
49
        # the reasons for this are a little obscure --
50
        # when running a Searcher, we want to run before the FIRST search
51
        # for the typer commands to be replicas of the ``Entry.run`` methods, Searcher fetches before running a search
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (118/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
52
        # but if we have multiple searches (as in ``mandos search --config``), we only want that at the beginning
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (113/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
53
        # the alternative was having ``mandos search`` dynamically subclass each ``Entry`` -- which was really hard
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (115/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
54
        # this is much cleaner, even though it's redundant
55
        # if the cached results under /pubchem and /chembl are deleted, we unfortunately won't cache the results
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (112/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
56
        # when running this command
57
        # to fix that, we need to delete the cached /match dataframes
58
        # now that I'm writing this down, I realize this is pretty bad
59
        # TODO
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
60
        # noinspection PyPep8Naming
61
        Chembl, Pubchem = Apis.Chembl, Apis.Pubchem
0 ignored issues
show
Coding Style Naming introduced by
Variable name "Pubchem" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
Variable name "Chembl" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
62
        logger.info(f"Using {Chembl}, {Pubchem}")
63
        key = hash(",".join(inchikeys))
64
        cached_path = (MANDOS_SETTINGS.match_cache_path / str(key)).with_suffix(".feather")
65
        if cached_path.exists():
66
            logger.info(f"Found ID matching results at {cached_path}")
67
            return IdMatchFrame.read_feather(cached_path)
68
        found_chembl: Dict[str, str] = {}
69
        found_pubchem: Dict[str, str] = {}
70
        if pubchem:
71
            for inchikey in inchikeys:
72
                try:
73
                    cid = Pubchem.fetch_data(inchikey).cid
74
                    found_pubchem[inchikey] = str(cid)
75
                    if not quiet:
76
                        logger.info(f"Found:      PubChem {inchikey} ({cid})")
77
                except CompoundNotFoundError:
78
                    logger.info(f"NOT FOUND: PubChem {inchikey}")
79
                    logger.trace(f"Did not find PubChem {inchikey}", exc_info=True)
80
        if chembl:
81
            for inchikey in inchikeys:
82
                try:
83
                    chid = ChemblUtils(Chembl).get_compound(inchikey).chid
84
                    found_chembl[inchikey] = chid
85
                    if not quiet:
86
                        logger.info(f"Found:      ChEMBL {inchikey} ({chid})")
87
                except CompoundNotFoundError:
88
                    logger.info(f"NOT FOUND: ChEMBL {inchikey}")
89
                    logger.trace(f"Did not find ChEMBL {inchikey}", exc_info=True)
90
        df = pd.DataFrame([pd.Series(dict(inchikey=c)) for c in inchikeys])
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
91
        df["chembl_id"] = df["inchikey"].map(found_chembl.get)
92
        df["pubchem_id"] = df["inchikey"].map(found_pubchem.get)
93
        df = IdMatchFrame(df)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
94
        df.to_feather(cached_path)
95
        logger.info(f"Wrote {cached_path}")
96
97
    @classmethod
98
    def read(cls, input_path: Path) -> InputFrame:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
99
        df: UntypedDf = TypedDfs.untyped("Input").read_file(input_path, header=None, comment="#")
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
100
        if "inchikey" in df.column_names():
101
            df = InputFrame.convert(df)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
102
        elif ".lines" in input_path.name or ".txt" in input_path.name:
103
            df.columns = ["inchikey"]
104
            df = InputFrame.convert(df)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
105
        else:
106
            raise ValueError(f"Could not parse {input_path}; no column 'inchikey'")
107
        # find duplicates
108
        # in hindsight, this wasn't worth the amount of code
109
        n0 = len(df)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n0" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
110
        # noinspection PyTypeChecker
111
        df: UntypedDf = df.drop_duplicates()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
112
        n1 = len(df)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n1" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
113
        logger.info("Read {n1} input compounds")
114
        if n0 == n1:
115
            logger.info(f"There were no duplicate rows")
0 ignored issues
show
introduced by
Using an f-string that does not have any interpolated variables
Loading history...
116
        else:
117
            logger.info(f"Dropped {n1-n0} duplicated rows")
118
        duplicated = df[df.duplicated("inchikey", keep=False)]
119
        duplicated_inchikeys = set(duplicated["inchikey"])
120
        # noinspection PyTypeChecker
121
        df = df.drop_duplicates(subset=["inchikey"], keep="first")
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
122
        n2 = len(df)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n2" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
123
        if len(duplicated) > 1:
124
            logger.error(
125
                f"{len(duplicated)} rows contain the same inchikey but have differences in other columns"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (105/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
126
            )
127
            logger.error(f"Dropped {n2-n1} rows with duplicate inchikeys")
128
            logger.error(f"The offending inchikeys are {duplicated_inchikeys}")
129
        return df
130
131
132
class Searcher:
133
    """
134
    Executes one or more searches and saves the results to CSV files.
135
    Create and use once.
136
    """
137
138
    def __init__(self, searches: Sequence[Search], to: Sequence[Path], input_path: Path):
139
        """
140
        Constructor.
141
142
        Args:
143
            searches:
144
            input_path: Path to the input file of one of the formats:
145
                - .txt containing one InChI Key per line
146
                - .csv, .tsv, .tab, csv.gz, .tsv.gz, .tab.gz, or .feather containing a column called inchikey
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (109/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
147
        """
148
        self.what = searches
149
        self.input_path: Optional[Path] = input_path
150
        self.input_df: InputFrame = None
151
        self.output_paths = {
152
            what.key: EntryPaths.output_path_of(what, input_path, path)
153
            for what, path in CommonTools.zip_list(searches, to)
154
        }
155
156
    def search(self) -> Searcher:
157
        """
158
        Performs the search, and writes data.
159
        """
160
        if self.input_df is not None:
161
            raise ValueError(f"Already ran a search")
0 ignored issues
show
introduced by
Using an f-string that does not have any interpolated variables
Loading history...
162
        self.input_df = SearcherUtils.read(self.input_path)
163
        inchikeys = self.input_df["inchikey"].unique()
164
        has_pubchem = any((isinstance(what, PubchemSearch) for what in self.what))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable what does not seem to be defined.
Loading history...
165
        has_chembl = any((isinstance(what, ChemblSearch) for what in self.what))
166
        # find the compounds first so the user knows what's missing before proceeding
167
        SearcherUtils.dl(inchikeys, pubchem=has_pubchem, chembl=has_chembl, quiet=True)
168
        for what in self.what:
169
            output_path = self.output_paths[what.key]
170
            metadata_path = output_path.with_suffix(".metadata.json")
171
            df = what.find_to_df(inchikeys)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
172
            # TODO keep any other columns in input_df
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
173
            df.to_csv(output_path)
174
            params = {k: str(v) for k, v in what.get_params().items() if k not in {"key", "api"}}
175
            metadata = NestedDotDict(dict(key=what.key, search=what.search_class, params=params))
176
            metadata.write_json(metadata_path)
177
            logger.info(f"Wrote {what.key} to {output_path}")
178
        return self
179
180
181
__all__ = ["Searcher", "IdMatchFrame", "SearcherUtils"]
182