Passed
Push — main ( bfa577...eb6882 )
by Douglas
04:37
created

mandos.model.pubchem_api   F

Complexity

Total Complexity 63

Size/Duplication

Total Lines 403
Duplicated Lines 0 %

Importance

Changes 0
Metric Value
eloc 292
dl 0
loc 403
rs 3.36
c 0
b 0
f 0
wmc 63

32 Methods

Rating   Name   Duplication   Size   Complexity  
A QueryingPubchemApi._tables_to_use() 0 24 2
A PubchemApi.fetch_data_from_cid() 0 4 1
A QueryingPubchemApi._external_table_url() 0 11 1
A CachingPubchemApi.similarity_path() 0 3 2
B QueryingPubchemApi._strip_by_key_in_place() 0 10 7
A PubchemApi.fetch_data() 0 2 1
A QueryingPubchemApi._fetch_hierarchies() 0 12 3
A QueryingPubchemApi._query_json() 0 6 2
A CachingPubchemApi._read_json() 0 7 2
A CachingPubchemApi.find_similar_compounds() 0 19 4
A QueryingPubchemApi._fetch_external_linksets() 0 4 1
A QueryingPubchemApi._fetch_structure_data() 0 7 2
A CachingPubchemApi.fetch_data() 0 16 3
A QueryingPubchemApi._get_parent() 0 9 3
A QueryingPubchemApi.__init__() 0 13 2
A QueryingPubchemApi._linksets_to_use() 0 6 1
A QueryingPubchemApi._get_metadata() 0 6 1
A QueryingPubchemApi.fetch_data() 0 22 2
A CachingPubchemApi.data_path() 0 3 2
A QueryingPubchemApi._fetch_external_table() 0 5 1
A QueryingPubchemApi.find_similar_compounds() 0 14 3
A CachingPubchemApi._write_json() 0 5 2
A QueryingPubchemApi._hierarchies_to_use() 0 28 3
A QueryingPubchemApi._fetch_data() 0 9 1
A QueryingPubchemApi._fetch_display_data() 0 3 1
A CachingPubchemApi.__init__() 0 6 1
A QueryingPubchemApi._fetch_core_data() 0 7 1
A QueryingPubchemApi._fetch_hierarchy() 0 8 2
A QueryingPubchemApi._fetch_external_tables() 0 4 1
A QueryingPubchemApi._fetch_external_linkset() 0 4 1
A PubchemApi.find_similar_compounds() 0 2 1
A QueryingPubchemApi._scrape_cid() 0 37 3

How to fix   Complexity   

Complexity

Complex classes like mandos.model.pubchem_api often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""
2
PubChem querying API.
3
"""
4
from __future__ import annotations
5
6
import abc
7
import logging
8
import re
9
import time
10
from urllib.error import HTTPError
11
from datetime import datetime, timezone
12
from pathlib import Path
13
from typing import Optional, Sequence, Union, FrozenSet, Mapping
14
15
import io
16
import gzip
17
import orjson
0 ignored issues
show
introduced by
Unable to import 'orjson'
Loading history...
18
import pandas as pd
0 ignored issues
show
introduced by
Unable to import 'pandas'
Loading history...
19
from pocketutils.core.dot_dict import NestedDotDict
0 ignored issues
show
introduced by
Unable to import 'pocketutils.core.dot_dict'
Loading history...
20
from pocketutils.core.query_utils import QueryExecutor
0 ignored issues
show
introduced by
Unable to import 'pocketutils.core.query_utils'
Loading history...
21
22
from mandos.model.pubchem_support.pubchem_data import PubchemData
23
24
logger = logging.getLogger("mandos")
25
26
27
class PubchemCompoundLookupError(LookupError):
0 ignored issues
show
Documentation introduced by
Empty class docstring
Loading history...
28
    """"""
29
30
31
class PubchemApi(metaclass=abc.ABCMeta):
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
32
    def fetch_data_from_cid(self, cid: int) -> Optional[PubchemData]:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
33
        # separated from fetch_data to make it completely clear what an int value means
34
        # noinspection PyTypeChecker
35
        return self.fetch_data(cid)
36
37
    def fetch_data(self, inchikey: str) -> Optional[PubchemData]:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
38
        raise NotImplementedError()
39
40
    def find_similar_compounds(self, inchi: Union[int, str], min_tc: float) -> FrozenSet[int]:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
41
        raise NotImplementedError()
42
43
44
class QueryingPubchemApi(PubchemApi):
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
45
    def __init__(
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
46
        self,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
47
        chem_data: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
48
        extra_tables: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
49
        classifiers: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
50
        extra_classifiers: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
51
        query: Optional[QueryExecutor] = None,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
52
    ):
53
        self._use_chem_data = chem_data
54
        self._use_extra_tables = extra_tables
55
        self._use_classifiers = classifiers
56
        self._use_extra_classifiers = extra_classifiers
57
        self._query = QueryExecutor(0.22, 0.25) if query is None else query
58
59
    _pug = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
60
    _pug_view = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view"
61
    _sdg = "https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi"
62
    _classifications = "https://pubchem.ncbi.nlm.nih.gov/classification/cgi/classifications.fcgi"
63
    _link_db = "https://pubchem.ncbi.nlm.nih.gov/link_db/link_db_server.cgi"
64
65
    def fetch_data(self, inchikey: str) -> Optional[PubchemData]:
66
        # Dear God this is terrible
67
        # Here are the steps:
68
        # 1. Download HTML for the InChI key and scrape the CID
69
        # 2. Download the "display" JSON data from the CID
70
        # 3. Look for a Parent-type related compound. If it exists, download its display data
71
        # 4. Download the structural data and append it
72
        # 5. Download the external table CSVs and append them
73
        # 6. Download the link sets and append them
74
        # 7. Download the classifiers (hierarchies) and append them
75
        # 8. Attach metadata about how we found this.
76
        # 9. Return the stupid, stupid result as a massive JSON struct.
77
        logger.info(f"Downloading PubChem data for {inchikey}")
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
78
        cid = self._scrape_cid(inchikey)
79
        try:
80
            data = self._fetch_data(cid, inchikey)
81
        except HTTPError:
82
            raise PubchemCompoundLookupError(
83
                f"Failed finding pubchem compound (JSON) from cid {cid}, inchikey {inchikey}"
84
            )
85
        data = self._get_parent(cid, inchikey, data)
86
        return data
87
88
    def find_similar_compounds(self, inchi: Union[int, str], min_tc: float) -> FrozenSet[int]:
89
        req = self._query(
90
            f"{self._pug}/compound/similarity/inchikey/{inchi}/JSON?Threshold={min_tc}",
91
            method="post",
92
        )
93
        key = orjson.loads(req)["Waiting"]["ListKey"]
94
        t0 = time.monotonic()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t0" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
95
        while time.monotonic() - t0 < 5:
96
            # it'll wait as needed here
97
            resp = self._query(f"{self._pug}/compound/listkey/{key}/cids/JSON")
98
            resp = NestedDotDict(orjson.loads(resp))
99
            if resp.get("IdentifierList.CID") is not None:
100
                return frozenset(resp.req_list_as("IdentifierList.CID", int))
101
        raise TimeoutError(f"Search for {inchi} using key {key} timed out")
102
103
    def _scrape_cid(self, inchikey: str) -> int:
104
        # This is awful
105
        # Every attempt to get the actual, correct, unique CID corresponding to the inchikey
106
        # failed with every proper PubChem API
107
        # We can't use <pug_view>/data/compound/<inchikey> -- we can only use a CID there
108
        # I found it with a PUG API
109
        # https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/GJSURZIOUXUGAL-UHFFFAOYSA-N/record/JSON
110
        # But that returns multiple results!!
111
        # There's no apparent way to find out which one is real
112
        # I tried then querying each found CID, getting the display data, and looking at their parents
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (102/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
113
        # Unfortunately, we end up with multiple contradictory parents
114
        # Plus, that's insanely slow -- we have to get the full JSON data for each parent
115
        # Every worse -- the PubChem API docs LIE!!
116
        # Using ?cids_type=parent DOES NOT GIVE THE PARENT compound
117
        # Ex: https://pubchem.ncbi.nlm.nih.gov/compound/656832
118
        # This is cocaine HCl, which has cocaine (446220) as a parent
119
        # https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/656832/JSON
120
        # gives 656832 back again
121
        # same thing when querying by inchikey
122
        # Ultimately, I found that I can get HTML containing the CID from an inchikey
123
        # From there, we'll just have to download its "display" data and get the parent, then download that data
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (112/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
124
        url = f"https://pubchem.ncbi.nlm.nih.gov/compound/{inchikey}"
125
        pat = re.compile(
126
            r'<meta property="og:url" content="https://pubchem\.ncbi\.nlm\.nih\.gov/compound/(\d+)">'
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (101/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
127
        )
128
        try:
129
            html = self._query(url)
130
        except HTTPError:
131
            raise PubchemCompoundLookupError(
132
                f"Failed finding pubchem compound (HTML) from inchikey {inchikey} [url: {url}]"
133
            )
134
        match = pat.search(html)
135
        if match is None:
136
            raise PubchemCompoundLookupError(
137
                f"Something is wrong with the HTML from {url}; og:url not found"
138
            )
139
        return int(match.group(1))
140
141
    def _get_parent(self, cid: int, inchikey: str, data: PubchemData) -> PubchemData:
142
        # guard with is not None: we're not caching, so don't do it twice
143
        if data.parent_or_none is None:
144
            return data
145
        try:
146
            return self._fetch_data(data.parent_or_none, inchikey)
147
        except HTTPError:
148
            raise PubchemCompoundLookupError(
149
                f"Failed finding pubchem parent compound (JSON)"
150
                f"for cid {data.parent_or_none}, child cid {cid}, inchikey {inchikey}"
151
            )
152
153
    def _fetch_data(self, cid: int, inchikey: str) -> PubchemData:
154
        when_started = datetime.now(timezone.utc).astimezone()
155
        t0 = time.monotonic_ns()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t0" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
156
        data = self._fetch_core_data(cid)
157
        t1 = time.monotonic_ns()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t1" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
158
        when_finished = datetime.now(timezone.utc).astimezone()
159
        data["meta"] = self._get_metadata(inchikey, when_started, when_finished, t0, t1)
160
        self._strip_by_key_in_place(data, "DisplayControls")
161
        return PubchemData(NestedDotDict(data))
162
163
    def _fetch_core_data(self, cid: int) -> dict:
164
        return dict(
165
            record=self._fetch_display_data(cid),
166
            structure=self._fetch_structure_data(cid),
167
            external_tables=self._fetch_external_tables(cid),
168
            link_sets=self._fetch_external_linksets(cid),
169
            classifications=self._fetch_hierarchies(cid),
170
        )
171
172
    def _get_metadata(self, inchikey: str, started: datetime, finished: datetime, t0: int, t1: int):
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
Coding Style Naming introduced by
Argument name "t0" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
Coding Style Naming introduced by
Argument name "t1" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
173
        return dict(
174
            timestamp_fetch_started=started.isoformat(),
175
            timestamp_fetch_finished=finished.isoformat(),
176
            from_lookup=inchikey,
177
            fetch_nanos_taken=str(t1 - t0),
178
        )
179
180
    def _fetch_display_data(self, cid: int) -> Optional[NestedDotDict]:
181
        url = f"{self._pug_view}/data/compound/{cid}/JSON/?response_type=display"
182
        return self._query_json(url)["Record"]
183
184
    def _fetch_structure_data(self, cid: int) -> NestedDotDict:
185
        if not self._use_chem_data:
186
            return NestedDotDict({})
187
        url = f"{self._pug}/compound/cid/{cid}/JSON"
188
        data = self._query_json(url)["PC_Compounds"][0]
189
        del [data["structure"]["props"]]  # redundant with props section in record
190
        return data
191
192
    def _fetch_external_tables(self, cid: int) -> Mapping[str, str]:
193
        return {
194
            ext_table: self._fetch_external_table(cid, ext_table)
195
            for ext_table in self._tables_to_use.values()
196
        }
197
198
    def _fetch_external_linksets(self, cid: int) -> Mapping[str, str]:
199
        return {
200
            table: self._fetch_external_linkset(cid, table)
201
            for table in self._linksets_to_use.values()
202
        }
203
204
    def _fetch_hierarchies(self, cid: int) -> NestedDotDict:
205
        build_up = {}
206
        for hname, hid in self._hierarchies_to_use.items():
207
            try:
208
                build_up[hname] = self._fetch_hierarchy(cid, hid)
209
            except (HTTPError, KeyError, LookupError) as e:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "e" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
210
                logger.debug(f"No data for classifier {hid}, compound {cid}: {e}")
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
211
        # These list all of the child nodes for each node
212
        # Some of them are > 1000 items -- they're HUGE
213
        # We don't expect to need to navigate to children
214
        self._strip_by_key_in_place(build_up, "ChildID")
215
        return NestedDotDict(build_up)
216
217
    def _fetch_external_table(self, cid: int, table: str) -> Sequence[dict]:
218
        url = self._external_table_url(cid, table)
219
        data = self._query(url)
220
        df: pd.DataFrame = pd.read_csv(io.StringIO(data))
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
221
        return list(df.T.to_dict().values())
222
223
    def _fetch_external_linkset(self, cid: int, table: str) -> NestedDotDict:
224
        url = f"{self._link_db}?format=JSON&type={table}&operation=GetAllLinks&id_1={cid}"
225
        data = self._query(url)
226
        return NestedDotDict(orjson.loads(data))
227
228
    def _fetch_hierarchy(self, cid: int, hid: int) -> Sequence[dict]:
229
        url = f"{self._classifications}?format=json&hid={hid}&search_uid_type=cid&search_uid={cid}&search_type=list&response_type=display"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (138/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
230
        data: Sequence[dict] = orjson.loads(self._query(url))["Hierarchies"]
231
        # underneath Hierarchies is a list of Hierarchy
232
        logger.debug(f"Found data for classifier {hid}, compound {cid}")
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
233
        if len(data) == 0:
234
            raise LookupError(f"Failed getting hierarchy {hid}")
235
        return data
236
237
    @property
238
    def _tables_to_use(self) -> Mapping[str, str]:
239
        dct = {
240
            "drug:clinicaltrials.gov:clinical_trials": "clinicaltrials",
241
            "pharm:pubchem:reactions": "pathwayreaction",
242
            "uses:cpdat:uses": "cpdat",
243
            "tox:chemidplus:acute_effects": "chemidplus",
244
            "dis:ctd:associated_disorders_and_diseases": "ctd_chemical_disease",
245
            "lit:pubchem:depositor_provided_pubmed_citations": "pubmed",
246
            "bio:dgidb:drug_gene_interactions": "dgidb",
247
            "bio:ctd:chemical_gene_interactions": "ctdchemicalgene",
248
            "bio:drugbank:drugbank_interactions": "drugbank",
249
            "bio:drugbank:drug_drug_interactions": "drugbankddi",
250
            "bio:pubchem:bioassay_results": "bioactivity",
251
        }
252
        if self._use_extra_tables:
253
            dct.update(
254
                {
255
                    "patent:depositor_provided_patent_identifiers": "patent",
256
                    "bio:rcsb_pdb:protein_bound_3d_structures": "pdb",
257
                    "related:pubchem:related_compounds_with_annotation": "compound",
258
                }
259
            )
260
        return dct
261
262
    @property
263
    def _linksets_to_use(self) -> Mapping[str, str]:
264
        return {
265
            "lit:pubchem:chemical_cooccurrences_in_literature": "ChemicalNeighbor",
266
            "lit:pubchem:gene_cooccurrences_in_literature": "ChemicalGeneSymbolNeighbor",
267
            "lit:pubchem:disease_cooccurrences_in_literature": "ChemicalDiseaseNeighbor",
268
        }
269
270
    @property
271
    def _hierarchies_to_use(self) -> Mapping[str, int]:
272
        if not self._use_classifiers:
273
            return {}
274
        dct = {
275
            "MeSH Tree": 1,
276
            "ChEBI Ontology": 2,
277
            "WHO ATC Classification System": 79,
278
            "Guide to PHARMACOLOGY Target Classification": 92,
279
            "ChEMBL Target Tree": 87,
280
        }
281
        if self._use_extra_classifiers:
282
            dct.update(
283
                {
284
                    "KEGG: Phytochemical Compounds": 5,
285
                    "KEGG: Drug": 14,
286
                    "KEGG: USP": 15,
287
                    "KEGG: Major components of natural products": 69,
288
                    "KEGG: Target-based Classification of Drugs": 22,
289
                    "KEGG: OTC drugs": 25,
290
                    "KEGG: Drug Classes": 96,
291
                    "CAMEO Chemicals": 86,
292
                    "EPA CPDat Classification": 99,
293
                    "FDA Pharm Classes": 78,
294
                    "ChemIDplus": 84,
295
                }
296
            )
297
        return dct
298
299
    def _external_table_url(self, cid: int, collection: str) -> str:
300
        return (
301
            self._sdg
302
            + "?infmt=json"
303
            + "&outfmt=csv"
304
            + "&query={ download : * , collection : "
305
            + collection
306
            + " , where :{ ands :[{ cid : "
307
            + str(cid)
308
            + " }]}}"
309
        ).replace(" ", "%22")
310
311
    def _query_json(self, url: str) -> NestedDotDict:
312
        data = self._query(url)
313
        data = NestedDotDict(orjson.loads(data))
314
        if "Fault" in data:
315
            raise ValueError(f"Request failed ({data.get('Code')}) on {url}: {data.get('Message')}")
316
        return data
317
318
    def _strip_by_key_in_place(self, data: Union[dict, list], bad_key: str) -> None:
319
        if isinstance(data, list):
320
            for x in data:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "x" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
321
                self._strip_by_key_in_place(x, bad_key)
322
        elif isinstance(data, dict):
323
            for k, v in list(data.items()):
0 ignored issues
show
Coding Style Naming introduced by
Variable name "v" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
324
                if k == bad_key:
325
                    del data[k]
326
                elif isinstance(v, (list, dict)):
327
                    self._strip_by_key_in_place(v, bad_key)
328
329
330
class CachingPubchemApi(PubchemApi):
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
331
    def __init__(
332
        self, cache_dir: Path, querier: Optional[QueryingPubchemApi], compress: bool = True
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
333
    ):
334
        self._cache_dir = cache_dir
335
        self._querier = querier
336
        self._compress = compress
337
338
    def fetch_data(self, inchikey: str) -> Optional[PubchemData]:
339
        path = self.data_path(inchikey)
340
        if path.exists():
341
            logger.info(f"Found cached PubChem data at {path.absolute()}")
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
342
        elif self._querier is None:
343
            raise PubchemCompoundLookupError(f"Key {inchikey} not found in cache")
344
        else:
345
            logger.info(f"Downloading PubChem data for {inchikey} ...")
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
346
            data = self._querier.fetch_data(inchikey)
347
            path.parent.mkdir(parents=True, exist_ok=True)
348
            encoded = data.to_json()
349
            self._write_json(encoded, path)
350
            logger.info(f"Wrote PubChem data to {path.absolute()}")
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
351
            return data
352
        read = self._read_json(path)
353
        return PubchemData(read)
354
355
    def _write_json(self, encoded: str, path: Path) -> None:
356
        if self._compress:
357
            path.write_bytes(gzip.compress(encoded.encode(encoding="utf8")))
358
        else:
359
            path.write_text(encoded, encoding="utf8")
360
361
    def _read_json(self, path: Path) -> NestedDotDict:
362
        if self._compress:
363
            deflated = gzip.decompress(path.read_bytes())
364
            read = orjson.loads(deflated)
365
        else:
366
            read = orjson.loads(path.read_text(encoding="utf8"))
367
        return NestedDotDict(read)
368
369
    def find_similar_compounds(self, inchi: Union[int, str], min_tc: float) -> FrozenSet[int]:
370
        path = self.similarity_path(inchi)
371
        if not path.exists():
372
            df = None
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
373
            existing = set()
374
        else:
375
            df = pd.read_csv(path, sep="\t")
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
376
            df = df[df["min_tc"] < min_tc]
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
377
            existing = set(df["cid"].values)
378
        if len(existing) == 0:
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
379
            found = self._querier.find_similar_compounds(inchi, min_tc)
380
            path.parent.mkdir(parents=True, exist_ok=True)
381
            new_df = pd.DataFrame([pd.Series(dict(cid=cid, min_tc=min_tc)) for cid in found])
382
            if df is not None:
383
                new_df = pd.concat([df, new_df])
384
            new_df.to_csv(path, sep="\t")
385
            return frozenset(existing.union(found))
386
        else:
387
            return frozenset(existing)
388
389
    def data_path(self, inchikey: str):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
390
        ext = ".json.gz" if self._compress else ".json"
391
        return self._cache_dir / "data" / f"{inchikey}{ext}"
392
393
    def similarity_path(self, inchikey: str):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
394
        ext = ".tab.gz" if self._compress else ".tab"
395
        return self._cache_dir / "similarity" / f"{inchikey}{ext}"
396
397
398
__all__ = [
399
    "PubchemApi",
400
    "CachingPubchemApi",
401
    "QueryingPubchemApi",
402
    "PubchemCompoundLookupError",
403
]
404