Passed
Push — main ( 4074a3...6b8d16 )
by Douglas
01:56
created

mandos.model.apis.querying_pubchem_api   B

Complexity

Total Complexity 51

Size/Duplication

Total Lines 388
Duplicated Lines 0 %

Importance

Changes 0
Metric Value
eloc 286
dl 0
loc 388
rs 7.92
c 0
b 0
f 0
wmc 51

25 Methods

Rating   Name   Duplication   Size   Complexity  
A QueryingPubchemApi._fetch_external_table() 0 6 1
A QueryingPubchemApi._fetch_core_data() 0 9 1
A QueryingPubchemApi._fetch_hierarchy() 0 9 2
A QueryingPubchemApi.find_inchikey() 0 4 1
A QueryingPubchemApi.fetch_data() 0 27 2
A QueryingPubchemApi._fetch_external_linkset() 0 5 1
A QueryingPubchemApi._get_parent() 0 15 3
A QueryingPubchemApi._external_table_url() 0 11 1
A QueryingPubchemApi._fetch_hierarchies() 0 13 3
A QueryingPubchemApi.fetch_properties() 0 19 4
A QueryingPubchemApi._fetch_structure_data() 0 8 2
A QueryingPubchemApi._get_linked_records() 0 10 1
A QueryingPubchemApi.find_id() 0 9 2
B QueryingPubchemApi._scrape_cid() 0 43 5
A QueryingPubchemApi._get_metadata() 0 6 1
A QueryingPubchemApi._query_json() 0 9 2
A QueryingPubchemApi._fetch_external_linksets() 0 7 1
A QueryingPubchemApi._fetch_data() 0 17 2
A QueryingPubchemApi.__init__() 0 13 1
A QueryingPubchemApi._tables_to_use() 0 24 2
A QueryingPubchemApi._hierarchies_to_use() 0 28 3
A QueryingPubchemApi._fetch_display_data() 0 5 1
A QueryingPubchemApi._linksets_to_use() 0 6 1
B QueryingPubchemApi._strip_by_key_in_place() 0 10 7
A QueryingPubchemApi._fetch_external_tables() 0 7 1

How to fix   Complexity   

Complexity

Complex classes like mandos.model.apis.querying_pubchem_api often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""
2
PubChem querying API.
3
"""
4
from __future__ import annotations
5
6
import io
7
import time
8
from datetime import datetime, timezone
9
from typing import Any, List, Mapping, Optional, Sequence, Tuple, Union
10
from urllib.error import HTTPError
11
12
import orjson
0 ignored issues
show
introduced by
Unable to import 'orjson'
Loading history...
13
import pandas as pd
0 ignored issues
show
introduced by
Unable to import 'pandas'
Loading history...
14
import regex
0 ignored issues
show
introduced by
Unable to import 'regex'
Loading history...
15
from pocketutils.core.dot_dict import NestedDotDict
0 ignored issues
show
introduced by
Unable to import 'pocketutils.core.dot_dict'
Loading history...
16
from pocketutils.core.exceptions import (
0 ignored issues
show
introduced by
Unable to import 'pocketutils.core.exceptions'
Loading history...
17
    DataIntegrityError,
18
    DownloadError,
19
    LookupFailedError,
20
)
21
from pocketutils.core.query_utils import QueryExecutor
0 ignored issues
show
introduced by
Unable to import 'pocketutils.core.query_utils'
Loading history...
22
23
from mandos.model.apis.pubchem_api import PubchemApi, PubchemCompoundLookupError
24
from mandos.model.apis.pubchem_support.pubchem_data import PubchemData
25
from mandos.model.settings import QUERY_EXECUTORS, SETTINGS
26
from mandos.model.utils.setup import logger
27
28
29
class QueryingPubchemApi(PubchemApi):
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
30
    def __init__(
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
31
        self,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
32
        chem_data: bool = True,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
33
        extra_tables: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
34
        classifiers: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
35
        extra_classifiers: bool = False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
36
        executor: QueryExecutor = QUERY_EXECUTORS.pubchem,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
37
    ):
38
        self._use_chem_data = chem_data
39
        self._use_extra_tables = extra_tables
40
        self._use_classifiers = classifiers
41
        self._use_extra_classifiers = extra_classifiers
42
        self._executor = executor
43
44
    _pug = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
45
    _pug_view = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view"
46
    _sdg = "https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi"
47
    _classifications = "https://pubchem.ncbi.nlm.nih.gov/classification/cgi/classifications.fcgi"
48
    _link_db = "https://pubchem.ncbi.nlm.nih.gov/link_db/link_db_server.cgi"
49
50
    def find_inchikey(self, cid: int) -> str:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
51
        # return self.fetch_data(cid).names_and_identifiers.inchikey
52
        props = self.fetch_properties(cid)
53
        return props["InChIKey"]
54
55
    def find_id(self, inchikey: str) -> Optional[int]:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
56
        # we have to scrape to get the parent anyway,
57
        # so just download it
58
        # TODO: there's a faster way
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
59
        try:
60
            return self.fetch_data(inchikey).cid
61
        except PubchemCompoundLookupError:
62
            logger.debug(f"Could not find pubchem ID for {inchikey}", exc_info=True)
63
            return None
64
65
    def fetch_properties(self, cid: int) -> Mapping[str, Any]:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
66
        url = f"{self._pug}/compound/cid/{cid}/JSON"
67
        #
68
        try:
69
            matches: NestedDotDict = self._query_json(url)
70
        except HTTPError:
71
            raise PubchemCompoundLookupError(f"Failed finding pubchem compound {cid}")
72
        props = matches["PC_Compounds"][0]["props"]
73
        props = {NestedDotDict(p).get("urn.label"): p.get("value") for p in props}
74
75
        def _get_val(v):
0 ignored issues
show
Coding Style Naming introduced by
Argument name "v" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Unused Code introduced by
Either all return statements in a function should return an expression, or none of them should.
Loading history...
76
            v = NestedDotDict(v)
77
            for t in ["ival", "fval", "sval"]:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
78
                if t in v.keys():
79
                    return v[t]
80
81
        props = {k: _get_val(v) for k, v in props.items() if k is not None and v is not None}
82
        logger.debug(f"DLed properties for {cid}")
83
        return props
84
85
    def fetch_data(self, inchikey: Union[str, int]) -> [PubchemData]:
86
        # Dear God this is terrible
87
        # Here are the steps:
88
        # 1. Download HTML for the InChI key and scrape the CID
89
        # 2. Download the "display" JSON data from the CID
90
        # 3. Look for a Parent-type related compound. If it exists, download its display data
91
        # 4. Download the structural data and append it
92
        # 5. Download the external table CSVs and append them
93
        # 6. Download the link sets and append them
94
        # 7. Download the classifiers (hierarchies) and append them
95
        # 8. Attach metadata about how we found this.
96
        # 9. Return the stupid, stupid result as a massive JSON struct.
97
        logger.info(f"Downloading PubChem data for {inchikey}")
98
        if isinstance(inchikey, int):
99
            cid = inchikey
100
            # note: this might not be the parent
101
            # that's ok -- we're about to fix that
102
            inchikey = self.find_inchikey(cid)
103
            logger.debug(f"Matched CID {cid} to {inchikey}")
104
        else:
105
            cid = self._scrape_cid(inchikey)
106
            logger.debug(f"Matched inchikey {inchikey} to CID {cid} (scraped)")
107
        stack = []
108
        data = self._fetch_data(cid, inchikey, stack)
109
        logger.debug(f"Downloaded raw data for {cid}/{inchikey}")
110
        data = self._get_parent(cid, inchikey, data, stack)
111
        return data
112
113
    def _scrape_cid(self, inchikey: str) -> int:
114
        # This is awful
115
        # Every attempt to get the actual, correct, unique CID corresponding to the inchikey
116
        # failed with every proper PubChem API
117
        # We can't use <pug_view>/data/compound/<inchikey> -- we can only use a CID there
118
        # I found it with a PUG API
119
        # https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/GJSURZIOUXUGAL-UHFFFAOYSA-N/record/JSON
120
        # But that returns multiple results!!
121
        # There's no apparent way to find out which one is real
122
        # I tried then querying each found CID, getting the display data, and looking at their parents
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (102/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
123
        # Unfortunately, we end up with multiple contradictory parents
124
        # Plus, that's insanely slow -- we have to get the full JSON data for each parent
125
        # Every worse -- the PubChem API docs LIE!!
126
        # Using ?cids_type=parent DOES NOT GIVE THE PARENT compound
127
        # Ex: https://pubchem.ncbi.nlm.nih.gov/compound/656832
128
        # This is cocaine HCl, which has cocaine (446220) as a parent
129
        # https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/656832/JSON
130
        # gives 656832 back again
131
        # same thing when querying by inchikey
132
        # Ultimately, I found that I can get HTML containing the CID from an inchikey
133
        # From there, we'll just have to download its "display" data and get the parent, then download that data
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (112/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
134
        url = f"https://pubchem.ncbi.nlm.nih.gov/compound/{inchikey}"
135
        pat = regex.compile(
136
            r'<meta property="og:url" content="https://pubchem\.ncbi\.nlm\.nih\.gov/compound/(\d+)">',
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (102/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
137
            flags=regex.V1,
138
        )
139
        try:
140
            for i in range(SETTINGS.pubchem_n_tries):
0 ignored issues
show
Unused Code introduced by
The variable i seems to be unused.
Loading history...
141
                try:
142
                    html = self._executor(url)
143
                except ConnectionAbortedError:
144
                    logger.warning(f"Connection aborted for {inchikey} [url: {url}]", exc_info=True)
145
                    continue
146
        except HTTPError:
147
            raise PubchemCompoundLookupError(
148
                f"Failed finding pubchem compound (HTML) from {inchikey} [url: {url}]"
149
            )
150
        match = pat.search(html)
0 ignored issues
show
introduced by
The variable html does not seem to be defined in case the for loop on line 140 is not entered. Are you sure this can never be the case?
Loading history...
151
        if match is None:
152
            raise DataIntegrityError(
153
                f"Something is wrong with the HTML from {url}; og:url not found"
154
            )
155
        return int(match.group(1))
156
157
    def _get_parent(
158
        self, cid: int, inchikey: str, data: PubchemData, stack: List[Tuple[int, str]]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
159
    ) -> PubchemData:
160
        # guard with is not None: we're not caching, so don't do it twice
161
        p = data.parent_or_none
0 ignored issues
show
Coding Style Naming introduced by
Variable name "p" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
162
        if p is None:
163
            logger.info(f"{cid}/{inchikey} is its own parent")
164
            return data
165
        try:
166
            logger.info(f"{cid}/{inchikey} has parent {p}")
167
            del data
168
            return self._fetch_data(p, inchikey, stack)
169
        except HTTPError:
170
            raise PubchemCompoundLookupError(
171
                f"Failed finding pubchem parent compound (JSON)"
172
                f"for cid {p}, child cid {cid}, inchikey {inchikey}"
173
            )
174
175
    def _fetch_data(self, cid: int, inchikey: str, stack: List[Tuple[int, str]]) -> PubchemData:
176
        when_started = datetime.now(timezone.utc).astimezone()
177
        t0 = time.monotonic_ns()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t0" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
178
        try:
179
            data = self._fetch_core_data(cid, stack)
180
        except HTTPError:
181
            raise PubchemCompoundLookupError(
182
                f"Failed finding pubchem compound (JSON) from cid {cid}, inchikey {inchikey}"
183
            )
184
        t1 = time.monotonic_ns()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t1" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
185
        when_finished = datetime.now(timezone.utc).astimezone()
186
        logger.trace(f"Downloaded {cid} in {t1-t0} s")
187
        data["meta"] = self._get_metadata(inchikey, when_started, when_finished, t0, t1)
188
        self._strip_by_key_in_place(data, "DisplayControls")
189
        stack.append((cid, inchikey))
190
        logger.trace(f"Stack: {stack}")
191
        return PubchemData(NestedDotDict(data))
192
193
    def _fetch_core_data(self, cid: int, stack: List[Tuple[int, str]]) -> dict:
194
        return dict(
195
            record=self._fetch_display_data(cid),
196
            linked_records=self._get_linked_records(cid, stack),
197
            structure=self._fetch_structure_data(cid),
198
            external_tables=self._fetch_external_tables(cid),
199
            link_sets=self._fetch_external_linksets(cid),
200
            classifications=self._fetch_hierarchies(cid),
201
            properties=NestedDotDict(self.fetch_properties(cid)),
202
        )
203
204
    def _get_metadata(self, inchikey: str, started: datetime, finished: datetime, t0: int, t1: int):
0 ignored issues
show
Coding Style Naming introduced by
Argument name "t1" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
best-practice introduced by
Too many arguments (6/5)
Loading history...
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
Coding Style Naming introduced by
Argument name "t0" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
205
        return dict(
206
            timestamp_fetch_started=started.isoformat(),
207
            timestamp_fetch_finished=finished.isoformat(),
208
            from_lookup=inchikey,
209
            fetch_nanos_taken=str(t1 - t0),
210
        )
211
212
    def _get_linked_records(self, cid: int, stack: List[Tuple[int, str]]) -> NestedDotDict:
213
        url = f"{self._pug}/compound/cid/{cid}/cids/JSON?cids_type=same_parent_stereo"
214
        data = self._query_json(url).sub("IdentifierList")
215
        logger.debug(f"DLed {len(data.get('CID', []))} linked records for {cid}")
216
        results = {
217
            "CID": [*data.get("CID", []), *[s for s, _ in stack]],
218
            "inchikey": [i for _, i in stack],
219
        }
220
        logger.debug(f"Linked records are: {results}")
221
        return NestedDotDict(results)
222
223
    def _fetch_display_data(self, cid: int) -> Optional[NestedDotDict]:
224
        url = f"{self._pug_view}/data/compound/{cid}/JSON/?response_type=display"
225
        data = self._query_json(url)["Record"]
226
        logger.debug(f"DLed display data for {cid}")
227
        return data
228
229
    def _fetch_structure_data(self, cid: int) -> NestedDotDict:
230
        if not self._use_chem_data:
231
            return NestedDotDict({})
232
        url = f"{self._pug}/compound/cid/{cid}/JSON"
233
        data = self._query_json(url)["PC_Compounds"][0]
234
        del data["props"]  # redundant with props section in record
235
        logger.debug(f"DLed structure for {cid}")
236
        return data
237
238
    def _fetch_external_tables(self, cid: int) -> Mapping[str, str]:
239
        x = {
0 ignored issues
show
Coding Style Naming introduced by
Variable name "x" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
240
            ext_table: self._fetch_external_table(cid, ext_table)
241
            for ext_table in self._tables_to_use.values()
242
        }
243
        logger.debug(f"DLed {len(self._tables_to_use)} external tables for {cid}")
244
        return x
245
246
    def _fetch_external_linksets(self, cid: int) -> Mapping[str, str]:
247
        x = {
0 ignored issues
show
Coding Style Naming introduced by
Variable name "x" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
248
            table: self._fetch_external_linkset(cid, table)
249
            for table in self._linksets_to_use.values()
250
        }
251
        logger.debug(f"DLed {len(self._linksets_to_use)} external linksets for {cid}")
252
        return x
253
254
    def _fetch_hierarchies(self, cid: int) -> NestedDotDict:
255
        build_up = {}
256
        for hname, hid in self._hierarchies_to_use.items():
257
            try:
258
                build_up[hname] = self._fetch_hierarchy(cid, hname, hid)
259
            except (HTTPError, KeyError, LookupError) as e:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "e" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
260
                logger.debug(f"No data for classifier {hid}, compound {cid}: {e}")
261
        # These list all of the child nodes for each node
262
        # Some of them are > 1000 items -- they're HUGE
263
        # We don't expect to need to navigate to children
264
        self._strip_by_key_in_place(build_up, "ChildID")
265
        logger.debug(f"DLed {len(self._hierarchies_to_use)} hierarchies for {cid}")
266
        return NestedDotDict(build_up)
267
268
    def _fetch_external_table(self, cid: int, table: str) -> Sequence[dict]:
269
        url = self._external_table_url(cid, table)
270
        data = self._executor(url)
271
        df: pd.DataFrame = pd.read_csv(io.StringIO(data)).reset_index()
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
272
        logger.debug(f"Downloaded table {table} with {len(df)} rows for {cid}")
273
        return list(df.to_dict(orient="records"))
274
275
    def _fetch_external_linkset(self, cid: int, table: str) -> NestedDotDict:
276
        url = f"{self._link_db}?format=JSON&type={table}&operation=GetAllLinks&id_1={cid}"
277
        data = self._executor(url)
278
        logger.debug(f"Downloaded linkset {table} rows for {cid}")
279
        return NestedDotDict(orjson.loads(data))
280
281
    def _fetch_hierarchy(self, cid: int, hname: str, hid: int) -> Sequence[dict]:
282
        url = f"{self._classifications}?format=json&hid={hid}&search_uid_type=cid&search_uid={cid}&search_type=list&response_type=display"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (138/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
283
        data: Sequence[dict] = orjson.loads(self._executor(url))["Hierarchies"]
284
        # underneath Hierarchies is a list of Hierarchy
285
        logger.debug(f"Found data for classifier {hid}, compound {cid}")
286
        if len(data) == 0:
287
            raise LookupFailedError(f"Failed getting hierarchy {hid}")
288
        logger.debug(f"Downloaded hierarchy {hname} ({hid}) for {cid}")
289
        return data
290
291
    @property
292
    def _tables_to_use(self) -> Mapping[str, str]:
293
        dct = {
294
            "drug:clinicaltrials.gov:clinical_trials": "clinicaltrials",
295
            "pharm:pubchem:reactions": "pathwayreaction",
296
            "uses:cpdat:uses": "cpdat",
297
            "tox:chemidplus:acute_effects": "chemidplus",
298
            "dis:ctd:associated_disorders_and_diseases": "ctd_chemical_disease",
299
            "lit:pubchem:depositor_provided_pubmed_citations": "pubmed",
300
            "bio:dgidb:drug_gene_interactions": "dgidb",
301
            "bio:ctd:chemical_gene_interactions": "ctdchemicalgene",
302
            "bio:drugbank:drugbank_interactions": "drugbank",
303
            "bio:drugbank:drug_drug_interactions": "drugbankddi",
304
            "bio:pubchem:bioassay_results": "bioactivity",
305
        }
306
        if self._use_extra_tables:
307
            dct.update(
308
                {
309
                    "patent:depositor_provided_patent_identifiers": "patent",
310
                    "bio:rcsb_pdb:protein_bound_3d_structures": "pdb",
311
                    "related:pubchem:related_compounds_with_annotation": "compound",
312
                }
313
            )
314
        return dct
315
316
    @property
317
    def _linksets_to_use(self) -> Mapping[str, str]:
318
        return {
319
            "lit:pubchem:chemical_cooccurrences_in_literature": "ChemicalNeighbor",
320
            "lit:pubchem:gene_cooccurrences_in_literature": "ChemicalGeneSymbolNeighbor",
321
            "lit:pubchem:disease_cooccurrences_in_literature": "ChemicalDiseaseNeighbor",
322
        }
323
324
    @property
325
    def _hierarchies_to_use(self) -> Mapping[str, int]:
326
        if not self._use_classifiers:
327
            return {}
328
        dct = {
329
            "MeSH Tree": 1,
330
            "ChEBI Ontology": 2,
331
            "WHO ATC Classification System": 79,
332
            "Guide to PHARMACOLOGY Target Classification": 92,
333
            "ChEMBL Target Tree": 87,
334
        }
335
        if self._use_extra_classifiers:
336
            dct.update(
337
                {
338
                    "KEGG: Phytochemical Compounds": 5,
339
                    "KEGG: Drug": 14,
340
                    "KEGG: USP": 15,
341
                    "KEGG: Major components of natural products": 69,
342
                    "KEGG: Target-based Classification of Drugs": 22,
343
                    "KEGG: OTC drugs": 25,
344
                    "KEGG: Drug Classes": 96,
345
                    "CAMEO Chemicals": 86,
346
                    "EPA CPDat Classification": 99,
347
                    "FDA Pharm Classes": 78,
348
                    "ChemIDplus": 84,
349
                }
350
            )
351
        return dct
352
353
    def _external_table_url(self, cid: int, collection: str) -> str:
354
        return (
355
            self._sdg
356
            + "?infmt=json"
357
            + "&outfmt=csv"
358
            + "&query={ download : * , collection : "
359
            + collection
360
            + " , where :{ ands :[{ cid : "
361
            + str(cid)
362
            + " }]}}"
363
        ).replace(" ", "%22")
364
365
    def _query_json(self, url: str) -> NestedDotDict:
366
        data = self._executor(url)
367
        data = NestedDotDict(orjson.loads(data))
368
        if "Fault" in data:
369
            raise DownloadError(
370
                f"Request failed ({data.get('Code')}) on {url}: {data.get('Message')}"
371
            )
372
        logger.trace(f"Queried {url}")
373
        return data
374
375
    def _strip_by_key_in_place(self, data: Union[dict, list], bad_key: str) -> None:
376
        if isinstance(data, list):
377
            for x in data:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "x" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
378
                self._strip_by_key_in_place(x, bad_key)
379
        elif isinstance(data, dict):
380
            for k, v in list(data.items()):
0 ignored issues
show
Coding Style Naming introduced by
Variable name "v" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
381
                if k == bad_key:
382
                    del data[k]
383
                elif isinstance(v, (list, dict)):
384
                    self._strip_by_key_in_place(v, bad_key)
385
386
387
__all__ = ["QueryingPubchemApi"]
388