Passed
Push — main ( 4e4203...cdf0f7 )
by Douglas
01:39
created

mandos.model.taxonomy   F

Complexity

Total Complexity 63

Size/Duplication

Total Lines 378
Duplicated Lines 0 %

Importance

Changes 0
Metric Value
eloc 201
dl 0
loc 378
rs 3.36
c 0
b 0
f 0
wmc 63

45 Methods

Rating   Name   Duplication   Size   Complexity  
A _Taxon.__str__() 0 2 1
A Taxon._ancestors() 0 3 1
A Taxon.children() 0 8 1
A Taxon.__lt__() 0 2 1
A Taxon.__str__() 0 2 1
A Taxon.__repr__() 0 2 1
A Taxon.__eq__() 0 2 1
A Taxon.__hash__() 0 2 1
A _Taxon.__lt__() 0 2 1
A Taxon._descendents() 0 4 2
A _Taxon.__eq__() 0 2 1
A _Taxon.__repr__() 0 2 1
A Taxon.descendents() 0 10 1
A Taxon.ancestors() 0 10 1
A _Taxon.__hash__() 0 2 1
A Taxon.id() 0 8 1
A _Taxon.add_child() 0 2 1
A _Taxon.set_parent() 0 2 1
A Taxon.parent() 0 8 1
A _Taxon.set_name() 0 2 1
A Taxon.name() 0 8 1
A Taxonomy.__len__() 0 2 1
A Taxonomy.n_taxa() 0 2 1
A Taxonomy.subtree() 0 6 1
A Taxonomy.__getitem__() 0 17 2
A Taxonomy.roots() 0 8 1
A Taxonomy.from_list() 0 8 1
A Taxonomy.get_by_name() 0 7 2
A Taxonomy.req_only_by_name() 0 12 3
A Taxonomy.to_df() 0 5 1
A Taxonomy.subtrees_by_name() 0 10 2
A Taxonomy.from_path() 0 4 1
A Taxonomy.contains() 0 2 1
A Taxonomy.taxa() 0 8 1
A Taxonomy.__contains__() 0 2 1
A Taxonomy.req_one_by_name() 0 10 2
A Taxonomy.req() 0 8 2
B Taxonomy.from_df() 0 31 5
A Taxonomy.leaves() 0 3 1
A Taxonomy.get_one_by_name() 0 13 3
A Taxonomy._build_by_name() 0 6 2
A Taxonomy.get() 0 16 3
A Taxonomy.__init__() 0 12 2
A Taxonomy.__str__() 0 2 1
A Taxonomy.__repr__() 0 3 1

How to fix   Complexity   

Complexity

Complex classes like mandos.model.taxonomy often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
from __future__ import annotations
0 ignored issues
show
introduced by
Missing module docstring
Loading history...
2
3
from collections import defaultdict
4
from dataclasses import dataclass
5
from functools import total_ordering
6
from pathlib import Path
7
from typing import List, Mapping, Optional, Sequence, Set, Union, FrozenSet, Iterable
8
9
import pandas as pd
0 ignored issues
show
introduced by
Unable to import 'pandas'
Loading history...
10
from typeddfs import TypedDfs
0 ignored issues
show
introduced by
Unable to import 'typeddfs'
Loading history...
11
12
from mandos import logger
13
14
TaxonomyDf = (
15
    TypedDfs.typed("TaxonomyDf").require("taxon").require("parent").require("scientific_name")
16
).build()
17
18
19
@total_ordering
0 ignored issues
show
Documentation introduced by
Empty class docstring
Loading history...
20
@dataclass()
21
class Taxon:
22
    """"""
23
24
    # we can't use frozen=True because we have both parents and children
25
    # instead, just use properties
26
    __id: int
27
    __name: str
28
    __parent: Optional[Taxon]
29
    __children: Set[Taxon]
30
31
    @property
32
    def id(self) -> int:
0 ignored issues
show
Coding Style Naming introduced by
Attribute name "id" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
33
        """
34
35
        Returns:
36
37
        """
38
        return self.__id
39
40
    @property
41
    def name(self) -> str:
42
        """
43
44
        Returns:
45
46
        """
47
        return self.__name
48
49
    @property
50
    def parent(self) -> Taxon:
51
        """
52
53
        Returns:
54
55
        """
56
        return self.__parent
57
58
    @property
59
    def children(self) -> Set[Taxon]:
60
        """
61
62
        Returns:
63
64
        """
65
        return set(self.__children)
66
67
    @property
68
    def ancestors(self) -> Sequence[Taxon]:
69
        """
70
71
        Returns:
72
73
        """
74
        lst = []
75
        self._ancestors(lst)
76
        return lst
77
78
    @property
79
    def descendents(self) -> Sequence[Taxon]:
80
        """
81
82
        Returns:
83
84
        """
85
        lst = []
86
        self._descendents(lst)
87
        return lst
88
89
    def _ancestors(self, values: List[Taxon]) -> None:
90
        values.append(self.parent)
91
        self.parent._ancestors(values)
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _ancestors was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
92
93
    def _descendents(self, values: List[Taxon]) -> None:
94
        values.extend(self.children)
95
        for child in self.children:
96
            child._descendents(values)
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _descendents was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
97
98
    def __str__(self):
99
        return repr(self)
100
101
    def __repr__(self):
102
        return f"{self.__class__.__name__}({self.id}: {self.name} (parent={self.parent.id if self.parent else 'none'}))"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (120/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
103
104
    def __hash__(self):
105
        return hash(self.id)
106
107
    def __eq__(self, other):
108
        return self.id == other.id
109
110
    def __lt__(self, other):
111
        return self.id < other.id
112
113
114
@dataclass()
115
class _Taxon(Taxon):
116
    """
117
    An internal, modifiable taxon for building the tree.
118
    """
119
120
    def set_name(self, name: str):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
121
        self.__name = name
122
123
    def set_parent(self, parent: _Taxon):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
124
        self.__parent = parent
125
126
    def add_child(self, child: _Taxon):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
127
        self.__children.add(child)
128
129
    # weirdly these are required again -- probably an issue with dataclass
130
131
    def __str__(self):
132
        return repr(self)
133
134
    def __repr__(self):
135
        return f"{self.__class__.__name__}({self.id}: {self.name} (parent={self.parent.id if self.parent else 'none'}))"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (120/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
136
137
    def __hash__(self):
138
        return hash(self.id)
139
140
    def __eq__(self, other):
141
        return self.id == other.id
142
143
    def __lt__(self, other):
144
        return self.id < other.id
145
146
147
class Taxonomy:
148
    """
149
    A taxonomic tree of organisms from UniProt.
150
    Elements in the tree can be looked up by name or ID using ``__getitem__`` and ``get``.
151
    """
152
153
    def __init__(self, by_id: Mapping[int, Taxon], by_name: Mapping[str, FrozenSet[Taxon]]):
154
        """
155
156
        Args:
157
            by_id:
158
        """
159
        # constructor provided for consistency with the members
160
        self._by_id = dict(by_id)
161
        self._by_name = dict(by_name)
162
        # this probably isn't actually possible
163
        if len(self) == 0:
164
            logger.warning(f"{self} contains 0 taxa")
165
166
    @classmethod
167
    def from_list(cls, taxa: Sequence[Taxon]) -> Taxonomy:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
168
        by_id = {x.id: x for x in taxa}
169
        by_name = cls._build_by_name(by_id.values())
170
        tax = Taxonomy(by_id, by_name)
171
        # catch duplicate values
172
        assert len(tax._by_id) == len(taxa), f"{len(tax._by_id)} != {len(taxa)}"
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _by_id was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
173
        return tax
174
175
    @classmethod
176
    def from_path(cls, path: Path) -> Taxonomy:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
177
        df = pd.read_csv(path, sep="\t", header=0)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
178
        return cls.from_df(df)
179
180
    @classmethod
181
    def from_df(cls, df: TaxonomyDf) -> Taxonomy:
0 ignored issues
show
Coding Style Naming introduced by
Argument name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
182
        """
183
        Reads from a DataFrame from a CSV file provided by a UniProt download.
184
        Strips any entries with missing or empty-string scientific names.
185
186
        Args:
187
            df: A dataframe with columns (at least) "taxon", "scientific_name", and "parent"
188
189
        Returns:
190
            The corresponding taxonomic tree
191
        """
192
        df["taxon"] = df["taxon"].astype(int)
193
        # TODO fillna(0) should not be needed
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
194
        df["parent"] = df["parent"].fillna(0).astype(int)
195
        # just build up a tree, sticking the elements in by_id
196
        tax = {}
197
        for row in df.itertuples():
198
            child = tax.setdefault(row.taxon, _Taxon(row.taxon, row.scientific_name, None, set()))
199
            child.set_name(row.scientific_name)
200
            if row.parent != 0:
201
                parent = tax.setdefault(row.parent, _Taxon(row.parent, "", None, set()))
202
                child.set_parent(parent)
203
                parent.add_child(child)
204
        bad = [t for t in tax.values() if t.name.strip() == ""]
205
        if len(bad) > 0:
206
            raise ValueError(f"There are taxa with missing or empty names: {bad}.")
207
        for v in tax.values():
0 ignored issues
show
Coding Style Naming introduced by
Variable name "v" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
208
            v.__class__ = Taxon
209
        by_name = cls._build_by_name(tax.values())
210
        return Taxonomy(tax, by_name)
211
212
    def to_df(self) -> TaxonomyDf:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
213
        return TaxonomyDf(
214
            [
215
                pd.Series(dict(taxon=taxon.id, scientific_name=taxon.name, parent=taxon.parent.id))
216
                for taxon in self.taxa
217
            ]
218
        )
219
220
    @property
221
    def taxa(self) -> Sequence[Taxon]:
222
        """
223
224
        Returns:
225
226
        """
227
        return list(self._by_id.values())
228
229
    @property
230
    def roots(self) -> Sequence[Taxon]:
231
        """
232
233
        Returns:
234
235
        """
236
        return [k for k in self.taxa if k.parent is None or k.parent not in self]
237
238
    @property
239
    def leaves(self) -> Sequence[Taxon]:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
240
        return [k for k in self.taxa if len(k.children) == 0]
241
242
    def subtree(self, item: int) -> Taxonomy:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
243
        item = self[item]
244
        descendents = {item, *item.descendents}
245
        by_id = {d.id: d for d in descendents}
246
        by_name = self.__class__._build_by_name(by_id.values())
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _build_by_name was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
247
        return Taxonomy(by_id, by_name)
248
249
    def subtrees_by_name(self, item: str) -> Taxonomy:
250
        """
251
        Returns the taxonomy that rooted at each of the taxa with the specified scientific name.
252
        """
253
        descendents: Set[Taxon] = set()
254
        for taxon in self._by_name.get(item, []):
255
            descendents.update({taxon, *taxon.descendents})
256
        by_id = {d.id: d for d in descendents}
257
        by_name = self.__class__._build_by_name(by_id.values())
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _build_by_name was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
258
        return Taxonomy(by_id, by_name)
259
260
    def req_one_by_name(self, item: str) -> Taxon:
261
        """
262
        Gets a single taxon by its name.
263
        If there are multiple, returns the first (lowest ID).
264
        Raises an error if there are no matches.
265
        """
266
        one = self.get_one_by_name(item)
267
        if one is None:
268
            raise LookupError(f"No taxa for {item}")
269
        return one
270
271
    def req_only_by_name(self, item: str) -> Taxon:
272
        """
273
        Gets a single taxon by its name.
274
        Raises an error if there are multiple matches for the name, or if there are no matches.
275
        """
276
        taxa = self.get_by_name(item)
277
        ids = ",".join([str(t.id) for t in taxa])
278
        if len(taxa) > 1:
0 ignored issues
show
Unused Code introduced by
Unnecessary "elif" after "raise"
Loading history...
279
            raise ValueError(f"Got multiple results for {item}: {ids}")
280
        elif len(taxa) == 0:
281
            raise LookupError(f"No taxa for {item}")
282
        return next(iter(taxa))
283
284
    def get_one_by_name(self, item: str) -> Optional[Taxon]:
285
        """
286
        Gets a single taxon by its name.
287
        If there are multiple, returns the first (lowest ID).
288
        If there are none, returns ``None``.
289
        """
290
        taxa = self.get_by_name(item)
291
        ids = ",".join([str(t.id) for t in taxa])
292
        if len(taxa) > 1:
293
            logger.warning(f"Got multiple results for {item}: {ids}")
294
        elif len(taxa) == 0:
295
            return None
296
        return next(iter(taxa))
297
298
    def get_by_name(self, item: str) -> FrozenSet[Taxon]:
299
        """
300
        Gets all taxa that match a scientific name.
301
        """
302
        if isinstance(item, Taxon):
303
            item = item.name
304
        return self._by_name.get(item, frozenset(set()))
305
306
    def req(self, item: int) -> Taxon:
307
        """
308
        Gets a single taxon by its ID.
309
        Raises an error if it is not found.
310
        """
311
        if isinstance(item, Taxon):
312
            item = item.id
313
        return self[item]
314
315
    def get(self, item: int) -> Optional[Taxon]:
316
        """
317
        Corresponds to ``dict.get``.
318
319
        Args:
320
            item: The scientific name or UniProt ID
321
322
        Returns:
323
            The taxon, or None if it was not found
324
        """
325
        if isinstance(item, Taxon):
326
            item = item.id
327
        if isinstance(item, int):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
328
            return self._by_id.get(item)
329
        else:
330
            raise TypeError(f"Type {type(item)} of {item} not applicable")
331
332
    def __getitem__(self, item: int) -> Taxon:
333
        """
334
        Corresponds to ``dict[_]``.
335
336
        Args:
337
            item: The UniProt ID
338
339
        Returns:
340
            The taxon
341
342
        Raises:
343
            KeyError: If the taxon was not found
344
        """
345
        got = self.get(item)
346
        if got is None:
347
            raise KeyError(f"{item} not found in {self}")
348
        return got
349
350
    def contains(self, item: Union[Taxon, int, str]):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
351
        return self.get(item) is not None
352
353
    def n_taxa(self) -> int:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
354
        return len(self._by_id)
355
356
    def __contains__(self, item: Union[Taxon, int, str]):
357
        return self.get(item) is not None
358
359
    def __len__(self) -> int:
360
        return len(self._by_id)
361
362
    def __str__(self) -> str:
363
        return repr(self)
364
365
    def __repr__(self) -> str:
366
        roots = ", ".join(r.name for r in self.roots)
367
        return f"{self.__class__.__name__}(n={len(self._by_id)} (roots={roots}) @ {hex(id(self))})"
368
369
    @classmethod
370
    def _build_by_name(cls, tax: Iterable[Taxon]) -> Mapping[str, FrozenSet[Taxon]]:
371
        by_name = defaultdict(set)
372
        for t in tax:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "t" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
373
            by_name[t.name].add(t)
374
        return {k: frozenset(v) for k, v in by_name.items()}
375
376
377
__all__ = ["Taxon", "Taxonomy", "TaxonomyDf"]
378