mandos.model.taxonomy.Taxonomy.from_list()   A
last analyzed

Complexity

Conditions 1

Size

Total Lines 6
Code Lines 5

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 1
eloc 5
nop 2
dl 0
loc 6
rs 10
c 0
b 0
f 0
1
from __future__ import annotations
0 ignored issues
show
introduced by
Missing module docstring
Loading history...
2
3
import logging
4
from dataclasses import dataclass
5
from functools import total_ordering
6
from pathlib import Path
7
from typing import List, Mapping, Optional, Sequence, Set, Union
0 ignored issues
show
Unused Code introduced by
Unused Union imported from typing
Loading history...
8
9
import pandas as pd
0 ignored issues
show
introduced by
Unable to import 'pandas'
Loading history...
10
from typeddfs import TypedDfs
0 ignored issues
show
introduced by
Unable to import 'typeddfs'
Loading history...
11
12
logger = logging.getLogger(__package__)
13
14
TaxonomyDf = (
15
    TypedDfs.typed("TaxonomyDf").require("taxon").require("parent").require("scientific_name")
16
).build()
17
18
19
@total_ordering
0 ignored issues
show
Documentation introduced by
Empty class docstring
Loading history...
20
@dataclass()
21
class Taxon:
22
    """"""
23
24
    # we can't use frozen=True because we have both parents and children
25
    # instead, just use properties
26
    __id: int
27
    __name: str
28
    __parent: Optional[Taxon]
29
    __children: Set[Taxon]
30
31
    @property
32
    def id(self) -> int:
0 ignored issues
show
Coding Style Naming introduced by
Attribute name "id" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
33
        """
34
35
        Returns:
36
37
        """
38
        return self.__id
39
40
    @property
41
    def name(self) -> str:
42
        """
43
44
        Returns:
45
46
        """
47
        return self.__name
48
49
    @property
50
    def parent(self) -> Taxon:
51
        """
52
53
        Returns:
54
55
        """
56
        return self.__parent
57
58
    @property
59
    def children(self) -> Set[Taxon]:
60
        """
61
62
        Returns:
63
64
        """
65
        return set(self.__children)
66
67
    @property
68
    def ancestors(self) -> Sequence[Taxon]:
69
        """
70
71
        Returns:
72
73
        """
74
        lst = []
75
        self._ancestors(lst)
76
        return lst
77
78
    @property
79
    def descendents(self) -> Sequence[Taxon]:
80
        """
81
82
        Returns:
83
84
        """
85
        lst = []
86
        self._descendents(lst)
87
        return lst
88
89
    def _ancestors(self, values: List[Taxon]) -> None:
90
        values.append(self.parent)
91
        self.parent._ancestors(values)
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _ancestors was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
92
93
    def _descendents(self, values: List[Taxon]) -> None:
94
        values.extend(self.children)
95
        for child in self.children:
96
            child._descendents(values)
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _descendents was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
97
98
    def __str__(self):
99
        return repr(self)
100
101
    def __repr__(self):
102
        return f"{self.__class__.__name__}({self.id}: {self.name} (parent={self.parent.id if self.parent else 'none'}))"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (120/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
103
104
    def __hash__(self):
105
        return hash(self.id)
106
107
    def __eq__(self, other):
108
        return self.id == other.id
109
110
    def __lt__(self, other):
111
        return self.id < other.id
112
113
114
@dataclass()
115
class _Taxon(Taxon):
116
    """
117
    An internal, modifiable taxon for building the tree.
118
    """
119
120
    def set_name(self, name: str):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
121
        self.__name = name
122
123
    def set_parent(self, parent: _Taxon):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
124
        self.__parent = parent
125
126
    def add_child(self, child: _Taxon):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
127
        self.__children.add(child)
128
129
    # weirdly these are required again -- probably an issue with dataclass
130
131
    def __str__(self):
132
        return repr(self)
133
134
    def __repr__(self):
135
        return f"{self.__class__.__name__}({self.id}: {self.name} (parent={self.parent.id if self.parent else 'none'}))"
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (120/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
136
137
    def __hash__(self):
138
        return hash(self.id)
139
140
    def __eq__(self, other):
141
        return self.id == other.id
142
143
    def __lt__(self, other):
144
        return self.id < other.id
145
146
147
class Taxonomy:
148
    """
149
    A taxonomic tree of organisms from UniProt.
150
    Elements in the tree can be looked up by name or ID using ``__getitem__`` and ``get``.
151
    """
152
153
    def __init__(self, by_id: Mapping[int, Taxon]):
154
        """
155
156
        Args:
157
            by_id:
158
            by_name:
159
        """
160
        # constructor provided for consistency with the members
161
        self._by_id = dict(by_id)
162
163
    @classmethod
164
    def from_list(cls, taxa: Sequence[Taxon]) -> Taxonomy:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
165
        tax = Taxonomy({x.id: x for x in taxa})
166
        # catch duplicate values
167
        assert len(tax._by_id) == len(taxa), f"{len(tax._by_id)} != {len(taxa)}"
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _by_id was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
168
        return tax
169
170
    @classmethod
171
    def from_path(cls, path: Path) -> Taxonomy:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
172
        df = pd.read_csv(path, sep="\t", header=0)
0 ignored issues
show
Coding Style Naming introduced by
Variable name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
173
        return cls.from_df(df)
174
175
    @classmethod
176
    def from_df(cls, df: TaxonomyDf) -> Taxonomy:
0 ignored issues
show
Coding Style Naming introduced by
Argument name "df" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
177
        """
178
        Reads from a DataFrame from a CSV file provided by a UniProt download.
179
        Strips any entries with missing or empty-string scientific names.
180
181
        Args:
182
            df: A dataframe with columns (at least) "taxon", "scientific_name", and "parent"
183
184
        Returns:
185
            The corresponding taxonomic tree
186
        """
187
        df["taxon"] = df["taxon"].astype(int)
188
        # TODO fillna(0) should not be needed
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
189
        df["parent"] = df["parent"].fillna(0).astype(int)
190
        # just build up a tree, sticking the elements in by_id
191
        tax = {}
192
        for row in df.itertuples():
193
            child = tax.setdefault(row.taxon, _Taxon(row.taxon, row.scientific_name, None, set()))
194
            child.set_name(row.scientific_name)
195
            if row.parent != 0:
196
                parent = tax.setdefault(row.parent, _Taxon(row.parent, "", None, set()))
197
                child.set_parent(parent)
198
                parent.add_child(child)
199
        bad = [t for t in tax.values() if t.name.strip() == ""]
200
        if len(bad) > 0:
201
            raise ValueError(f"There are taxa with missing or empty names: {bad}.")
202
        for v in tax.values():
0 ignored issues
show
Coding Style Naming introduced by
Variable name "v" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
203
            v.__class__ = Taxon
204
        return Taxonomy(tax)
205
206
    @property
207
    def taxa(self) -> Sequence[Taxon]:
208
        """
209
210
        Returns:
211
212
        """
213
        return list(self._by_id.values())
214
215
    @property
216
    def roots(self) -> Sequence[Taxon]:
217
        """
218
219
        Returns:
220
221
        """
222
        return [k for k in self.taxa if k.parent is None or k.parent not in self]
223
224
    @property
225
    def leaves(self) -> Sequence[Taxon]:
226
        """
227
228
        Returns:
229
230
        """
231
        return [k for k in self.taxa if len(k.children) == 0]
232
233
    def subtree(self, item: int) -> Taxonomy:
234
        """
235
236
        Args:
237
            item:
238
239
        Returns:
240
241
        """
242
        item = self[item]
243
        descendents = {item, *item.descendents}
244
        return Taxonomy({d.id: d for d in descendents})
245
246
    def req(self, item: int) -> Taxon:
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
247
        if isinstance(item, Taxon):
248
            item = item.id
249
        return self[item]
250
251
    def get(self, item: int) -> Optional[Taxon]:
252
        """
253
        Corresponds to ``dict.get``.
254
255
        Args:
256
            item: The scientific name or UniProt ID
257
258
        Returns:
259
            The taxon, or None if it was not found
260
        """
261
        if isinstance(item, Taxon):
262
            item = item.id
263
        if isinstance(item, int):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
264
            return self._by_id.get(item)
265
        else:
266
            raise TypeError(f"Type {type(item)} of {item} not applicable")
267
268
    def __getitem__(self, item: int) -> Taxon:
269
        """
270
        Corresponds to ``dict[_]``.
271
272
        Args:
273
            item: The UniProt ID
274
275
        Returns:
276
            The taxon
277
278
        Raises:
279
            KeyError: If the taxon was not found
280
        """
281
        got = self.get(item)
282
        if got is None:
283
            raise KeyError(f"{item} not found in {self}")
284
        return got
285
286
    def __contains__(self, item):
287
        """
288
289
        Args:
290
            item:
291
292
        Returns:
293
294
        """
295
        return self.get(item) is not None
296
297
    def __len__(self) -> int:
298
        """
299
300
        Returns:
301
302
        """
303
        return len(self._by_id)
304
305
    def __str__(self) -> str:
306
        return repr(self)
307
308
    def __repr__(self) -> str:
309
        roots = ", ".join(r.name for r in self.roots)
310
        return f"{self.__class__.__name__}(n={len(self._by_id)} (roots={roots}) @ {hex(id(self))})"
311
312
313
__all__ = ["Taxon", "Taxonomy"]
314