Completed
Pull Request — master (#135)
by Chris
11:32
created

abydos.distance._seqalign.sim_matrix()   F

Complexity

Conditions 14

Size

Total Lines 56
Code Lines 25

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 18
CRAP Score 14

Importance

Changes 0
Metric Value
eloc 25
dl 0
loc 56
ccs 18
cts 18
cp 1
rs 3.6
c 0
b 0
f 0
cc 14
nop 7
crap 14

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance._seqalign.sim_matrix() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance.seqalign.
20
21
The distance.seqalign module implements string edit distance functions
22
used in sequence alignment:
23
24
    - Matrix similarity
25
    - Needleman-Wunsch score
26
    - Smith-Waterman score
27
    - Gotoh score
28
"""
29
30 1
from __future__ import unicode_literals
31
32 1
from numpy import float32 as np_float32
33 1
from numpy import zeros as np_zeros
34
35 1
from six.moves import range
36
37 1
from ._basic import sim_ident
38 1
from ._distance import Distance
39
40 1
__all__ = [
41
    'Gotoh',
42
    'NeedlemanWunsch',
43
    'SmithWaterman',
44
    'gotoh',
45
    'needleman_wunsch',
46
    'smith_waterman',
47
]
48
49
50 1
class NeedlemanWunsch(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
51
    """Needleman-Wunsch score.
52
53
    The Needleman-Wunsch score :cite:`Needleman:1970` is a standard edit
54
    distance measure.
55
    """
56
57 1
    @staticmethod
58 1
    def sim_matrix(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
59
        src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
60
        tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
61
        mat=None,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
62
        mismatch_cost=0,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
63
        match_cost=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
64
        symmetric=True,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
65
        alphabet=None,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
66
    ):
67
        """Return the matrix similarity of two strings.
68
69
        With the default parameters, this is identical to sim_ident.
70
        It is possible for sim_matrix to return values outside of the range
71
        :math:`[0, 1]`, if values outside that range are present in mat,
72
        mismatch_cost, or match_cost.
73
74
        :param str src: source string for comparison
75
        :param str tar: target string for comparison
76
        :param dict mat: a dict mapping tuples to costs; the tuples are
77
            (src, tar) pairs of symbols from the alphabet parameter
78
        :param float mismatch_cost: the value returned if (src, tar) is absent
79
            from mat when src does not equal tar
80
        :param float match_cost: the value returned if (src, tar) is absent
81
            from mat when src equals tar
82
        :param bool symmetric: True if the cost of src not matching tar is
83
            identical to the cost of tar not matching src; in this case, the
84
            values in mat need only contain (src, tar) or (tar, src), not both
85
        :param str alphabet: a collection of tokens from which src and tar are
86
            drawn; if this is defined a ValueError is raised if either tar or
87
            src is not found in alphabet
88
        :returns: matrix similarity
89
        :rtype: float
90
91
        >>> NeedlemanWunsch.sim_matrix('cat', 'hat')
92
        0
93
        >>> NeedlemanWunsch.sim_matrix('hat', 'hat')
94
        1
95
        """
96 1
        if alphabet:
97 1
            alphabet = tuple(alphabet)
98 1
            for i in src:
99 1
                if i not in alphabet:
100 1
                    raise ValueError('src value not in alphabet')
101 1
            for i in tar:
102 1
                if i not in alphabet:
103 1
                    raise ValueError('tar value not in alphabet')
104
105 1
        if src == tar:
106 1
            if mat and (src, src) in mat:
107 1
                return mat[(src, src)]
108 1
            return match_cost
109 1
        if mat and (src, tar) in mat:
110 1
            return mat[(src, tar)]
111 1
        elif symmetric and mat and (tar, src) in mat:
112 1
            return mat[(tar, src)]
113 1
        return mismatch_cost
114
115 1 View Code Duplication
    def dist_abs(self, src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
116
        """Return the Needleman-Wunsch score of two strings.
117
118
        :param str src: source string for comparison
119
        :param str tar: target string for comparison
120
        :param float gap_cost: the cost of an alignment gap (1 by default)
121
        :param function sim_func: a function that returns the similarity of two
122
            characters (identity similarity by default)
123
        :returns: Needleman-Wunsch score
124
        :rtype: float
125
126
        >>> cmp = NeedlemanWunsch()
127
        >>> cmp.dist_abs('cat', 'hat')
128
        2.0
129
        >>> cmp.dist_abs('Niall', 'Neil')
130
        1.0
131
        >>> cmp.dist_abs('aluminum', 'Catalan')
132
        -1.0
133
        >>> cmp.dist_abs('ATCG', 'TAGC')
134
        0.0
135
        """
136 1
        d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_float32)
137
138 1
        for i in range(len(src) + 1):
139 1
            d_mat[i, 0] = -(i * gap_cost)
140 1
        for j in range(len(tar) + 1):
141 1
            d_mat[0, j] = -(j * gap_cost)
142 1
        for i in range(1, len(src) + 1):
143 1
            for j in range(1, len(tar) + 1):
144 1
                match = d_mat[i - 1, j - 1] + sim_func(src[i - 1], tar[j - 1])
145 1
                delete = d_mat[i - 1, j] - gap_cost
146 1
                insert = d_mat[i, j - 1] - gap_cost
147 1
                d_mat[i, j] = max(match, delete, insert)
148 1
        return d_mat[d_mat.shape[0] - 1, d_mat.shape[1] - 1]
149
150
151 1
def needleman_wunsch(src, tar, gap_cost=1, sim_func=sim_ident):
152
    """Return the Needleman-Wunsch score of two strings.
153
154
    This is a wrapper for :py:meth:`NeedlemanWunsch.dist_abs`.
155
156
    :param str src: source string for comparison
157
    :param str tar: target string for comparison
158
    :param float gap_cost: the cost of an alignment gap (1 by default)
159
    :param function sim_func: a function that returns the similarity of two
160
        characters (identity similarity by default)
161
    :returns: Needleman-Wunsch score
162
    :rtype: float
163
164
    >>> needleman_wunsch('cat', 'hat')
165
    2.0
166
    >>> needleman_wunsch('Niall', 'Neil')
167
    1.0
168
    >>> needleman_wunsch('aluminum', 'Catalan')
169
    -1.0
170
    >>> needleman_wunsch('ATCG', 'TAGC')
171
    0.0
172
    """
173 1
    return NeedlemanWunsch().dist_abs(src, tar, gap_cost, sim_func)
174
175
176 1
class SmithWaterman(NeedlemanWunsch):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
177
    """Smith-Waterman score.
178
179
    The Smith-Waterman score :cite:`Smith:1981` is a standard edit distance
180
    measure, differing from Needleman-Wunsch in that it focuses on local
181
    alignment and disallows negative scores.
182
    """
183
184 1 View Code Duplication
    def dist_abs(self, src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
185
        """Return the Smith-Waterman score of two strings.
186
187
        :param str src: source string for comparison
188
        :param str tar: target string for comparison
189
        :param float gap_cost: the cost of an alignment gap (1 by default)
190
        :param function sim_func: a function that returns the similarity of two
191
            characters (identity similarity by default)
192
        :returns: Smith-Waterman score
193
        :rtype: float
194
195
        >>> cmp = SmithWaterman()
196
        >>> cmp.dist_abs('cat', 'hat')
197
        2.0
198
        >>> cmp.dist_abs('Niall', 'Neil')
199
        1.0
200
        >>> cmp.dist_abs('aluminum', 'Catalan')
201
        0.0
202
        >>> cmp.dist_abs('ATCG', 'TAGC')
203
        1.0
204
        """
205 1
        d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_float32)
206
207 1
        for i in range(len(src) + 1):
208 1
            d_mat[i, 0] = 0
209 1
        for j in range(len(tar) + 1):
210 1
            d_mat[0, j] = 0
211 1
        for i in range(1, len(src) + 1):
212 1
            for j in range(1, len(tar) + 1):
213 1
                match = d_mat[i - 1, j - 1] + sim_func(src[i - 1], tar[j - 1])
214 1
                delete = d_mat[i - 1, j] - gap_cost
215 1
                insert = d_mat[i, j - 1] - gap_cost
216 1
                d_mat[i, j] = max(0, match, delete, insert)
217 1
        return d_mat[d_mat.shape[0] - 1, d_mat.shape[1] - 1]
218
219
220 1
def smith_waterman(src, tar, gap_cost=1, sim_func=sim_ident):
221
    """Return the Smith-Waterman score of two strings.
222
223
    This is a wrapper for :py:meth:`SmithWaterman.dist_abs`.
224
225
    :param str src: source string for comparison
226
    :param str tar: target string for comparison
227
    :param float gap_cost: the cost of an alignment gap (1 by default)
228
    :param function sim_func: a function that returns the similarity of two
229
        characters (identity similarity by default)
230
    :returns: Smith-Waterman score
231
    :rtype: float
232
233
    >>> smith_waterman('cat', 'hat')
234
    2.0
235
    >>> smith_waterman('Niall', 'Neil')
236
    1.0
237
    >>> smith_waterman('aluminum', 'Catalan')
238
    0.0
239
    >>> smith_waterman('ATCG', 'TAGC')
240
    1.0
241
    """
242 1
    return SmithWaterman().dist_abs(src, tar, gap_cost, sim_func)
243
244
245 1
class Gotoh(NeedlemanWunsch):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
246
    """Gotoh score.
247
248
    The Gotoh score :cite:`Gotoh:1982` is essentially Needleman-Wunsch with
249
    affine gap penalties.
250
    """
251
252 1
    def dist_abs(self, src, tar, gap_open=1, gap_ext=0.4, sim_func=sim_ident):
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
253
        """Return the Gotoh score of two strings.
254
255
        :param str src: source string for comparison
256
        :param str tar: target string for comparison
257
        :param float gap_open: the cost of an open alignment gap (1 by default)
258
        :param float gap_ext: the cost of an alignment gap extension (0.4 by
259
            default)
260
        :param function sim_func: a function that returns the similarity of two
261
            characters (identity similarity by default)
262
        :returns: Gotoh score
263
        :rtype: float
264
265
        >>> cmp = Gotoh()
266
        >>> cmp.dist_abs('cat', 'hat')
267
        2.0
268
        >>> cmp.dist_abs('Niall', 'Neil')
269
        1.0
270
        >>> round(cmp.dist_abs('aluminum', 'Catalan'), 12)
271
        -0.4
272
        >>> cmp.dist_abs('cat', 'hat')
273
        2.0
274
        """
275 1
        d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_float32)
276 1
        p_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_float32)
277 1
        q_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_float32)
278
279 1
        d_mat[0, 0] = 0
280 1
        p_mat[0, 0] = float('-inf')
281 1
        q_mat[0, 0] = float('-inf')
282 1
        for i in range(1, len(src) + 1):
283 1
            d_mat[i, 0] = float('-inf')
284 1
            p_mat[i, 0] = -gap_open - gap_ext * (i - 1)
285 1
            q_mat[i, 0] = float('-inf')
286 1
            q_mat[i, 1] = -gap_open
287 1
        for j in range(1, len(tar) + 1):
288 1
            d_mat[0, j] = float('-inf')
289 1
            p_mat[0, j] = float('-inf')
290 1
            p_mat[1, j] = -gap_open
291 1
            q_mat[0, j] = -gap_open - gap_ext * (j - 1)
292
293 1
        for i in range(1, len(src) + 1):
294 1
            for j in range(1, len(tar) + 1):
295 1
                sim_val = sim_func(src[i - 1], tar[j - 1])
296 1
                d_mat[i, j] = max(
297
                    d_mat[i - 1, j - 1] + sim_val,
298
                    p_mat[i - 1, j - 1] + sim_val,
299
                    q_mat[i - 1, j - 1] + sim_val,
300
                )
301
302 1
                p_mat[i, j] = max(
303
                    d_mat[i - 1, j] - gap_open, p_mat[i - 1, j] - gap_ext
304
                )
305
306 1
                q_mat[i, j] = max(
307
                    d_mat[i, j - 1] - gap_open, q_mat[i, j - 1] - gap_ext
308
                )
309
310 1
        i, j = (n - 1 for n in d_mat.shape)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable n does not seem to be defined.
Loading history...
311 1
        return max(d_mat[i, j], p_mat[i, j], q_mat[i, j])
312
313
314 1
def gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=sim_ident):
315
    """Return the Gotoh score of two strings.
316
317
    This is a wrapper for :py:meth:`Gotoh.dist_abs`.
318
319
    :param str src: source string for comparison
320
    :param str tar: target string for comparison
321
    :param float gap_open: the cost of an open alignment gap (1 by default)
322
    :param float gap_ext: the cost of an alignment gap extension (0.4 by
323
        default)
324
    :param function sim_func: a function that returns the similarity of two
325
        characters (identity similarity by default)
326
    :returns: Gotoh score
327
    :rtype: float
328
329
    >>> gotoh('cat', 'hat')
330
    2.0
331
    >>> gotoh('Niall', 'Neil')
332
    1.0
333
    >>> round(gotoh('aluminum', 'Catalan'), 12)
334
    -0.4
335
    >>> gotoh('cat', 'hat')
336
    2.0
337
    """
338 1
    return Gotoh().dist_abs(src, tar, gap_open, gap_ext, sim_func)
339
340
341
if __name__ == '__main__':
342
    import doctest
343
344
    doctest.testmod()
345