ReesLevenshtein.dist_abs()   F
last analyzed

Complexity

Conditions 23

Size

Total Lines 156
Code Lines 80

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 49
CRAP Score 23

Importance

Changes 0
Metric Value
eloc 80
dl 0
loc 156
ccs 49
cts 49
cp 1
rs 0
c 0
b 0
f 0
cc 23
nop 3
crap 23

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance._rees_levenshtein.ReesLevenshtein.dist_abs() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# Copyright 2019-2020 by Christopher C. Little.
2
# This file is part of Abydos.
3
#
4
# Abydos is free software: you can redistribute it and/or modify
5
# it under the terms of the GNU General Public License as published by
6
# the Free Software Foundation, either version 3 of the License, or
7
# (at your option) any later version.
8
#
9
# Abydos is distributed in the hope that it will be useful,
10
# but WITHOUT ANY WARRANTY; without even the implied warranty of
11
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12
# GNU General Public License for more details.
13
#
14
# You should have received a copy of the GNU General Public License
15
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
16
17
"""abydos.distance._rees_levenshtein.
18
19 1
Rees-Levenshtein distance
20
"""
21
22
from typing import Any, Callable, List, cast
23
24 1
from numpy import int_ as np_int
25
from numpy import zeros as np_zeros
26
27
from ._distance import _Distance
28
29
__all__ = ['ReesLevenshtein']
30
31 1
32 1
class ReesLevenshtein(_Distance):
33
    r"""Rees-Levenshtein distance.
34 1
35
    Rees-Levenshtein distance :cite:`Rees:2014,Rees:2013` is the "Modified
36 1
    Damerau-Levenshtein Distance Algorithm, created by Tony Rees as part of
37
    Taxamatch.
38
39 1
    .. versionadded:: 0.4.0
40
    """
41
42
    def __init__(
43
        self,
44
        block_limit: int = 2,
45
        normalizer: Callable[[List[float]], float] = max,
46
        **kwargs: Any
47
    ) -> None:
48
        """Initialize ReesLevenshtein instance.
49 1
50
        Parameters
51
        ----------
52
        block_limit : int
53
            The block length limit
54
        normalizer : function
55
            A function that takes an list and computes a normalization term
56
            by which the edit distance is divided (max by default). Another
57
            good option is the sum function.
58
        **kwargs
59
            Arbitrary keyword arguments
60
61
62
        .. versionadded:: 0.4.0
63
64
        """
65
        super(ReesLevenshtein, self).__init__(**kwargs)
66
        self._normalizer = normalizer
67 1
        self._block_limit = block_limit
68 1
69 1
    def dist_abs(self, src: str, tar: str) -> float:
70
        """Return the Rees-Levenshtein distance of two strings.
71 1
72
        This is a straightforward port of the PL/SQL implementation at
73
        https://confluence.csiro.au/public/taxamatch/the-mdld-modified-damerau-levenshtein-distance-algorithm
74
75
        Parameters
76
        ----------
77
        src : str
78
            Source string for comparison
79
        tar : str
80
            Target string for comparison
81
82
        Returns
83
        -------
84
        float
85
            Rees-Levenshtein distance
86
87
        Examples
88
        --------
89
        >>> cmp = ReesLevenshtein()
90
        >>> cmp.dist_abs('cat', 'hat')
91
        1
92
        >>> cmp.dist_abs('Niall', 'Neil')
93
        3
94
        >>> cmp.dist_abs('aluminum', 'Catalan')
95
        7
96
        >>> cmp.dist_abs('ATCG', 'TAGC')
97
        2
98
99
100
        .. versionadded:: 0.4.0
101
102
        """
103
        v_str1_length = len(src)
104
        v_str2_length = len(tar)
105 1
106 1
        if tar == src:
107
            return 0
108 1
        if not src:
109 1
            return len(tar)
110 1
        if not tar:
111 1
            return len(src)
112 1
        if v_str1_length == 1 and v_str2_length == 1:
113 1
            return 1
114 1
115 1
        def _substr(string: str, start: int, length: int) -> str:
116
            if start > 0:
117 1
                start -= 1
118 1
            else:
119 1
                start += len(string) - 1
120
121 1
            end = start + length
122
123 1
            return string[start:end]
124
125 1
        v_temp_str1 = str(src)
126
        v_temp_str2 = str(tar)
127 1
128 1
        # first trim common leading characters
129
        while v_temp_str1[:1] == v_temp_str2[:1]:
130
            v_temp_str1 = v_temp_str1[1:]
131 1
            v_temp_str2 = v_temp_str2[1:]
132 1
133 1
        # then trim common trailing characters
134
        while v_temp_str1[-1:] == v_temp_str2[-1:]:
135
            v_temp_str1 = v_temp_str1[:-1]
136 1
            v_temp_str2 = v_temp_str2[:-1]
137 1
138 1
        v_str1_length = len(v_temp_str1)
139
        v_str2_length = len(v_temp_str2)
140 1
141 1
        # then calculate standard Levenshtein Distance
142
        if v_str1_length == 0 or v_str2_length == 0:
143
            return max(v_str2_length, v_str1_length)
144 1
        if v_str1_length == 1 and v_str2_length == 1:
145 1
            return 1
146 1
147 1
        # create table (NB: this is transposed relative to the PL/SQL version)
148
        d_mat = np_zeros((v_str1_length + 1, v_str2_length + 1), dtype=np_int)
149
150 1
        # enter values in first (leftmost) column
151
        for i in range(1, v_str1_length + 1):
152
            d_mat[i, 0] = i
153 1
        # populate remaining columns
154 1
        for j in range(1, v_str2_length + 1):
155
            d_mat[0, j] = j
156 1
157 1
            for i in range(1, v_str1_length + 1):
158
                if v_temp_str1[i - 1] == v_temp_str2[j - 1]:
159 1
                    v_this_cost = 0
160 1
                else:
161 1
                    v_this_cost = 1
162
163 1
                # extension to cover multiple single, double, triple, etc.
164
                # character transpositions
165
                # that includes calculation of original Levenshtein distance
166
                # when no transposition found
167
                v_temp_block_length = int(
168
                    min(
169 1
                        v_str1_length / 2, v_str2_length / 2, self._block_limit
170
                    )
171
                )
172
173
                while v_temp_block_length >= 1:
174
                    if (
175 1
                        (i >= v_temp_block_length * 2)
176 1
                        and (j >= v_temp_block_length * 2)
177
                        and (
178
                            _substr(
179
                                v_temp_str1,
180
                                i - v_temp_block_length * 2 - 1,
181
                                v_temp_block_length,
182
                            )
183
                            == _substr(
184
                                v_temp_str2,
185
                                j - v_temp_block_length - 1,
186
                                v_temp_block_length,
187
                            )
188
                        )
189
                        and (
190
                            _substr(
191
                                v_temp_str1,
192
                                i - v_temp_block_length - 1,
193
                                v_temp_block_length,
194
                            )
195
                            == _substr(
196
                                v_temp_str2,
197
                                j - v_temp_block_length * 2 - 1,
198
                                v_temp_block_length,
199
                            )
200
                        )
201
                    ):
202
                        # transposition found
203
                        d_mat[i, j] = min(
204
                            d_mat[i, j - 1] + 1,
205 1
                            d_mat[i - 1, j] + 1,
206
                            d_mat[
207
                                i - v_temp_block_length * 2,
208
                                j - v_temp_block_length * 2,
209
                            ]
210
                            + v_this_cost
211
                            + v_temp_block_length
212
                            - 1,
213
                        )
214
                        v_temp_block_length = 0
215
                    elif v_temp_block_length == 1:
216 1
                        # no transposition
217 1
                        d_mat[i, j] = min(
218
                            d_mat[i, j - 1] + 1,
219 1
                            d_mat[i - 1, j] + 1,
220
                            d_mat[i - 1, j - 1] + v_this_cost,
221
                        )
222
                    v_temp_block_length -= 1
223
224 1
        return cast(float, d_mat[v_str1_length, v_str2_length])
225
226 1
    def dist(self, src: str, tar: str) -> float:
227
        """Return the normalized Rees-Levenshtein distance of two strings.
228 1
229
        Parameters
230
        ----------
231
        src : str
232
            Source string for comparison
233
        tar : str
234
            Target string for comparison
235
236
        Returns
237
        -------
238
        float
239
            Normalized Rees-Levenshtein distance
240
241
        Examples
242
        --------
243
        >>> cmp = ReesLevenshtein()
244
        >>> cmp.dist('cat', 'hat')
245
        0.3333333333333333
246
        >>> cmp.dist('Niall', 'Neil')
247
        0.6
248
        >>> cmp.dist('aluminum', 'Catalan')
249
        0.875
250
        >>> cmp.dist('ATCG', 'TAGC')
251
        0.5
252
253
254
        .. versionadded:: 0.4.0
255
256
        """
257
        if src == tar:
258
            return 0.0
259 1
        return self.dist_abs(src, tar) / (
260 1
            self._normalizer([len(src), len(tar)])
261 1
        )
262
263
264
if __name__ == '__main__':
265
    import doctest
266
267
    doctest.testmod()
268