Sift4Extended.dist_abs()   F
last analyzed

Complexity

Conditions 25

Size

Total Lines 127
Code Lines 75

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 61
CRAP Score 25

Importance

Changes 0
Metric Value
eloc 75
dl 0
loc 127
ccs 61
cts 61
cp 1
rs 0
c 0
b 0
f 0
cc 25
nop 3
crap 25

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance._sift4_extended.Sift4Extended.dist_abs() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# Copyright 2019-2020 by Christopher C. Little.
2
# This file is part of Abydos.
3
#
4
# Abydos is free software: you can redistribute it and/or modify
5
# it under the terms of the GNU General Public License as published by
6
# the Free Software Foundation, either version 3 of the License, or
7
# (at your option) any later version.
8
#
9
# Abydos is distributed in the hope that it will be useful,
10
# but WITHOUT ANY WARRANTY; without even the implied warranty of
11
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12
# GNU General Public License for more details.
13
#
14
# You should have received a copy of the GNU General Public License
15
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
16
17
"""abydos.distance._sift4_extended.
18
19 1
Sift4 Extended approximate string distance
20
"""
21
22
from typing import Any, Callable, Dict, List, Optional, Union
23
24 1
from ._distance import _Distance
25
from ._sift4 import Sift4
26
from ..tokenizer import CharacterTokenizer, _Tokenizer
27
28
__all__ = ['Sift4Extended']
29
30
31 1
class Sift4Extended(_Distance):
32
    r"""Sift4 Extended version.
33 1
34 1
    This is an approximation of edit distance, described in
35 1
    :cite:`Zackwehdex:2014`.
36
37 1
    .. versionadded:: 0.4.0
38
    """
39
40 1
    _sift4 = Sift4()
41
42
    def __init__(
43
        self,
44
        max_offset: int = 5,
45
        max_distance: int = 0,
46
        tokenizer: Optional[_Tokenizer] = None,
47
        token_matcher: Optional[Callable[[str, str], bool]] = None,
48
        matching_evaluator: Optional[Callable[[str, str], float]] = None,
49 1
        local_length_evaluator: Optional[Callable[[float], float]] = None,
50
        transposition_cost_evaluator: Optional[
51 1
            Callable[[int, int], float]
52
        ] = None,
53
        transpositions_evaluator: Optional[
54
            Callable[[float, float], float]
55
        ] = None,
56
        **kwargs: Any
57
    ) -> None:
58
        """Initialize Sift4Extended instance.
59
60
        Parameters
61
        ----------
62
        max_offset : int
63
            The number of characters to search for matching letters
64
        max_distance : int
65
            The distance at which to stop and exit
66
        tokenizer : _Tokenizer
67
            A tokenizer instance (character tokenization by default)
68
        token_matcher : function
69
            A token matcher function of two parameters (equality by default).
70
            :math:`Sift4Extended.sift4_token_matcher` is also supplied.
71
        matching_evaluator : function
72
            A token match quality function of two parameters (1 by default).
73
            :math:`Sift4Extended.sift4_matching_evaluator` is also supplied.
74
        local_length_evaluator : function
75
            A local length evaluator function (its single parameter by
76
            default). :math:`Sift4Extended.reward_length_evaluator` and
77
            :math:`Sift4Extended.reward_length_evaluator_exp` are also
78
            supplied.
79
        transposition_cost_evaluator : function
80
            A transposition cost evaluator function of two parameters (1 by
81
            default).
82
            :math:`Sift4Extended.longer_transpositions_are_more_costly` is also
83
            supplied.
84
        transpositions_evaluator : function
85
            A transpositions evaluator function of two parameters (the second
86
            parameter subtracted from the first, by default).
87
        **kwargs
88
            Arbitrary keyword arguments
89
90
91
        .. versionadded:: 0.4.0
92
93
        """
94
        super(Sift4Extended, self).__init__(**kwargs)
95
        self._max_offset = max_offset
96
        self._max_distance = max_distance
97
98
        if tokenizer is not None:
99 1
            self._tokenizer = tokenizer
100 1
        else:
101 1
            self._tokenizer = CharacterTokenizer()
102 1
103 1
        if token_matcher is not None:
104 1
            self._token_matcher = token_matcher
105 1
        else:
106 1
            self._token_matcher = lambda t1, t2: t1 == t2
107 1
108
        if matching_evaluator is not None:
109 1
            self._matching_evaluator = matching_evaluator
110 1
        else:
111 1
            self._matching_evaluator = lambda t1, t2: 1
112 1
113 1
        if local_length_evaluator is not None:
114 1
            self._local_length_evaluator = local_length_evaluator
115 1
        else:
116 1
            self._local_length_evaluator = lambda local_cs: local_cs
117 1
118 1
        if transposition_cost_evaluator is not None:
119 1
            self._transposition_cost_evaluator = transposition_cost_evaluator
120 1
        else:
121
            self._transposition_cost_evaluator = lambda c1, c2: 1
122 1
123
        if transpositions_evaluator is not None:
124
            self._transpositions_evaluator = transpositions_evaluator
125
        else:
126
            self._transpositions_evaluator = lambda lcss, trans: lcss - trans
127
128
    def dist_abs(self, src: str, tar: str) -> float:
129
        """Return the Sift4 Extended distance between two strings.
130
131
        Parameters
132
        ----------
133
        src : str
134
            Source string for comparison
135
        tar : str
136
            Target string for comparison
137
138
        Returns
139
        -------
140
        int
141
            The Sift4 distance according to the extended formula
142
143
        Examples
144
        --------
145
        >>> cmp = Sift4Extended()
146
        >>> cmp.dist_abs('cat', 'hat')
147
        1
148
        >>> cmp.dist_abs('Niall', 'Neil')
149
        2
150
        >>> cmp.dist_abs('aluminum', 'Catalan')
151
        5
152
        >>> cmp.dist_abs('ATCG', 'TAGC')
153 1
        2
154 1
155
156 1
        .. versionadded:: 0.4.0
157 1
158
        """
159 1
        src_list = self._tokenizer.tokenize(src).get_list()
160 1
        tar_list = self._tokenizer.tokenize(tar).get_list()
161
162 1
        if not src_list:
163 1
            return len(tar_list)
164
165 1
        if not tar_list:
166 1
            return len(src_list)
167 1
168 1
        src_len = len(src_list)
169 1
        tar_len = len(tar_list)
170 1
171
        src_cur = 0
172 1
        tar_cur = 0
173 1
        lcss = 0.0
174 1
        local_cs = 0.0
175
        trans = 0.0
176
        offset_arr = []  # type: List[Dict[str, Union[int, bool]]]
177 1
178 1
        while (src_cur < src_len) and (tar_cur < tar_len):
179 1
            if self._token_matcher(src_list[src_cur], tar_list[tar_cur]):
180 1
                local_cs += self._matching_evaluator(
181 1
                    src_list[src_cur], tar_list[tar_cur]
182 1
                )
183
                is_trans = False
184
                i = 0
185 1
                while i < len(offset_arr):
186 1
                    ofs = offset_arr[i]
187
                    if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
188
                        is_trans = abs(tar_cur - src_cur) >= abs(
189 1
                            ofs['tar_cur'] - ofs['src_cur']
190 1
                        )
191 1
                        if is_trans:
192
                            trans += self._transposition_cost_evaluator(
193
                                src_cur, tar_cur
194 1
                            )
195 1
                        elif not ofs['trans']:
196 1
                            ofs['trans'] = True
197
                            trans += self._transposition_cost_evaluator(
198 1
                                ofs['tar_cur'], ofs['src_cur']
199
                            )
200 1
                        break
201
                    elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
202
                        del offset_arr[i]
203
                    else:
204 1
                        i += 1
205 1
206 1
                offset_arr.append(
207 1
                    {'src_cur': src_cur, 'tar_cur': tar_cur, 'trans': is_trans}
208 1
                )
209 1
            else:
210
                lcss += self._local_length_evaluator(local_cs)
211
                local_cs = 0
212 1
                if src_cur != tar_cur:
213 1
                    src_cur = tar_cur = min(src_cur, tar_cur)
214
                for i in range(self._max_offset):
215
                    if not (
216 1
                        (src_cur + i < src_len) or (tar_cur + i < tar_len)
217 1
                    ):
218 1
                        break
219 1
                    if (src_cur + i < src_len) and (
220
                        self._token_matcher(
221
                            src_list[src_cur + i], tar_list[tar_cur]
222 1
                        )
223 1
                    ):
224 1
                        src_cur += i - 1
225
                        tar_cur -= 1
226 1
                        break
227 1
                    if (tar_cur + i < tar_len) and (
228
                        self._token_matcher(
229 1
                            src_list[src_cur], tar_list[tar_cur + i]
230 1
                        )
231
                    ):
232
                        src_cur -= 1
233 1
                        tar_cur += i - 1
234 1
                        break
235
236 1
            src_cur += 1
237 1
            tar_cur += 1
238 1
239 1
            if self._max_distance:
240
                temporary_distance = self._local_length_evaluator(
241 1
                    max(src_cur, tar_cur)
242 1
                ) - self._transpositions_evaluator(lcss, trans)
243
                if temporary_distance >= self._max_distance:
244
                    return round(temporary_distance)
245
246
            if (src_cur >= src_len) or (tar_cur >= tar_len):
247 1
                lcss += self._local_length_evaluator(local_cs)
248
                local_cs = 0
249
                src_cur = tar_cur = min(src_cur, tar_cur)
250
251
        lcss += self._local_length_evaluator(local_cs)
252
        return round(
253
            self._local_length_evaluator(max(src_len, tar_len))
254
            - self._transpositions_evaluator(lcss, trans)
255
        )
256
257
    @staticmethod
258
    def sift4_token_matcher(src: str, tar: str) -> bool:
259
        """Sift4 Token Matcher.
260
261
        Parameters
262
        ----------
263
        src : str
264
            Source string for comparison
265
        tar : str
266 1
            Target string for comparison
267
268 1
        Returns
269
        -------
270
        bool
271
            Whether the Sift4 similarity of the two tokens is over 0.7
272
273
        .. versionadded:: 0.4.0
274
275
        """
276
        return Sift4Extended.sift4_matching_evaluator(src, tar) > 0.7
277
278
    @staticmethod
279
    def sift4_matching_evaluator(src: str, tar: str) -> float:
280
        """Sift4 Matching Evaluator.
281
282
        Parameters
283
        ----------
284
        src : str
285
            Source string for comparison
286
        tar : str
287 1
            Target string for comparison
288
289 1
        Returns
290
        -------
291
        float
292
            The Sift4 similarity of the two tokens
293
294
        .. versionadded:: 0.4.0
295
296
        """
297
        return Sift4Extended._sift4.sim(src, tar)
298
299
    @staticmethod
300
    def reward_length_evaluator(length: int) -> float:
301
        """Reward Length Evaluator.
302
303
        Parameters
304
        ----------
305
        length : int
306 1
            The length of a local match
307 1
308 1
        Returns
309
        -------
310 1
        float
311
            A reward value that grows sub-linearly
312
313
        .. versionadded:: 0.4.0
314
315
        """
316
        if length < 1:
317
            return 1
318
        return length - 1 / (length + 1)
319
320
    @staticmethod
321
    def reward_length_evaluator_exp(length: int) -> float:
322
        """Reward Length Evaluator.
323
324
        Parameters
325
        ----------
326
        length : int
327 1
            The length of a local match
328
329 1
        Returns
330
        -------
331
        float
332
            A reward value that grows exponentially
333
334
        .. versionadded:: 0.4.0
335
336
        """
337
        return length ** 1.5
338
339
    @staticmethod
340
    def longer_transpositions_are_more_costly(pos1: int, pos2: int) -> float:
341
        """Longer Transpositions Are More Costly.
342
343
        Parameters
344
        ----------
345
        pos1 : int
346
            The position of the first transposition
347
        pos2 : int
348 1
            The position of the second transposition
349
350
        Returns
351
        -------
352
        float
353
            A cost that grows as difference in the positions increases
354
355
        .. versionadded:: 0.4.0
356
357
        """
358
        return abs(pos2 - pos1) / 9 + 1
359
360
361
if __name__ == '__main__':
362
    import doctest
363
364
    doctest.testmod()
365