Sift4Extended.__init__()   D
last analyzed

Complexity

Conditions 12

Size

Total Lines 85
Code Lines 36

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 24
CRAP Score 12

Importance

Changes 0
Metric Value
eloc 36
dl 0
loc 85
ccs 24
cts 24
cp 1
rs 4.8
c 0
b 0
f 0
cc 12
nop 10
crap 12

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance._sift4_extended.Sift4Extended.__init__() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
# Copyright 2019-2020 by Christopher C. Little.
2
# This file is part of Abydos.
3
#
4
# Abydos is free software: you can redistribute it and/or modify
5
# it under the terms of the GNU General Public License as published by
6
# the Free Software Foundation, either version 3 of the License, or
7
# (at your option) any later version.
8
#
9
# Abydos is distributed in the hope that it will be useful,
10
# but WITHOUT ANY WARRANTY; without even the implied warranty of
11
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12
# GNU General Public License for more details.
13
#
14
# You should have received a copy of the GNU General Public License
15
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
16
17
"""abydos.distance._sift4_extended.
18
19 1
Sift4 Extended approximate string distance
20
"""
21
22
from typing import Any, Callable, Dict, List, Optional, Union
23
24 1
from ._distance import _Distance
25
from ._sift4 import Sift4
26
from ..tokenizer import CharacterTokenizer, _Tokenizer
27
28
__all__ = ['Sift4Extended']
29
30
31 1
class Sift4Extended(_Distance):
32
    r"""Sift4 Extended version.
33 1
34 1
    This is an approximation of edit distance, described in
35 1
    :cite:`Zackwehdex:2014`.
36
37 1
    .. versionadded:: 0.4.0
38
    """
39
40 1
    _sift4 = Sift4()
41
42
    def __init__(
43
        self,
44
        max_offset: int = 5,
45
        max_distance: int = 0,
46
        tokenizer: Optional[_Tokenizer] = None,
47
        token_matcher: Optional[Callable[[str, str], bool]] = None,
48
        matching_evaluator: Optional[Callable[[str, str], float]] = None,
49 1
        local_length_evaluator: Optional[Callable[[float], float]] = None,
50
        transposition_cost_evaluator: Optional[
51 1
            Callable[[int, int], float]
52
        ] = None,
53
        transpositions_evaluator: Optional[
54
            Callable[[float, float], float]
55
        ] = None,
56
        **kwargs: Any
57
    ) -> None:
58
        """Initialize Sift4Extended instance.
59
60
        Parameters
61
        ----------
62
        max_offset : int
63
            The number of characters to search for matching letters
64
        max_distance : int
65
            The distance at which to stop and exit
66
        tokenizer : _Tokenizer
67
            A tokenizer instance (character tokenization by default)
68
        token_matcher : function
69
            A token matcher function of two parameters (equality by default).
70
            :math:`Sift4Extended.sift4_token_matcher` is also supplied.
71
        matching_evaluator : function
72
            A token match quality function of two parameters (1 by default).
73
            :math:`Sift4Extended.sift4_matching_evaluator` is also supplied.
74
        local_length_evaluator : function
75
            A local length evaluator function (its single parameter by
76
            default). :math:`Sift4Extended.reward_length_evaluator` and
77
            :math:`Sift4Extended.reward_length_evaluator_exp` are also
78
            supplied.
79
        transposition_cost_evaluator : function
80
            A transposition cost evaluator function of two parameters (1 by
81
            default).
82
            :math:`Sift4Extended.longer_transpositions_are_more_costly` is also
83
            supplied.
84
        transpositions_evaluator : function
85
            A transpositions evaluator function of two parameters (the second
86
            parameter subtracted from the first, by default).
87
        **kwargs
88
            Arbitrary keyword arguments
89
90
91
        .. versionadded:: 0.4.0
92
93
        """
94
        super(Sift4Extended, self).__init__(**kwargs)
95
        self._max_offset = max_offset
96
        self._max_distance = max_distance
97
98
        if tokenizer is not None:
99 1
            self._tokenizer = tokenizer
100 1
        else:
101 1
            self._tokenizer = CharacterTokenizer()
102 1
103 1
        if token_matcher is not None:
104 1
            self._token_matcher = token_matcher
105 1
        else:
106 1
            self._token_matcher = lambda t1, t2: t1 == t2
107 1
108
        if matching_evaluator is not None:
109 1
            self._matching_evaluator = matching_evaluator
110 1
        else:
111 1
            self._matching_evaluator = lambda t1, t2: 1
112 1
113 1
        if local_length_evaluator is not None:
114 1
            self._local_length_evaluator = local_length_evaluator
115 1
        else:
116 1
            self._local_length_evaluator = lambda local_cs: local_cs
117 1
118 1
        if transposition_cost_evaluator is not None:
119 1
            self._transposition_cost_evaluator = transposition_cost_evaluator
120 1
        else:
121
            self._transposition_cost_evaluator = lambda c1, c2: 1
122 1
123
        if transpositions_evaluator is not None:
124
            self._transpositions_evaluator = transpositions_evaluator
125
        else:
126
            self._transpositions_evaluator = lambda lcss, trans: lcss - trans
127
128
    def dist_abs(self, src: str, tar: str) -> float:
129
        """Return the Sift4 Extended distance between two strings.
130
131
        Parameters
132
        ----------
133
        src : str
134
            Source string for comparison
135
        tar : str
136
            Target string for comparison
137
138
        Returns
139
        -------
140
        int
141
            The Sift4 distance according to the extended formula
142
143
        Examples
144
        --------
145
        >>> cmp = Sift4Extended()
146
        >>> cmp.dist_abs('cat', 'hat')
147
        1
148
        >>> cmp.dist_abs('Niall', 'Neil')
149
        2
150
        >>> cmp.dist_abs('aluminum', 'Catalan')
151
        5
152
        >>> cmp.dist_abs('ATCG', 'TAGC')
153 1
        2
154 1
155
156 1
        .. versionadded:: 0.4.0
157 1
158
        """
159 1
        src_list = self._tokenizer.tokenize(src).get_list()
160 1
        tar_list = self._tokenizer.tokenize(tar).get_list()
161
162 1
        if not src_list:
163 1
            return len(tar_list)
164
165 1
        if not tar_list:
166 1
            return len(src_list)
167 1
168 1
        src_len = len(src_list)
169 1
        tar_len = len(tar_list)
170 1
171
        src_cur = 0
172 1
        tar_cur = 0
173 1
        lcss = 0.0
174 1
        local_cs = 0.0
175
        trans = 0.0
176
        offset_arr = []  # type: List[Dict[str, Union[int, bool]]]
177 1
178 1
        while (src_cur < src_len) and (tar_cur < tar_len):
179 1
            if self._token_matcher(src_list[src_cur], tar_list[tar_cur]):
180 1
                local_cs += self._matching_evaluator(
181 1
                    src_list[src_cur], tar_list[tar_cur]
182 1
                )
183
                is_trans = False
184
                i = 0
185 1
                while i < len(offset_arr):
186 1
                    ofs = offset_arr[i]
187
                    if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
188
                        is_trans = abs(tar_cur - src_cur) >= abs(
189 1
                            ofs['tar_cur'] - ofs['src_cur']
190 1
                        )
191 1
                        if is_trans:
192
                            trans += self._transposition_cost_evaluator(
193
                                src_cur, tar_cur
194 1
                            )
195 1
                        elif not ofs['trans']:
196 1
                            ofs['trans'] = True
197
                            trans += self._transposition_cost_evaluator(
198 1
                                ofs['tar_cur'], ofs['src_cur']
199
                            )
200 1
                        break
201
                    elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
202
                        del offset_arr[i]
203
                    else:
204 1
                        i += 1
205 1
206 1
                offset_arr.append(
207 1
                    {'src_cur': src_cur, 'tar_cur': tar_cur, 'trans': is_trans}
208 1
                )
209 1
            else:
210
                lcss += self._local_length_evaluator(local_cs)
211
                local_cs = 0
212 1
                if src_cur != tar_cur:
213 1
                    src_cur = tar_cur = min(src_cur, tar_cur)
214
                for i in range(self._max_offset):
215
                    if not (
216 1
                        (src_cur + i < src_len) or (tar_cur + i < tar_len)
217 1
                    ):
218 1
                        break
219 1
                    if (src_cur + i < src_len) and (
220
                        self._token_matcher(
221
                            src_list[src_cur + i], tar_list[tar_cur]
222 1
                        )
223 1
                    ):
224 1
                        src_cur += i - 1
225
                        tar_cur -= 1
226 1
                        break
227 1
                    if (tar_cur + i < tar_len) and (
228
                        self._token_matcher(
229 1
                            src_list[src_cur], tar_list[tar_cur + i]
230 1
                        )
231
                    ):
232
                        src_cur -= 1
233 1
                        tar_cur += i - 1
234 1
                        break
235
236 1
            src_cur += 1
237 1
            tar_cur += 1
238 1
239 1
            if self._max_distance:
240
                temporary_distance = self._local_length_evaluator(
241 1
                    max(src_cur, tar_cur)
242 1
                ) - self._transpositions_evaluator(lcss, trans)
243
                if temporary_distance >= self._max_distance:
244
                    return round(temporary_distance)
245
246
            if (src_cur >= src_len) or (tar_cur >= tar_len):
247 1
                lcss += self._local_length_evaluator(local_cs)
248
                local_cs = 0
249
                src_cur = tar_cur = min(src_cur, tar_cur)
250
251
        lcss += self._local_length_evaluator(local_cs)
252
        return round(
253
            self._local_length_evaluator(max(src_len, tar_len))
254
            - self._transpositions_evaluator(lcss, trans)
255
        )
256
257
    @staticmethod
258
    def sift4_token_matcher(src: str, tar: str) -> bool:
259
        """Sift4 Token Matcher.
260
261
        Parameters
262
        ----------
263
        src : str
264
            Source string for comparison
265
        tar : str
266 1
            Target string for comparison
267
268 1
        Returns
269
        -------
270
        bool
271
            Whether the Sift4 similarity of the two tokens is over 0.7
272
273
        .. versionadded:: 0.4.0
274
275
        """
276
        return Sift4Extended.sift4_matching_evaluator(src, tar) > 0.7
277
278
    @staticmethod
279
    def sift4_matching_evaluator(src: str, tar: str) -> float:
280
        """Sift4 Matching Evaluator.
281
282
        Parameters
283
        ----------
284
        src : str
285
            Source string for comparison
286
        tar : str
287 1
            Target string for comparison
288
289 1
        Returns
290
        -------
291
        float
292
            The Sift4 similarity of the two tokens
293
294
        .. versionadded:: 0.4.0
295
296
        """
297
        return Sift4Extended._sift4.sim(src, tar)
298
299
    @staticmethod
300
    def reward_length_evaluator(length: int) -> float:
301
        """Reward Length Evaluator.
302
303
        Parameters
304
        ----------
305
        length : int
306 1
            The length of a local match
307 1
308 1
        Returns
309
        -------
310 1
        float
311
            A reward value that grows sub-linearly
312
313
        .. versionadded:: 0.4.0
314
315
        """
316
        if length < 1:
317
            return 1
318
        return length - 1 / (length + 1)
319
320
    @staticmethod
321
    def reward_length_evaluator_exp(length: int) -> float:
322
        """Reward Length Evaluator.
323
324
        Parameters
325
        ----------
326
        length : int
327 1
            The length of a local match
328
329 1
        Returns
330
        -------
331
        float
332
            A reward value that grows exponentially
333
334
        .. versionadded:: 0.4.0
335
336
        """
337
        return length ** 1.5
338
339
    @staticmethod
340
    def longer_transpositions_are_more_costly(pos1: int, pos2: int) -> float:
341
        """Longer Transpositions Are More Costly.
342
343
        Parameters
344
        ----------
345
        pos1 : int
346
            The position of the first transposition
347
        pos2 : int
348 1
            The position of the second transposition
349
350
        Returns
351
        -------
352
        float
353
            A cost that grows as difference in the positions increases
354
355
        .. versionadded:: 0.4.0
356
357
        """
358
        return abs(pos2 - pos1) / 9 + 1
359
360
361
if __name__ == '__main__':
362
    import doctest
363
364
    doctest.testmod()
365