Completed
Pull Request — master (#138)
by Chris
14:20
created

abydos.distance._sift4   B

Complexity

Total Complexity 44

Size/Duplication

Total Lines 350
Duplicated Lines 0 %

Test Coverage

Coverage 100%

Importance

Changes 0
Metric Value
wmc 44
eloc 127
dl 0
loc 350
ccs 109
cts 109
cp 1
rs 8.8798
c 0
b 0
f 0

4 Functions

Rating   Name   Duplication   Size   Complexity  
A sim_sift4() 0 23 1
A sift4_simplest() 0 19 1
A dist_sift4() 0 22 1
A sift4_common() 0 23 1

3 Methods

Rating   Name   Duplication   Size   Complexity  
F Sift4.dist_abs() 0 100 25
A Sift4.dist() 0 25 1
F Sift4Simplest.dist_abs() 0 65 14

How to fix   Complexity   

Complexity

Complex classes like abydos.distance._sift4 often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance.sift4.
20
21
The distance.sift4 module implements Sift4 approximate string distance
22
functions.
23
"""
24
25 1
from __future__ import division, unicode_literals
26
27 1
from six.moves import range
28
29 1
from ._distance import Distance
30
31 1
__all__ = [
32
    'Sift4',
33
    'Sift4Simplest',
34
    'dist_sift4',
35
    'sift4_common',
36
    'sift4_simplest',
37
    'sim_sift4',
38
]
39
40
41 1
class Sift4(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
42
    """Sift4 Common version.
43
44
    This is an approximation of edit distance, described in
45
    :cite:`Zackwehdex:2014`.
46
    """
47
48 1
    def dist_abs(self, src, tar, max_offset=5, max_distance=0):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (18/15).
Loading history...
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
49
        """Return the "common" Sift4 distance between two terms.
50
51
        :param str src: source string for comparison
52
        :param str tar: target string for comparison
53
        :param max_offset: the number of characters to search for matching
54
            letters
55
        :param max_distance: the distance at which to stop and exit
56
        :returns: the Sift4 distance according to the common formula
57
        :rtype: int
58
59
        >>> cmp = Sift4()
60
        >>> cmp.dist_abs('cat', 'hat')
61
        1
62
        >>> cmp.dist_abs('Niall', 'Neil')
63
        2
64
        >>> cmp.dist_abs('Colin', 'Cuilen')
65
        3
66
        >>> cmp.dist_abs('ATCG', 'TAGC')
67
        2
68
        """
69 1
        if not src:
70 1
            return len(tar)
71
72 1
        if not tar:
73 1
            return len(src)
74
75 1
        src_len = len(src)
76 1
        tar_len = len(tar)
77
78 1
        src_cur = 0
79 1
        tar_cur = 0
80 1
        lcss = 0
81 1
        local_cs = 0
82 1
        trans = 0
83 1
        offset_arr = []
84
85 1
        while (src_cur < src_len) and (tar_cur < tar_len):
86 1
            if src[src_cur] == tar[tar_cur]:
87 1
                local_cs += 1
88 1
                is_trans = False
89 1
                i = 0
90 1
                while i < len(offset_arr):
91 1
                    ofs = offset_arr[i]
92 1
                    if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
93 1
                        is_trans = abs(tar_cur - src_cur) >= abs(
94
                            ofs['tar_cur'] - ofs['src_cur']
95
                        )
96 1
                        if is_trans:
97 1
                            trans += 1
98 1
                        elif not ofs['trans']:
99 1
                            ofs['trans'] = True
100 1
                            trans += 1
101 1
                        break
102 1
                    elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
103 1
                        del offset_arr[i]
104
                    else:
105 1
                        i += 1
106
107 1
                offset_arr.append(
108
                    {'src_cur': src_cur, 'tar_cur': tar_cur, 'trans': is_trans}
109
                )
110
            else:
111 1
                lcss += local_cs
112 1
                local_cs = 0
113 1
                if src_cur != tar_cur:
114 1
                    src_cur = tar_cur = min(src_cur, tar_cur)
115 1
                for i in range(max_offset):
116 1
                    if not (
117
                        (src_cur + i < src_len) or (tar_cur + i < tar_len)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
118
                    ):
119 1
                        break
120 1
                    if (src_cur + i < src_len) and (
121
                        src[src_cur + i] == tar[tar_cur]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
122
                    ):
123 1
                        src_cur += i - 1
124 1
                        tar_cur -= 1
125 1
                        break
126 1
                    if (tar_cur + i < tar_len) and (
127
                        src[src_cur] == tar[tar_cur + i]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
128
                    ):
129 1
                        src_cur -= 1
130 1
                        tar_cur += i - 1
131 1
                        break
132
133 1
            src_cur += 1
134 1
            tar_cur += 1
135
136 1
            if max_distance:
137 1
                temporary_distance = max(src_cur, tar_cur) - lcss + trans
138 1
                if temporary_distance >= max_distance:
139 1
                    return round(temporary_distance)
140
141 1
            if (src_cur >= src_len) or (tar_cur >= tar_len):
142 1
                lcss += local_cs
143 1
                local_cs = 0
144 1
                src_cur = tar_cur = min(src_cur, tar_cur)
145
146 1
        lcss += local_cs
147 1
        return round(max(src_len, tar_len) - lcss + trans)
148
149 1
    def dist(self, src, tar, max_offset=5, max_distance=0):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist' method
Loading history...
150
        """Return the normalized "common" Sift4 distance between two terms.
151
152
        This is Sift4 distance, normalized to [0, 1].
153
154
        :param str src: source string for comparison
155
        :param str tar: target string for comparison
156
        :param max_offset: the number of characters to search for matching
157
            letters
158
        :param max_distance: the distance at which to stop and exit
159
        :returns: the normalized Sift4 distance
160
        :rtype: float
161
162
        >>> cmp = Sift4()
163
        >>> round(cmp.dist('cat', 'hat'), 12)
164
        0.333333333333
165
        >>> cmp.dist('Niall', 'Neil')
166
        0.4
167
        >>> cmp.dist('Colin', 'Cuilen')
168
        0.5
169
        >>> cmp.dist('ATCG', 'TAGC')
170
        0.5
171
        """
172 1
        return self.dist_abs(src, tar, max_offset, max_distance) / (
173
            max(len(src), len(tar), 1)
174
        )
175
176
177 1
def sift4_common(src, tar, max_offset=5, max_distance=0):
178
    """Return the "common" Sift4 distance between two terms.
179
180
    This is an approximation of edit distance, described in
181
    :cite:`Zackwehdex:2014`.
182
183
    :param str src: source string for comparison
184
    :param str tar: target string for comparison
185
    :param max_offset: the number of characters to search for matching letters
186
    :param max_distance: the distance at which to stop and exit
187
    :returns: the Sift4 distance according to the common formula
188
    :rtype: int
189
190
    >>> sift4_common('cat', 'hat')
191
    1
192
    >>> sift4_common('Niall', 'Neil')
193
    2
194
    >>> sift4_common('Colin', 'Cuilen')
195
    3
196
    >>> sift4_common('ATCG', 'TAGC')
197
    2
198
    """
199 1
    return Sift4().dist_abs(src, tar, max_offset, max_distance)
200
201
202 1
def dist_sift4(src, tar, max_offset=5, max_distance=0):
203
    """Return the normalized "common" Sift4 distance between two terms.
204
205
    This is Sift4 distance, normalized to [0, 1].
206
207
    :param str src: source string for comparison
208
    :param str tar: target string for comparison
209
    :param max_offset: the number of characters to search for matching letters
210
    :param max_distance: the distance at which to stop and exit
211
    :returns: the normalized Sift4 distance
212
    :rtype: float
213
214
    >>> round(dist_sift4('cat', 'hat'), 12)
215
    0.333333333333
216
    >>> dist_sift4('Niall', 'Neil')
217
    0.4
218
    >>> dist_sift4('Colin', 'Cuilen')
219
    0.5
220
    >>> dist_sift4('ATCG', 'TAGC')
221
    0.5
222
    """
223 1
    return Sift4().dist(src, tar, max_offset, max_distance)
224
225
226 1
def sim_sift4(src, tar, max_offset=5, max_distance=0):
227
    """Return the normalized "common" Sift4 similarity of two terms.
228
229
    Normalized Sift4 similarity is the complement of normalized Sift4 distance:
230
    :math:`sim_{Sift4} = 1 - dist_{Sift4}`.
231
232
    :param str src: source string for comparison
233
    :param str tar: target string for comparison
234
    :param max_offset: the number of characters to search for matching letters
235
    :param max_distance: the distance at which to stop and exit
236
    :returns: the normalized Sift4 similarity
237
    :rtype: float
238
239
    >>> round(sim_sift4('cat', 'hat'), 12)
240
    0.666666666667
241
    >>> sim_sift4('Niall', 'Neil')
242
    0.6
243
    >>> sim_sift4('Colin', 'Cuilen')
244
    0.5
245
    >>> sim_sift4('ATCG', 'TAGC')
246
    0.5
247
    """
248 1
    return Sift4().sim(src, tar, max_offset, max_distance)
249
250
251 1
class Sift4Simplest(Sift4):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
252
    """Sift4 Simplest version.
253
254
    This is an approximation of edit distance, described in
255
    :cite:`Zackwehdex:2014`.
256
    """
257
258 1
    def dist_abs(self, src, tar, max_offset=5):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
259
        """Return the "simplest" Sift4 distance between two terms.
260
261
        :param str src: source string for comparison
262
        :param str tar: target string for comparison
263
        :param max_offset: the number of characters to search for matching
264
            letters
265
        :returns: the Sift4 distance according to the simplest formula
266
        :rtype: int
267
268
        >>> cmp = Sift4Simplest()
269
        >>> cmp.dist_abs('cat', 'hat')
270
        1
271
        >>> cmp.dist_abs('Niall', 'Neil')
272
        2
273
        >>> cmp.dist_abs('Colin', 'Cuilen')
274
        3
275
        >>> cmp.dist_abs('ATCG', 'TAGC')
276
        2
277
        """
278 1
        if not src:
279 1
            return len(tar)
280
281 1
        if not tar:
282 1
            return len(src)
283
284 1
        src_len = len(src)
285 1
        tar_len = len(tar)
286
287 1
        src_cur = 0
288 1
        tar_cur = 0
289 1
        lcss = 0
290 1
        local_cs = 0
291
292 1
        while (src_cur < src_len) and (tar_cur < tar_len):
293 1
            if src[src_cur] == tar[tar_cur]:
294 1
                local_cs += 1
295
            else:
296 1
                lcss += local_cs
297 1
                local_cs = 0
298 1
                if src_cur != tar_cur:
299 1
                    src_cur = tar_cur = max(src_cur, tar_cur)
300 1
                for i in range(max_offset):
301 1
                    if not (
302
                        (src_cur + i < src_len) or (tar_cur + i < tar_len)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
303
                    ):
304 1
                        break
305 1
                    if (src_cur + i < src_len) and (
306
                        src[src_cur + i] == tar[tar_cur]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
307
                    ):
308 1
                        src_cur += i
309 1
                        local_cs += 1
310 1
                        break
311 1
                    if (tar_cur + i < tar_len) and (
312
                        src[src_cur] == tar[tar_cur + i]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
313
                    ):
314 1
                        tar_cur += i
315 1
                        local_cs += 1
316 1
                        break
317
318 1
            src_cur += 1
319 1
            tar_cur += 1
320
321 1
        lcss += local_cs
322 1
        return round(max(src_len, tar_len) - lcss)
323
324
325 1
def sift4_simplest(src, tar, max_offset=5):
326
    """Return the "simplest" Sift4 distance between two terms.
327
328
    :param str src: source string for comparison
329
    :param str tar: target string for comparison
330
    :param max_offset: the number of characters to search for matching letters
331
    :returns: the Sift4 distance according to the simplest formula
332
    :rtype: int
333
334
    >>> sift4_simplest('cat', 'hat')
335
    1
336
    >>> sift4_simplest('Niall', 'Neil')
337
    2
338
    >>> sift4_simplest('Colin', 'Cuilen')
339
    3
340
    >>> sift4_simplest('ATCG', 'TAGC')
341
    2
342
    """
343 1
    return Sift4Simplest().dist_abs(src, tar, max_offset)
344
345
346
if __name__ == '__main__':
347
    import doctest
348
349
    doctest.testmod()
350