Test Failed
Push — master ( 64abe2...a464fa )
by Chris
04:02 queued 11s
created

abydos.distance.sift4.sift4_simplest()   F

Complexity

Conditions 14

Size

Total Lines 60
Code Lines 33

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 14
eloc 33
nop 3
dl 0
loc 60
rs 3.6
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance.sift4.sift4_simplest() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.distance.sift4.
20
21
The distance.sift4 module implements Sift4 approximate string distance
22
functions.
23
"""
24
25
from __future__ import division, unicode_literals
26
27
from six.moves import range
28
29
30
__all__ = ['dist_sift4', 'sift4_common', 'sift4_simplest', 'sim_sift4']
31
32
33
def sift4_simplest(src, tar, max_offset=5):
34
    """Return the "simplest" Sift4 distance between two terms.
35
36
    This is an approximation of edit distance, described in
37
    :cite:`Zackwehdex:2014`.
38
39
    :param str src: source string for comparison
40
    :param str tar: target string for comparison
41
    :param max_offset: the number of characters to search for matching letters
42
    :returns: the Sift4 distance according to the simplest formula
43
    :rtype: int
44
45
    >>> sift4_simplest('cat', 'hat')
46
    1
47
    >>> sift4_simplest('Niall', 'Neil')
48
    2
49
    >>> sift4_simplest('Colin', 'Cuilen')
50
    3
51
    >>> sift4_simplest('ATCG', 'TAGC')
52
    2
53
    """
54
    if not src:
55
        return len(tar)
56
57
    if not tar:
58
        return len(src)
59
60
    src_len = len(src)
61
    tar_len = len(tar)
62
63
    src_cur = 0
64
    tar_cur = 0
65
    lcss = 0
66
    local_cs = 0
67
68
    while (src_cur < src_len) and (tar_cur < tar_len):
69
        if src[src_cur] == tar[tar_cur]:
70
            local_cs += 1
71
        else:
72
            lcss += local_cs
73
            local_cs = 0
74
            if src_cur != tar_cur:
75
                src_cur = tar_cur = max(src_cur, tar_cur)
76
            for i in range(max_offset):
77
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
78
                    break
79
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
80
                    src_cur += i
81
                    local_cs += 1
82
                    break
83
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
84
                    tar_cur += i
85
                    local_cs += 1
86
                    break
87
88
        src_cur += 1
89
        tar_cur += 1
90
91
    lcss += local_cs
92
    return round(max(src_len, tar_len) - lcss)
93
94
95
def sift4_common(src, tar, max_offset=5, max_distance=0):
96
    """Return the "common" Sift4 distance between two terms.
97
98
    This is an approximation of edit distance, described in
99
    :cite:`Zackwehdex:2014`.
100
101
    :param str src: source string for comparison
102
    :param str tar: target string for comparison
103
    :param max_offset: the number of characters to search for matching letters
104
    :param max_distance: the distance at which to stop and exit
105
    :returns: the Sift4 distance according to the common formula
106
    :rtype: int
107
108
    >>> sift4_common('cat', 'hat')
109
    1
110
    >>> sift4_common('Niall', 'Neil')
111
    2
112
    >>> sift4_common('Colin', 'Cuilen')
113
    3
114
    >>> sift4_common('ATCG', 'TAGC')
115
    2
116
    """
117
    if not src:
118
        return len(tar)
119
120
    if not tar:
121
        return len(src)
122
123
    src_len = len(src)
124
    tar_len = len(tar)
125
126
    src_cur = 0
127
    tar_cur = 0
128
    lcss = 0
129
    local_cs = 0
130
    trans = 0
131
    offset_arr = []
132
133
    while (src_cur < src_len) and (tar_cur < tar_len):
134
        if src[src_cur] == tar[tar_cur]:
135
            local_cs += 1
136
            is_trans = False
137
            i = 0
138
            while i < len(offset_arr):
139
                ofs = offset_arr[i]
140
                if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
141
                    is_trans = (abs(tar_cur-src_cur) >=
142
                                abs(ofs['tar_cur']-ofs['src_cur']))
143
                    if is_trans:
144
                        trans += 1
145
                    elif not ofs['trans']:
146
                        ofs['trans'] = True
147
                        trans += 1
148
                    break
149
                elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
150
                    del offset_arr[i]
151
                else:
152
                    i += 1
153
154
            offset_arr.append({'src_cur': src_cur, 'tar_cur': tar_cur,
155
                               'trans': is_trans})
156
        else:
157
            lcss += local_cs
158
            local_cs = 0
159
            if src_cur != tar_cur:
160
                src_cur = tar_cur = min(src_cur, tar_cur)
161
            for i in range(max_offset):
162
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
163
                    break
164
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
165
                    src_cur += i-1
166
                    tar_cur -= 1
167
                    break
168
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
169
                    src_cur -= 1
170
                    tar_cur += i-1
171
                    break
172
173
        src_cur += 1
174
        tar_cur += 1
175
176
        if max_distance:
177
            temporary_distance = max(src_cur, tar_cur) - lcss + trans
178
            if temporary_distance >= max_distance:
179
                return round(temporary_distance)
180
181
        if (src_cur >= src_len) or (tar_cur >= tar_len):
182
            lcss += local_cs
183
            local_cs = 0
184
            src_cur = tar_cur = min(src_cur, tar_cur)
185
186
    lcss += local_cs
187
    return round(max(src_len, tar_len) - lcss + trans)
188
189
190
def dist_sift4(src, tar, max_offset=5, max_distance=0):
191
    """Return the normalized "common" Sift4 distance between two terms.
192
193
    This is Sift4 distance, normalized to [0, 1].
194
195
    :param str src: source string for comparison
196
    :param str tar: target string for comparison
197
    :param max_offset: the number of characters to search for matching letters
198
    :param max_distance: the distance at which to stop and exit
199
    :returns: the normalized Sift4 distance
200
    :rtype: float
201
202
    >>> round(dist_sift4('cat', 'hat'), 12)
203
    0.333333333333
204
    >>> dist_sift4('Niall', 'Neil')
205
    0.4
206
    >>> dist_sift4('Colin', 'Cuilen')
207
    0.5
208
    >>> dist_sift4('ATCG', 'TAGC')
209
    0.5
210
    """
211
    return (sift4_common(src, tar, max_offset, max_distance) /
212
            (max(len(src), len(tar), 1)))
213
214
215
def sim_sift4(src, tar, max_offset=5, max_distance=0):
216
    """Return the normalized "common" Sift4 similarity of two terms.
217
218
    Normalized Sift4 similarity is the complement of normalized Sift4 distance:
219
    :math:`sim_{Sift4} = 1 - dist_{Sift4}`.
220
221
    :param str src: source string for comparison
222
    :param str tar: target string for comparison
223
    :param max_offset: the number of characters to search for matching letters
224
    :param max_distance: the distance at which to stop and exit
225
    :returns: the normalized Sift4 similarity
226
    :rtype: float
227
228
    >>> round(sim_sift4('cat', 'hat'), 12)
229
    0.666666666667
230
    >>> sim_sift4('Niall', 'Neil')
231
    0.6
232
    >>> sim_sift4('Colin', 'Cuilen')
233
    0.5
234
    >>> sim_sift4('ATCG', 'TAGC')
235
    0.5
236
    """
237
    return 1-dist_sift4(src, tar, max_offset, max_distance)
238
239
240
if __name__ == '__main__':
241
    import doctest
242
    doctest.testmod()
243