Completed
Branch master (78a222)
by Chris
14:36
created

abydos.distance._jaro.sim_jaro_winkler()   F

Complexity

Conditions 31

Size

Total Lines 160
Code Lines 73

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 57
CRAP Score 31

Importance

Changes 0
Metric Value
eloc 73
dl 0
loc 160
ccs 57
cts 57
cp 1
rs 0
c 0
b 0
f 0
cc 31
nop 7
crap 31

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance._jaro.sim_jaro_winkler() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance.jaro.
20
21
The distance.jaro module implements distance metrics based on
22
:cite:`Jaro:1989` and subsequent works:
23
24
    - Jaro distance
25
    - Jaro-Winkler distance
26
    - the strcmp95 algorithm variant of Jaro-Winkler distance
27
"""
28
29 1
from __future__ import division, unicode_literals
30
31 1
from collections import defaultdict
32
33 1
from six.moves import range
34
35 1
from ..tokenizer import QGrams
36
37
38 1
__all__ = [
39
    'dist_jaro_winkler',
40
    'dist_strcmp95',
41
    'sim_jaro_winkler',
42
    'sim_strcmp95',
43
]
44
45
46 1
def sim_strcmp95(src, tar, long_strings=False):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (23/15).
Loading history...
47
    """Return the strcmp95 similarity of two strings.
48
49
    This is a Python translation of the C code for strcmp95:
50
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
51
    :cite:`Winkler:1994`.
52
    The above file is a US Government publication and, accordingly,
53
    in the public domain.
54
55
    This is based on the Jaro-Winkler distance, but also attempts to correct
56
    for some common typos and frequently confused characters. It is also
57
    limited to uppercase ASCII characters, so it is appropriate to American
58
    names, but not much else.
59
60
    :param str src: source string for comparison
61
    :param str tar: target string for comparison
62
    :param bool long_strings: set to True to "Increase the probability of a
63
        match when the number of matched characters is large.  This option
64
        allows for a little more tolerance when the strings are large. It is
65
        not an appropriate test when comparing fixed length fields such as
66
        phone and social security numbers."
67
    :returns: strcmp95 similarity
68
    :rtype: float
69
70
    >>> sim_strcmp95('cat', 'hat')
71
    0.7777777777777777
72
    >>> sim_strcmp95('Niall', 'Neil')
73
    0.8454999999999999
74
    >>> sim_strcmp95('aluminum', 'Catalan')
75
    0.6547619047619048
76
    >>> sim_strcmp95('ATCG', 'TAGC')
77
    0.8333333333333334
78
    """
79
80 1
    def _in_range(char):
81
        """Return True if char is in the range (0, 91)."""
82 1
        return 91 > ord(char) > 0
83
84 1
    ying = src.strip().upper()
85 1
    yang = tar.strip().upper()
86
87 1
    if ying == yang:
88 1
        return 1.0
89
    # If either string is blank - return - added in Version 2
90 1
    if not ying or not yang:
91 1
        return 0.0
92
93 1
    adjwt = defaultdict(int)
94 1
    sp_mx = (
95
        ('A', 'E'),
96
        ('A', 'I'),
97
        ('A', 'O'),
98
        ('A', 'U'),
99
        ('B', 'V'),
100
        ('E', 'I'),
101
        ('E', 'O'),
102
        ('E', 'U'),
103
        ('I', 'O'),
104
        ('I', 'U'),
105
        ('O', 'U'),
106
        ('I', 'Y'),
107
        ('E', 'Y'),
108
        ('C', 'G'),
109
        ('E', 'F'),
110
        ('W', 'U'),
111
        ('W', 'V'),
112
        ('X', 'K'),
113
        ('S', 'Z'),
114
        ('X', 'S'),
115
        ('Q', 'C'),
116
        ('U', 'V'),
117
        ('M', 'N'),
118
        ('L', 'I'),
119
        ('Q', 'O'),
120
        ('P', 'R'),
121
        ('I', 'J'),
122
        ('2', 'Z'),
123
        ('5', 'S'),
124
        ('8', 'B'),
125
        ('1', 'I'),
126
        ('1', 'L'),
127
        ('0', 'O'),
128
        ('0', 'Q'),
129
        ('C', 'K'),
130
        ('G', 'J'),
131
    )
132
133
    # Initialize the adjwt array on the first call to the function only.
134
    # The adjwt array is used to give partial credit for characters that
135
    # may be errors due to known phonetic or character recognition errors.
136
    # A typical example is to match the letter "O" with the number "0"
137 1
    for i in sp_mx:
138 1
        adjwt[(i[0], i[1])] = 3
139 1
        adjwt[(i[1], i[0])] = 3
140
141 1
    if len(ying) > len(yang):
142 1
        search_range = len(ying)
143 1
        minv = len(yang)
144
    else:
145 1
        search_range = len(yang)
146 1
        minv = len(ying)
147
148
    # Blank out the flags
149 1
    ying_flag = [0] * search_range
150 1
    yang_flag = [0] * search_range
151 1
    search_range = max(0, search_range // 2 - 1)
152
153
    # Looking only within the search range, count and flag the matched pairs.
154 1
    num_com = 0
155 1
    yl1 = len(yang) - 1
156 1
    for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
157 1
        low_lim = (i - search_range) if (i >= search_range) else 0
158 1
        hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
159 1
        for j in range(low_lim, hi_lim + 1):
160 1
            if (yang_flag[j] == 0) and (yang[j] == ying[i]):
161 1
                yang_flag[j] = 1
162 1
                ying_flag[i] = 1
163 1
                num_com += 1
164 1
                break
165
166
    # If no characters in common - return
167 1
    if num_com == 0:
168 1
        return 0.0
169
170
    # Count the number of transpositions
171 1
    k = n_trans = 0
172 1
    for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
173 1
        if ying_flag[i] != 0:
174 1
            j = 0
175 1
            for j in range(k, len(yang)):  # pragma: no branch
176 1
                if yang_flag[j] != 0:
177 1
                    k = j + 1
178 1
                    break
179 1
            if ying[i] != yang[j]:
180 1
                n_trans += 1
181 1
    n_trans //= 2
182
183
    # Adjust for similarities in unmatched characters
184 1
    n_simi = 0
185 1
    if minv > num_com:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
186 1
        for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
187 1
            if ying_flag[i] == 0 and _in_range(ying[i]):
188 1
                for j in range(len(yang)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
189 1
                    if yang_flag[j] == 0 and _in_range(yang[j]):
190 1
                        if (ying[i], yang[j]) in adjwt:
191 1
                            n_simi += adjwt[(ying[i], yang[j])]
192 1
                            yang_flag[j] = 2
193 1
                            break
194 1
    num_sim = n_simi / 10.0 + num_com
195
196
    # Main weight computation
197 1
    weight = (
198
        num_sim / len(ying)
199
        + num_sim / len(yang)
200
        + (num_com - n_trans) / num_com
201
    )
202 1
    weight /= 3.0
203
204
    # Continue to boost the weight if the strings are similar
205 1
    if weight > 0.7:
206
207
        # Adjust for having up to the first 4 characters in common
208 1
        j = 4 if (minv >= 4) else minv
209 1
        i = 0
210 1
        while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
211 1
            i += 1
212 1
        if i:
213 1
            weight += i * 0.1 * (1.0 - weight)
214
215
        # Optionally adjust for long strings.
216
217
        # After agreeing beginning chars, at least two more must agree and
218
        # the agreeing characters must be > .5 of remaining characters.
219 1
        if (
220
            long_strings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
221
            and (minv > 4)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
222
            and (num_com > i + 1)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
223
            and (2 * num_com >= minv + i)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
224
        ):
225 1
            if not ying[0].isdigit():
226 1
                weight += (1.0 - weight) * (
227
                    (num_com - i - 1) / (len(ying) + len(yang) - i * 2 + 2)
228
                )
229
230 1
    return weight
231
232
233 1
def dist_strcmp95(src, tar, long_strings=False):
234
    """Return the strcmp95 distance between two strings.
235
236
    strcmp95 distance is the complement of strcmp95 similarity:
237
    :math:`dist_{strcmp95} = 1 - sim_{strcmp95}`.
238
239
    :param str src: source string for comparison
240
    :param str tar: target string for comparison
241
    :param bool long_strings: set to True to "Increase the probability of a
242
        match when the number of matched characters is large.  This option
243
        allows for a little more tolerance when the strings are large. It is
244
        not an appropriate test when comparing fixed length fields such as
245
        phone and social security numbers."
246
    :returns: strcmp95 distance
247
    :rtype: float
248
249
    >>> round(dist_strcmp95('cat', 'hat'), 12)
250
    0.222222222222
251
    >>> round(dist_strcmp95('Niall', 'Neil'), 12)
252
    0.1545
253
    >>> round(dist_strcmp95('aluminum', 'Catalan'), 12)
254
    0.345238095238
255
    >>> round(dist_strcmp95('ATCG', 'TAGC'), 12)
256
    0.166666666667
257
    """
258 1
    return 1 - sim_strcmp95(src, tar, long_strings)
259
260
261 1
def sim_jaro_winkler(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (22/15).
Loading history...
262
    src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
263
    tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
264
    qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
265
    mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
266
    long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
267
    boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
268
    scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
269
):
270
    """Return the Jaro or Jaro-Winkler similarity of two strings.
271
272
    Jaro(-Winkler) distance is a string edit distance initially proposed by
273
    Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.
274
275
    This is Python based on the C code for strcmp95:
276
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
277
    :cite:`Winkler:1994`. The above file is a US Government publication and,
278
    accordingly, in the public domain.
279
280
    :param str src: source string for comparison
281
    :param str tar: target string for comparison
282
    :param int qval: the length of each q-gram (defaults to 1: character-wise
283
        matching)
284
    :param str mode: indicates which variant of this distance metric to
285
        compute:
286
287
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
288
              increases the score for matches near the start of the word
289
            - 'jaro' -- computes the Jaro distance
290
291
    The following arguments apply only when mode is 'winkler':
292
293
    :param bool long_strings: set to True to "Increase the probability of a
294
        match when the number of matched characters is large.  This option
295
        allows for a little more tolerance when the strings are large.  It is
296
        not an appropriate test when comparing fixed length fields such as
297
        phone and social security numbers."
298
    :param float boost_threshold: a value between 0 and 1, below which the
299
        Winkler boost is not applied (defaults to 0.7)
300
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
301
        much to boost scores for matching prefixes (defaults to 0.1)
302
303
    :returns: Jaro or Jaro-Winkler similarity
304
    :rtype: float
305
306
    >>> round(sim_jaro_winkler('cat', 'hat'), 12)
307
    0.777777777778
308
    >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
309
    0.805
310
    >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
311
    0.60119047619
312
    >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
313
    0.833333333333
314
315
    >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
316
    0.777777777778
317
    >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
318
    0.783333333333
319
    >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
320
    0.60119047619
321
    >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
322
    0.833333333333
323
    """
324 1
    if mode == 'winkler':
325 1
        if boost_threshold > 1 or boost_threshold < 0:
326 1
            raise ValueError(
327
                'Unsupported boost_threshold assignment; '
328
                + 'boost_threshold must be between 0 and 1.'
329
            )
330 1
        if scaling_factor > 0.25 or scaling_factor < 0:
331 1
            raise ValueError(
332
                'Unsupported scaling_factor assignment; '
333
                + 'scaling_factor must be between 0 and 0.25.'
334
            )
335
336 1
    if src == tar:
337 1
        return 1.0
338
339 1
    src = QGrams(src.strip(), qval).ordered_list
340 1
    tar = QGrams(tar.strip(), qval).ordered_list
341
342 1
    lens = len(src)
343 1
    lent = len(tar)
344
345
    # If either string is blank - return - added in Version 2
346 1
    if lens == 0 or lent == 0:
347 1
        return 0.0
348
349 1
    if lens > lent:
350 1
        search_range = lens
351 1
        minv = lent
352
    else:
353 1
        search_range = lent
354 1
        minv = lens
355
356
    # Zero out the flags
357 1
    src_flag = [0] * search_range
358 1
    tar_flag = [0] * search_range
359 1
    search_range = max(0, search_range // 2 - 1)
360
361
    # Looking only within the search range, count and flag the matched pairs.
362 1
    num_com = 0
363 1
    yl1 = lent - 1
364 1
    for i in range(lens):
365 1
        low_lim = (i - search_range) if (i >= search_range) else 0
366 1
        hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
367 1
        for j in range(low_lim, hi_lim + 1):
368 1
            if (tar_flag[j] == 0) and (tar[j] == src[i]):
369 1
                tar_flag[j] = 1
370 1
                src_flag[i] = 1
371 1
                num_com += 1
372 1
                break
373
374
    # If no characters in common - return
375 1
    if num_com == 0:
376 1
        return 0.0
377
378
    # Count the number of transpositions
379 1
    k = n_trans = 0
380 1
    for i in range(lens):
381 1
        if src_flag[i] != 0:
382 1
            j = 0
383 1
            for j in range(k, lent):  # pragma: no branch
384 1
                if tar_flag[j] != 0:
385 1
                    k = j + 1
386 1
                    break
387 1
            if src[i] != tar[j]:
388 1
                n_trans += 1
389 1
    n_trans //= 2
390
391
    # Main weight computation for Jaro distance
392 1
    weight = num_com / lens + num_com / lent + (num_com - n_trans) / num_com
393 1
    weight /= 3.0
394
395
    # Continue to boost the weight if the strings are similar
396
    # This is the Winkler portion of Jaro-Winkler distance
397 1
    if mode == 'winkler' and weight > boost_threshold:
398
399
        # Adjust for having up to the first 4 characters in common
400 1
        j = 4 if (minv >= 4) else minv
401 1
        i = 0
402 1
        while (i < j) and (src[i] == tar[i]):
403 1
            i += 1
404 1
        weight += i * scaling_factor * (1.0 - weight)
405
406
        # Optionally adjust for long strings.
407
408
        # After agreeing beginning chars, at least two more must agree and
409
        # the agreeing characters must be > .5 of remaining characters.
410 1
        if (
411
            long_strings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
412
            and (minv > 4)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
413
            and (num_com > i + 1)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
414
            and (2 * num_com >= minv + i)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
415
        ):
416 1
            weight += (1.0 - weight) * (
417
                (num_com - i - 1) / (lens + lent - i * 2 + 2)
418
            )
419
420 1
    return weight
421
422
423 1
def dist_jaro_winkler(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
424
    src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
425
    tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
426
    qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
427
    mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
428
    long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
429
    boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
430
    scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
431
):
432
    """Return the Jaro or Jaro-Winkler distance between two strings.
433
434
    Jaro(-Winkler) similarity is the complement of Jaro(-Winkler) distance:
435
    :math:`sim_{Jaro(-Winkler)} = 1 - dist_{Jaro(-Winkler)}`.
436
437
    :param str src: source string for comparison
438
    :param str tar: target string for comparison
439
    :param int qval: the length of each q-gram (defaults to 1: character-wise
440
        matching)
441
    :param str mode: indicates which variant of this distance metric to
442
        compute:
443
444
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
445
              increases the score for matches near the start of the word
446
            - 'jaro' -- computes the Jaro distance
447
448
    The following arguments apply only when mode is 'winkler':
449
450
    :param bool long_strings: set to True to "Increase the probability of a
451
        match when the number of matched characters is large.  This option
452
        allows for a little more tolerance when the strings are large.  It is
453
        not an appropriate test when comparing fixed length fields such as
454
        phone and social security numbers."
455
    :param float boost_threshold: a value between 0 and 1, below which the
456
        Winkler boost is not applied (defaults to 0.7)
457
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
458
        much to boost scores for matching prefixes (defaults to 0.1)
459
460
    :returns: Jaro or Jaro-Winkler distance
461
    :rtype: float
462
463
    >>> round(dist_jaro_winkler('cat', 'hat'), 12)
464
    0.222222222222
465
    >>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
466
    0.195
467
    >>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
468
    0.39880952381
469
    >>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
470
    0.166666666667
471
472
    >>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
473
    0.222222222222
474
    >>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
475
    0.216666666667
476
    >>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
477
    0.39880952381
478
    >>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
479
    0.166666666667
480
    """
481 1
    return 1 - sim_jaro_winkler(
482
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
483
    )
484
485
486
if __name__ == '__main__':
487
    import doctest
488
489
    doctest.testmod()
490