Completed
Pull Request — master (#141)
by Chris
13:03
created

abydos.distance._jaro   F

Complexity

Total Complexity 72

Size/Duplication

Total Lines 629
Duplicated Lines 0 %

Test Coverage

Coverage 100%

Importance

Changes 0
Metric Value
wmc 72
eloc 231
dl 0
loc 629
ccs 144
cts 144
cp 1
rs 2.64
c 0
b 0
f 0

4 Functions

Rating   Name   Duplication   Size   Complexity  
A sim_strcmp95() 0 29 1
A sim_jaro_winkler() 0 62 1
A dist_strcmp95() 0 29 1
A dist_jaro_winkler() 0 62 1

2 Methods

Rating   Name   Duplication   Size   Complexity  
F JaroWinkler.sim() 0 165 31
F Strcmp95.sim() 0 150 37

How to fix   Complexity   

Complexity

Complex classes like abydos.distance._jaro often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance.jaro.
20
21
The distance.jaro module implements distance metrics based on
22
:cite:`Jaro:1989` and subsequent works:
23
24
    - Jaro distance
25
    - Jaro-Winkler distance
26
    - the strcmp95 algorithm variant of Jaro-Winkler distance
27
"""
28
29 1
from __future__ import division, unicode_literals
30
31 1
from collections import defaultdict
32
33 1
from six.moves import range
34
35 1
from ._distance import Distance
36 1
from ..tokenizer import QGrams
37
38 1
__all__ = [
39
    'JaroWinkler',
40
    'Strcmp95',
41
    'dist_jaro_winkler',
42
    'dist_strcmp95',
43
    'sim_jaro_winkler',
44
    'sim_strcmp95',
45
]
46
47
48 1
class Strcmp95(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
49
    """Strcmp95.
50
51
    This is a Python translation of the C code for strcmp95:
52
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
53
    :cite:`Winkler:1994`.
54
    The above file is a US Government publication and, accordingly,
55
    in the public domain.
56
57
    This is based on the Jaro-Winkler distance, but also attempts to correct
58
    for some common typos and frequently confused characters. It is also
59
    limited to uppercase ASCII characters, so it is appropriate to American
60
    names, but not much else.
61
    """
62
63 1
    _sp_mx = (
64
        ('A', 'E'),
65
        ('A', 'I'),
66
        ('A', 'O'),
67
        ('A', 'U'),
68
        ('B', 'V'),
69
        ('E', 'I'),
70
        ('E', 'O'),
71
        ('E', 'U'),
72
        ('I', 'O'),
73
        ('I', 'U'),
74
        ('O', 'U'),
75
        ('I', 'Y'),
76
        ('E', 'Y'),
77
        ('C', 'G'),
78
        ('E', 'F'),
79
        ('W', 'U'),
80
        ('W', 'V'),
81
        ('X', 'K'),
82
        ('S', 'Z'),
83
        ('X', 'S'),
84
        ('Q', 'C'),
85
        ('U', 'V'),
86
        ('M', 'N'),
87
        ('L', 'I'),
88
        ('Q', 'O'),
89
        ('P', 'R'),
90
        ('I', 'J'),
91
        ('2', 'Z'),
92
        ('5', 'S'),
93
        ('8', 'B'),
94
        ('1', 'I'),
95
        ('1', 'L'),
96
        ('0', 'O'),
97
        ('0', 'Q'),
98
        ('C', 'K'),
99
        ('G', 'J'),
100
    )
101
102 1
    def sim(self, src, tar, long_strings=False):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (24/15).
Loading history...
Bug introduced by
Parameters differ from overridden 'sim' method
Loading history...
103
        """Return the strcmp95 similarity of two strings.
104
105
        Args:
106
            src (str): Source string for comparison
107
            tar (str): Target string for comparison
108
            long_strings (bool): Set to True to increase the probability of a
109
                match when the number of matched characters is large. This
110
                option allows for a little more tolerance when the strings are
111
                large. It is not an appropriate test when comparing fixed
112
                length fields such as phone and social security numbers.
113
114
        Returns:
115
            float: Strcmp95 similarity
116
117
        Examples:
118
            >>> cmp = Strcmp95()
119
            >>> cmp.sim('cat', 'hat')
120
            0.7777777777777777
121
            >>> cmp.sim('Niall', 'Neil')
122
            0.8454999999999999
123
            >>> cmp.sim('aluminum', 'Catalan')
124
            0.6547619047619048
125
            >>> cmp.sim('ATCG', 'TAGC')
126
            0.8333333333333334
127
128
        """
129
130 1
        def _in_range(char):
131
            """Return True if char is in the range (0, 91).
132
133
            Args:
134
                char (str): The character to check
135
136
            Returns:
137
                bool: True if char is in the range (0, 91)
138
139
            """
140 1
            return 91 > ord(char) > 0
141
142 1
        ying = src.strip().upper()
143 1
        yang = tar.strip().upper()
144
145 1
        if ying == yang:
146 1
            return 1.0
147
        # If either string is blank - return - added in Version 2
148 1
        if not ying or not yang:
149 1
            return 0.0
150
151 1
        adjwt = defaultdict(int)
152
153
        # Initialize the adjwt array on the first call to the function only.
154
        # The adjwt array is used to give partial credit for characters that
155
        # may be errors due to known phonetic or character recognition errors.
156
        # A typical example is to match the letter "O" with the number "0"
157 1
        for i in self._sp_mx:
158 1
            adjwt[(i[0], i[1])] = 3
159 1
            adjwt[(i[1], i[0])] = 3
160
161 1
        if len(ying) > len(yang):
162 1
            search_range = len(ying)
163 1
            minv = len(yang)
164
        else:
165 1
            search_range = len(yang)
166 1
            minv = len(ying)
167
168
        # Blank out the flags
169 1
        ying_flag = [0] * search_range
170 1
        yang_flag = [0] * search_range
171 1
        search_range = max(0, search_range // 2 - 1)
172
173
        # Looking only within the search range,
174
        # count and flag the matched pairs.
175 1
        num_com = 0
176 1
        yl1 = len(yang) - 1
177 1
        for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
178 1
            low_lim = (i - search_range) if (i >= search_range) else 0
179 1
            hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
180 1
            for j in range(low_lim, hi_lim + 1):
181 1
                if (yang_flag[j] == 0) and (yang[j] == ying[i]):
182 1
                    yang_flag[j] = 1
183 1
                    ying_flag[i] = 1
184 1
                    num_com += 1
185 1
                    break
186
187
        # If no characters in common - return
188 1
        if num_com == 0:
189 1
            return 0.0
190
191
        # Count the number of transpositions
192 1
        k = n_trans = 0
193 1
        for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
194 1
            if ying_flag[i] != 0:
195 1
                j = 0
196 1
                for j in range(k, len(yang)):  # pragma: no branch
197 1
                    if yang_flag[j] != 0:
198 1
                        k = j + 1
199 1
                        break
200 1
                if ying[i] != yang[j]:
201 1
                    n_trans += 1
202 1
        n_trans //= 2
203
204
        # Adjust for similarities in unmatched characters
205 1
        n_simi = 0
206 1
        if minv > num_com:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
207 1
            for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
208 1
                if ying_flag[i] == 0 and _in_range(ying[i]):
209 1
                    for j in range(len(yang)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
210 1
                        if yang_flag[j] == 0 and _in_range(yang[j]):
211 1
                            if (ying[i], yang[j]) in adjwt:
212 1
                                n_simi += adjwt[(ying[i], yang[j])]
213 1
                                yang_flag[j] = 2
214 1
                                break
215 1
        num_sim = n_simi / 10.0 + num_com
216
217
        # Main weight computation
218 1
        weight = (
219
            num_sim / len(ying)
220
            + num_sim / len(yang)
221
            + (num_com - n_trans) / num_com
222
        )
223 1
        weight /= 3.0
224
225
        # Continue to boost the weight if the strings are similar
226 1
        if weight > 0.7:
227
228
            # Adjust for having up to the first 4 characters in common
229 1
            j = 4 if (minv >= 4) else minv
230 1
            i = 0
231 1
            while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
232 1
                i += 1
233 1
            if i:
234 1
                weight += i * 0.1 * (1.0 - weight)
235
236
            # Optionally adjust for long strings.
237
238
            # After agreeing beginning chars, at least two more must agree and
239
            # the agreeing characters must be > .5 of remaining characters.
240 1
            if (
241
                long_strings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
242
                and (minv > 4)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
243
                and (num_com > i + 1)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
244
                and (2 * num_com >= minv + i)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
245
            ):
246 1
                if not ying[0].isdigit():
247 1
                    weight += (1.0 - weight) * (
248
                        (num_com - i - 1) / (len(ying) + len(yang) - i * 2 + 2)
249
                    )
250
251 1
        return weight
252
253
254 1
def sim_strcmp95(src, tar, long_strings=False):
255
    """Return the strcmp95 similarity of two strings.
256
257
    This is a wrapper for :py:meth:`Strcmp95.sim`.
258
259
    Args:
260
        src (str): Source string for comparison
261
        tar (str): Target string for comparison
262
        long_strings (bool): Set to True to increase the probability of a
263
            match when the number of matched characters is large. This option
264
            allows for a little more tolerance when the strings are large. It
265
            is not an appropriate test when comparing fixed length fields such
266
            as phone and social security numbers.
267
268
    Returns:
269
        float: strcmp95 similarity
270
271
    Examples:
272
        >>> sim_strcmp95('cat', 'hat')
273
        0.7777777777777777
274
        >>> sim_strcmp95('Niall', 'Neil')
275
        0.8454999999999999
276
        >>> sim_strcmp95('aluminum', 'Catalan')
277
        0.6547619047619048
278
        >>> sim_strcmp95('ATCG', 'TAGC')
279
        0.8333333333333334
280
281
    """
282 1
    return Strcmp95().sim(src, tar, long_strings)
283
284
285 1
def dist_strcmp95(src, tar, long_strings=False):
286
    """Return the strcmp95 distance between two strings.
287
288
    This is a wrapper for :py:meth:`Strcmp95.dist`.
289
290
    Args:
291
        src (str): Source string for comparison
292
        tar (str): Target string for comparison
293
        long_strings (bool): Set to True to increase the probability of a
294
            match when the number of matched characters is large. This option
295
            allows for a little more tolerance when the strings are large. It
296
            is not an appropriate test when comparing fixed length fields such
297
            as phone and social security numbers.
298
299
    Returns:
300
        float: strcmp95 distance
301
302
    Examples:
303
        >>> round(dist_strcmp95('cat', 'hat'), 12)
304
        0.222222222222
305
        >>> round(dist_strcmp95('Niall', 'Neil'), 12)
306
        0.1545
307
        >>> round(dist_strcmp95('aluminum', 'Catalan'), 12)
308
        0.345238095238
309
        >>> round(dist_strcmp95('ATCG', 'TAGC'), 12)
310
        0.166666666667
311
312
    """
313 1
    return Strcmp95().dist(src, tar, long_strings)
314
315
316 1
class JaroWinkler(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
317
    """Jaro-Winkler distance.
318
319
    Jaro(-Winkler) distance is a string edit distance initially proposed by
320
    Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.
321
322
    This is Python based on the C code for strcmp95:
323
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
324
    :cite:`Winkler:1994`. The above file is a US Government publication and,
325
    accordingly, in the public domain.
326
    """
327
328 1
    def sim(
0 ignored issues
show
best-practice introduced by
Too many arguments (8/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (24/15).
Loading history...
Bug introduced by
Parameters differ from overridden 'sim' method
Loading history...
329
        self,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
330
        src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
331
        tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
332
        qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
333
        mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
334
        long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
335
        boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
336
        scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
337
    ):
338
        """Return the Jaro or Jaro-Winkler similarity of two strings.
339
340
        Args:
341
            src (str): Source string for comparison
342
            tar (str): Target string for comparison
343
            qval (int): The length of each q-gram (defaults to 1:
344
                character-wise matching)
345
            mode (str): Indicates which variant of this distance metric to
346
                compute:
347
                    - ``winkler`` -- computes the Jaro-Winkler distance
348
                      (default) which increases the score for matches near the
349
                      start of the word
350
                    - ``jaro`` -- computes the Jaro distance
351
            long_strings (bool): Set to True to "Increase the probability of a
352
                match when the number of matched characters is large. This
353
                option allows for a little more tolerance when the strings are
354
                large. It is not an appropriate test when comparing fixed
355
                length fields such as phone and social security numbers."
356
                (Used in 'winkler' mode only.)
357
            boost_threshold (float): A value between 0 and 1, below which the
358
                Winkler boost is not applied (defaults to 0.7). (Used in
359
                'winkler' mode only.)
360
            scaling_factor (float): A value between 0 and 0.25, indicating by
361
                how much to boost scores for matching prefixes (defaults to
362
                0.1). (Used in 'winkler' mode only.)
363
364
        Returns:
365
            float: Jaro or Jaro-Winkler similarity
366
367
        Raises:
368
            ValueError: Unsupported boost_threshold assignment; boost_threshold
369
                must be between 0 and 1.
370
            ValueError: Unsupported scaling_factor assignment; scaling_factor
371
                must be between 0 and 0.25.'
372
373
        Examples:
374
            >>> round(sim_jaro_winkler('cat', 'hat'), 12)
375
            0.777777777778
376
            >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
377
            0.805
378
            >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
379
            0.60119047619
380
            >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
381
            0.833333333333
382
383
            >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
384
            0.777777777778
385
            >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
386
            0.783333333333
387
            >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
388
            0.60119047619
389
            >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
390
            0.833333333333
391
392
        """
393 1
        if mode == 'winkler':
394 1
            if boost_threshold > 1 or boost_threshold < 0:
395 1
                raise ValueError(
396
                    'Unsupported boost_threshold assignment; '
397
                    + 'boost_threshold must be between 0 and 1.'
398
                )
399 1
            if scaling_factor > 0.25 or scaling_factor < 0:
400 1
                raise ValueError(
401
                    'Unsupported scaling_factor assignment; '
402
                    + 'scaling_factor must be between 0 and 0.25.'
403
                )
404
405 1
        if src == tar:
406 1
            return 1.0
407
408 1
        src = QGrams(src.strip(), qval).ordered_list
409 1
        tar = QGrams(tar.strip(), qval).ordered_list
410
411 1
        lens = len(src)
412 1
        lent = len(tar)
413
414
        # If either string is blank - return - added in Version 2
415 1
        if lens == 0 or lent == 0:
416 1
            return 0.0
417
418 1
        if lens > lent:
419 1
            search_range = lens
420 1
            minv = lent
421
        else:
422 1
            search_range = lent
423 1
            minv = lens
424
425
        # Zero out the flags
426 1
        src_flag = [0] * search_range
427 1
        tar_flag = [0] * search_range
428 1
        search_range = max(0, search_range // 2 - 1)
429
430
        # Looking only within the search range,
431
        # count and flag the matched pairs.
432 1
        num_com = 0
433 1
        yl1 = lent - 1
434 1
        for i in range(lens):
435 1
            low_lim = (i - search_range) if (i >= search_range) else 0
436 1
            hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
437 1
            for j in range(low_lim, hi_lim + 1):
438 1
                if (tar_flag[j] == 0) and (tar[j] == src[i]):
439 1
                    tar_flag[j] = 1
440 1
                    src_flag[i] = 1
441 1
                    num_com += 1
442 1
                    break
443
444
        # If no characters in common - return
445 1
        if num_com == 0:
446 1
            return 0.0
447
448
        # Count the number of transpositions
449 1
        k = n_trans = 0
450 1
        for i in range(lens):
451 1
            if src_flag[i] != 0:
452 1
                j = 0
453 1
                for j in range(k, lent):  # pragma: no branch
454 1
                    if tar_flag[j] != 0:
455 1
                        k = j + 1
456 1
                        break
457 1
                if src[i] != tar[j]:
458 1
                    n_trans += 1
459 1
        n_trans //= 2
460
461
        # Main weight computation for Jaro distance
462 1
        weight = (
463
            num_com / lens + num_com / lent + (num_com - n_trans) / num_com
464
        )
465 1
        weight /= 3.0
466
467
        # Continue to boost the weight if the strings are similar
468
        # This is the Winkler portion of Jaro-Winkler distance
469 1
        if mode == 'winkler' and weight > boost_threshold:
470
471
            # Adjust for having up to the first 4 characters in common
472 1
            j = 4 if (minv >= 4) else minv
473 1
            i = 0
474 1
            while (i < j) and (src[i] == tar[i]):
475 1
                i += 1
476 1
            weight += i * scaling_factor * (1.0 - weight)
477
478
            # Optionally adjust for long strings.
479
480
            # After agreeing beginning chars, at least two more must agree and
481
            # the agreeing characters must be > .5 of remaining characters.
482 1
            if (
483
                long_strings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
484
                and (minv > 4)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
485
                and (num_com > i + 1)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
486
                and (2 * num_com >= minv + i)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
487
            ):
488 1
                weight += (1.0 - weight) * (
489
                    (num_com - i - 1) / (lens + lent - i * 2 + 2)
490
                )
491
492 1
        return weight
493
494
495 1
def sim_jaro_winkler(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
496
    src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
497
    tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
498
    qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
499
    mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
500
    long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
501
    boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
502
    scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
503
):
504
    """Return the Jaro or Jaro-Winkler similarity of two strings.
505
506
    This is a wrapper for :py:meth:`JaroWinkler.sim`.
507
508
    Args:
509
        src (str): Source string for comparison
510
        tar (str): Target string for comparison
511
        qval (int): The length of each q-gram (defaults to 1:
512
            character-wise matching)
513
        mode (str): Indicates which variant of this distance metric to
514
            compute:
515
                - ``winkler`` -- computes the Jaro-Winkler distance (default)
516
                  which increases the score for matches near the start of the
517
                  word
518
                - ``jaro`` -- computes the Jaro distance
519
        long_strings (bool): Set to True to "Increase the probability of a
520
            match when the number of matched characters is large. This option
521
            allows for a little more tolerance when the strings are large. It
522
            is not an appropriate test when comparing fixedlength fields such
523
            as phone and social security numbers." (Used in 'winkler' mode
524
            only.)
525
        boost_threshold (float): A value between 0 and 1, below which the
526
            Winkler boost is not applied (defaults to 0.7). (Used in 'winkler'
527
            mode only.)
528
        scaling_factor (float): A value between 0 and 0.25, indicating by how
529
            much to boost scores for matching prefixes (defaults to 0.1). (Used
530
            in 'winkler' mode only.)
531
532
    Returns:
533
        float: Jaro or Jaro-Winkler similarity
534
535
    Examples:
536
        >>> round(sim_jaro_winkler('cat', 'hat'), 12)
537
        0.777777777778
538
        >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
539
        0.805
540
        >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
541
        0.60119047619
542
        >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
543
        0.833333333333
544
545
        >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
546
        0.777777777778
547
        >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
548
        0.783333333333
549
        >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
550
        0.60119047619
551
        >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
552
        0.833333333333
553
554
    """
555 1
    return JaroWinkler().sim(
556
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
557
    )
558
559
560 1
def dist_jaro_winkler(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
561
    src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
562
    tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
563
    qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
564
    mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
565
    long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
566
    boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
567
    scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
568
):
569
    """Return the Jaro or Jaro-Winkler distance between two strings.
570
571
    This is a wrapper for :py:meth:`JaroWinkler.dist`.
572
573
    Args:
574
        src (str): Source string for comparison
575
        tar (str): Target string for comparison
576
        qval (int): The length of each q-gram (defaults to 1:
577
            character-wise matching)
578
        mode (str): Indicates which variant of this distance metric to
579
            compute:
580
                - ``winkler`` -- computes the Jaro-Winkler distance (default)
581
                  which increases the score for matches near the start of the
582
                  word
583
                - ``jaro`` -- computes the Jaro distance
584
        long_strings (bool): Set to True to "Increase the probability of a
585
            match when the number of matched characters is large. This option
586
            allows for a little more tolerance when the strings are large. It
587
            is not an appropriate test when comparing fixedlength fields such
588
            as phone and social security numbers." (Used in 'winkler' mode
589
            only.)
590
        boost_threshold (float): A value between 0 and 1, below which the
591
            Winkler boost is not applied (defaults to 0.7). (Used in 'winkler'
592
            mode only.)
593
        scaling_factor (float): A value between 0 and 0.25, indicating by how
594
            much to boost scores for matching prefixes (defaults to 0.1). (Used
595
            in 'winkler' mode only.)
596
597
    Returns:
598
        float: Jaro or Jaro-Winkler distance
599
600
    Examples:
601
        >>> round(dist_jaro_winkler('cat', 'hat'), 12)
602
        0.222222222222
603
        >>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
604
        0.195
605
        >>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
606
        0.39880952381
607
        >>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
608
        0.166666666667
609
610
        >>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
611
        0.222222222222
612
        >>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
613
        0.216666666667
614
        >>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
615
        0.39880952381
616
        >>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
617
        0.166666666667
618
619
    """
620 1
    return JaroWinkler().dist(
621
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
622
    )
623
624
625
if __name__ == '__main__':
626
    import doctest
627
628
    doctest.testmod()
629