Completed
Push — master ( f43547...71985b )
by Chris
12:00 queued 10s
created

abydos.distance._jaro_winkler.dist_jaro_winkler()   A

Complexity

Conditions 1

Size

Total Lines 69
Code Lines 10

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 2
CRAP Score 1

Importance

Changes 0
Metric Value
cc 1
eloc 10
nop 7
dl 0
loc 69
ccs 2
cts 2
cp 1
crap 1
rs 9.9
c 0
b 0
f 0

How to fix   Long Method   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance._jaro_winkler.
20
21
The distance._JaroWinkler module implements distance metrics based on
22
:cite:`Jaro:1989` and subsequent works:
23
24
    - Jaro distance
25
    - Jaro-Winkler distance
26
"""
27
28 1
from __future__ import (
29
    absolute_import,
30
    division,
31
    print_function,
32
    unicode_literals,
33
)
34
35 1
from six.moves import range
36
37 1
from ._distance import _Distance
38 1
from ..tokenizer import QGrams
39
40 1
__all__ = ['JaroWinkler', 'dist_jaro_winkler', 'sim_jaro_winkler']
41
42
43 1
class JaroWinkler(_Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
44
    """Jaro-Winkler distance.
45
46
    Jaro(-Winkler) distance is a string edit distance initially proposed by
47
    Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.
48
49
    This is Python based on the C code for strcmp95:
50
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
51
    :cite:`Winkler:1994`. The above file is a US Government publication and,
52
    accordingly, in the public domain.
53
    """
54
55 1
    def sim(
0 ignored issues
show
best-practice introduced by
Too many arguments (8/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (24/15).
Loading history...
Bug introduced by
Parameters differ from overridden 'sim' method
Loading history...
56
        self,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
57
        src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
58
        tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
59
        qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
60
        mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
61
        long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
62
        boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
63
        scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
64
    ):
65
        """Return the Jaro or Jaro-Winkler similarity of two strings.
66
67
        Parameters
68
        ----------
69
        src : str
70
            Source string for comparison
71
        tar : str
72
            Target string for comparison
73
        qval : int
74
            The length of each q-gram (defaults to 1: character-wise matching)
75
        mode : str
76
            Indicates which variant of this distance metric to compute:
77
78
                - ``winkler`` -- computes the Jaro-Winkler distance (default)
79
                  which increases the score for matches near the start of the
80
                  word
81
                - ``jaro`` -- computes the Jaro distance
82
83
        long_strings : bool
84
            Set to True to "Increase the probability of a match when the number
85
            of matched characters is large. This option allows for a little
86
            more tolerance when the strings are large. It is not an appropriate
87
            test when comparing fixed length fields such as phone and social
88
            security numbers." (Used in 'winkler' mode only.)
89
        boost_threshold : float
90
            A value between 0 and 1, below which the Winkler boost is not
91
            applied (defaults to 0.7). (Used in 'winkler' mode only.)
92
        scaling_factor : float
93
            A value between 0 and 0.25, indicating by how much to boost scores
94
            for matching prefixes (defaults to 0.1). (Used in 'winkler' mode
95
            only.)
96
97
        Returns
98
        -------
99
        float
100
            Jaro or Jaro-Winkler similarity
101
102
        Raises
103
        ------
104
        ValueError
105
            Unsupported boost_threshold assignment; boost_threshold must be
106
            between 0 and 1.
107
        ValueError
108
            Unsupported scaling_factor assignment; scaling_factor must be
109
            between 0 and 0.25.'
110
111
        Examples
112
        --------
113
        >>> round(sim_jaro_winkler('cat', 'hat'), 12)
114
        0.777777777778
115
        >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
116
        0.805
117
        >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
118
        0.60119047619
119
        >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
120
        0.833333333333
121
122
        >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
123
        0.777777777778
124
        >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
125
        0.783333333333
126
        >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
127
        0.60119047619
128
        >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
129
        0.833333333333
130
131
        """
132 1
        if mode == 'winkler':
133 1
            if boost_threshold > 1 or boost_threshold < 0:
134 1
                raise ValueError(
135
                    'Unsupported boost_threshold assignment; '
136
                    + 'boost_threshold must be between 0 and 1.'
137
                )
138 1
            if scaling_factor > 0.25 or scaling_factor < 0:
139 1
                raise ValueError(
140
                    'Unsupported scaling_factor assignment; '
141
                    + 'scaling_factor must be between 0 and 0.25.'
142
                )
143
144 1
        if src == tar:
145 1
            return 1.0
146
147 1
        src = QGrams(src.strip(), qval)._ordered_list
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _ordered_list was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
148 1
        tar = QGrams(tar.strip(), qval)._ordered_list
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like _ordered_list was declared protected and should not be accessed from this context.

Prefixing a member variable _ is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class:

class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
Loading history...
149
150 1
        lens = len(src)
151 1
        lent = len(tar)
152
153
        # If either string is blank - return - added in Version 2
154 1
        if lens == 0 or lent == 0:
155 1
            return 0.0
156
157 1
        if lens > lent:
158 1
            search_range = lens
159 1
            minv = lent
160
        else:
161 1
            search_range = lent
162 1
            minv = lens
163
164
        # Zero out the flags
165 1
        src_flag = [0] * search_range
166 1
        tar_flag = [0] * search_range
167 1
        search_range = max(0, search_range // 2 - 1)
168
169
        # Looking only within the search range,
170
        # count and flag the matched pairs.
171 1
        num_com = 0
172 1
        yl1 = lent - 1
173 1
        for i in range(lens):
174 1
            low_lim = (i - search_range) if (i >= search_range) else 0
175 1
            hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
176 1
            for j in range(low_lim, hi_lim + 1):
177 1
                if (tar_flag[j] == 0) and (tar[j] == src[i]):
178 1
                    tar_flag[j] = 1
179 1
                    src_flag[i] = 1
180 1
                    num_com += 1
181 1
                    break
182
183
        # If no characters in common - return
184 1
        if num_com == 0:
185 1
            return 0.0
186
187
        # Count the number of transpositions
188 1
        k = n_trans = 0
189 1
        for i in range(lens):
190 1
            if src_flag[i] != 0:
191 1
                j = 0
192 1
                for j in range(k, lent):  # pragma: no branch
193 1
                    if tar_flag[j] != 0:
194 1
                        k = j + 1
195 1
                        break
196 1
                if src[i] != tar[j]:
197 1
                    n_trans += 1
198 1
        n_trans //= 2
199
200
        # Main weight computation for Jaro distance
201 1
        weight = (
202
            num_com / lens + num_com / lent + (num_com - n_trans) / num_com
203
        )
204 1
        weight /= 3.0
205
206
        # Continue to boost the weight if the strings are similar
207
        # This is the Winkler portion of Jaro-Winkler distance
208 1
        if mode == 'winkler' and weight > boost_threshold:
209
210
            # Adjust for having up to the first 4 characters in common
211 1
            j = 4 if (minv >= 4) else minv
212 1
            i = 0
213 1
            while (i < j) and (src[i] == tar[i]):
214 1
                i += 1
215 1
            weight += i * scaling_factor * (1.0 - weight)
216
217
            # Optionally adjust for long strings.
218
219
            # After agreeing beginning chars, at least two more must agree and
220
            # the agreeing characters must be > .5 of remaining characters.
221 1
            if (
222
                long_strings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
223
                and (minv > 4)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
224
                and (num_com > i + 1)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
225
                and (2 * num_com >= minv + i)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
226
            ):
227 1
                weight += (1.0 - weight) * (
228
                    (num_com - i - 1) / (lens + lent - i * 2 + 2)
229
                )
230
231 1
        return weight
232
233
234 1
def sim_jaro_winkler(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
235
    src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
236
    tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
237
    qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
238
    mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
239
    long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
240
    boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
241
    scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
242
):
243
    """Return the Jaro or Jaro-Winkler similarity of two strings.
244
245
    This is a wrapper for :py:meth:`JaroWinkler.sim`.
246
247
    Parameters
248
    ----------
249
    src : str
250
        Source string for comparison
251
    tar : str
252
        Target string for comparison
253
    qval : int
254
        The length of each q-gram (defaults to 1: character-wise matching)
255
    mode : str
256
        Indicates which variant of this distance metric to compute:
257
258
            - ``winkler`` -- computes the Jaro-Winkler distance (default) which
259
              increases the score for matches near the start of the word
260
            - ``jaro`` -- computes the Jaro distance
261
262
    long_strings : bool
263
        Set to True to "Increase the probability of a match when the number of
264
        matched characters is large. This option allows for a little more
265
        tolerance when the strings are large. It is not an appropriate test
266
        when comparing fixedlength fields such as phone and social security
267
        numbers." (Used in 'winkler' mode only.)
268
    boost_threshold : float
269
        A value between 0 and 1, below which the Winkler boost is not applied
270
        (defaults to 0.7). (Used in 'winkler' mode only.)
271
    scaling_factor : float
272
        A value between 0 and 0.25, indicating by how much to boost scores for
273
        matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
274
275
    Returns
276
    -------
277
    float
278
        Jaro or Jaro-Winkler similarity
279
280
    Examples
281
    --------
282
    >>> round(sim_jaro_winkler('cat', 'hat'), 12)
283
    0.777777777778
284
    >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
285
    0.805
286
    >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
287
    0.60119047619
288
    >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
289
    0.833333333333
290
291
    >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
292
    0.777777777778
293
    >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
294
    0.783333333333
295
    >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
296
    0.60119047619
297
    >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
298
    0.833333333333
299
300
    """
301 1
    return JaroWinkler().sim(
302
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
303
    )
304
305
306 1
def dist_jaro_winkler(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
307
    src,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
308
    tar,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
309
    qval=1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
310
    mode='winkler',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
311
    long_strings=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
312
    boost_threshold=0.7,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
313
    scaling_factor=0.1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
314
):
315
    """Return the Jaro or Jaro-Winkler distance between two strings.
316
317
    This is a wrapper for :py:meth:`JaroWinkler.dist`.
318
319
    Parameters
320
    ----------
321
    src : str
322
        Source string for comparison
323
    tar : str
324
        Target string for comparison
325
    qval : int
326
        The length of each q-gram (defaults to 1: character-wise matching)
327
    mode : str
328
        Indicates which variant of this distance metric to compute:
329
330
            - ``winkler`` -- computes the Jaro-Winkler distance (default) which
331
              increases the score for matches near the start of the word
332
            - ``jaro`` -- computes the Jaro distance
333
334
    long_strings : bool
335
        Set to True to "Increase the probability of a match when the number of
336
        matched characters is large. This option allows for a little more
337
        tolerance when the strings are large. It is not an appropriate test
338
        when comparing fixedlength fields such as phone and social security
339
        numbers." (Used in 'winkler' mode only.)
340
    boost_threshold : float
341
        A value between 0 and 1, below which the Winkler boost is not applied
342
        (defaults to 0.7). (Used in 'winkler' mode only.)
343
    scaling_factor : float
344
        A value between 0 and 0.25, indicating by how much to boost scores for
345
        matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
346
347
    Returns
348
    -------
349
    float
350
        Jaro or Jaro-Winkler distance
351
352
    Examples
353
    --------
354
    >>> round(dist_jaro_winkler('cat', 'hat'), 12)
355
    0.222222222222
356
    >>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
357
    0.195
358
    >>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
359
    0.39880952381
360
    >>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
361
    0.166666666667
362
363
    >>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
364
    0.222222222222
365
    >>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
366
    0.216666666667
367
    >>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
368
    0.39880952381
369
    >>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
370
    0.166666666667
371
372
    """
373 1
    return JaroWinkler().dist(
374
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
375
    )
376
377
378
if __name__ == '__main__':
379
    import doctest
380
381
    doctest.testmod()
382