abydos.distance._jaro_winkler.JaroWinkler.sim() - Code Metrics - Inspection of "Merge pull request #149 from chrislit/0.3.6" - chrislit/abydos - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Push — master ( f43547...71985b )

by Chris

created 2018-11-17 08:52 UTC

abydos.distance._jaro_winkler.JaroWinkler.sim() F

↳ Parent: abydos.distance._jaro_winkler

Complexity

Conditions

Size

Total Lines	177
Code Lines	75

Duplication

Lines	0
Ratio	0 %

Code Coverage

Tests	57
CRAP Score	31

Importance

Changes

Metric	Value
cc	31
eloc	75
nop	8
dl	0
loc	177
ccs	57
cts	57
cp	1
crap	31
rs	0
c	0
b	0
f	0

How to fix Long Method Complexity Many Parameters

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

If many parameters/temporary variables are present:
- Replace temporary variables with Query
- Introduce parameter object; often combined with preserve whole object
- If the above two are insufficient: Replace method with method object
If you have long conditionals: Decompose Conditional
Otherwise: Extract method

Complexity

Complex classes like abydos.distance._jaro_winkler.JaroWinkler.sim() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

If you can get the data from another object that the method knows about already: Replace the parameter with method call
If several parameters are part of another object: Preserve Whole Object
If parameters are not yet connected: Introduce Parameter Object

# -*- coding: utf-8 -*-

# Copyright 2014-2018 by Christopher C. Little.
# This file is part of Abydos.
#
# Abydos is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Abydos is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.

"""abydos.distance._jaro_winkler.

The distance._JaroWinkler module implements distance metrics based on
:cite:`Jaro:1989` and subsequent works:

    - Jaro distance
    - Jaro-Winkler distance
"""

from __future__ import (
    absolute_import,
    division,
    print_function,
    unicode_literals,
)

from six.moves import range

from ._distance import _Distance
from ..tokenizer import QGrams

__all__ = ['JaroWinkler', 'dist_jaro_winkler', 'sim_jaro_winkler']


class JaroWinkler(_Distance):

    """Jaro-Winkler distance.

    Jaro(-Winkler) distance is a string edit distance initially proposed by
    Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.

    This is Python based on the C code for strcmp95:
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
    :cite:`Winkler:1994`. The above file is a US Government publication and,
    accordingly, in the public domain.
    """

    def sim(

        self,

        src,

        tar,

        qval=1,

        mode='winkler',

        long_strings=False,

        boost_threshold=0.7,

        scaling_factor=0.1,

    ):
        """Return the Jaro or Jaro-Winkler similarity of two strings.

        Parameters
        ----------
        src : str
            Source string for comparison
        tar : str
            Target string for comparison
        qval : int
            The length of each q-gram (defaults to 1: character-wise matching)
        mode : str
            Indicates which variant of this distance metric to compute:

                - ``winkler`` -- computes the Jaro-Winkler distance (default)
                  which increases the score for matches near the start of the
                  word
                - ``jaro`` -- computes the Jaro distance

        long_strings : bool
            Set to True to "Increase the probability of a match when the number
            of matched characters is large. This option allows for a little
            more tolerance when the strings are large. It is not an appropriate
            test when comparing fixed length fields such as phone and social
            security numbers." (Used in 'winkler' mode only.)
        boost_threshold : float
            A value between 0 and 1, below which the Winkler boost is not
            applied (defaults to 0.7). (Used in 'winkler' mode only.)
        scaling_factor : float
            A value between 0 and 0.25, indicating by how much to boost scores
            for matching prefixes (defaults to 0.1). (Used in 'winkler' mode
            only.)

        Returns
        -------
        float
            Jaro or Jaro-Winkler similarity

        Raises
        ------
        ValueError
            Unsupported boost_threshold assignment; boost_threshold must be
            between 0 and 1.
        ValueError
            Unsupported scaling_factor assignment; scaling_factor must be
            between 0 and 0.25.'

        Examples
        --------
        >>> round(sim_jaro_winkler('cat', 'hat'), 12)
        0.777777777778
        >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
        0.805
        >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
        0.60119047619
        >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
        0.833333333333

        >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
        0.777777777778
        >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
        0.783333333333
        >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
        0.60119047619
        >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
        0.833333333333

        """
        if mode == 'winkler':
            if boost_threshold > 1 or boost_threshold < 0:
                raise ValueError(
                    'Unsupported boost_threshold assignment; '
                    + 'boost_threshold must be between 0 and 1.'
                )
            if scaling_factor > 0.25 or scaling_factor < 0:
                raise ValueError(
                    'Unsupported scaling_factor assignment; '
                    + 'scaling_factor must be between 0 and 0.25.'
                )

        if src == tar:
            return 1.0

        src = QGrams(src.strip(), qval)._ordered_list
class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent
        tar = QGrams(tar.strip(), qval)._ordered_list
class MyParent:
    def __init__(self):
        self._x = 1;
        self.y = 2;

class MyChild(MyParent):
    def some_method(self):
        return self._x    # Ok, since accessed from a child class

class AnotherClass:
    def some_method(self, instance_of_my_child):
        return instance_of_my_child._x   # Would be flagged as AnotherClass is not
                                         # a child class of MyParent

        lens = len(src)
        lent = len(tar)

        # If either string is blank - return - added in Version 2
        if lens == 0 or lent == 0:
            return 0.0

        if lens > lent:
            search_range = lens
            minv = lent
        else:
            search_range = lent
            minv = lens

        # Zero out the flags
        src_flag = [0] * search_range
        tar_flag = [0] * search_range
        search_range = max(0, search_range // 2 - 1)

        # Looking only within the search range,
        # count and flag the matched pairs.
        num_com = 0
        yl1 = lent - 1
        for i in range(lens):
            low_lim = (i - search_range) if (i >= search_range) else 0
            hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
            for j in range(low_lim, hi_lim + 1):
                if (tar_flag[j] == 0) and (tar[j] == src[i]):
                    tar_flag[j] = 1
                    src_flag[i] = 1
                    num_com += 1
                    break

        # If no characters in common - return
        if num_com == 0:
            return 0.0

        # Count the number of transpositions
        k = n_trans = 0
        for i in range(lens):
            if src_flag[i] != 0:
                j = 0
                for j in range(k, lent):  # pragma: no branch
                    if tar_flag[j] != 0:
                        k = j + 1
                        break
                if src[i] != tar[j]:
                    n_trans += 1
        n_trans //= 2

        # Main weight computation for Jaro distance
        weight = (
            num_com / lens + num_com / lent + (num_com - n_trans) / num_com
        )
        weight /= 3.0

        # Continue to boost the weight if the strings are similar
        # This is the Winkler portion of Jaro-Winkler distance
        if mode == 'winkler' and weight > boost_threshold:

            # Adjust for having up to the first 4 characters in common
            j = 4 if (minv >= 4) else minv
            i = 0
            while (i < j) and (src[i] == tar[i]):
                i += 1
            weight += i * scaling_factor * (1.0 - weight)

            # Optionally adjust for long strings.

            # After agreeing beginning chars, at least two more must agree and
            # the agreeing characters must be > .5 of remaining characters.
            if (
                long_strings

                and (minv > 4)

                and (num_com > i + 1)

                and (2 * num_com >= minv + i)

            ):
                weight += (1.0 - weight) * (
                    (num_com - i - 1) / (lens + lent - i * 2 + 2)
                )

        return weight


def sim_jaro_winkler(

    src,

    tar,

    qval=1,

    mode='winkler',

    long_strings=False,

    boost_threshold=0.7,

    scaling_factor=0.1,

):
    """Return the Jaro or Jaro-Winkler similarity of two strings.

    This is a wrapper for :py:meth:`JaroWinkler.sim`.

    Parameters
    ----------
    src : str
        Source string for comparison
    tar : str
        Target string for comparison
    qval : int
        The length of each q-gram (defaults to 1: character-wise matching)
    mode : str
        Indicates which variant of this distance metric to compute:

            - ``winkler`` -- computes the Jaro-Winkler distance (default) which
              increases the score for matches near the start of the word
            - ``jaro`` -- computes the Jaro distance

    long_strings : bool
        Set to True to "Increase the probability of a match when the number of
        matched characters is large. This option allows for a little more
        tolerance when the strings are large. It is not an appropriate test
        when comparing fixedlength fields such as phone and social security
        numbers." (Used in 'winkler' mode only.)
    boost_threshold : float
        A value between 0 and 1, below which the Winkler boost is not applied
        (defaults to 0.7). (Used in 'winkler' mode only.)
    scaling_factor : float
        A value between 0 and 0.25, indicating by how much to boost scores for
        matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)

    Returns
    -------
    float
        Jaro or Jaro-Winkler similarity

    Examples
    --------
    >>> round(sim_jaro_winkler('cat', 'hat'), 12)
    0.777777777778
    >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
    0.805
    >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
    0.60119047619
    >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
    0.833333333333

    >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
    0.777777777778
    >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
    0.783333333333
    >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
    0.60119047619
    >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
    0.833333333333

    """
    return JaroWinkler().sim(
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
    )


def dist_jaro_winkler(

    src,

    tar,

    qval=1,

    mode='winkler',

    long_strings=False,

    boost_threshold=0.7,

    scaling_factor=0.1,

):
    """Return the Jaro or Jaro-Winkler distance between two strings.

    This is a wrapper for :py:meth:`JaroWinkler.dist`.

    Parameters
    ----------
    src : str
        Source string for comparison
    tar : str
        Target string for comparison
    qval : int
        The length of each q-gram (defaults to 1: character-wise matching)
    mode : str
        Indicates which variant of this distance metric to compute:

            - ``winkler`` -- computes the Jaro-Winkler distance (default) which
              increases the score for matches near the start of the word
            - ``jaro`` -- computes the Jaro distance

    long_strings : bool
        Set to True to "Increase the probability of a match when the number of
        matched characters is large. This option allows for a little more
        tolerance when the strings are large. It is not an appropriate test
        when comparing fixedlength fields such as phone and social security
        numbers." (Used in 'winkler' mode only.)
    boost_threshold : float
        A value between 0 and 1, below which the Winkler boost is not applied
        (defaults to 0.7). (Used in 'winkler' mode only.)
    scaling_factor : float
        A value between 0 and 0.25, indicating by how much to boost scores for
        matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)

    Returns
    -------
    float
        Jaro or Jaro-Winkler distance

    Examples
    --------
    >>> round(dist_jaro_winkler('cat', 'hat'), 12)
    0.222222222222
    >>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
    0.195
    >>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
    0.39880952381
    >>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
    0.166666666667

    >>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
    0.222222222222
    >>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
    0.216666666667
    >>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
    0.39880952381
    >>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
    0.166666666667

    """
    return JaroWinkler().dist(
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
    )


if __name__ == '__main__':
    import doctest

    doctest.testmod()


1		# -- coding: utf-8 --
2
3		# Copyright 2014-2018 by Christopher C. Little.
4		# This file is part of Abydos.
5		#
6		# Abydos is free software: you can redistribute it and/or modify
7		# it under the terms of the GNU General Public License as published by
8		# the Free Software Foundation, either version 3 of the License, or
9		# (at your option) any later version.
10		#
11		# Abydos is distributed in the hope that it will be useful,
12		# but WITHOUT ANY WARRANTY; without even the implied warranty of
13		# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14		# GNU General Public License for more details.
15		#
16		# You should have received a copy of the GNU General Public License
17		# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19	1	"""abydos.distance._jaro_winkler.
20
21		The distance._JaroWinkler module implements distance metrics based on
22		:cite:`Jaro:1989` and subsequent works:
23
24		- Jaro distance
25		- Jaro-Winkler distance
26		"""
27
28	1	from __future__ import (
29		absolute_import,
30		division,
31		print_function,
32		unicode_literals,
33		)
34
35	1	from six.moves import range
36
37	1	from ._distance import _Distance
38	1	from ..tokenizer import QGrams
39
40	1	__all__ = ['JaroWinkler', 'dist_jaro_winkler', 'sim_jaro_winkler']
41
42
43	1	class JaroWinkler(_Distance):
		0 ignored issues – show Unused Code introduced 2018-11-10 01:42 UTC by Report Bug Copy Issue Report The variable `__class__` seems to be unused. Loading history...
44		"""Jaro-Winkler distance.
45
46		Jaro(-Winkler) distance is a string edit distance initially proposed by
47		Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.
48
49		This is Python based on the C code for strcmp95:
50		http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
51		:cite:`Winkler:1994`. The above file is a US Government publication and,
52		accordingly, in the public domain.
53		"""
54
55	1	def sim(
		0 ignored issues – show best-practice introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Too many arguments (8/5) Loading history... Comprehensibility introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report This function exceeds the maximum number of variables (24/15). Loading history... Bug introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Parameters differ from overridden 'sim' method Loading history...
56		self,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
57		src,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
58		tar,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
59		qval=1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
60		mode='winkler',
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
61		long_strings=False,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
62		boost_threshold=0.7,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
63		scaling_factor=0.1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
64		):
65		"""Return the Jaro or Jaro-Winkler similarity of two strings.
66
67		Parameters
68		----------
69		src : str
70		Source string for comparison
71		tar : str
72		Target string for comparison
73		qval : int
74		The length of each q-gram (defaults to 1: character-wise matching)
75		mode : str
76		Indicates which variant of this distance metric to compute:
77
78		- ``winkler`` -- computes the Jaro-Winkler distance (default)
79		which increases the score for matches near the start of the
80		word
81		- ``jaro`` -- computes the Jaro distance
82
83		long_strings : bool
84		Set to True to "Increase the probability of a match when the number
85		of matched characters is large. This option allows for a little
86		more tolerance when the strings are large. It is not an appropriate
87		test when comparing fixed length fields such as phone and social
88		security numbers." (Used in 'winkler' mode only.)
89		boost_threshold : float
90		A value between 0 and 1, below which the Winkler boost is not
91		applied (defaults to 0.7). (Used in 'winkler' mode only.)
92		scaling_factor : float
93		A value between 0 and 0.25, indicating by how much to boost scores
94		for matching prefixes (defaults to 0.1). (Used in 'winkler' mode
95		only.)
96
97		Returns
98		-------
99		float
100		Jaro or Jaro-Winkler similarity
101
102		Raises
103		------
104		ValueError
105		Unsupported boost_threshold assignment; boost_threshold must be
106		between 0 and 1.
107		ValueError
108		Unsupported scaling_factor assignment; scaling_factor must be
109		between 0 and 0.25.'
110
111		Examples
112		--------
113		>>> round(sim_jaro_winkler('cat', 'hat'), 12)
114		0.777777777778
115		>>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
116		0.805
117		>>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
118		0.60119047619
119		>>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
120		0.833333333333
121
122		>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
123		0.777777777778
124		>>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
125		0.783333333333
126		>>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
127		0.60119047619
128		>>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
129		0.833333333333
130
131		"""
132	1	if mode == 'winkler':
133	1	if boost_threshold > 1 or boost_threshold < 0:
134	1	raise ValueError(
135		'Unsupported boost_threshold assignment; '
136		+ 'boost_threshold must be between 0 and 1.'
137		)
138	1	if scaling_factor > 0.25 or scaling_factor < 0:
139	1	raise ValueError(
140		'Unsupported scaling_factor assignment; '
141		+ 'scaling_factor must be between 0 and 0.25.'
142		)
143
144	1	if src == tar:
145	1	return 1.0
146
147	1	src = QGrams(src.strip(), qval)._ordered_list
		0 ignored issues – show Coding Style Best Practice introduced 2018-11-17 01:13 UTC by Report Bug Copy Issue Report It seems like `_ordered_list` was declared protected and should not be accessed from this context. Prefixing a member variable `_` is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class: class MyParent: def __init__(self): self._x = 1; self.y = 2; class MyChild(MyParent): def some_method(self): return self._x # Ok, since accessed from a child class class AnotherClass: def some_method(self, instance_of_my_child): return instance_of_my_child._x # Would be flagged as AnotherClass is not # a child class of MyParent Loading history...
148	1	tar = QGrams(tar.strip(), qval)._ordered_list
		0 ignored issues – show Coding Style Best Practice introduced 2018-11-17 01:13 UTC by Report Bug Copy Issue Report It seems like `_ordered_list` was declared protected and should not be accessed from this context. Prefixing a member variable `_` is usually regarded as the equivalent of declaring it with protected visibility that exists in other languages. Consequentially, such a member should only be accessed from the same class or a child class: class MyParent: def __init__(self): self._x = 1; self.y = 2; class MyChild(MyParent): def some_method(self): return self._x # Ok, since accessed from a child class class AnotherClass: def some_method(self, instance_of_my_child): return instance_of_my_child._x # Would be flagged as AnotherClass is not # a child class of MyParent Loading history...
149
150	1	lens = len(src)
151	1	lent = len(tar)
152
153		# If either string is blank - return - added in Version 2
154	1	if lens == 0 or lent == 0:
155	1	return 0.0
156
157	1	if lens > lent:
158	1	search_range = lens
159	1	minv = lent
160		else:
161	1	search_range = lent
162	1	minv = lens
163
164		# Zero out the flags
165	1	src_flag = [0] * search_range
166	1	tar_flag = [0] * search_range
167	1	search_range = max(0, search_range // 2 - 1)
168
169		# Looking only within the search range,
170		# count and flag the matched pairs.
171	1	num_com = 0
172	1	yl1 = lent - 1
173	1	for i in range(lens):
174	1	low_lim = (i - search_range) if (i >= search_range) else 0
175	1	hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
176	1	for j in range(low_lim, hi_lim + 1):
177	1	if (tar_flag[j] == 0) and (tar[j] == src[i]):
178	1	tar_flag[j] = 1
179	1	src_flag[i] = 1
180	1	num_com += 1
181	1	break
182
183		# If no characters in common - return
184	1	if num_com == 0:
185	1	return 0.0
186
187		# Count the number of transpositions
188	1	k = n_trans = 0
189	1	for i in range(lens):
190	1	if src_flag[i] != 0:
191	1	j = 0
192	1	for j in range(k, lent): # pragma: no branch
193	1	if tar_flag[j] != 0:
194	1	k = j + 1
195	1	break
196	1	if src[i] != tar[j]:
197	1	n_trans += 1
198	1	n_trans //= 2
199
200		# Main weight computation for Jaro distance
201	1	weight = (
202		num_com / lens + num_com / lent + (num_com - n_trans) / num_com
203		)
204	1	weight /= 3.0
205
206		# Continue to boost the weight if the strings are similar
207		# This is the Winkler portion of Jaro-Winkler distance
208	1	if mode == 'winkler' and weight > boost_threshold:
209
210		# Adjust for having up to the first 4 characters in common
211	1	j = 4 if (minv >= 4) else minv
212	1	i = 0
213	1	while (i < j) and (src[i] == tar[i]):
214	1	i += 1
215	1	weight += i * scaling_factor * (1.0 - weight)
216
217		# Optionally adjust for long strings.
218
219		# After agreeing beginning chars, at least two more must agree and
220		# the agreeing characters must be > .5 of remaining characters.
221	1	if (
222		long_strings
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
223		and (minv > 4)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
224		and (num_com > i + 1)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
225		and (2 * num_com >= minv + i)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
226		):
227	1	weight += (1.0 - weight) * (
228		(num_com - i - 1) / (lens + lent - i * 2 + 2)
229		)
230
231	1	return weight
232
233
234	1	def sim_jaro_winkler(
		0 ignored issues – show best-practice introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Too many arguments (7/5) Loading history...
235		src,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
236		tar,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
237		qval=1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
238		mode='winkler',
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
239		long_strings=False,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
240		boost_threshold=0.7,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
241		scaling_factor=0.1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
242		):
243		"""Return the Jaro or Jaro-Winkler similarity of two strings.
244
245		This is a wrapper for :py:meth:`JaroWinkler.sim`.
246
247		Parameters
248		----------
249		src : str
250		Source string for comparison
251		tar : str
252		Target string for comparison
253		qval : int
254		The length of each q-gram (defaults to 1: character-wise matching)
255		mode : str
256		Indicates which variant of this distance metric to compute:
257
258		- ``winkler`` -- computes the Jaro-Winkler distance (default) which
259		increases the score for matches near the start of the word
260		- ``jaro`` -- computes the Jaro distance
261
262		long_strings : bool
263		Set to True to "Increase the probability of a match when the number of
264		matched characters is large. This option allows for a little more
265		tolerance when the strings are large. It is not an appropriate test
266		when comparing fixedlength fields such as phone and social security
267		numbers." (Used in 'winkler' mode only.)
268		boost_threshold : float
269		A value between 0 and 1, below which the Winkler boost is not applied
270		(defaults to 0.7). (Used in 'winkler' mode only.)
271		scaling_factor : float
272		A value between 0 and 0.25, indicating by how much to boost scores for
273		matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
274
275		Returns
276		-------
277		float
278		Jaro or Jaro-Winkler similarity
279
280		Examples
281		--------
282		>>> round(sim_jaro_winkler('cat', 'hat'), 12)
283		0.777777777778
284		>>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
285		0.805
286		>>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
287		0.60119047619
288		>>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
289		0.833333333333
290
291		>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
292		0.777777777778
293		>>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
294		0.783333333333
295		>>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
296		0.60119047619
297		>>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
298		0.833333333333
299
300		"""
301	1	return JaroWinkler().sim(
302		src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
303		)
304
305
306	1	def dist_jaro_winkler(
		0 ignored issues – show best-practice introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Too many arguments (7/5) Loading history...
307		src,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
308		tar,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
309		qval=1,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
310		mode='winkler',
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
311		long_strings=False,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
312		boost_threshold=0.7,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
313		scaling_factor=0.1,
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
314		):
315		"""Return the Jaro or Jaro-Winkler distance between two strings.
316
317		This is a wrapper for :py:meth:`JaroWinkler.dist`.
318
319		Parameters
320		----------
321		src : str
322		Source string for comparison
323		tar : str
324		Target string for comparison
325		qval : int
326		The length of each q-gram (defaults to 1: character-wise matching)
327		mode : str
328		Indicates which variant of this distance metric to compute:
329
330		- ``winkler`` -- computes the Jaro-Winkler distance (default) which
331		increases the score for matches near the start of the word
332		- ``jaro`` -- computes the Jaro distance
333
334		long_strings : bool
335		Set to True to "Increase the probability of a match when the number of
336		matched characters is large. This option allows for a little more
337		tolerance when the strings are large. It is not an appropriate test
338		when comparing fixedlength fields such as phone and social security
339		numbers." (Used in 'winkler' mode only.)
340		boost_threshold : float
341		A value between 0 and 1, below which the Winkler boost is not applied
342		(defaults to 0.7). (Used in 'winkler' mode only.)
343		scaling_factor : float
344		A value between 0 and 0.25, indicating by how much to boost scores for
345		matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
346
347		Returns
348		-------
349		float
350		Jaro or Jaro-Winkler distance
351
352		Examples
353		--------
354		>>> round(dist_jaro_winkler('cat', 'hat'), 12)
355		0.222222222222
356		>>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
357		0.195
358		>>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
359		0.39880952381
360		>>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
361		0.166666666667
362
363		>>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
364		0.222222222222
365		>>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
366		0.216666666667
367		>>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
368		0.39880952381
369		>>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
370		0.166666666667
371
372		"""
373	1	return JaroWinkler().dist(
374		src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
375		)
376
377
378		if __name__ == '__main__':
379		import doctest
380
381		doctest.testmod()
382

chrislit / abydos

Push — master ( f43547...71985b )

abydos.distance._jaro_winkler.JaroWinkler.sim() F

Complexity

Size

Duplication

Code Coverage

Importance

How to fix Long Method Complexity Many Parameters

Long Method

Complexity

Many Parameters

Duplication Side-by-Side

Filter issues like