abydos.distance._jaro - Code Metrics - Inspection of "78a222a9f7d8976f6744d263e3d6d01a2a991c27" - chrislit/abydos - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Branch — master (78a222)

by Chris

created 2018-10-26 11:30 UTC

abydos.distance._jaro F

↳ Parent: Project

Complexity

Total Complexity

Size/Duplication

Total Lines	490
Duplicated Lines	0 %

Test Coverage

Coverage

100%

Importance

Changes

Metric	Value
wmc	70
eloc	212
dl	0
loc	490
ccs	137
cts	137
cp	1
rs	2.8
c	0
b	0
f	0

4 Functions

Rating	Name	Size	Complexity
F	sim_strcmp95()	185	37
F	sim_jaro_winkler()	160	31
A	dist_strcmp95()	26	1
A	dist_jaro_winkler()	60	1

How to fix Complexity

# -*- coding: utf-8 -*-

# Copyright 2014-2018 by Christopher C. Little.
# This file is part of Abydos.
#
# Abydos is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Abydos is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.

"""abydos.distance.jaro.

The distance.jaro module implements distance metrics based on
:cite:`Jaro:1989` and subsequent works:

    - Jaro distance
    - Jaro-Winkler distance
    - the strcmp95 algorithm variant of Jaro-Winkler distance
"""

from __future__ import division, unicode_literals

from collections import defaultdict

from six.moves import range

from ..tokenizer import QGrams


__all__ = [
    'dist_jaro_winkler',
    'dist_strcmp95',
    'sim_jaro_winkler',
    'sim_strcmp95',
]


def sim_strcmp95(src, tar, long_strings=False):

    """Return the strcmp95 similarity of two strings.

    This is a Python translation of the C code for strcmp95:
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
    :cite:`Winkler:1994`.
    The above file is a US Government publication and, accordingly,
    in the public domain.

    This is based on the Jaro-Winkler distance, but also attempts to correct
    for some common typos and frequently confused characters. It is also
    limited to uppercase ASCII characters, so it is appropriate to American
    names, but not much else.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param bool long_strings: set to True to "Increase the probability of a
        match when the number of matched characters is large.  This option
        allows for a little more tolerance when the strings are large. It is
        not an appropriate test when comparing fixed length fields such as
        phone and social security numbers."
    :returns: strcmp95 similarity
    :rtype: float

    >>> sim_strcmp95('cat', 'hat')
    0.7777777777777777
    >>> sim_strcmp95('Niall', 'Neil')
    0.8454999999999999
    >>> sim_strcmp95('aluminum', 'Catalan')
    0.6547619047619048
    >>> sim_strcmp95('ATCG', 'TAGC')
    0.8333333333333334
    """

    def _in_range(char):
        """Return True if char is in the range (0, 91)."""
        return 91 > ord(char) > 0

    ying = src.strip().upper()
    yang = tar.strip().upper()

    if ying == yang:
        return 1.0
    # If either string is blank - return - added in Version 2
    if not ying or not yang:
        return 0.0

    adjwt = defaultdict(int)
    sp_mx = (
        ('A', 'E'),
        ('A', 'I'),
        ('A', 'O'),
        ('A', 'U'),
        ('B', 'V'),
        ('E', 'I'),
        ('E', 'O'),
        ('E', 'U'),
        ('I', 'O'),
        ('I', 'U'),
        ('O', 'U'),
        ('I', 'Y'),
        ('E', 'Y'),
        ('C', 'G'),
        ('E', 'F'),
        ('W', 'U'),
        ('W', 'V'),
        ('X', 'K'),
        ('S', 'Z'),
        ('X', 'S'),
        ('Q', 'C'),
        ('U', 'V'),
        ('M', 'N'),
        ('L', 'I'),
        ('Q', 'O'),
        ('P', 'R'),
        ('I', 'J'),
        ('2', 'Z'),
        ('5', 'S'),
        ('8', 'B'),
        ('1', 'I'),
        ('1', 'L'),
        ('0', 'O'),
        ('0', 'Q'),
        ('C', 'K'),
        ('G', 'J'),
    )

    # Initialize the adjwt array on the first call to the function only.
    # The adjwt array is used to give partial credit for characters that
    # may be errors due to known phonetic or character recognition errors.
    # A typical example is to match the letter "O" with the number "0"
    for i in sp_mx:
        adjwt[(i[0], i[1])] = 3
        adjwt[(i[1], i[0])] = 3

    if len(ying) > len(yang):
        search_range = len(ying)
        minv = len(yang)
    else:
        search_range = len(yang)
        minv = len(ying)

    # Blank out the flags
    ying_flag = [0] * search_range
    yang_flag = [0] * search_range
    search_range = max(0, search_range // 2 - 1)

    # Looking only within the search range, count and flag the matched pairs.
    num_com = 0
    yl1 = len(yang) - 1
    for i in range(len(ying)):

        low_lim = (i - search_range) if (i >= search_range) else 0
        hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
        for j in range(low_lim, hi_lim + 1):
            if (yang_flag[j] == 0) and (yang[j] == ying[i]):
                yang_flag[j] = 1
                ying_flag[i] = 1
                num_com += 1
                break

    # If no characters in common - return
    if num_com == 0:
        return 0.0

    # Count the number of transpositions
    k = n_trans = 0
    for i in range(len(ying)):

        if ying_flag[i] != 0:
            j = 0
            for j in range(k, len(yang)):  # pragma: no branch
                if yang_flag[j] != 0:
                    k = j + 1
                    break
            if ying[i] != yang[j]:
                n_trans += 1
    n_trans //= 2

    # Adjust for similarities in unmatched characters
    n_simi = 0
    if minv > num_com:

        for i in range(len(ying)):

            if ying_flag[i] == 0 and _in_range(ying[i]):
                for j in range(len(yang)):

                    if yang_flag[j] == 0 and _in_range(yang[j]):
                        if (ying[i], yang[j]) in adjwt:
                            n_simi += adjwt[(ying[i], yang[j])]
                            yang_flag[j] = 2
                            break
    num_sim = n_simi / 10.0 + num_com

    # Main weight computation
    weight = (
        num_sim / len(ying)
        + num_sim / len(yang)
        + (num_com - n_trans) / num_com
    )
    weight /= 3.0

    # Continue to boost the weight if the strings are similar
    if weight > 0.7:

        # Adjust for having up to the first 4 characters in common
        j = 4 if (minv >= 4) else minv
        i = 0
        while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
            i += 1
        if i:
            weight += i * 0.1 * (1.0 - weight)

        # Optionally adjust for long strings.

        # After agreeing beginning chars, at least two more must agree and
        # the agreeing characters must be > .5 of remaining characters.
        if (
            long_strings

            and (minv > 4)

            and (num_com > i + 1)

            and (2 * num_com >= minv + i)

        ):
            if not ying[0].isdigit():
                weight += (1.0 - weight) * (
                    (num_com - i - 1) / (len(ying) + len(yang) - i * 2 + 2)
                )

    return weight


def dist_strcmp95(src, tar, long_strings=False):
    """Return the strcmp95 distance between two strings.

    strcmp95 distance is the complement of strcmp95 similarity:
    :math:`dist_{strcmp95} = 1 - sim_{strcmp95}`.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param bool long_strings: set to True to "Increase the probability of a
        match when the number of matched characters is large.  This option
        allows for a little more tolerance when the strings are large. It is
        not an appropriate test when comparing fixed length fields such as
        phone and social security numbers."
    :returns: strcmp95 distance
    :rtype: float

    >>> round(dist_strcmp95('cat', 'hat'), 12)
    0.222222222222
    >>> round(dist_strcmp95('Niall', 'Neil'), 12)
    0.1545
    >>> round(dist_strcmp95('aluminum', 'Catalan'), 12)
    0.345238095238
    >>> round(dist_strcmp95('ATCG', 'TAGC'), 12)
    0.166666666667
    """
    return 1 - sim_strcmp95(src, tar, long_strings)


def sim_jaro_winkler(

    src,

    tar,

    qval=1,

    mode='winkler',

    long_strings=False,

    boost_threshold=0.7,

    scaling_factor=0.1,

):
    """Return the Jaro or Jaro-Winkler similarity of two strings.

    Jaro(-Winkler) distance is a string edit distance initially proposed by
    Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.

    This is Python based on the C code for strcmp95:
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
    :cite:`Winkler:1994`. The above file is a US Government publication and,
    accordingly, in the public domain.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param int qval: the length of each q-gram (defaults to 1: character-wise
        matching)
    :param str mode: indicates which variant of this distance metric to
        compute:

            - 'winkler' -- computes the Jaro-Winkler distance (default) which
              increases the score for matches near the start of the word
            - 'jaro' -- computes the Jaro distance

    The following arguments apply only when mode is 'winkler':

    :param bool long_strings: set to True to "Increase the probability of a
        match when the number of matched characters is large.  This option
        allows for a little more tolerance when the strings are large.  It is
        not an appropriate test when comparing fixed length fields such as
        phone and social security numbers."
    :param float boost_threshold: a value between 0 and 1, below which the
        Winkler boost is not applied (defaults to 0.7)
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
        much to boost scores for matching prefixes (defaults to 0.1)

    :returns: Jaro or Jaro-Winkler similarity
    :rtype: float

    >>> round(sim_jaro_winkler('cat', 'hat'), 12)
    0.777777777778
    >>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
    0.805
    >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
    0.60119047619
    >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
    0.833333333333

    >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
    0.777777777778
    >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
    0.783333333333
    >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
    0.60119047619
    >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
    0.833333333333
    """
    if mode == 'winkler':
        if boost_threshold > 1 or boost_threshold < 0:
            raise ValueError(
                'Unsupported boost_threshold assignment; '
                + 'boost_threshold must be between 0 and 1.'
            )
        if scaling_factor > 0.25 or scaling_factor < 0:
            raise ValueError(
                'Unsupported scaling_factor assignment; '
                + 'scaling_factor must be between 0 and 0.25.'
            )

    if src == tar:
        return 1.0

    src = QGrams(src.strip(), qval).ordered_list
    tar = QGrams(tar.strip(), qval).ordered_list

    lens = len(src)
    lent = len(tar)

    # If either string is blank - return - added in Version 2
    if lens == 0 or lent == 0:
        return 0.0

    if lens > lent:
        search_range = lens
        minv = lent
    else:
        search_range = lent
        minv = lens

    # Zero out the flags
    src_flag = [0] * search_range
    tar_flag = [0] * search_range
    search_range = max(0, search_range // 2 - 1)

    # Looking only within the search range, count and flag the matched pairs.
    num_com = 0
    yl1 = lent - 1
    for i in range(lens):
        low_lim = (i - search_range) if (i >= search_range) else 0
        hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
        for j in range(low_lim, hi_lim + 1):
            if (tar_flag[j] == 0) and (tar[j] == src[i]):
                tar_flag[j] = 1
                src_flag[i] = 1
                num_com += 1
                break

    # If no characters in common - return
    if num_com == 0:
        return 0.0

    # Count the number of transpositions
    k = n_trans = 0
    for i in range(lens):
        if src_flag[i] != 0:
            j = 0
            for j in range(k, lent):  # pragma: no branch
                if tar_flag[j] != 0:
                    k = j + 1
                    break
            if src[i] != tar[j]:
                n_trans += 1
    n_trans //= 2

    # Main weight computation for Jaro distance
    weight = num_com / lens + num_com / lent + (num_com - n_trans) / num_com
    weight /= 3.0

    # Continue to boost the weight if the strings are similar
    # This is the Winkler portion of Jaro-Winkler distance
    if mode == 'winkler' and weight > boost_threshold:

        # Adjust for having up to the first 4 characters in common
        j = 4 if (minv >= 4) else minv
        i = 0
        while (i < j) and (src[i] == tar[i]):
            i += 1
        weight += i * scaling_factor * (1.0 - weight)

        # Optionally adjust for long strings.

        # After agreeing beginning chars, at least two more must agree and
        # the agreeing characters must be > .5 of remaining characters.
        if (
            long_strings

            and (minv > 4)

            and (num_com > i + 1)

            and (2 * num_com >= minv + i)

        ):
            weight += (1.0 - weight) * (
                (num_com - i - 1) / (lens + lent - i * 2 + 2)
            )

    return weight


def dist_jaro_winkler(

    src,

    tar,

    qval=1,

    mode='winkler',

    long_strings=False,

    boost_threshold=0.7,

    scaling_factor=0.1,

):
    """Return the Jaro or Jaro-Winkler distance between two strings.

    Jaro(-Winkler) similarity is the complement of Jaro(-Winkler) distance:
    :math:`sim_{Jaro(-Winkler)} = 1 - dist_{Jaro(-Winkler)}`.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param int qval: the length of each q-gram (defaults to 1: character-wise
        matching)
    :param str mode: indicates which variant of this distance metric to
        compute:

            - 'winkler' -- computes the Jaro-Winkler distance (default) which
              increases the score for matches near the start of the word
            - 'jaro' -- computes the Jaro distance

    The following arguments apply only when mode is 'winkler':

    :param bool long_strings: set to True to "Increase the probability of a
        match when the number of matched characters is large.  This option
        allows for a little more tolerance when the strings are large.  It is
        not an appropriate test when comparing fixed length fields such as
        phone and social security numbers."
    :param float boost_threshold: a value between 0 and 1, below which the
        Winkler boost is not applied (defaults to 0.7)
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
        much to boost scores for matching prefixes (defaults to 0.1)

    :returns: Jaro or Jaro-Winkler distance
    :rtype: float

    >>> round(dist_jaro_winkler('cat', 'hat'), 12)
    0.222222222222
    >>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
    0.195
    >>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
    0.39880952381
    >>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
    0.166666666667

    >>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
    0.222222222222
    >>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
    0.216666666667
    >>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
    0.39880952381
    >>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
    0.166666666667
    """
    return 1 - sim_jaro_winkler(
        src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
    )


if __name__ == '__main__':
    import doctest

    doctest.testmod()


1		# -- coding: utf-8 --
2
3		# Copyright 2014-2018 by Christopher C. Little.
4		# This file is part of Abydos.
5		#
6		# Abydos is free software: you can redistribute it and/or modify
7		# it under the terms of the GNU General Public License as published by
8		# the Free Software Foundation, either version 3 of the License, or
9		# (at your option) any later version.
10		#
11		# Abydos is distributed in the hope that it will be useful,
12		# but WITHOUT ANY WARRANTY; without even the implied warranty of
13		# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14		# GNU General Public License for more details.
15		#
16		# You should have received a copy of the GNU General Public License
17		# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19	1	"""abydos.distance.jaro.
20
21		The distance.jaro module implements distance metrics based on
22		:cite:`Jaro:1989` and subsequent works:
23
24		- Jaro distance
25		- Jaro-Winkler distance
26		- the strcmp95 algorithm variant of Jaro-Winkler distance
27		"""
28
29	1	from __future__ import division, unicode_literals
30
31	1	from collections import defaultdict
32
33	1	from six.moves import range
34
35	1	from ..tokenizer import QGrams
36
37
38	1	__all__ = [
39		'dist_jaro_winkler',
40		'dist_strcmp95',
41		'sim_jaro_winkler',
42		'sim_strcmp95',
43		]
44
45
46	1	def sim_strcmp95(src, tar, long_strings=False):
		0 ignored issues – show Comprehensibility introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report This function exceeds the maximum number of variables (23/15). Loading history...
47		"""Return the strcmp95 similarity of two strings.
48
49		This is a Python translation of the C code for strcmp95:
50		http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
51		:cite:`Winkler:1994`.
52		The above file is a US Government publication and, accordingly,
53		in the public domain.
54
55		This is based on the Jaro-Winkler distance, but also attempts to correct
56		for some common typos and frequently confused characters. It is also
57		limited to uppercase ASCII characters, so it is appropriate to American
58		names, but not much else.
59
60		:param str src: source string for comparison
61		:param str tar: target string for comparison
62		:param bool long_strings: set to True to "Increase the probability of a
63		match when the number of matched characters is large. This option
64		allows for a little more tolerance when the strings are large. It is
65		not an appropriate test when comparing fixed length fields such as
66		phone and social security numbers."
67		:returns: strcmp95 similarity
68		:rtype: float
69
70		>>> sim_strcmp95('cat', 'hat')
71		0.7777777777777777
72		>>> sim_strcmp95('Niall', 'Neil')
73		0.8454999999999999
74		>>> sim_strcmp95('aluminum', 'Catalan')
75		0.6547619047619048
76		>>> sim_strcmp95('ATCG', 'TAGC')
77		0.8333333333333334
78		"""
79
80	1	def _in_range(char):
81		"""Return True if char is in the range (0, 91)."""
82	1	return 91 > ord(char) > 0
83
84	1	ying = src.strip().upper()
85	1	yang = tar.strip().upper()
86
87	1	if ying == yang:
88	1	return 1.0
89		# If either string is blank - return - added in Version 2
90	1	if not ying or not yang:
91	1	return 0.0
92
93	1	adjwt = defaultdict(int)
94	1	sp_mx = (
95		('A', 'E'),
96		('A', 'I'),
97		('A', 'O'),
98		('A', 'U'),
99		('B', 'V'),
100		('E', 'I'),
101		('E', 'O'),
102		('E', 'U'),
103		('I', 'O'),
104		('I', 'U'),
105		('O', 'U'),
106		('I', 'Y'),
107		('E', 'Y'),
108		('C', 'G'),
109		('E', 'F'),
110		('W', 'U'),
111		('W', 'V'),
112		('X', 'K'),
113		('S', 'Z'),
114		('X', 'S'),
115		('Q', 'C'),
116		('U', 'V'),
117		('M', 'N'),
118		('L', 'I'),
119		('Q', 'O'),
120		('P', 'R'),
121		('I', 'J'),
122		('2', 'Z'),
123		('5', 'S'),
124		('8', 'B'),
125		('1', 'I'),
126		('1', 'L'),
127		('0', 'O'),
128		('0', 'Q'),
129		('C', 'K'),
130		('G', 'J'),
131		)
132
133		# Initialize the adjwt array on the first call to the function only.
134		# The adjwt array is used to give partial credit for characters that
135		# may be errors due to known phonetic or character recognition errors.
136		# A typical example is to match the letter "O" with the number "0"
137	1	for i in sp_mx:
138	1	adjwt[(i[0], i[1])] = 3
139	1	adjwt[(i[1], i[0])] = 3
140
141	1	if len(ying) > len(yang):
142	1	search_range = len(ying)
143	1	minv = len(yang)
144		else:
145	1	search_range = len(yang)
146	1	minv = len(ying)
147
148		# Blank out the flags
149	1	ying_flag = [0] * search_range
150	1	yang_flag = [0] * search_range
151	1	search_range = max(0, search_range // 2 - 1)
152
153		# Looking only within the search range, count and flag the matched pairs.
154	1	num_com = 0
155	1	yl1 = len(yang) - 1
156	1	for i in range(len(ying)):
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
157	1	low_lim = (i - search_range) if (i >= search_range) else 0
158	1	hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
159	1	for j in range(low_lim, hi_lim + 1):
160	1	if (yang_flag[j] == 0) and (yang[j] == ying[i]):
161	1	yang_flag[j] = 1
162	1	ying_flag[i] = 1
163	1	num_com += 1
164	1	break
165
166		# If no characters in common - return
167	1	if num_com == 0:
168	1	return 0.0
169
170		# Count the number of transpositions
171	1	k = n_trans = 0
172	1	for i in range(len(ying)):
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
173	1	if ying_flag[i] != 0:
174	1	j = 0
175	1	for j in range(k, len(yang)): # pragma: no branch
176	1	if yang_flag[j] != 0:
177	1	k = j + 1
178	1	break
179	1	if ying[i] != yang[j]:
180	1	n_trans += 1
181	1	n_trans //= 2
182
183		# Adjust for similarities in unmatched characters
184	1	n_simi = 0
185	1	if minv > num_com:
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Too many nested blocks (6/5) Loading history...
186	1	for i in range(len(ying)):
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
187	1	if ying_flag[i] == 0 and _in_range(ying[i]):
188	1	for j in range(len(yang)):
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
189	1	if yang_flag[j] == 0 and _in_range(yang[j]):
190	1	if (ying[i], yang[j]) in adjwt:
191	1	n_simi += adjwt[(ying[i], yang[j])]
192	1	yang_flag[j] = 2
193	1	break
194	1	num_sim = n_simi / 10.0 + num_com
195
196		# Main weight computation
197	1	weight = (
198		num_sim / len(ying)
199		+ num_sim / len(yang)
200		+ (num_com - n_trans) / num_com
201		)
202	1	weight /= 3.0
203
204		# Continue to boost the weight if the strings are similar
205	1	if weight > 0.7:
206
207		# Adjust for having up to the first 4 characters in common
208	1	j = 4 if (minv >= 4) else minv
209	1	i = 0
210	1	while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
211	1	i += 1
212	1	if i:
213	1	weight += i * 0.1 * (1.0 - weight)
214
215		# Optionally adjust for long strings.
216
217		# After agreeing beginning chars, at least two more must agree and
218		# the agreeing characters must be > .5 of remaining characters.
219	1	if (
220		long_strings
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
221		and (minv > 4)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
222		and (num_com > i + 1)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
223		and (2 * num_com >= minv + i)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
224		):
225	1	if not ying[0].isdigit():
226	1	weight += (1.0 - weight) * (
227		(num_com - i - 1) / (len(ying) + len(yang) - i * 2 + 2)
228		)
229
230	1	return weight
231
232
233	1	def dist_strcmp95(src, tar, long_strings=False):
234		"""Return the strcmp95 distance between two strings.
235
236		strcmp95 distance is the complement of strcmp95 similarity:
237		:math:`dist_{strcmp95} = 1 - sim_{strcmp95}`.
238
239		:param str src: source string for comparison
240		:param str tar: target string for comparison
241		:param bool long_strings: set to True to "Increase the probability of a
242		match when the number of matched characters is large. This option
243		allows for a little more tolerance when the strings are large. It is
244		not an appropriate test when comparing fixed length fields such as
245		phone and social security numbers."
246		:returns: strcmp95 distance
247		:rtype: float
248
249		>>> round(dist_strcmp95('cat', 'hat'), 12)
250		0.222222222222
251		>>> round(dist_strcmp95('Niall', 'Neil'), 12)
252		0.1545
253		>>> round(dist_strcmp95('aluminum', 'Catalan'), 12)
254		0.345238095238
255		>>> round(dist_strcmp95('ATCG', 'TAGC'), 12)
256		0.166666666667
257		"""
258	1	return 1 - sim_strcmp95(src, tar, long_strings)
259
260
261	1	def sim_jaro_winkler(
		0 ignored issues – show best-practice introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Too many arguments (7/5) Loading history... Comprehensibility introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report This function exceeds the maximum number of variables (22/15). Loading history...
262		src,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
263		tar,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
264		qval=1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
265		mode='winkler',
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
266		long_strings=False,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
267		boost_threshold=0.7,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
268		scaling_factor=0.1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
269		):
270		"""Return the Jaro or Jaro-Winkler similarity of two strings.
271
272		Jaro(-Winkler) distance is a string edit distance initially proposed by
273		Jaro and extended by Winkler :cite:`Jaro:1989,Winkler:1990`.
274
275		This is Python based on the C code for strcmp95:
276		http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
277		:cite:`Winkler:1994`. The above file is a US Government publication and,
278		accordingly, in the public domain.
279
280		:param str src: source string for comparison
281		:param str tar: target string for comparison
282		:param int qval: the length of each q-gram (defaults to 1: character-wise
283		matching)
284		:param str mode: indicates which variant of this distance metric to
285		compute:
286
287		- 'winkler' -- computes the Jaro-Winkler distance (default) which
288		increases the score for matches near the start of the word
289		- 'jaro' -- computes the Jaro distance
290
291		The following arguments apply only when mode is 'winkler':
292
293		:param bool long_strings: set to True to "Increase the probability of a
294		match when the number of matched characters is large. This option
295		allows for a little more tolerance when the strings are large. It is
296		not an appropriate test when comparing fixed length fields such as
297		phone and social security numbers."
298		:param float boost_threshold: a value between 0 and 1, below which the
299		Winkler boost is not applied (defaults to 0.7)
300		:param float scaling_factor: a value between 0 and 0.25, indicating by how
301		much to boost scores for matching prefixes (defaults to 0.1)
302
303		:returns: Jaro or Jaro-Winkler similarity
304		:rtype: float
305
306		>>> round(sim_jaro_winkler('cat', 'hat'), 12)
307		0.777777777778
308		>>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
309		0.805
310		>>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
311		0.60119047619
312		>>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
313		0.833333333333
314
315		>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
316		0.777777777778
317		>>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
318		0.783333333333
319		>>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
320		0.60119047619
321		>>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
322		0.833333333333
323		"""
324	1	if mode == 'winkler':
325	1	if boost_threshold > 1 or boost_threshold < 0:
326	1	raise ValueError(
327		'Unsupported boost_threshold assignment; '
328		+ 'boost_threshold must be between 0 and 1.'
329		)
330	1	if scaling_factor > 0.25 or scaling_factor < 0:
331	1	raise ValueError(
332		'Unsupported scaling_factor assignment; '
333		+ 'scaling_factor must be between 0 and 0.25.'
334		)
335
336	1	if src == tar:
337	1	return 1.0
338
339	1	src = QGrams(src.strip(), qval).ordered_list
340	1	tar = QGrams(tar.strip(), qval).ordered_list
341
342	1	lens = len(src)
343	1	lent = len(tar)
344
345		# If either string is blank - return - added in Version 2
346	1	if lens == 0 or lent == 0:
347	1	return 0.0
348
349	1	if lens > lent:
350	1	search_range = lens
351	1	minv = lent
352		else:
353	1	search_range = lent
354	1	minv = lens
355
356		# Zero out the flags
357	1	src_flag = [0] * search_range
358	1	tar_flag = [0] * search_range
359	1	search_range = max(0, search_range // 2 - 1)
360
361		# Looking only within the search range, count and flag the matched pairs.
362	1	num_com = 0
363	1	yl1 = lent - 1
364	1	for i in range(lens):
365	1	low_lim = (i - search_range) if (i >= search_range) else 0
366	1	hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1
367	1	for j in range(low_lim, hi_lim + 1):
368	1	if (tar_flag[j] == 0) and (tar[j] == src[i]):
369	1	tar_flag[j] = 1
370	1	src_flag[i] = 1
371	1	num_com += 1
372	1	break
373
374		# If no characters in common - return
375	1	if num_com == 0:
376	1	return 0.0
377
378		# Count the number of transpositions
379	1	k = n_trans = 0
380	1	for i in range(lens):
381	1	if src_flag[i] != 0:
382	1	j = 0
383	1	for j in range(k, lent): # pragma: no branch
384	1	if tar_flag[j] != 0:
385	1	k = j + 1
386	1	break
387	1	if src[i] != tar[j]:
388	1	n_trans += 1
389	1	n_trans //= 2
390
391		# Main weight computation for Jaro distance
392	1	weight = num_com / lens + num_com / lent + (num_com - n_trans) / num_com
393	1	weight /= 3.0
394
395		# Continue to boost the weight if the strings are similar
396		# This is the Winkler portion of Jaro-Winkler distance
397	1	if mode == 'winkler' and weight > boost_threshold:
398
399		# Adjust for having up to the first 4 characters in common
400	1	j = 4 if (minv >= 4) else minv
401	1	i = 0
402	1	while (i < j) and (src[i] == tar[i]):
403	1	i += 1
404	1	weight += i * scaling_factor * (1.0 - weight)
405
406		# Optionally adjust for long strings.
407
408		# After agreeing beginning chars, at least two more must agree and
409		# the agreeing characters must be > .5 of remaining characters.
410	1	if (
411		long_strings
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
412		and (minv > 4)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
413		and (num_com > i + 1)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
414		and (2 * num_com >= minv + i)
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
415		):
416	1	weight += (1.0 - weight) * (
417		(num_com - i - 1) / (lens + lent - i * 2 + 2)
418		)
419
420	1	return weight
421
422
423	1	def dist_jaro_winkler(
		0 ignored issues – show best-practice introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Too many arguments (7/5) Loading history...
424		src,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
425		tar,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
426		qval=1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
427		mode='winkler',
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
428		long_strings=False,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
429		boost_threshold=0.7,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
430		scaling_factor=0.1,
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
431		):
432		"""Return the Jaro or Jaro-Winkler distance between two strings.
433
434		Jaro(-Winkler) similarity is the complement of Jaro(-Winkler) distance:
435		:math:`sim_{Jaro(-Winkler)} = 1 - dist_{Jaro(-Winkler)}`.
436
437		:param str src: source string for comparison
438		:param str tar: target string for comparison
439		:param int qval: the length of each q-gram (defaults to 1: character-wise
440		matching)
441		:param str mode: indicates which variant of this distance metric to
442		compute:
443
444		- 'winkler' -- computes the Jaro-Winkler distance (default) which
445		increases the score for matches near the start of the word
446		- 'jaro' -- computes the Jaro distance
447
448		The following arguments apply only when mode is 'winkler':
449
450		:param bool long_strings: set to True to "Increase the probability of a
451		match when the number of matched characters is large. This option
452		allows for a little more tolerance when the strings are large. It is
453		not an appropriate test when comparing fixed length fields such as
454		phone and social security numbers."
455		:param float boost_threshold: a value between 0 and 1, below which the
456		Winkler boost is not applied (defaults to 0.7)
457		:param float scaling_factor: a value between 0 and 0.25, indicating by how
458		much to boost scores for matching prefixes (defaults to 0.1)
459
460		:returns: Jaro or Jaro-Winkler distance
461		:rtype: float
462
463		>>> round(dist_jaro_winkler('cat', 'hat'), 12)
464		0.222222222222
465		>>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
466		0.195
467		>>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
468		0.39880952381
469		>>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
470		0.166666666667
471
472		>>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
473		0.222222222222
474		>>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
475		0.216666666667
476		>>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
477		0.39880952381
478		>>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
479		0.166666666667
480		"""
481	1	return 1 - sim_jaro_winkler(
482		src, tar, qval, mode, long_strings, boost_threshold, scaling_factor
483		)
484
485
486		if __name__ == '__main__':
487		import doctest
488
489		doctest.testmod()
490

chrislit / abydos

Branch — master (78a222)

abydos.distance._jaro F

Complexity

Size/Duplication

Test Coverage

Importance

4 Functions

How to fix Complexity

Complexity

Duplication Side-by-Side

Filter issues like