abydos.distance._levenshtein - Code Metrics - Inspection of "78a222a9f7d8976f6744d263e3d6d01a2a991c27" - chrislit/abydos - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Branch — master (78a222)

by Chris

created 2018-10-26 11:30 UTC

abydos.distance._levenshtein B

↳ Parent: Project

Complexity

Total Complexity

Size/Duplication

Total Lines	483
Duplicated Lines	0 %

Test Coverage

Coverage

100%

Importance

Changes

Metric	Value
wmc	44
eloc	130
dl	0
loc	483
ccs	96
cts	96
cp	1
rs	8.8798
c	0
b	0
f	0

9 Functions

Rating	Name	Size	Complexity
A	dist_damerau()	37	2
F	damerau_levenshtein()	103	19
A	dist_levenshtein()	43	2
A	sim_indel()	21	1
F	levenshtein()	94	15
A	indel()	21	1
A	sim_damerau()	28	1
A	dist_indel()	23	2
A	sim_levenshtein()	37	1

How to fix Complexity

# -*- coding: utf-8 -*-

# Copyright 2014-2018 by Christopher C. Little.
# This file is part of Abydos.
#
# Abydos is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Abydos is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.

"""abydos.distance.levenshtein.

The distance.levenshtein module implements string edit distance functions
based on Levenshtein distance, including:

    - Levenshtein distance
    - Optimal String Alignment distance
    - Levenshtein-Damerau distance
    - Indel distance
"""

from __future__ import division, unicode_literals

from sys import maxsize

from numpy import int as np_int
from numpy import zeros as np_zeros

from six.moves import range


__all__ = [
    'damerau_levenshtein',
    'dist_damerau',
    'dist_indel',
    'dist_levenshtein',
    'levenshtein',
    'sim_damerau',
    'sim_indel',
    'sim_levenshtein',
]


def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
    """Return the Levenshtein distance between two strings.

    This is the standard edit distance measure. Cf.
    :cite:`Levenshtein:1965,Levenshtein:1966`.

    Two additional variants: optimal string alignment (aka restricted
    Damerau-Levenshtein distance) :cite:`Boytsov:2011` and the
    Damerau-Levenshtein :cite:`Damerau:1964` distance are also supported.

    The ordinary Levenshtein & Optimal String Alignment distance both
    employ the Wagner-Fischer dynamic programming algorithm
    :cite:`Wagner:1974`.

    Levenshtein edit distance ordinarily has unit insertion, deletion, and
    substitution costs.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param str mode: specifies a mode for computing the Levenshtein distance:

        - 'lev' (default) computes the ordinary Levenshtein distance,
          in which edits may include inserts, deletes, and substitutions
        - 'osa' computes the Optimal String Alignment distance, in which
          edits may include inserts, deletes, substitutions, and
          transpositions but substrings may only be edited once
        - 'dam' computes the Damerau-Levenshtein distance, in which
          edits may include inserts, deletes, substitutions, and
          transpositions and substrings may undergo repeated edits

    :param tuple cost: a 4-tuple representing the cost of the four possible
        edits: inserts, deletes, substitutions, and transpositions,
        respectively (by default: (1, 1, 1, 1))
    :returns: the Levenshtein distance between src & tar
    :rtype: int (may return a float if cost has float values)

    >>> levenshtein('cat', 'hat')
    1
    >>> levenshtein('Niall', 'Neil')
    3
    >>> levenshtein('aluminum', 'Catalan')
    7
    >>> levenshtein('ATCG', 'TAGC')
    3

    >>> levenshtein('ATCG', 'TAGC', mode='osa')
    2
    >>> levenshtein('ACTG', 'TAGC', mode='osa')
    4

    >>> levenshtein('ATCG', 'TAGC', mode='dam')
    2
    >>> levenshtein('ACTG', 'TAGC', mode='dam')
    3
    """
    ins_cost, del_cost, sub_cost, trans_cost = cost

    if src == tar:
        return 0
    if not src:
        return len(tar) * ins_cost
    if not tar:
        return len(src) * del_cost

    if 'dam' in mode:
        return damerau_levenshtein(src, tar, cost)

    d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_int)
    for i in range(len(src) + 1):
        d_mat[i, 0] = i * del_cost
    for j in range(len(tar) + 1):
        d_mat[0, j] = j * ins_cost

    for i in range(len(src)):

        for j in range(len(tar)):

            d_mat[i + 1, j + 1] = min(
                d_mat[i + 1, j] + ins_cost,  # ins
                d_mat[i, j + 1] + del_cost,  # del
                d_mat[i, j] + (sub_cost if src[i] != tar[j] else 0),  # sub/==
            )

            if mode == 'osa':
                if (
                    i + 1 > 1

                    and j + 1 > 1

                    and src[i] == tar[j - 1]

                    and src[i - 1] == tar[j]

                ):
                    # transposition
                    d_mat[i + 1, j + 1] = min(
                        d_mat[i + 1, j + 1], d_mat[i - 1, j - 1] + trans_cost
                    )

    return d_mat[len(src), len(tar)]


def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
    """Return the normalized Levenshtein distance between two strings.

    The Levenshtein distance is normalized by dividing the Levenshtein distance
    (calculated by any of the three supported methods) by the greater of
    the number of characters in src times the cost of a delete and
    the number of characters in tar times the cost of an insert.
    For the case in which all operations have :math:`cost = 1`, this is
    equivalent to the greater of the length of the two strings src & tar.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param str mode: specifies a mode for computing the Levenshtein distance:

        - 'lev' (default) computes the ordinary Levenshtein distance,
          in which edits may include inserts, deletes, and substitutions
        - 'osa' computes the Optimal String Alignment distance, in which
          edits may include inserts, deletes, substitutions, and
          transpositions but substrings may only be edited once
        - 'dam' computes the Damerau-Levenshtein distance, in which
          edits may include inserts, deletes, substitutions, and
          transpositions and substrings may undergo repeated edits

    :param tuple cost: a 4-tuple representing the cost of the four possible
        edits: inserts, deletes, substitutions, and transpositions,
        respectively (by default: (1, 1, 1, 1))
    :returns: normalized Levenshtein distance
    :rtype: float

    >>> round(dist_levenshtein('cat', 'hat'), 12)
    0.333333333333
    >>> round(dist_levenshtein('Niall', 'Neil'), 12)
    0.6
    >>> dist_levenshtein('aluminum', 'Catalan')
    0.875
    >>> dist_levenshtein('ATCG', 'TAGC')
    0.75
    """
    if src == tar:
        return 0
    ins_cost, del_cost = cost[:2]
    return levenshtein(src, tar, mode, cost) / (
        max(len(src) * del_cost, len(tar) * ins_cost)
    )


def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
    """Return the Levenshtein similarity of two strings.

    Normalized Levenshtein similarity is the complement of normalized
    Levenshtein distance:
    :math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param str mode: specifies a mode for computing the Levenshtein distance:

            - 'lev' (default) computes the ordinary Levenshtein distance,
              in which edits may include inserts, deletes, and substitutions
            - 'osa' computes the Optimal String Alignment distance, in which
              edits may include inserts, deletes, substitutions, and
              transpositions but substrings may only be edited once
            - 'dam' computes the Damerau-Levenshtein distance, in which
              edits may include inserts, deletes, substitutions, and
              transpositions and substrings may undergo repeated edits

    :param tuple cost: a 4-tuple representing the cost of the four possible
        edits:
        inserts, deletes, substitutions, and transpositions, respectively
        (by default: (1, 1, 1, 1))
    :returns: normalized Levenshtein similarity
    :rtype: float

    >>> round(sim_levenshtein('cat', 'hat'), 12)
    0.666666666667
    >>> round(sim_levenshtein('Niall', 'Neil'), 12)
    0.4
    >>> sim_levenshtein('aluminum', 'Catalan')
    0.125
    >>> sim_levenshtein('ATCG', 'TAGC')
    0.25
    """
    return 1 - dist_levenshtein(src, tar, mode, cost)


def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):

    """Return the Damerau-Levenshtein distance between two strings.

    This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`.
    Damerau-Levenshtein code is based on Java code by Kevin L. Stern
    :cite:`Stern:2014`, under the MIT license:
    https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param tuple cost: a 4-tuple representing the cost of the four possible
        edits:
        inserts, deletes, substitutions, and transpositions, respectively
        (by default: (1, 1, 1, 1))
    :returns: the Damerau-Levenshtein distance between src & tar
    :rtype: int (may return a float if cost has float values)

    >>> damerau_levenshtein('cat', 'hat')
    1
    >>> damerau_levenshtein('Niall', 'Neil')
    3
    >>> damerau_levenshtein('aluminum', 'Catalan')
    7
    >>> damerau_levenshtein('ATCG', 'TAGC')
    2
    """
    ins_cost, del_cost, sub_cost, trans_cost = cost

    if src == tar:
        return 0
    if not src:
        return len(tar) * ins_cost
    if not tar:
        return len(src) * del_cost

    if 2 * trans_cost < ins_cost + del_cost:
        raise ValueError(
            'Unsupported cost assignment; the cost of two '
            + 'transpositions must not be less than the cost of '
            + 'an insert plus a delete.'
        )

    d_mat = np_zeros((len(src)) * (len(tar)), dtype=np_int).reshape(
        (len(src), len(tar))
    )

    if src[0] != tar[0]:
        d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)

    src_index_by_character = {src[0]: 0}
    for i in range(1, len(src)):
        del_distance = d_mat[i - 1, 0] + del_cost
        ins_distance = (i + 1) * del_cost + ins_cost
        match_distance = i * del_cost + (0 if src[i] == tar[0] else sub_cost)
        d_mat[i, 0] = min(del_distance, ins_distance, match_distance)

    for j in range(1, len(tar)):
        del_distance = (j + 1) * ins_cost + del_cost
        ins_distance = d_mat[0, j - 1] + ins_cost
        match_distance = j * ins_cost + (0 if src[0] == tar[j] else sub_cost)
        d_mat[0, j] = min(del_distance, ins_distance, match_distance)

    for i in range(1, len(src)):
        max_src_letter_match_index = 0 if src[i] == tar[0] else -1
        for j in range(1, len(tar)):
            candidate_swap_index = (
                -1
                if tar[j] not in src_index_by_character
                else src_index_by_character[tar[j]]
            )
            j_swap = max_src_letter_match_index
            del_distance = d_mat[i - 1, j] + del_cost
            ins_distance = d_mat[i, j - 1] + ins_cost
            match_distance = d_mat[i - 1, j - 1]
            if src[i] != tar[j]:
                match_distance += sub_cost
            else:
                max_src_letter_match_index = j

            if candidate_swap_index != -1 and j_swap != -1:
                i_swap = candidate_swap_index

                if i_swap == 0 and j_swap == 0:
                    pre_swap_cost = 0
                else:
                    pre_swap_cost = d_mat[
                        max(0, i_swap - 1), max(0, j_swap - 1)
                    ]
                swap_distance = (
                    pre_swap_cost
                    + (i - i_swap - 1) * del_cost
                    + (j - j_swap - 1) * ins_cost
                    + trans_cost
                )
            else:
                swap_distance = maxsize

            d_mat[i, j] = min(
                del_distance, ins_distance, match_distance, swap_distance
            )
        src_index_by_character[src[i]] = i

    return d_mat[len(src) - 1, len(tar) - 1]


def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
    """Return the Damerau-Levenshtein similarity of two strings.

    Damerau-Levenshtein distance normalized to the interval [0, 1].

    The Damerau-Levenshtein distance is normalized by dividing the
    Damerau-Levenshtein distance by the greater of
    the number of characters in src times the cost of a delete and
    the number of characters in tar times the cost of an insert.
    For the case in which all operations have :math:`cost = 1`, this is
    equivalent to the greater of the length of the two strings src & tar.

    The arguments are identical to those of the levenshtein() function.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param tuple cost: a 4-tuple representing the cost of the four possible
        edits:
        inserts, deletes, substitutions, and transpositions, respectively
        (by default: (1, 1, 1, 1))
    :returns: normalized Damerau-Levenshtein distance
    :rtype: float

    >>> round(dist_damerau('cat', 'hat'), 12)
    0.333333333333
    >>> round(dist_damerau('Niall', 'Neil'), 12)
    0.6
    >>> dist_damerau('aluminum', 'Catalan')
    0.875
    >>> dist_damerau('ATCG', 'TAGC')
    0.5
    """
    if src == tar:
        return 0
    ins_cost, del_cost = cost[:2]
    return damerau_levenshtein(src, tar, cost) / (
        max(len(src) * del_cost, len(tar) * ins_cost)
    )


def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
    """Return the Damerau-Levenshtein similarity of two strings.

    Normalized Damerau-Levenshtein similarity the complement of normalized
    Damerau-Levenshtein distance:
    :math:`sim_{Damerau} = 1 - dist_{Damerau}`.

    The arguments are identical to those of the levenshtein() function.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :param tuple cost: a 4-tuple representing the cost of the four possible
        edits:
        inserts, deletes, substitutions, and transpositions, respectively
        (by default: (1, 1, 1, 1))
    :returns: normalized Damerau-Levenshtein similarity
    :rtype: float

    >>> round(sim_damerau('cat', 'hat'), 12)
    0.666666666667
    >>> round(sim_damerau('Niall', 'Neil'), 12)
    0.4
    >>> sim_damerau('aluminum', 'Catalan')
    0.125
    >>> sim_damerau('ATCG', 'TAGC')
    0.5
    """
    return 1 - dist_damerau(src, tar, cost)


def indel(src, tar):
    """Return the indel distance between two strings.

    This is equivalent to Levenshtein distance, when only inserts and deletes
    are possible.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :returns: indel distance
    :rtype: float

    >>> round(dist_indel('cat', 'hat'), 12)
    0.333333333333
    >>> round(dist_indel('Niall', 'Neil'), 12)
    0.333333333333
    >>> round(dist_indel('Colin', 'Cuilen'), 12)
    0.454545454545
    >>> dist_indel('ATCG', 'TAGC')
    0.5
    """
    return levenshtein(src, tar, mode='lev', cost=(1, 1, 9999, 9999))


def dist_indel(src, tar):
    """Return the normalized indel distance between two strings.

    This is equivalent to normalized Levenshtein distance, when only inserts
    and deletes are possible.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :returns: indel distance
    :rtype: float

    >>> round(dist_indel('cat', 'hat'), 12)
    0.333333333333
    >>> round(dist_indel('Niall', 'Neil'), 12)
    0.333333333333
    >>> round(dist_indel('Colin', 'Cuilen'), 12)
    0.454545454545
    >>> dist_indel('ATCG', 'TAGC')
    0.5
    """
    if src == tar:
        return 0
    return indel(src, tar) / (len(src) + len(tar))


def sim_indel(src, tar):
    """Return the normalized indel similarity of two strings.

    This is equivalent to normalized Levenshtein similarity, when only inserts
    and deletes are possible.

    :param str src: source string for comparison
    :param str tar: target string for comparison
    :returns: indel similarity
    :rtype: float

    >>> round(sim_indel('cat', 'hat'), 12)
    0.666666666667
    >>> round(sim_indel('Niall', 'Neil'), 12)
    0.666666666667
    >>> round(sim_indel('Colin', 'Cuilen'), 12)
    0.545454545455
    >>> sim_indel('ATCG', 'TAGC')
    0.5
    """
    return 1 - dist_indel(src, tar)


if __name__ == '__main__':
    import doctest

    doctest.testmod()


1		# -- coding: utf-8 --
2
3		# Copyright 2014-2018 by Christopher C. Little.
4		# This file is part of Abydos.
5		#
6		# Abydos is free software: you can redistribute it and/or modify
7		# it under the terms of the GNU General Public License as published by
8		# the Free Software Foundation, either version 3 of the License, or
9		# (at your option) any later version.
10		#
11		# Abydos is distributed in the hope that it will be useful,
12		# but WITHOUT ANY WARRANTY; without even the implied warranty of
13		# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14		# GNU General Public License for more details.
15		#
16		# You should have received a copy of the GNU General Public License
17		# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19	1	"""abydos.distance.levenshtein.
20
21		The distance.levenshtein module implements string edit distance functions
22		based on Levenshtein distance, including:
23
24		- Levenshtein distance
25		- Optimal String Alignment distance
26		- Levenshtein-Damerau distance
27		- Indel distance
28		"""
29
30	1	from __future__ import division, unicode_literals
31
32	1	from sys import maxsize
33
34	1	from numpy import int as np_int
35	1	from numpy import zeros as np_zeros
36
37	1	from six.moves import range
38
39
40	1	__all__ = [
41		'damerau_levenshtein',
42		'dist_damerau',
43		'dist_indel',
44		'dist_levenshtein',
45		'levenshtein',
46		'sim_damerau',
47		'sim_indel',
48		'sim_levenshtein',
49		]
50
51
52	1	def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
53		"""Return the Levenshtein distance between two strings.
54
55		This is the standard edit distance measure. Cf.
56		:cite:`Levenshtein:1965,Levenshtein:1966`.
57
58		Two additional variants: optimal string alignment (aka restricted
59		Damerau-Levenshtein distance) :cite:`Boytsov:2011` and the
60		Damerau-Levenshtein :cite:`Damerau:1964` distance are also supported.
61
62		The ordinary Levenshtein & Optimal String Alignment distance both
63		employ the Wagner-Fischer dynamic programming algorithm
64		:cite:`Wagner:1974`.
65
66		Levenshtein edit distance ordinarily has unit insertion, deletion, and
67		substitution costs.
68
69		:param str src: source string for comparison
70		:param str tar: target string for comparison
71		:param str mode: specifies a mode for computing the Levenshtein distance:
72
73		- 'lev' (default) computes the ordinary Levenshtein distance,
74		in which edits may include inserts, deletes, and substitutions
75		- 'osa' computes the Optimal String Alignment distance, in which
76		edits may include inserts, deletes, substitutions, and
77		transpositions but substrings may only be edited once
78		- 'dam' computes the Damerau-Levenshtein distance, in which
79		edits may include inserts, deletes, substitutions, and
80		transpositions and substrings may undergo repeated edits
81
82		:param tuple cost: a 4-tuple representing the cost of the four possible
83		edits: inserts, deletes, substitutions, and transpositions,
84		respectively (by default: (1, 1, 1, 1))
85		:returns: the Levenshtein distance between src & tar
86		:rtype: int (may return a float if cost has float values)
87
88		>>> levenshtein('cat', 'hat')
89		1
90		>>> levenshtein('Niall', 'Neil')
91		3
92		>>> levenshtein('aluminum', 'Catalan')
93		7
94		>>> levenshtein('ATCG', 'TAGC')
95		3
96
97		>>> levenshtein('ATCG', 'TAGC', mode='osa')
98		2
99		>>> levenshtein('ACTG', 'TAGC', mode='osa')
100		4
101
102		>>> levenshtein('ATCG', 'TAGC', mode='dam')
103		2
104		>>> levenshtein('ACTG', 'TAGC', mode='dam')
105		3
106		"""
107	1	ins_cost, del_cost, sub_cost, trans_cost = cost
108
109	1	if src == tar:
110	1	return 0
111	1	if not src:
112	1	return len(tar) * ins_cost
113	1	if not tar:
114	1	return len(src) * del_cost
115
116	1	if 'dam' in mode:
117	1	return damerau_levenshtein(src, tar, cost)
118
119	1	d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_int)
120	1	for i in range(len(src) + 1):
121	1	d_mat[i, 0] = i * del_cost
122	1	for j in range(len(tar) + 1):
123	1	d_mat[0, j] = j * ins_cost
124
125	1	for i in range(len(src)):
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
126	1	for j in range(len(tar)):
		0 ignored issues – show unused-code introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
127	1	d_mat[i + 1, j + 1] = min(
128		d_mat[i + 1, j] + ins_cost, # ins
129		d_mat[i, j + 1] + del_cost, # del
130		d_mat[i, j] + (sub_cost if src[i] != tar[j] else 0), # sub/==
131		)
132
133	1	if mode == 'osa':
134	1	if (
135		i + 1 > 1
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
136		and j + 1 > 1
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
137		and src[i] == tar[j - 1]
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
138		and src[i - 1] == tar[j]
		0 ignored issues – show Coding Style introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
139		):
140		# transposition
141	1	d_mat[i + 1, j + 1] = min(
142		d_mat[i + 1, j + 1], d_mat[i - 1, j - 1] + trans_cost
143		)
144
145	1	return d_mat[len(src), len(tar)]
146
147
148	1	def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
149		"""Return the normalized Levenshtein distance between two strings.
150
151		The Levenshtein distance is normalized by dividing the Levenshtein distance
152		(calculated by any of the three supported methods) by the greater of
153		the number of characters in src times the cost of a delete and
154		the number of characters in tar times the cost of an insert.
155		For the case in which all operations have :math:`cost = 1`, this is
156		equivalent to the greater of the length of the two strings src & tar.
157
158		:param str src: source string for comparison
159		:param str tar: target string for comparison
160		:param str mode: specifies a mode for computing the Levenshtein distance:
161
162		- 'lev' (default) computes the ordinary Levenshtein distance,
163		in which edits may include inserts, deletes, and substitutions
164		- 'osa' computes the Optimal String Alignment distance, in which
165		edits may include inserts, deletes, substitutions, and
166		transpositions but substrings may only be edited once
167		- 'dam' computes the Damerau-Levenshtein distance, in which
168		edits may include inserts, deletes, substitutions, and
169		transpositions and substrings may undergo repeated edits
170
171		:param tuple cost: a 4-tuple representing the cost of the four possible
172		edits: inserts, deletes, substitutions, and transpositions,
173		respectively (by default: (1, 1, 1, 1))
174		:returns: normalized Levenshtein distance
175		:rtype: float
176
177		>>> round(dist_levenshtein('cat', 'hat'), 12)
178		0.333333333333
179		>>> round(dist_levenshtein('Niall', 'Neil'), 12)
180		0.6
181		>>> dist_levenshtein('aluminum', 'Catalan')
182		0.875
183		>>> dist_levenshtein('ATCG', 'TAGC')
184		0.75
185		"""
186	1	if src == tar:
187	1	return 0
188	1	ins_cost, del_cost = cost[:2]
189	1	return levenshtein(src, tar, mode, cost) / (
190		max(len(src) * del_cost, len(tar) * ins_cost)
191		)
192
193
194	1	def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
195		"""Return the Levenshtein similarity of two strings.
196
197		Normalized Levenshtein similarity is the complement of normalized
198		Levenshtein distance:
199		:math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`.
200
201		:param str src: source string for comparison
202		:param str tar: target string for comparison
203		:param str mode: specifies a mode for computing the Levenshtein distance:
204
205		- 'lev' (default) computes the ordinary Levenshtein distance,
206		in which edits may include inserts, deletes, and substitutions
207		- 'osa' computes the Optimal String Alignment distance, in which
208		edits may include inserts, deletes, substitutions, and
209		transpositions but substrings may only be edited once
210		- 'dam' computes the Damerau-Levenshtein distance, in which
211		edits may include inserts, deletes, substitutions, and
212		transpositions and substrings may undergo repeated edits
213
214		:param tuple cost: a 4-tuple representing the cost of the four possible
215		edits:
216		inserts, deletes, substitutions, and transpositions, respectively
217		(by default: (1, 1, 1, 1))
218		:returns: normalized Levenshtein similarity
219		:rtype: float
220
221		>>> round(sim_levenshtein('cat', 'hat'), 12)
222		0.666666666667
223		>>> round(sim_levenshtein('Niall', 'Neil'), 12)
224		0.4
225		>>> sim_levenshtein('aluminum', 'Catalan')
226		0.125
227		>>> sim_levenshtein('ATCG', 'TAGC')
228		0.25
229		"""
230	1	return 1 - dist_levenshtein(src, tar, mode, cost)
231
232
233	1	def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):
		0 ignored issues – show Comprehensibility introduced 2018-08-02 19:04 UTC by Report Bug Copy Issue Report This function exceeds the maximum number of variables (20/15). Loading history...
234		"""Return the Damerau-Levenshtein distance between two strings.
235
236		This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`.
237		Damerau-Levenshtein code is based on Java code by Kevin L. Stern
238		:cite:`Stern:2014`, under the MIT license:
239		https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
240
241		:param str src: source string for comparison
242		:param str tar: target string for comparison
243		:param tuple cost: a 4-tuple representing the cost of the four possible
244		edits:
245		inserts, deletes, substitutions, and transpositions, respectively
246		(by default: (1, 1, 1, 1))
247		:returns: the Damerau-Levenshtein distance between src & tar
248		:rtype: int (may return a float if cost has float values)
249
250		>>> damerau_levenshtein('cat', 'hat')
251		1
252		>>> damerau_levenshtein('Niall', 'Neil')
253		3
254		>>> damerau_levenshtein('aluminum', 'Catalan')
255		7
256		>>> damerau_levenshtein('ATCG', 'TAGC')
257		2
258		"""
259	1	ins_cost, del_cost, sub_cost, trans_cost = cost
260
261	1	if src == tar:
262	1	return 0
263	1	if not src:
264	1	return len(tar) * ins_cost
265	1	if not tar:
266	1	return len(src) * del_cost
267
268	1	if 2 * trans_cost < ins_cost + del_cost:
269	1	raise ValueError(
270		'Unsupported cost assignment; the cost of two '
271		+ 'transpositions must not be less than the cost of '
272		+ 'an insert plus a delete.'
273		)
274
275	1	d_mat = np_zeros((len(src)) * (len(tar)), dtype=np_int).reshape(
276		(len(src), len(tar))
277		)
278
279	1	if src[0] != tar[0]:
280	1	d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)
281
282	1	src_index_by_character = {src[0]: 0}
283	1	for i in range(1, len(src)):
284	1	del_distance = d_mat[i - 1, 0] + del_cost
285	1	ins_distance = (i + 1) * del_cost + ins_cost
286	1	match_distance = i * del_cost + (0 if src[i] == tar[0] else sub_cost)
287	1	d_mat[i, 0] = min(del_distance, ins_distance, match_distance)
288
289	1	for j in range(1, len(tar)):
290	1	del_distance = (j + 1) * ins_cost + del_cost
291	1	ins_distance = d_mat[0, j - 1] + ins_cost
292	1	match_distance = j * ins_cost + (0 if src[0] == tar[j] else sub_cost)
293	1	d_mat[0, j] = min(del_distance, ins_distance, match_distance)
294
295	1	for i in range(1, len(src)):
296	1	max_src_letter_match_index = 0 if src[i] == tar[0] else -1
297	1	for j in range(1, len(tar)):
298	1	candidate_swap_index = (
299		-1
300		if tar[j] not in src_index_by_character
301		else src_index_by_character[tar[j]]
302		)
303	1	j_swap = max_src_letter_match_index
304	1	del_distance = d_mat[i - 1, j] + del_cost
305	1	ins_distance = d_mat[i, j - 1] + ins_cost
306	1	match_distance = d_mat[i - 1, j - 1]
307	1	if src[i] != tar[j]:
308	1	match_distance += sub_cost
309		else:
310	1	max_src_letter_match_index = j
311
312	1	if candidate_swap_index != -1 and j_swap != -1:
313	1	i_swap = candidate_swap_index
314
315	1	if i_swap == 0 and j_swap == 0:
316	1	pre_swap_cost = 0
317		else:
318	1	pre_swap_cost = d_mat[
319		max(0, i_swap - 1), max(0, j_swap - 1)
320		]
321	1	swap_distance = (
322		pre_swap_cost
323		+ (i - i_swap - 1) * del_cost
324		+ (j - j_swap - 1) * ins_cost
325		+ trans_cost
326		)
327		else:
328	1	swap_distance = maxsize
329
330	1	d_mat[i, j] = min(
331		del_distance, ins_distance, match_distance, swap_distance
332		)
333	1	src_index_by_character[src[i]] = i
334
335	1	return d_mat[len(src) - 1, len(tar) - 1]
336
337
338	1	def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
339		"""Return the Damerau-Levenshtein similarity of two strings.
340
341		Damerau-Levenshtein distance normalized to the interval [0, 1].
342
343		The Damerau-Levenshtein distance is normalized by dividing the
344		Damerau-Levenshtein distance by the greater of
345		the number of characters in src times the cost of a delete and
346		the number of characters in tar times the cost of an insert.
347		For the case in which all operations have :math:`cost = 1`, this is
348		equivalent to the greater of the length of the two strings src & tar.
349
350		The arguments are identical to those of the levenshtein() function.
351
352		:param str src: source string for comparison
353		:param str tar: target string for comparison
354		:param tuple cost: a 4-tuple representing the cost of the four possible
355		edits:
356		inserts, deletes, substitutions, and transpositions, respectively
357		(by default: (1, 1, 1, 1))
358		:returns: normalized Damerau-Levenshtein distance
359		:rtype: float
360
361		>>> round(dist_damerau('cat', 'hat'), 12)
362		0.333333333333
363		>>> round(dist_damerau('Niall', 'Neil'), 12)
364		0.6
365		>>> dist_damerau('aluminum', 'Catalan')
366		0.875
367		>>> dist_damerau('ATCG', 'TAGC')
368		0.5
369		"""
370	1	if src == tar:
371	1	return 0
372	1	ins_cost, del_cost = cost[:2]
373	1	return damerau_levenshtein(src, tar, cost) / (
374		max(len(src) * del_cost, len(tar) * ins_cost)
375		)
376
377
378	1	def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
379		"""Return the Damerau-Levenshtein similarity of two strings.
380
381		Normalized Damerau-Levenshtein similarity the complement of normalized
382		Damerau-Levenshtein distance:
383		:math:`sim_{Damerau} = 1 - dist_{Damerau}`.
384
385		The arguments are identical to those of the levenshtein() function.
386
387		:param str src: source string for comparison
388		:param str tar: target string for comparison
389		:param tuple cost: a 4-tuple representing the cost of the four possible
390		edits:
391		inserts, deletes, substitutions, and transpositions, respectively
392		(by default: (1, 1, 1, 1))
393		:returns: normalized Damerau-Levenshtein similarity
394		:rtype: float
395
396		>>> round(sim_damerau('cat', 'hat'), 12)
397		0.666666666667
398		>>> round(sim_damerau('Niall', 'Neil'), 12)
399		0.4
400		>>> sim_damerau('aluminum', 'Catalan')
401		0.125
402		>>> sim_damerau('ATCG', 'TAGC')
403		0.5
404		"""
405	1	return 1 - dist_damerau(src, tar, cost)
406
407
408	1	def indel(src, tar):
409		"""Return the indel distance between two strings.
410
411		This is equivalent to Levenshtein distance, when only inserts and deletes
412		are possible.
413
414		:param str src: source string for comparison
415		:param str tar: target string for comparison
416		:returns: indel distance
417		:rtype: float
418
419		>>> round(dist_indel('cat', 'hat'), 12)
420		0.333333333333
421		>>> round(dist_indel('Niall', 'Neil'), 12)
422		0.333333333333
423		>>> round(dist_indel('Colin', 'Cuilen'), 12)
424		0.454545454545
425		>>> dist_indel('ATCG', 'TAGC')
426		0.5
427		"""
428	1	return levenshtein(src, tar, mode='lev', cost=(1, 1, 9999, 9999))
429
430
431	1	def dist_indel(src, tar):
432		"""Return the normalized indel distance between two strings.
433
434		This is equivalent to normalized Levenshtein distance, when only inserts
435		and deletes are possible.
436
437		:param str src: source string for comparison
438		:param str tar: target string for comparison
439		:returns: indel distance
440		:rtype: float
441
442		>>> round(dist_indel('cat', 'hat'), 12)
443		0.333333333333
444		>>> round(dist_indel('Niall', 'Neil'), 12)
445		0.333333333333
446		>>> round(dist_indel('Colin', 'Cuilen'), 12)
447		0.454545454545
448		>>> dist_indel('ATCG', 'TAGC')
449		0.5
450		"""
451	1	if src == tar:
452	1	return 0
453	1	return indel(src, tar) / (len(src) + len(tar))
454
455
456	1	def sim_indel(src, tar):
457		"""Return the normalized indel similarity of two strings.
458
459		This is equivalent to normalized Levenshtein similarity, when only inserts
460		and deletes are possible.
461
462		:param str src: source string for comparison
463		:param str tar: target string for comparison
464		:returns: indel similarity
465		:rtype: float
466
467		>>> round(sim_indel('cat', 'hat'), 12)
468		0.666666666667
469		>>> round(sim_indel('Niall', 'Neil'), 12)
470		0.666666666667
471		>>> round(sim_indel('Colin', 'Cuilen'), 12)
472		0.545454545455
473		>>> sim_indel('ATCG', 'TAGC')
474		0.5
475		"""
476	1	return 1 - dist_indel(src, tar)
477
478
479		if __name__ == '__main__':
480		import doctest
481
482		doctest.testmod()
483

chrislit / abydos

Branch — master (78a222)

abydos.distance._levenshtein B

Complexity

Size/Duplication

Test Coverage

Importance

9 Functions

How to fix Complexity

Complexity

Duplication Side-by-Side

Filter issues like