Completed
Branch master (78a222)
by Chris
14:36
created

abydos.distance._levenshtein   B

Complexity

Total Complexity 44

Size/Duplication

Total Lines 483
Duplicated Lines 0 %

Test Coverage

Coverage 100%

Importance

Changes 0
Metric Value
wmc 44
eloc 130
dl 0
loc 483
ccs 96
cts 96
cp 1
rs 8.8798
c 0
b 0
f 0

9 Functions

Rating   Name   Duplication   Size   Complexity  
A dist_damerau() 0 37 2
F damerau_levenshtein() 0 103 19
A dist_levenshtein() 0 43 2
A sim_indel() 0 21 1
F levenshtein() 0 94 15
A indel() 0 21 1
A sim_damerau() 0 28 1
A dist_indel() 0 23 2
A sim_levenshtein() 0 37 1

How to fix   Complexity   

Complexity

Complex classes like abydos.distance._levenshtein often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance.levenshtein.
20
21
The distance.levenshtein module implements string edit distance functions
22
based on Levenshtein distance, including:
23
24
    - Levenshtein distance
25
    - Optimal String Alignment distance
26
    - Levenshtein-Damerau distance
27
    - Indel distance
28
"""
29
30 1
from __future__ import division, unicode_literals
31
32 1
from sys import maxsize
33
34 1
from numpy import int as np_int
35 1
from numpy import zeros as np_zeros
36
37 1
from six.moves import range
38
39
40 1
__all__ = [
41
    'damerau_levenshtein',
42
    'dist_damerau',
43
    'dist_indel',
44
    'dist_levenshtein',
45
    'levenshtein',
46
    'sim_damerau',
47
    'sim_indel',
48
    'sim_levenshtein',
49
]
50
51
52 1
def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
53
    """Return the Levenshtein distance between two strings.
54
55
    This is the standard edit distance measure. Cf.
56
    :cite:`Levenshtein:1965,Levenshtein:1966`.
57
58
    Two additional variants: optimal string alignment (aka restricted
59
    Damerau-Levenshtein distance) :cite:`Boytsov:2011` and the
60
    Damerau-Levenshtein :cite:`Damerau:1964` distance are also supported.
61
62
    The ordinary Levenshtein & Optimal String Alignment distance both
63
    employ the Wagner-Fischer dynamic programming algorithm
64
    :cite:`Wagner:1974`.
65
66
    Levenshtein edit distance ordinarily has unit insertion, deletion, and
67
    substitution costs.
68
69
    :param str src: source string for comparison
70
    :param str tar: target string for comparison
71
    :param str mode: specifies a mode for computing the Levenshtein distance:
72
73
        - 'lev' (default) computes the ordinary Levenshtein distance,
74
          in which edits may include inserts, deletes, and substitutions
75
        - 'osa' computes the Optimal String Alignment distance, in which
76
          edits may include inserts, deletes, substitutions, and
77
          transpositions but substrings may only be edited once
78
        - 'dam' computes the Damerau-Levenshtein distance, in which
79
          edits may include inserts, deletes, substitutions, and
80
          transpositions and substrings may undergo repeated edits
81
82
    :param tuple cost: a 4-tuple representing the cost of the four possible
83
        edits: inserts, deletes, substitutions, and transpositions,
84
        respectively (by default: (1, 1, 1, 1))
85
    :returns: the Levenshtein distance between src & tar
86
    :rtype: int (may return a float if cost has float values)
87
88
    >>> levenshtein('cat', 'hat')
89
    1
90
    >>> levenshtein('Niall', 'Neil')
91
    3
92
    >>> levenshtein('aluminum', 'Catalan')
93
    7
94
    >>> levenshtein('ATCG', 'TAGC')
95
    3
96
97
    >>> levenshtein('ATCG', 'TAGC', mode='osa')
98
    2
99
    >>> levenshtein('ACTG', 'TAGC', mode='osa')
100
    4
101
102
    >>> levenshtein('ATCG', 'TAGC', mode='dam')
103
    2
104
    >>> levenshtein('ACTG', 'TAGC', mode='dam')
105
    3
106
    """
107 1
    ins_cost, del_cost, sub_cost, trans_cost = cost
108
109 1
    if src == tar:
110 1
        return 0
111 1
    if not src:
112 1
        return len(tar) * ins_cost
113 1
    if not tar:
114 1
        return len(src) * del_cost
115
116 1
    if 'dam' in mode:
117 1
        return damerau_levenshtein(src, tar, cost)
118
119 1
    d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_int)
120 1
    for i in range(len(src) + 1):
121 1
        d_mat[i, 0] = i * del_cost
122 1
    for j in range(len(tar) + 1):
123 1
        d_mat[0, j] = j * ins_cost
124
125 1
    for i in range(len(src)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
126 1
        for j in range(len(tar)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
127 1
            d_mat[i + 1, j + 1] = min(
128
                d_mat[i + 1, j] + ins_cost,  # ins
129
                d_mat[i, j + 1] + del_cost,  # del
130
                d_mat[i, j] + (sub_cost if src[i] != tar[j] else 0),  # sub/==
131
            )
132
133 1
            if mode == 'osa':
134 1
                if (
135
                    i + 1 > 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
136
                    and j + 1 > 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
137
                    and src[i] == tar[j - 1]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
138
                    and src[i - 1] == tar[j]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
139
                ):
140
                    # transposition
141 1
                    d_mat[i + 1, j + 1] = min(
142
                        d_mat[i + 1, j + 1], d_mat[i - 1, j - 1] + trans_cost
143
                    )
144
145 1
    return d_mat[len(src), len(tar)]
146
147
148 1
def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
149
    """Return the normalized Levenshtein distance between two strings.
150
151
    The Levenshtein distance is normalized by dividing the Levenshtein distance
152
    (calculated by any of the three supported methods) by the greater of
153
    the number of characters in src times the cost of a delete and
154
    the number of characters in tar times the cost of an insert.
155
    For the case in which all operations have :math:`cost = 1`, this is
156
    equivalent to the greater of the length of the two strings src & tar.
157
158
    :param str src: source string for comparison
159
    :param str tar: target string for comparison
160
    :param str mode: specifies a mode for computing the Levenshtein distance:
161
162
        - 'lev' (default) computes the ordinary Levenshtein distance,
163
          in which edits may include inserts, deletes, and substitutions
164
        - 'osa' computes the Optimal String Alignment distance, in which
165
          edits may include inserts, deletes, substitutions, and
166
          transpositions but substrings may only be edited once
167
        - 'dam' computes the Damerau-Levenshtein distance, in which
168
          edits may include inserts, deletes, substitutions, and
169
          transpositions and substrings may undergo repeated edits
170
171
    :param tuple cost: a 4-tuple representing the cost of the four possible
172
        edits: inserts, deletes, substitutions, and transpositions,
173
        respectively (by default: (1, 1, 1, 1))
174
    :returns: normalized Levenshtein distance
175
    :rtype: float
176
177
    >>> round(dist_levenshtein('cat', 'hat'), 12)
178
    0.333333333333
179
    >>> round(dist_levenshtein('Niall', 'Neil'), 12)
180
    0.6
181
    >>> dist_levenshtein('aluminum', 'Catalan')
182
    0.875
183
    >>> dist_levenshtein('ATCG', 'TAGC')
184
    0.75
185
    """
186 1
    if src == tar:
187 1
        return 0
188 1
    ins_cost, del_cost = cost[:2]
189 1
    return levenshtein(src, tar, mode, cost) / (
190
        max(len(src) * del_cost, len(tar) * ins_cost)
191
    )
192
193
194 1
def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
195
    """Return the Levenshtein similarity of two strings.
196
197
    Normalized Levenshtein similarity is the complement of normalized
198
    Levenshtein distance:
199
    :math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`.
200
201
    :param str src: source string for comparison
202
    :param str tar: target string for comparison
203
    :param str mode: specifies a mode for computing the Levenshtein distance:
204
205
            - 'lev' (default) computes the ordinary Levenshtein distance,
206
              in which edits may include inserts, deletes, and substitutions
207
            - 'osa' computes the Optimal String Alignment distance, in which
208
              edits may include inserts, deletes, substitutions, and
209
              transpositions but substrings may only be edited once
210
            - 'dam' computes the Damerau-Levenshtein distance, in which
211
              edits may include inserts, deletes, substitutions, and
212
              transpositions and substrings may undergo repeated edits
213
214
    :param tuple cost: a 4-tuple representing the cost of the four possible
215
        edits:
216
        inserts, deletes, substitutions, and transpositions, respectively
217
        (by default: (1, 1, 1, 1))
218
    :returns: normalized Levenshtein similarity
219
    :rtype: float
220
221
    >>> round(sim_levenshtein('cat', 'hat'), 12)
222
    0.666666666667
223
    >>> round(sim_levenshtein('Niall', 'Neil'), 12)
224
    0.4
225
    >>> sim_levenshtein('aluminum', 'Catalan')
226
    0.125
227
    >>> sim_levenshtein('ATCG', 'TAGC')
228
    0.25
229
    """
230 1
    return 1 - dist_levenshtein(src, tar, mode, cost)
231
232
233 1
def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (20/15).
Loading history...
234
    """Return the Damerau-Levenshtein distance between two strings.
235
236
    This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`.
237
    Damerau-Levenshtein code is based on Java code by Kevin L. Stern
238
    :cite:`Stern:2014`, under the MIT license:
239
    https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
240
241
    :param str src: source string for comparison
242
    :param str tar: target string for comparison
243
    :param tuple cost: a 4-tuple representing the cost of the four possible
244
        edits:
245
        inserts, deletes, substitutions, and transpositions, respectively
246
        (by default: (1, 1, 1, 1))
247
    :returns: the Damerau-Levenshtein distance between src & tar
248
    :rtype: int (may return a float if cost has float values)
249
250
    >>> damerau_levenshtein('cat', 'hat')
251
    1
252
    >>> damerau_levenshtein('Niall', 'Neil')
253
    3
254
    >>> damerau_levenshtein('aluminum', 'Catalan')
255
    7
256
    >>> damerau_levenshtein('ATCG', 'TAGC')
257
    2
258
    """
259 1
    ins_cost, del_cost, sub_cost, trans_cost = cost
260
261 1
    if src == tar:
262 1
        return 0
263 1
    if not src:
264 1
        return len(tar) * ins_cost
265 1
    if not tar:
266 1
        return len(src) * del_cost
267
268 1
    if 2 * trans_cost < ins_cost + del_cost:
269 1
        raise ValueError(
270
            'Unsupported cost assignment; the cost of two '
271
            + 'transpositions must not be less than the cost of '
272
            + 'an insert plus a delete.'
273
        )
274
275 1
    d_mat = np_zeros((len(src)) * (len(tar)), dtype=np_int).reshape(
276
        (len(src), len(tar))
277
    )
278
279 1
    if src[0] != tar[0]:
280 1
        d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)
281
282 1
    src_index_by_character = {src[0]: 0}
283 1
    for i in range(1, len(src)):
284 1
        del_distance = d_mat[i - 1, 0] + del_cost
285 1
        ins_distance = (i + 1) * del_cost + ins_cost
286 1
        match_distance = i * del_cost + (0 if src[i] == tar[0] else sub_cost)
287 1
        d_mat[i, 0] = min(del_distance, ins_distance, match_distance)
288
289 1
    for j in range(1, len(tar)):
290 1
        del_distance = (j + 1) * ins_cost + del_cost
291 1
        ins_distance = d_mat[0, j - 1] + ins_cost
292 1
        match_distance = j * ins_cost + (0 if src[0] == tar[j] else sub_cost)
293 1
        d_mat[0, j] = min(del_distance, ins_distance, match_distance)
294
295 1
    for i in range(1, len(src)):
296 1
        max_src_letter_match_index = 0 if src[i] == tar[0] else -1
297 1
        for j in range(1, len(tar)):
298 1
            candidate_swap_index = (
299
                -1
300
                if tar[j] not in src_index_by_character
301
                else src_index_by_character[tar[j]]
302
            )
303 1
            j_swap = max_src_letter_match_index
304 1
            del_distance = d_mat[i - 1, j] + del_cost
305 1
            ins_distance = d_mat[i, j - 1] + ins_cost
306 1
            match_distance = d_mat[i - 1, j - 1]
307 1
            if src[i] != tar[j]:
308 1
                match_distance += sub_cost
309
            else:
310 1
                max_src_letter_match_index = j
311
312 1
            if candidate_swap_index != -1 and j_swap != -1:
313 1
                i_swap = candidate_swap_index
314
315 1
                if i_swap == 0 and j_swap == 0:
316 1
                    pre_swap_cost = 0
317
                else:
318 1
                    pre_swap_cost = d_mat[
319
                        max(0, i_swap - 1), max(0, j_swap - 1)
320
                    ]
321 1
                swap_distance = (
322
                    pre_swap_cost
323
                    + (i - i_swap - 1) * del_cost
324
                    + (j - j_swap - 1) * ins_cost
325
                    + trans_cost
326
                )
327
            else:
328 1
                swap_distance = maxsize
329
330 1
            d_mat[i, j] = min(
331
                del_distance, ins_distance, match_distance, swap_distance
332
            )
333 1
        src_index_by_character[src[i]] = i
334
335 1
    return d_mat[len(src) - 1, len(tar) - 1]
336
337
338 1
def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
339
    """Return the Damerau-Levenshtein similarity of two strings.
340
341
    Damerau-Levenshtein distance normalized to the interval [0, 1].
342
343
    The Damerau-Levenshtein distance is normalized by dividing the
344
    Damerau-Levenshtein distance by the greater of
345
    the number of characters in src times the cost of a delete and
346
    the number of characters in tar times the cost of an insert.
347
    For the case in which all operations have :math:`cost = 1`, this is
348
    equivalent to the greater of the length of the two strings src & tar.
349
350
    The arguments are identical to those of the levenshtein() function.
351
352
    :param str src: source string for comparison
353
    :param str tar: target string for comparison
354
    :param tuple cost: a 4-tuple representing the cost of the four possible
355
        edits:
356
        inserts, deletes, substitutions, and transpositions, respectively
357
        (by default: (1, 1, 1, 1))
358
    :returns: normalized Damerau-Levenshtein distance
359
    :rtype: float
360
361
    >>> round(dist_damerau('cat', 'hat'), 12)
362
    0.333333333333
363
    >>> round(dist_damerau('Niall', 'Neil'), 12)
364
    0.6
365
    >>> dist_damerau('aluminum', 'Catalan')
366
    0.875
367
    >>> dist_damerau('ATCG', 'TAGC')
368
    0.5
369
    """
370 1
    if src == tar:
371 1
        return 0
372 1
    ins_cost, del_cost = cost[:2]
373 1
    return damerau_levenshtein(src, tar, cost) / (
374
        max(len(src) * del_cost, len(tar) * ins_cost)
375
    )
376
377
378 1
def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
379
    """Return the Damerau-Levenshtein similarity of two strings.
380
381
    Normalized Damerau-Levenshtein similarity the complement of normalized
382
    Damerau-Levenshtein distance:
383
    :math:`sim_{Damerau} = 1 - dist_{Damerau}`.
384
385
    The arguments are identical to those of the levenshtein() function.
386
387
    :param str src: source string for comparison
388
    :param str tar: target string for comparison
389
    :param tuple cost: a 4-tuple representing the cost of the four possible
390
        edits:
391
        inserts, deletes, substitutions, and transpositions, respectively
392
        (by default: (1, 1, 1, 1))
393
    :returns: normalized Damerau-Levenshtein similarity
394
    :rtype: float
395
396
    >>> round(sim_damerau('cat', 'hat'), 12)
397
    0.666666666667
398
    >>> round(sim_damerau('Niall', 'Neil'), 12)
399
    0.4
400
    >>> sim_damerau('aluminum', 'Catalan')
401
    0.125
402
    >>> sim_damerau('ATCG', 'TAGC')
403
    0.5
404
    """
405 1
    return 1 - dist_damerau(src, tar, cost)
406
407
408 1
def indel(src, tar):
409
    """Return the indel distance between two strings.
410
411
    This is equivalent to Levenshtein distance, when only inserts and deletes
412
    are possible.
413
414
    :param str src: source string for comparison
415
    :param str tar: target string for comparison
416
    :returns: indel distance
417
    :rtype: float
418
419
    >>> round(dist_indel('cat', 'hat'), 12)
420
    0.333333333333
421
    >>> round(dist_indel('Niall', 'Neil'), 12)
422
    0.333333333333
423
    >>> round(dist_indel('Colin', 'Cuilen'), 12)
424
    0.454545454545
425
    >>> dist_indel('ATCG', 'TAGC')
426
    0.5
427
    """
428 1
    return levenshtein(src, tar, mode='lev', cost=(1, 1, 9999, 9999))
429
430
431 1
def dist_indel(src, tar):
432
    """Return the normalized indel distance between two strings.
433
434
    This is equivalent to normalized Levenshtein distance, when only inserts
435
    and deletes are possible.
436
437
    :param str src: source string for comparison
438
    :param str tar: target string for comparison
439
    :returns: indel distance
440
    :rtype: float
441
442
    >>> round(dist_indel('cat', 'hat'), 12)
443
    0.333333333333
444
    >>> round(dist_indel('Niall', 'Neil'), 12)
445
    0.333333333333
446
    >>> round(dist_indel('Colin', 'Cuilen'), 12)
447
    0.454545454545
448
    >>> dist_indel('ATCG', 'TAGC')
449
    0.5
450
    """
451 1
    if src == tar:
452 1
        return 0
453 1
    return indel(src, tar) / (len(src) + len(tar))
454
455
456 1
def sim_indel(src, tar):
457
    """Return the normalized indel similarity of two strings.
458
459
    This is equivalent to normalized Levenshtein similarity, when only inserts
460
    and deletes are possible.
461
462
    :param str src: source string for comparison
463
    :param str tar: target string for comparison
464
    :returns: indel similarity
465
    :rtype: float
466
467
    >>> round(sim_indel('cat', 'hat'), 12)
468
    0.666666666667
469
    >>> round(sim_indel('Niall', 'Neil'), 12)
470
    0.666666666667
471
    >>> round(sim_indel('Colin', 'Cuilen'), 12)
472
    0.545454545455
473
    >>> sim_indel('ATCG', 'TAGC')
474
    0.5
475
    """
476 1
    return 1 - dist_indel(src, tar)
477
478
479
if __name__ == '__main__':
480
    import doctest
481
482
    doctest.testmod()
483