Completed
Pull Request — master (#141)
by Chris
13:03
created

Levenshtein.dist_abs()   F

Complexity

Conditions 14

Size

Total Lines 78
Code Lines 30

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 20
CRAP Score 14

Importance

Changes 0
Metric Value
eloc 30
dl 0
loc 78
ccs 20
cts 20
cp 1
rs 3.6
c 0
b 0
f 0
cc 14
nop 5
crap 14

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance._levenshtein.Levenshtein.dist_abs() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.distance.levenshtein.
20
21
The distance.levenshtein module implements string edit distance functions
22
based on Levenshtein distance, including:
23
24
    - Levenshtein distance
25
    - Optimal String Alignment distance
26
    - Levenshtein-Damerau distance
27
    - Indel distance
28
"""
29
30 1
from __future__ import division, unicode_literals
31
32 1
from sys import maxsize
33
34 1
from numpy import int as np_int
35 1
from numpy import zeros as np_zeros
36
37 1
from six.moves import range
38
39 1
from ._distance import Distance
40
41 1
__all__ = [
42
    'DamerauLevenshtein',
43
    'Indel',
44
    'Levenshtein',
45
    'damerau_levenshtein',
46
    'dist_damerau',
47
    'dist_indel',
48
    'dist_levenshtein',
49
    'levenshtein',
50
    'sim_damerau',
51
    'sim_indel',
52
    'sim_levenshtein',
53
]
54
55
56 1
class Levenshtein(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
57
    """Levenshtein distance.
58
59
    This is the standard edit distance measure. Cf.
60
    :cite:`Levenshtein:1965,Levenshtein:1966`.
61
62
    Optimal string alignment (aka restricted
63
    Damerau-Levenshtein distance) :cite:`Boytsov:2011` is also supported.
64
65
    The ordinary Levenshtein & Optimal String Alignment distance both
66
    employ the Wagner-Fischer dynamic programming algorithm
67
    :cite:`Wagner:1974`.
68
69
    Levenshtein edit distance ordinarily has unit insertion, deletion, and
70
    substitution costs.
71
    """
72
73 1
    def dist_abs(self, src, tar, mode='lev', cost=(1, 1, 1, 1)):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
74
        """Return the Levenshtein distance between two strings.
75
76
        Args:
77
            src (str): Source string for comparison
78
            tar (str): Target string for comparison
79
            mode (str): Specifies a mode for computing the Levenshtein
80
                distance:
81
                    - ``lev`` (default) computes the ordinary Levenshtein
82
                      distance, in which edits may include inserts, deletes,
83
                      and substitutions
84
                    - ``osa`` computes the Optimal String Alignment distance,
85
                      in which edits may include inserts, deletes,
86
                      substitutions, and transpositions but substrings may only
87
                      be edited once
88
            cost (tuple): A 4-tuple representing the cost of the four possible
89
                edits: inserts, deletes, substitutions, and transpositions,
90
                respectively (by default: (1, 1, 1, 1))
91
92
        Returns:
93
            int (may return a float if cost has float values): The Levenshtein
94
                distance between src & tar
95
96
        Examples:
97
            >>> cmp = Levenshtein()
98
            >>> cmp.dist_abs('cat', 'hat')
99
            1
100
            >>> cmp.dist_abs('Niall', 'Neil')
101
            3
102
            >>> cmp.dist_abs('aluminum', 'Catalan')
103
            7
104
            >>> cmp.dist_abs('ATCG', 'TAGC')
105
            3
106
107
            >>> cmp.dist_abs('ATCG', 'TAGC', mode='osa')
108
            2
109
            >>> cmp.dist_abs('ACTG', 'TAGC', mode='osa')
110
            4
111
112
        """
113 1
        ins_cost, del_cost, sub_cost, trans_cost = cost
114
115 1
        if src == tar:
116 1
            return 0
117 1
        if not src:
118 1
            return len(tar) * ins_cost
119 1
        if not tar:
120 1
            return len(src) * del_cost
121
122 1
        d_mat = np_zeros((len(src) + 1, len(tar) + 1), dtype=np_int)
123 1
        for i in range(len(src) + 1):
124 1
            d_mat[i, 0] = i * del_cost
125 1
        for j in range(len(tar) + 1):
126 1
            d_mat[0, j] = j * ins_cost
127
128 1
        for i in range(len(src)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
129 1
            for j in range(len(tar)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
130 1
                d_mat[i + 1, j + 1] = min(
131
                    d_mat[i + 1, j] + ins_cost,  # ins
132
                    d_mat[i, j + 1] + del_cost,  # del
133
                    d_mat[i, j]
134
                    + (sub_cost if src[i] != tar[j] else 0),  # sub/==
135
                )
136
137 1
                if mode == 'osa':
138 1
                    if (
139
                        i + 1 > 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
140
                        and j + 1 > 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
141
                        and src[i] == tar[j - 1]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
142
                        and src[i - 1] == tar[j]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
143
                    ):
144
                        # transposition
145 1
                        d_mat[i + 1, j + 1] = min(
146
                            d_mat[i + 1, j + 1],
147
                            d_mat[i - 1, j - 1] + trans_cost,
148
                        )
149
150 1
        return d_mat[len(src), len(tar)]
151
152 1
    def dist(self, src, tar, mode='lev', cost=(1, 1, 1, 1)):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist' method
Loading history...
153
        """Return the normalized Levenshtein distance between two strings.
154
155
        The Levenshtein distance is normalized by dividing the Levenshtein
156
        distance (calculated by any of the three supported methods) by the
157
        greater of the number of characters in src times the cost of a delete
158
        and the number of characters in tar times the cost of an insert.
159
        For the case in which all operations have :math:`cost = 1`, this is
160
        equivalent to the greater of the length of the two strings src & tar.
161
162
        Args:
163
            src (str): Source string for comparison
164
            tar (str): Target string for comparison
165
            mode (str): specifies a mode for computing the Levenshtein
166
                distance:
167
                    - ``lev`` (default) computes the ordinary Levenshtein
168
                      distance, in which edits may include inserts, deletes,
169
                      and substitutions
170
                    - ``osa`` computes the Optimal String Alignment distance,
171
                      in which edits may include inserts, deletes,
172
                      substitutions, and transpositions but substrings may only
173
                      be edited once
174
            cost (tuple): a 4-tuple representing the cost of the four possible
175
                edits: inserts, deletes, substitutions, and transpositions,
176
                respectively (by default: (1, 1, 1, 1))
177
178
        Returns:
179
            float: The normalized Levenshtein distance between src & tar
180
181
        Examples:
182
            >>> cmp = Levenshtein()
183
            >>> round(cmp.dist('cat', 'hat'), 12)
184
            0.333333333333
185
            >>> round(cmp.dist('Niall', 'Neil'), 12)
186
            0.6
187
            >>> cmp.dist('aluminum', 'Catalan')
188
            0.875
189
            >>> cmp.dist('ATCG', 'TAGC')
190
            0.75
191
192
        """
193 1
        if src == tar:
194 1
            return 0
195 1
        ins_cost, del_cost = cost[:2]
196 1
        return levenshtein(src, tar, mode, cost) / (
197
            max(len(src) * del_cost, len(tar) * ins_cost)
198
        )
199
200
201 1
def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
202
    """Return the Levenshtein distance between two strings.
203
204
    Args:
205
        src (str): Source string for comparison
206
        tar (str): Target string for comparison
207
        mode (str): specifies a mode for computing the Levenshtein
208
            distance:
209
                - ``lev`` (default) computes the ordinary Levenshtein distance,
210
                  in which edits may include inserts, deletes, and
211
                  substitutions
212
                - ``osa`` computes the Optimal String Alignment distance, in
213
                  which edits may include inserts, deletes, substitutions, and
214
                  transpositions but substrings may only be edited once
215
        cost (tuple): a 4-tuple representing the cost of the four possible
216
            edits: inserts, deletes, substitutions, and transpositions,
217
            respectively (by default: (1, 1, 1, 1))
218
219
    Returns:
220
        int (may return a float if cost has float values): The Levenshtein
221
            distance between src & tar
222
223
    Examples:
224
        >>> levenshtein('cat', 'hat')
225
        1
226
        >>> levenshtein('Niall', 'Neil')
227
        3
228
        >>> levenshtein('aluminum', 'Catalan')
229
        7
230
        >>> levenshtein('ATCG', 'TAGC')
231
        3
232
233
        >>> levenshtein('ATCG', 'TAGC', mode='osa')
234
        2
235
        >>> levenshtein('ACTG', 'TAGC', mode='osa')
236
        4
237
238
    """
239 1
    return Levenshtein().dist_abs(src, tar, mode, cost)
240
241
242 1
def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
243
    """Return the normalized Levenshtein distance between two strings.
244
245
    The Levenshtein distance is normalized by dividing the Levenshtein distance
246
    (calculated by any of the three supported methods) by the greater of
247
    the number of characters in src times the cost of a delete and
248
    the number of characters in tar times the cost of an insert.
249
    For the case in which all operations have :math:`cost = 1`, this is
250
    equivalent to the greater of the length of the two strings src & tar.
251
252
    Args:
253
        src (str): Source string for comparison
254
        tar (str): Target string for comparison
255
        mode (str): specifies a mode for computing the Levenshtein
256
            distance:
257
                - ``lev`` (default) computes the ordinary Levenshtein distance,
258
                  in which edits may include inserts, deletes, and
259
                  substitutions
260
                - ``osa`` computes the Optimal String Alignment distance, in
261
                  which edits may include inserts, deletes, substitutions, and
262
                  transpositions but substrings may only be edited once
263
        cost (tuple): a 4-tuple representing the cost of the four possible
264
            edits: inserts, deletes, substitutions, and transpositions,
265
            respectively (by default: (1, 1, 1, 1))
266
267
    Returns:
268
        float: The Levenshtein distance between src & tar
269
270
    Examples:
271
        >>> round(dist_levenshtein('cat', 'hat'), 12)
272
        0.333333333333
273
        >>> round(dist_levenshtein('Niall', 'Neil'), 12)
274
        0.6
275
        >>> dist_levenshtein('aluminum', 'Catalan')
276
        0.875
277
        >>> dist_levenshtein('ATCG', 'TAGC')
278
        0.75
279
280
    """
281 1
    return Levenshtein().dist(src, tar, mode, cost)
282
283
284 1
def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
285
    """Return the Levenshtein similarity of two strings.
286
287
    Normalized Levenshtein similarity is the complement of normalized
288
    Levenshtein distance:
289
    :math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`.
290
291
    Args:
292
        src (str): Source string for comparison
293
        tar (str): Target string for comparison
294
        mode (str): specifies a mode for computing the Levenshtein
295
            distance:
296
                - ``lev`` (default) computes the ordinary Levenshtein distance,
297
                  in which edits may include inserts, deletes, and
298
                  substitutions
299
                - ``osa`` computes the Optimal String Alignment distance, in
300
                  which edits may include inserts, deletes, substitutions, and
301
                  transpositions but substrings may only be edited once
302
        cost (tuple): a 4-tuple representing the cost of the four possible
303
            edits: inserts, deletes, substitutions, and transpositions,
304
            respectively (by default: (1, 1, 1, 1))
305
306
    Returns:
307
        float: The Levenshtein similarity between src & tar
308
309
    Examples:
310
        >>> round(sim_levenshtein('cat', 'hat'), 12)
311
        0.666666666667
312
        >>> round(sim_levenshtein('Niall', 'Neil'), 12)
313
        0.4
314
        >>> sim_levenshtein('aluminum', 'Catalan')
315
        0.125
316
        >>> sim_levenshtein('ATCG', 'TAGC')
317
        0.25
318
319
    """
320 1
    return Levenshtein().sim(src, tar, mode, cost)
321
322
323 1
class DamerauLevenshtein(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
324
    """Damerau-Levenshtein distance.
325
326
    This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`.
327
    Damerau-Levenshtein code is based on Java code by Kevin L. Stern
328
    :cite:`Stern:2014`, under the MIT license:
329
    https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
330
    """
331
332 1
    def dist_abs(self, src, tar, cost=(1, 1, 1, 1)):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (22/15).
Loading history...
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
333
        """Return the Damerau-Levenshtein distance between two strings.
334
335
        Args:
336
            src (str): Source string for comparison
337
            tar (str): Target string for comparison
338
            cost (tuple): a 4-tuple representing the cost of the four possible
339
                edits: inserts, deletes, substitutions, and transpositions,
340
                respectively (by default: (1, 1, 1, 1))
341
342
        Returns:
343
            int (may return a float if cost has float values): The
344
                Damerau-Levenshtein distance between src & tar
345
346
        Raises:
347
            ValueError: Unsupported cost assignment; the cost of two
348
                transpositions must not be less than the cost of an insert plus
349
                a delete.
350
351
        Examples:
352
            >>> cmp = DamerauLevenshtein()
353
            >>> cmp.dist_abs('cat', 'hat')
354
            1
355
            >>> cmp.dist_abs('Niall', 'Neil')
356
            3
357
            >>> cmp.dist_abs('aluminum', 'Catalan')
358
            7
359
            >>> cmp.dist_abs('ATCG', 'TAGC')
360
            2
361
362
        """
363 1
        ins_cost, del_cost, sub_cost, trans_cost = cost
364
365 1
        if src == tar:
366 1
            return 0
367 1
        if not src:
368 1
            return len(tar) * ins_cost
369 1
        if not tar:
370 1
            return len(src) * del_cost
371
372 1
        if 2 * trans_cost < ins_cost + del_cost:
373 1
            raise ValueError(
374
                'Unsupported cost assignment; the cost of two transpositions '
375
                + 'must not be less than the cost of an insert plus a delete.'
376
            )
377
378 1
        d_mat = np_zeros((len(src)) * (len(tar)), dtype=np_int).reshape(
379
            (len(src), len(tar))
380
        )
381
382 1
        if src[0] != tar[0]:
383 1
            d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)
384
385 1
        src_index_by_character = {src[0]: 0}
386 1
        for i in range(1, len(src)):
387 1
            del_distance = d_mat[i - 1, 0] + del_cost
388 1
            ins_distance = (i + 1) * del_cost + ins_cost
389 1
            match_distance = i * del_cost + (
390
                0 if src[i] == tar[0] else sub_cost
391
            )
392 1
            d_mat[i, 0] = min(del_distance, ins_distance, match_distance)
393
394 1
        for j in range(1, len(tar)):
395 1
            del_distance = (j + 1) * ins_cost + del_cost
396 1
            ins_distance = d_mat[0, j - 1] + ins_cost
397 1
            match_distance = j * ins_cost + (
398
                0 if src[0] == tar[j] else sub_cost
399
            )
400 1
            d_mat[0, j] = min(del_distance, ins_distance, match_distance)
401
402 1
        for i in range(1, len(src)):
403 1
            max_src_letter_match_index = 0 if src[i] == tar[0] else -1
404 1
            for j in range(1, len(tar)):
405 1
                candidate_swap_index = (
406
                    -1
407
                    if tar[j] not in src_index_by_character
408
                    else src_index_by_character[tar[j]]
409
                )
410 1
                j_swap = max_src_letter_match_index
411 1
                del_distance = d_mat[i - 1, j] + del_cost
412 1
                ins_distance = d_mat[i, j - 1] + ins_cost
413 1
                match_distance = d_mat[i - 1, j - 1]
414 1
                if src[i] != tar[j]:
415 1
                    match_distance += sub_cost
416
                else:
417 1
                    max_src_letter_match_index = j
418
419 1
                if candidate_swap_index != -1 and j_swap != -1:
420 1
                    i_swap = candidate_swap_index
421
422 1
                    if i_swap == 0 and j_swap == 0:
423 1
                        pre_swap_cost = 0
424
                    else:
425 1
                        pre_swap_cost = d_mat[
426
                            max(0, i_swap - 1), max(0, j_swap - 1)
427
                        ]
428 1
                    swap_distance = (
429
                        pre_swap_cost
430
                        + (i - i_swap - 1) * del_cost
431
                        + (j - j_swap - 1) * ins_cost
432
                        + trans_cost
433
                    )
434
                else:
435 1
                    swap_distance = maxsize
436
437 1
                d_mat[i, j] = min(
438
                    del_distance, ins_distance, match_distance, swap_distance
439
                )
440 1
            src_index_by_character[src[i]] = i
441
442 1
        return d_mat[len(src) - 1, len(tar) - 1]
443
444 1
    def dist(self, src, tar, cost=(1, 1, 1, 1)):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist' method
Loading history...
445
        """Return the Damerau-Levenshtein similarity of two strings.
446
447
        Damerau-Levenshtein distance normalized to the interval [0, 1].
448
449
        The Damerau-Levenshtein distance is normalized by dividing the
450
        Damerau-Levenshtein distance by the greater of
451
        the number of characters in src times the cost of a delete and
452
        the number of characters in tar times the cost of an insert.
453
        For the case in which all operations have :math:`cost = 1`, this is
454
        equivalent to the greater of the length of the two strings src & tar.
455
456
        Args:
457
            src (str): Source string for comparison
458
            tar (str): Target string for comparison
459
            cost (tuple): a 4-tuple representing the cost of the four possible
460
                edits: inserts, deletes, substitutions, and transpositions,
461
                respectively (by default: (1, 1, 1, 1))
462
463
        Returns:
464
            float: The normalized Damerau-Levenshtein distance
465
466
        Examples:
467
            >>> cmp = DamerauLevenshtein()
468
            >>> round(cmp.dist('cat', 'hat'), 12)
469
            0.333333333333
470
            >>> round(cmp.dist('Niall', 'Neil'), 12)
471
            0.6
472
            >>> cmp.dist('aluminum', 'Catalan')
473
            0.875
474
            >>> cmp.dist('ATCG', 'TAGC')
475
            0.5
476
477
        """
478 1
        if src == tar:
479 1
            return 0.0
480 1
        ins_cost, del_cost = cost[:2]
481 1
        return self.dist_abs(src, tar, cost) / (
482
            max(len(src) * del_cost, len(tar) * ins_cost)
483
        )
484
485
486 1
def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):
487
    """Return the Damerau-Levenshtein distance between two strings.
488
489
    Args:
490
        src (str): Source string for comparison
491
        tar (str): Target string for comparison
492
        cost (tuple): a 4-tuple representing the cost of the four possible
493
            edits: inserts, deletes, substitutions, and transpositions,
494
            respectively (by default: (1, 1, 1, 1))
495
496
    Returns:
497
        int (may return a float if cost has float values): The
498
            Damerau-Levenshtein distance between src & tar
499
500
    Examples:
501
        >>> damerau_levenshtein('cat', 'hat')
502
        1
503
        >>> damerau_levenshtein('Niall', 'Neil')
504
        3
505
        >>> damerau_levenshtein('aluminum', 'Catalan')
506
        7
507
        >>> damerau_levenshtein('ATCG', 'TAGC')
508
        2
509
510
    """
511 1
    return DamerauLevenshtein().dist_abs(src, tar, cost)
512
513
514 1
def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
515
    """Return the Damerau-Levenshtein similarity of two strings.
516
517
    Damerau-Levenshtein distance normalized to the interval [0, 1].
518
519
    The Damerau-Levenshtein distance is normalized by dividing the
520
    Damerau-Levenshtein distance by the greater of
521
    the number of characters in src times the cost of a delete and
522
    the number of characters in tar times the cost of an insert.
523
    For the case in which all operations have :math:`cost = 1`, this is
524
    equivalent to the greater of the length of the two strings src & tar.
525
526
    Args:
527
        src (str): Source string for comparison
528
        tar (str): Target string for comparison
529
        cost (tuple): a 4-tuple representing the cost of the four possible
530
            edits: inserts, deletes, substitutions, and transpositions,
531
            respectively (by default: (1, 1, 1, 1))
532
533
    Returns:
534
        float: The normalized Damerau-Levenshtein distance
535
536
    Examples:
537
        >>> round(dist_damerau('cat', 'hat'), 12)
538
        0.333333333333
539
        >>> round(dist_damerau('Niall', 'Neil'), 12)
540
        0.6
541
        >>> dist_damerau('aluminum', 'Catalan')
542
        0.875
543
        >>> dist_damerau('ATCG', 'TAGC')
544
        0.5
545
546
    """
547 1
    return DamerauLevenshtein().dist(src, tar, cost)
548
549
550 1
def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
551
    """Return the Damerau-Levenshtein similarity of two strings.
552
553
    Normalized Damerau-Levenshtein similarity the complement of normalized
554
    Damerau-Levenshtein distance:
555
    :math:`sim_{Damerau} = 1 - dist_{Damerau}`.
556
557
    Args:
558
        src (str): Source string for comparison
559
        tar (str): Target string for comparison
560
        cost (tuple): a 4-tuple representing the cost of the four possible
561
            edits: inserts, deletes, substitutions, and transpositions,
562
            respectively (by default: (1, 1, 1, 1))
563
564
    Returns:
565
        float: The normalized Damerau-Levenshtein similarity
566
567
    Examples:
568
        >>> round(sim_damerau('cat', 'hat'), 12)
569
        0.666666666667
570
        >>> round(sim_damerau('Niall', 'Neil'), 12)
571
        0.4
572
        >>> sim_damerau('aluminum', 'Catalan')
573
        0.125
574
        >>> sim_damerau('ATCG', 'TAGC')
575
        0.5
576
577
    """
578 1
    return DamerauLevenshtein().sim(src, tar, cost)
579
580
581 1
class Indel(Distance):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
582
    """Indel distance.
583
584
    This is equivalent to Levenshtein distance, when only inserts and deletes
585
    are possible.
586
    """
587
588 1
    _lev = Levenshtein()
589
590 1
    def dist_abs(self, src, tar):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist_abs' method
Loading history...
591
        """Return the indel distance between two strings.
592
593
        Args:
594
            src (str): Source string for comparison
595
            tar (str): Target string for comparison
596
597
        Returns:
598
            int: Indel distance
599
600
        Examples:
601
            >>> cmp = Indel()
602
            >>> cmp.dist_abs('cat', 'hat')
603
            2
604
            >>> cmp.dist_abs('Niall', 'Neil')
605
            3
606
            >>> cmp.dist_abs('Colin', 'Cuilen')
607
            5
608
            >>> cmp.dist_abs('ATCG', 'TAGC')
609
            4
610
611
        """
612 1
        return self._lev.dist_abs(
613
            src, tar, mode='lev', cost=(1, 1, 9999, 9999)
614
        )
615
616 1
    def dist(self, src, tar):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'dist' method
Loading history...
617
        """Return the normalized indel distance between two strings.
618
619
        This is equivalent to normalized Levenshtein distance, when only
620
        inserts and deletes are possible.
621
622
        Args:
623
            src (str): Source string for comparison
624
            tar (str): Target string for comparison
625
626
        Returns:
627
            float: Normalized indel distance
628
629
        Examples:
630
            >>> cmp = Indel()
631
            >>> round(cmp.dist('cat', 'hat'), 12)
632
            0.333333333333
633
            >>> round(cmp.dist('Niall', 'Neil'), 12)
634
            0.333333333333
635
            >>> round(cmp.dist('Colin', 'Cuilen'), 12)
636
            0.454545454545
637
            >>> cmp.dist('ATCG', 'TAGC')
638
            0.5
639
640
        """
641 1
        if src == tar:
642 1
            return 0.0
643 1
        return self.dist_abs(src, tar) / (len(src) + len(tar))
644
645
646 1
def indel(src, tar):
647
    """Return the indel distance between two strings.
648
649
    Args:
650
        src (str): Source string for comparison
651
        tar (str): Target string for comparison
652
653
    Returns:
654
        int: Indel distance
655
656
    Examples:
657
        >>> indel('cat', 'hat')
658
        2
659
        >>> indel('Niall', 'Neil')
660
        3
661
        >>> indel('Colin', 'Cuilen')
662
        5
663
        >>> indel('ATCG', 'TAGC')
664
        4
665
666
    """
667
    return Indel().dist_abs(src, tar)
668
669
670 1
def dist_indel(src, tar):
671
    """Return the normalized indel distance between two strings.
672
673
    This is equivalent to normalized Levenshtein distance, when only inserts
674
    and deletes are possible.
675
676
    Args:
677
        src (str): Source string for comparison
678
        tar (str): Target string for comparison
679
680
    Returns:
681
        float: Normalized indel distance
682
683
    Examples:
684
        >>> round(dist_indel('cat', 'hat'), 12)
685
        0.333333333333
686
        >>> round(dist_indel('Niall', 'Neil'), 12)
687
        0.333333333333
688
        >>> round(dist_indel('Colin', 'Cuilen'), 12)
689
        0.454545454545
690
        >>> dist_indel('ATCG', 'TAGC')
691
        0.5
692
693
    """
694 1
    return Indel().dist(src, tar)
695
696
697 1
def sim_indel(src, tar):
698
    """Return the normalized indel similarity of two strings.
699
700
    This is equivalent to normalized Levenshtein similarity, when only inserts
701
    and deletes are possible.
702
703
    Args:
704
        src (str): Source string for comparison
705
        tar (str): Target string for comparison
706
707
    Returns:
708
        float: Normalized indel similarity
709
710
    Examples:
711
        >>> round(sim_indel('cat', 'hat'), 12)
712
        0.666666666667
713
        >>> round(sim_indel('Niall', 'Neil'), 12)
714
        0.666666666667
715
        >>> round(sim_indel('Colin', 'Cuilen'), 12)
716
        0.545454545455
717
        >>> sim_indel('ATCG', 'TAGC')
718
        0.5
719
720
    """
721 1
    return Indel().sim(src, tar)
722
723
724
if __name__ == '__main__':
725
    import doctest
726
727
    doctest.testmod()
728