Completed
Push — master ( 2d4fd8...e8b381 )
by Chris
12:21
created

abydos.distance.typo()   F

Complexity

Conditions 13

Size

Total Lines 105
Code Lines 69

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 13
eloc 69
nop 4
dl 0
loc 105
rs 3.9545
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance.typo() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.distance.
20
21
The distance module implements string edit distance functions including:
22
23
    - Levenshtein distance
24
    - Optimal String Alignment distance
25
    - Levenshtein-Damerau distance
26
    - Hamming distance
27
    - Tversky index
28
    - Sørensen–Dice coefficient & distance
29
    - Jaccard similarity coefficient & distance
30
    - overlap similarity & distance
31
    - Tanimoto coefficient & distance
32
    - Minkowski distance & similarity
33
    - Manhattan distance & similarity
34
    - Euclidean distance & similarity
35
    - Chebyshev distance & similarity
36
    - cosine similarity & distance
37
    - Jaro distance
38
    - Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
39
    - Longest common substring
40
    - Ratcliff-Obershelp similarity & distance
41
    - Match Rating Algorithm similarity
42
    - Normalized Compression Distance (NCD) & similarity
43
    - Monge-Elkan similarity & distance
44
    - Matrix similarity
45
    - Needleman-Wunsch score
46
    - Smither-Waterman score
47
    - Gotoh score
48
    - Length similarity
49
    - Prefix, Suffix, and Identity similarity & distance
50
    - Modified Language-Independent Product Name Search (MLIPNS) similarity &
51
      distance
52
    - Bag distance
53
    - Editex distance
54
    - Eudex distances
55
    - Sift4 distance
56
    - Baystat distance & similarity
57
    - Typo distance
58
59
Functions beginning with the prefixes 'sim' and 'dist' are guaranteed to be
60
in the range [0, 1], and sim_X = 1 - dist_X since the two are complements.
61
If a sim_X function is supplied identical src & tar arguments, it is guaranteed
62
to return 1; the corresponding dist_X function is guaranteed to return 0.
63
"""
64
65
from __future__ import division, unicode_literals
66
67
import codecs
68
import math
69
import sys
70
import types
71
import unicodedata
72
from collections import Counter, defaultdict
73
74
import numpy as np
75
76
from six import text_type
77
from six.moves import range
78
79
from .compression import ac_encode, ac_train, rle_encode
80
from .phonetic import eudex, mra
81
from .qgram import QGrams
82
83
try:
84
    import lzma
85
except ImportError:  # pragma: no cover
86
    # If the system lacks the lzma library, that's fine, but lzma comrpession
87
    # similarity won't be supported.
88
    pass
89
90
91
def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
92
    """Return the Levenshtein distance between two strings.
93
94
    Levenshtein distance
95
96
    This is the standard edit distance measure. Cf.
97
    https://en.wikipedia.org/wiki/Levenshtein_distance
98
99
    Two additional variants: optimal string alignment (aka restricted
100
    Damerau-Levenshtein distance) and the Damerau-Levenshtein distance
101
    are also supported. Cf.
102
    https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
103
104
    The ordinary Levenshtein & Optimal String Alignment distance both
105
    employ the Wagner-Fischer dynamic programming algorithm. Cf.
106
    https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
107
108
    Levenshtein edit distance ordinarily has unit insertion, deletion, and
109
    substitution costs.
110
111
    :param str src, tar: two strings to be compared
112
    :param str mode: specifies a mode for computing the Levenshtein distance:
113
114
        - 'lev' (default) computes the ordinary Levenshtein distance,
115
          in which edits may include inserts, deletes, and substitutions
116
        - 'osa' computes the Optimal String Alignment distance, in which
117
          edits may include inserts, deletes, substitutions, and
118
          transpositions but substrings may only be edited once
119
        - 'dam' computes the Damerau-Levenshtein distance, in which
120
          edits may include inserts, deletes, substitutions, and
121
          transpositions and substrings may undergo repeated edits
122
123
    :param tuple cost: a 4-tuple representing the cost of the four possible
124
        edits: inserts, deletes, substitutions, and transpositions,
125
        respectively (by default: (1, 1, 1, 1))
126
    :returns: the Levenshtein distance between src & tar
127
    :rtype: int (may return a float if cost has float values)
128
129
    >>> levenshtein('cat', 'hat')
130
    1
131
    >>> levenshtein('Niall', 'Neil')
132
    3
133
    >>> levenshtein('aluminum', 'Catalan')
134
    7
135
    >>> levenshtein('ATCG', 'TAGC')
136
    3
137
138
    >>> levenshtein('ATCG', 'TAGC', mode='osa')
139
    2
140
    >>> levenshtein('ACTG', 'TAGC', mode='osa')
141
    4
142
143
    >>> levenshtein('ATCG', 'TAGC', mode='dam')
144
    2
145
    >>> levenshtein('ACTG', 'TAGC', mode='dam')
146
    3
147
    """
148
    ins_cost, del_cost, sub_cost, trans_cost = cost
149
150
    if src == tar:
151
        return 0
152
    if not src:
153
        return len(tar) * ins_cost
154
    if not tar:
155
        return len(src) * del_cost
156
157
    if 'dam' in mode:
158
        return damerau_levenshtein(src, tar, cost)
159
160
    # pylint: disable=no-member
161
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
162
    # pylint: enable=no-member
163
    for i in range(len(src)+1):
164
        d_mat[i, 0] = i * del_cost
165
    for j in range(len(tar)+1):
166
        d_mat[0, j] = j * ins_cost
167
168
    for i in range(len(src)):
169
        for j in range(len(tar)):
170
            d_mat[i+1, j+1] = min(
171
                d_mat[i+1, j] + ins_cost,  # ins
172
                d_mat[i, j+1] + del_cost,  # del
173
                d_mat[i, j] + (sub_cost if src[i] != tar[j] else 0)  # sub/==
174
            )
175
176
            if mode == 'osa':
177
                if ((i+1 > 1 and j+1 > 1 and src[i] == tar[j-1] and
178
                     src[i-1] == tar[j])):
179
                    # transposition
180
                    d_mat[i+1, j+1] = min(d_mat[i+1, j+1],
181
                                          d_mat[i-1, j-1] + trans_cost)
182
183
    return d_mat[len(src), len(tar)]
184
185
186
def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
187
    """Return the normalized Levenshtein distance between two strings.
188
189
    Levenshtein distance normalized to the interval [0, 1]
190
191
    The Levenshtein distance is normalized by dividing the Levenshtein distance
192
    (calculated by any of the three supported methods) by the greater of
193
    the number of characters in src times the cost of a delete and
194
    the number of characters in tar times the cost of an insert.
195
    For the case in which all operations have :math:`cost = 1`, this is
196
    equivalent to the greater of the length of the two strings src & tar.
197
198
    :param str src, tar: two strings to be compared
199
    :param str mode: specifies a mode for computing the Levenshtein distance:
200
201
        - 'lev' (default) computes the ordinary Levenshtein distance,
202
          in which edits may include inserts, deletes, and substitutions
203
        - 'osa' computes the Optimal String Alignment distance, in which
204
          edits may include inserts, deletes, substitutions, and
205
          transpositions but substrings may only be edited once
206
        - 'dam' computes the Damerau-Levenshtein distance, in which
207
          edits may include inserts, deletes, substitutions, and
208
          transpositions and substrings may undergo repeated edits
209
210
    :param tuple cost: a 4-tuple representing the cost of the four possible
211
        edits: inserts, deletes, substitutions, and transpositions,
212
        respectively (by default: (1, 1, 1, 1))
213
    :returns: normalized Levenshtein distance
214
    :rtype: float
215
216
    >>> dist_levenshtein('cat', 'hat')
217
    0.33333333333333331
218
    >>> dist_levenshtein('Niall', 'Neil')
219
    0.59999999999999998
220
    >>> dist_levenshtein('aluminum', 'Catalan')
221
    0.875
222
    >>> dist_levenshtein('ATCG', 'TAGC')
223
    0.75
224
    """
225
    if src == tar:
226
        return 0
227
    ins_cost, del_cost = cost[:2]
228
    return (levenshtein(src, tar, mode, cost) /
229
            (max(len(src)*del_cost, len(tar)*ins_cost)))
230
231
232
def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
233
    """Return the Levenshtein similarity of two strings.
234
235
    Levenshtein similarity normalized to the interval [0, 1]
236
237
    Levenshtein similarity the complement of Levenshtein distance:
238
    :math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`
239
240
    The arguments are identical to those of the levenshtein() function.
241
242
    :param str src, tar: two strings to be compared
243
    :param str mode: specifies a mode for computing the Levenshtein distance:
244
245
            - 'lev' (default) computes the ordinary Levenshtein distance,
246
              in which edits may include inserts, deletes, and substitutions
247
            - 'osa' computes the Optimal String Alignment distance, in which
248
              edits may include inserts, deletes, substitutions, and
249
              transpositions but substrings may only be edited once
250
            - 'dam' computes the Damerau-Levenshtein distance, in which
251
              edits may include inserts, deletes, substitutions, and
252
              transpositions and substrings may undergo repeated edits
253
254
    :param tuple cost: a 4-tuple representing the cost of the four possible
255
        edits:
256
        inserts, deletes, substitutions, and transpositions, respectively
257
        (by default: (1, 1, 1, 1))
258
    :returns: normalized Levenshtein similarity
259
    :rtype: float
260
261
    >>> sim_levenshtein('cat', 'hat')
262
    0.66666666666666674
263
    >>> sim_levenshtein('Niall', 'Neil')
264
    0.40000000000000002
265
    >>> sim_levenshtein('aluminum', 'Catalan')
266
    0.125
267
    >>> sim_levenshtein('ATCG', 'TAGC')
268
    0.25
269
    """
270
    return 1 - dist_levenshtein(src, tar, mode, cost)
271
272
273
def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):
274
    """Return the Damerau-Levenshtein distance between two strings.
275
276
    Damerau-Levenshtein distance
277
278
    This computes the Damerau-Levenshtein distance. Cf.
279
    https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
280
281
    Damerau-Levenshtein code based on Java code by Kevin L. Stern,
282
    under the MIT license:
283
    https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
284
285
    :param str src, tar: two strings to be compared
286
    :param tuple cost: a 4-tuple representing the cost of the four possible
287
        edits:
288
        inserts, deletes, substitutions, and transpositions, respectively
289
        (by default: (1, 1, 1, 1))
290
    :returns: the Damerau-Levenshtein distance between src & tar
291
    :rtype: int (may return a float if cost has float values)
292
293
    >>> damerau_levenshtein('cat', 'hat')
294
    1
295
    >>> damerau_levenshtein('Niall', 'Neil')
296
    3
297
    >>> damerau_levenshtein('aluminum', 'Catalan')
298
    7
299
    >>> damerau_levenshtein('ATCG', 'TAGC')
300
    2
301
    """
302
    ins_cost, del_cost, sub_cost, trans_cost = cost
303
304
    if src == tar:
305
        return 0
306
    if not src:
307
        return len(tar) * ins_cost
308
    if not tar:
309
        return len(src) * del_cost
310
311
    if 2*trans_cost < ins_cost + del_cost:
312
        raise ValueError('Unsupported cost assignment; the cost of two ' +
313
                         'transpositions must not be less than the cost of ' +
314
                         'an insert plus a delete.')
315
316
    # pylint: disable=no-member
317
    d_mat = (np.zeros((len(src))*(len(tar)), dtype=np.int).
318
             reshape((len(src), len(tar))))
319
    # pylint: enable=no-member
320
321
    if src[0] != tar[0]:
322
        d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)
323
324
    src_index_by_character = {}
325
    src_index_by_character[src[0]] = 0
326
    for i in range(1, len(src)):
327
        del_distance = d_mat[i-1, 0] + del_cost
328
        ins_distance = (i+1) * del_cost + ins_cost
329
        match_distance = (i * del_cost +
330
                          (0 if src[i] == tar[0] else sub_cost))
331
        d_mat[i, 0] = min(del_distance, ins_distance, match_distance)
332
333
    for j in range(1, len(tar)):
334
        del_distance = (j+1) * ins_cost + del_cost
335
        ins_distance = d_mat[0, j-1] + ins_cost
336
        match_distance = (j * ins_cost +
337
                          (0 if src[0] == tar[j] else sub_cost))
338
        d_mat[0, j] = min(del_distance, ins_distance, match_distance)
339
340
    for i in range(1, len(src)):
341
        max_src_letter_match_index = (0 if src[i] == tar[0] else -1)
342
        for j in range(1, len(tar)):
343
            candidate_swap_index = (-1 if tar[j] not in
344
                                    src_index_by_character else
345
                                    src_index_by_character[tar[j]])
346
            j_swap = max_src_letter_match_index
347
            del_distance = d_mat[i-1, j] + del_cost
348
            ins_distance = d_mat[i, j-1] + ins_cost
349
            match_distance = d_mat[i-1, j-1]
350
            if src[i] != tar[j]:
351
                match_distance += sub_cost
352
            else:
353
                max_src_letter_match_index = j
354
355
            if candidate_swap_index != -1 and j_swap != -1:
356
                i_swap = candidate_swap_index
357
358
                if i_swap == 0 and j_swap == 0:
359
                    pre_swap_cost = 0
360
                else:
361
                    pre_swap_cost = d_mat[max(0, i_swap-1), max(0, j_swap-1)]
362
                swap_distance = (pre_swap_cost + (i - i_swap - 1) *
363
                                 del_cost + (j - j_swap - 1) * ins_cost +
364
                                 trans_cost)
365
            else:
366
                swap_distance = sys.maxsize
367
368
            d_mat[i, j] = min(del_distance, ins_distance,
369
                              match_distance, swap_distance)
370
        src_index_by_character[src[i]] = i
371
372
    return d_mat[len(src)-1, len(tar)-1]
373
374
375
def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
376
    """Return the Damerau-Levenshtein similarity of two strings.
377
378
    Damerau-Levenshtein distance normalized to the interval [0, 1]
379
380
    The Damerau-Levenshtein distance is normalized by dividing the
381
    Damerau-Levenshtein distance by the greater of
382
    the number of characters in src times the cost of a delete and
383
    the number of characters in tar times the cost of an insert.
384
    For the case in which all operations have :math:`cost = 1`, this is
385
    equivalent to the greater of the length of the two strings src & tar.
386
387
    The arguments are identical to those of the levenshtein() function.
388
389
    :param str src, tar: two strings to be compared
390
    :param tuple cost: a 4-tuple representing the cost of the four possible
391
        edits:
392
        inserts, deletes, substitutions, and transpositions, respectively
393
        (by default: (1, 1, 1, 1))
394
    :returns: normalized Damerau-Levenshtein distance
395
    :rtype: float
396
397
    >>> dist_damerau('cat', 'hat')
398
    0.33333333333333331
399
    >>> dist_damerau('Niall', 'Neil')
400
    0.59999999999999998
401
    >>> dist_damerau('aluminum', 'Catalan')
402
    0.875
403
    >>> dist_damerau('ATCG', 'TAGC')
404
    0.5
405
    """
406
    if src == tar:
407
        return 0
408
    ins_cost, del_cost = cost[:2]
409
    return (damerau_levenshtein(src, tar, cost) /
410
            (max(len(src)*del_cost, len(tar)*ins_cost)))
411
412
413
def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
414
    """Return the Damerau-Levenshtein similarity of two strings.
415
416
    Damerau-Levenshtein similarity normalized to the interval [0, 1]
417
418
    Damerau-Levenshtein similarity the complement of Damerau-Levenshtein
419
    distance:
420
    :math:`sim_{Damerau} = 1 - dist_{Damerau}`
421
422
    The arguments are identical to those of the levenshtein() function.
423
424
    :param str src, tar: two strings to be compared
425
    :param tuple cost: a 4-tuple representing the cost of the four possible
426
        edits:
427
        inserts, deletes, substitutions, and transpositions, respectively
428
        (by default: (1, 1, 1, 1))
429
    :returns: normalized Damerau-Levenshtein similarity
430
    :rtype: float
431
432
    >>> sim_damerau('cat', 'hat')
433
    0.66666666666666674
434
    >>> sim_damerau('Niall', 'Neil')
435
    0.40000000000000002
436
    >>> sim_damerau('aluminum', 'Catalan')
437
    0.125
438
    >>> sim_damerau('ATCG', 'TAGC')
439
    0.5
440
    """
441
    return 1 - dist_damerau(src, tar, cost)
442
443
444
def hamming(src, tar, difflens=True):
445
    """Return the Hamming distance between two strings.
446
447
    Hamming distance
448
449
    Hamming distance equals the number of character positions at which two
450
    strings differ. For strings of unequal lengths, it is not normally defined.
451
    By default, this implementation calculates the Hamming distance of the
452
    first n characters where n is the lesser of the two strings' lengths and
453
    adds to this the difference in string lengths.
454
455
    :param str src, tar: two strings to be compared
456
    :param bool allow_different_lengths:
457
        If True (default), this returns the Hamming distance for those
458
        characters that have a matching character in both strings plus the
459
        difference in the strings' lengths. This is equivalent to extending
460
        the shorter string with obligatorily non-matching characters.
461
        If False, an exception is raised in the case of strings of unequal
462
        lengths.
463
    :returns: the Hamming distance between src & tar
464
    :rtype: int
465
466
    >>> hamming('cat', 'hat')
467
    1
468
    >>> hamming('Niall', 'Neil')
469
    3
470
    >>> hamming('aluminum', 'Catalan')
471
    8
472
    >>> hamming('ATCG', 'TAGC')
473
    4
474
    """
475
    if not difflens and len(src) != len(tar):
476
        raise ValueError('Undefined for sequences of unequal length; set ' +
477
                         'difflens to True for Hamming distance between ' +
478
                         'strings of unequal lengths.')
479
480
    hdist = 0
481
    if difflens:
482
        hdist += abs(len(src)-len(tar))
483
    hdist += sum(c1 != c2 for c1, c2 in zip(src, tar))
484
485
    return hdist
486
487
488
def dist_hamming(src, tar, difflens=True):
489
    """Return the normalized Hamming distance between two strings.
490
491
    Hamming distance normalized to the interval [0, 1]
492
493
    The Hamming distance is normalized by dividing it
494
    by the greater of the number of characters in src & tar (unless difflens is
495
    set to False, in which case an exception is raised).
496
497
    The arguments are identical to those of the hamming() function.
498
499
    :param str src, tar: two strings to be compared
500
    :param bool allow_different_lengths:
501
        If True (default), this returns the Hamming distance for those
502
        characters that have a matching character in both strings plus the
503
        difference in the strings' lengths. This is equivalent to extending
504
        the shorter string with obligatorily non-matching characters.
505
        If False, an exception is raised in the case of strings of unequal
506
        lengths.
507
    :returns: normalized Hamming distance
508
    :rtype: float
509
510
    >>> dist_hamming('cat', 'hat')
511
    0.3333333333333333
512
    >>> dist_hamming('Niall', 'Neil')
513
    0.6
514
    >>> dist_hamming('aluminum', 'Catalan')
515
    1.0
516
    >>> dist_hamming('ATCG', 'TAGC')
517
    1.0
518
    """
519
    if src == tar:
520
        return 0
521
    return hamming(src, tar, difflens) / max(len(src), len(tar))
522
523
524
def sim_hamming(src, tar, difflens=True):
525
    """Return the normalized Hamming similarity of two strings.
526
527
    Hamming similarity normalized to the interval [0, 1]
528
529
    Hamming similarity is the complement of normalized Hamming distance:
530
    :math:`sim_{Hamming} = 1 - dist{Hamming}`
531
532
    Provided that difflens==True, the Hamming similarity is identical to the
533
    Language-Independent Product Name Search (LIPNS) similarity score. For
534
    further information, see the sim_mlipns documentation.
535
536
    The arguments are identical to those of the hamming() function.
537
538
    :param str src, tar: two strings to be compared
539
    :param bool allow_different_lengths:
540
        If True (default), this returns the Hamming distance for those
541
        characters that have a matching character in both strings plus the
542
        difference in the strings' lengths. This is equivalent to extending
543
        the shorter string with obligatorily non-matching characters.
544
        If False, an exception is raised in the case of strings of unequal
545
        lengths.
546
    :returns: normalized Hamming similarity
547
    :rtype: float
548
549
    >>> sim_hamming('cat', 'hat')
550
    0.6666666666666667
551
    >>> sim_hamming('Niall', 'Neil')
552
    0.4
553
    >>> sim_hamming('aluminum', 'Catalan')
554
    0.0
555
    >>> sim_hamming('ATCG', 'TAGC')
556
    0.0
557
    """
558
    return 1 - dist_hamming(src, tar, difflens)
559
560
561
def _get_qgrams(src, tar, qval):
562
    """Return the Q-Grams in src & tar.
563
564
    :param str src, tar: two strings to be compared
565
        (or QGrams/Counter objects)
566
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
567
        version
568
    :return: Q-Grams
569
    """
570
    if isinstance(src, Counter) and isinstance(tar, Counter):
571
        return src, tar
572
    if qval and qval > 0:
573
        return QGrams(src, qval), QGrams(tar, qval)
574
    return Counter(src.strip().split()), Counter(tar.strip().split())
575
576
577
def sim_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None):
578
    r"""Return the Tversky index of two strings.
579
580
    Tversky index
581
582
    The Tversky index is defined as:
583
    For two sets X and Y:
584
    :math:`sim_{Tversky}(X, Y) = \\frac{|X \\cap Y|}
585
    {|X \\cap Y| + \\alpha|X - Y| + \\beta|Y - X|}`
586
587
    Cf. https://en.wikipedia.org/wiki/Tversky_index
588
589
    :math:`\\alpha = \\beta = 1` is equivalent to the Jaccard & Tanimoto
590
    similarity coefficients.
591
592
    :math:`\\alpha = \\beta = 0.5` is equivalent to the Sørensen-Dice
593
    similarity coefficient.
594
595
    Unequal α and β will tend to emphasize one or the other set's
596
    contributions:
597
598
        - :math:`\\alpha > \\beta` emphasizes the contributions of X over Y
599
        - :math:`\\alpha < \\beta` emphasizes the contributions of Y over X)
600
601
    Parameter values' relation to 1 emphasizes different types of
602
    contributions:
603
604
        - :math:`\\alpha and \\beta > 1` emphsize unique contributions over the
605
          intersection
606
        - :math:`\\alpha and \\beta < 1` emphsize the intersection over unique
607
          contributions
608
609
    The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and
610
    Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with
611
    Distributional Measures for Semantic Textual Similarity. This is activated
612
    by specifying a bias parameter.
613
    Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf
614
615
    :param str src, tar: two strings to be compared
616
        (or QGrams/Counter objects)
617
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
618
        version
619
    :param float alpha, beta: two Tversky index parameters as indicated in the
620
        description below
621
    :returns: Tversky similarity
622
    :rtype: float
623
624
    >>> sim_tversky('cat', 'hat')
625
    0.3333333333333333
626
    >>> sim_tversky('Niall', 'Neil')
627
    0.2222222222222222
628
    >>> sim_tversky('aluminum', 'Catalan')
629
    0.0625
630
    >>> sim_tversky('ATCG', 'TAGC')
631
    0.0
632
    """
633
    if alpha < 0 or beta < 0:
634
        raise ValueError('Unsupported weight assignment; alpha and beta ' +
635
                         'must be greater than or equal to 0.')
636
637
    if src == tar:
638
        return 1.0
639
    elif not src or not tar:
640
        return 0.0
641
642
    q_src, q_tar = _get_qgrams(src, tar, qval)
643
    q_src_mag = sum(q_src.values())
644
    q_tar_mag = sum(q_tar.values())
645
    q_intersection_mag = sum((q_src & q_tar).values())
646
647
    if not q_src or not q_tar:
648
        return 0.0
649
650
    if bias is None:
651
        return q_intersection_mag / (q_intersection_mag + alpha *
652
                                     (q_src_mag - q_intersection_mag) +
653
                                     beta * (q_tar_mag - q_intersection_mag))
654
655
    a_val = min(q_src_mag - q_intersection_mag,
656
                q_tar_mag - q_intersection_mag)
657
    b_val = max(q_src_mag - q_intersection_mag,
658
                q_tar_mag - q_intersection_mag)
659
    c_val = q_intersection_mag + bias
660
    return c_val / (beta * (alpha * a_val + (1 - alpha) * b_val) + c_val)
661
662
663
def dist_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None):
664
    """Return the Tverssky distance between two strings.
665
666
    Tversky distance
667
668
    Tversky distance is the complement of the Tvesrsky index (similarity):
669
    :math:`dist_{Tversky} = 1-sim_{Tversky}`
670
671
    The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and
672
    Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with
673
    Distributional Measures for Semantic Textual Similarity. This is activated
674
    by specifying a bias parameter.
675
    Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf
676
677
    :param str src, tar: two strings to be compared
678
        (or QGrams/Counter objects)
679
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
680
        version
681
    :param float alpha, beta: two Tversky index parameters as indicated in the
682
        description below
683
    :returns: Tversky distance
684
    :rtype: float
685
686
    >>> dist_tversky('cat', 'hat')
687
    0.6666666666666667
688
    >>> dist_tversky('Niall', 'Neil')
689
    0.7777777777777778
690
    >>> dist_tversky('aluminum', 'Catalan')
691
    0.9375
692
    >>> dist_tversky('ATCG', 'TAGC')
693
    1.0
694
    """
695
    return 1 - sim_tversky(src, tar, qval, alpha, beta, bias)
696
697
698
def sim_dice(src, tar, qval=2):
699
    r"""Return the Sørensen–Dice coefficient of two strings.
700
701
    Sørensen–Dice coefficient
702
703
    For two sets X and Y, the Sørensen–Dice coefficient is
704
    :math:`sim_{dice}(X, Y) = \\frac{2 \\cdot |X \\cap Y|}{|X| + |Y|}`
705
706
    This is identical to the Tanimoto similarity coefficient
707
    and the Tversky index for :math:`\\alpha = \\beta = 0.5`
708
709
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
710
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
711
        version
712
    :returns: Sørensen–Dice similarity
713
    :rtype: float
714
715
    >>> sim_dice('cat', 'hat')
716
    0.5
717
    >>> sim_dice('Niall', 'Neil')
718
    0.36363636363636365
719
    >>> sim_dice('aluminum', 'Catalan')
720
    0.11764705882352941
721
    >>> sim_dice('ATCG', 'TAGC')
722
    0.0
723
    """
724
    return sim_tversky(src, tar, qval, 0.5, 0.5)
725
726
727
def dist_dice(src, tar, qval=2):
728
    """Return the Sørensen–Dice distance between two strings.
729
730
    Sørensen–Dice distance
731
732
    Sørensen–Dice distance is the complemenjt of the Sørensen–Dice coefficient:
733
    :math:`dist_{dice} = 1 - sim_{dice}`
734
735
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
736
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
737
        version
738
    :returns: Sørensen–Dice distance
739
    :rtype: float
740
741
    >>> dist_dice('cat', 'hat')
742
    0.5
743
    >>> dist_dice('Niall', 'Neil')
744
    0.6363636363636364
745
    >>> dist_dice('aluminum', 'Catalan')
746
    0.8823529411764706
747
    >>> dist_dice('ATCG', 'TAGC')
748
    1.0
749
    """
750
    return 1 - sim_dice(src, tar, qval)
751
752
753
def sim_jaccard(src, tar, qval=2):
754
    r"""Return the Jaccard similarity of two strings.
755
756
    Jaccard similarity coefficient
757
758
    For two sets X and Y, the Jaccard similarity coefficient is
759
    :math:`sim_{jaccard}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}`
760
761
    This is identical to the Tanimoto similarity coefficient
762
    and the Tversky index for :math:`\\alpha = \\beta = 1`
763
764
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
765
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
766
        version
767
    :returns: Jaccard similarity
768
    :rtype: float
769
770
    >>> sim_jaccard('cat', 'hat')
771
    0.3333333333333333
772
    >>> sim_jaccard('Niall', 'Neil')
773
    0.2222222222222222
774
    >>> sim_jaccard('aluminum', 'Catalan')
775
    0.0625
776
    >>> sim_jaccard('ATCG', 'TAGC')
777
    0.0
778
    """
779
    return sim_tversky(src, tar, qval, 1, 1)
780
781
782
def dist_jaccard(src, tar, qval=2):
783
    """Return the Jaccard distance between two strings.
784
785
    Jaccard distance
786
787
    Jaccard distance is the complement of the Jaccard similarity coefficient:
788
    :math:`dist_{Jaccard} = 1 - sim_{Jaccard}`
789
790
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
791
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
792
        version
793
    :returns: Jaccard distance
794
    :rtype: float
795
796
    >>> dist_jaccard('cat', 'hat')
797
    0.6666666666666667
798
    >>> dist_jaccard('Niall', 'Neil')
799
    0.7777777777777778
800
    >>> dist_jaccard('aluminum', 'Catalan')
801
    0.9375
802
    >>> dist_jaccard('ATCG', 'TAGC')
803
    1.0
804
    """
805
    return 1 - sim_jaccard(src, tar, qval)
806
807
808
def sim_overlap(src, tar, qval=2):
809
    r"""Return the overlap coefficient of two strings.
810
811
    Overlap coefficient
812
813
    For two sets X and Y, the overlap coefficient is
814
    :math:`sim_{overlap}(X, Y) = \\frac{|X \\cap Y|}{min(|X|, |Y|)}`
815
816
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
817
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
818
        version
819
    :returns: overlap similarity
820
    :rtype: float
821
822
    >>> sim_overlap('cat', 'hat')
823
    0.5
824
    >>> sim_overlap('Niall', 'Neil')
825
    0.4
826
    >>> sim_overlap('aluminum', 'Catalan')
827
    0.125
828
    >>> sim_overlap('ATCG', 'TAGC')
829
    0.0
830
    """
831
    if src == tar:
832
        return 1.0
833
    elif not src or not tar:
834
        return 0.0
835
836
    q_src, q_tar = _get_qgrams(src, tar, qval)
837
    q_src_mag = sum(q_src.values())
838
    q_tar_mag = sum(q_tar.values())
839
    q_intersection_mag = sum((q_src & q_tar).values())
840
841
    return q_intersection_mag / min(q_src_mag, q_tar_mag)
842
843
844
def dist_overlap(src, tar, qval=2):
845
    """Return the overlap distance between two strings.
846
847
    Overlap distance
848
849
    Overlap distance is the complement of the overlap coefficient:
850
    :math:`sim_{overlap} = 1 - dist_{overlap}`
851
852
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
853
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
854
        version
855
    :returns: overlap distance
856
    :rtype: float
857
858
    >>> dist_overlap('cat', 'hat')
859
    0.5
860
    >>> dist_overlap('Niall', 'Neil')
861
    0.6
862
    >>> dist_overlap('aluminum', 'Catalan')
863
    0.875
864
    >>> dist_overlap('ATCG', 'TAGC')
865
    1.0
866
    """
867
    return 1 - sim_overlap(src, tar, qval)
868
869
870
def sim_tanimoto(src, tar, qval=2):
871
    r"""Return the Tanimoto similarity of two strings.
872
873
    Tanimoto similarity
874
875
    For two sets X and Y, the Tanimoto similarity coefficient is
876
    :math:`sim_{Tanimoto}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}`
877
    This is identical to the Jaccard similarity coefficient
878
    and the Tversky index for :math:`\\alpha = \\beta = 1`
879
880
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
881
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
882
        version
883
    :returns: Tanimoto similarity
884
    :rtype: float
885
886
    >>> sim_tanimoto('cat', 'hat')
887
    0.3333333333333333
888
    >>> sim_tanimoto('Niall', 'Neil')
889
    0.2222222222222222
890
    >>> sim_tanimoto('aluminum', 'Catalan')
891
    0.0625
892
    >>> sim_tanimoto('ATCG', 'TAGC')
893
    0.0
894
    """
895
    return sim_jaccard(src, tar, qval)
896
897
898
def tanimoto(src, tar, qval=2):
899
    """Return the Tanimoto distance between two strings.
900
901
    Tanimoto distance
902
903
    Tanimoto distance is :math:`-log_{2}sim_{Tanimoto}`
904
905
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
906
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
907
        version
908
    :returns: Tanimoto distance
909
    :rtype: float
910
911
    >>> tanimoto('cat', 'hat')
912
    -1.5849625007211563
913
    >>> tanimoto('Niall', 'Neil')
914
    -2.1699250014423126
915
    >>> tanimoto('aluminum', 'Catalan')
916
    -4.0
917
    >>> tanimoto('ATCG', 'TAGC')
918
    -inf
919
    """
920
    coeff = sim_jaccard(src, tar, qval)
921
    if coeff != 0:
922
        return math.log(coeff, 2)
923
924
    return float('-inf')
925
926
927
def minkowski(src, tar, qval=2, pval=1, normalize=False):
928
    """Return the Minkowski distance (:math:`L^p-norm`) of two strings.
929
930
    :param src:
931
    :param tar:
932
    :param qval:
933
    :param pval:
934
    :return:
935
    """
936
    q_src, q_tar = _get_qgrams(src, tar, qval)
937
    diffs = ((q_src - q_tar) + (q_tar - q_src)).values()
938
939
    normalizer = 1
940
    if normalize:
941
        totals = (q_src + q_tar).values()
942
        if pval == 0:
943
            normalizer = len(totals)
944
        else:
945
            normalizer = sum(_**pval for _ in totals)**(1 / pval)
946
947
    if pval == float('inf'):
948
        # Chebyshev distance
949
        return max(diffs)/normalizer
950
    if pval == 0:
951
        # This is the l_0 "norm" as developed by David Donoho
952
        return len(diffs)
953
    return sum(_**pval for _ in diffs)**(1 / pval)/normalizer
954
955
956
def dist_minkowski(src, tar, qval=2, pval=1):
957
    """Return Minkowski distance of two strings, normalized to [0, 1].
958
959
    :param src:
960
    :param tar:
961
    :param qval2:
962
    :param pval:
963
    :return:
964
    """
965
    return minkowski(src, tar, qval, pval, True)
966
967
968
def sim_minkowski(src, tar, qval=2, pval=1):
969
    """Return Minkowski similarity of two strings, normalized to [0, 1].
970
971
    :param src:
972
    :param tar:
973
    :param qval2:
974
    :param pval:
975
    :return:
976
    """
977
    return 1-minkowski(src, tar, qval, pval, True)
978
979
980
def manhattan(src, tar, qval=2, normalize=False):
981
    """Return the Manhattan distance between two strings.
982
983
    :param src:
984
    :param tar:
985
    :param qval:
986
    :return:
987
    """
988
    return minkowski(src, tar, qval, 1, normalize)
989
990
991
def dist_manhattan(src, tar, qval=2):
992
    """Return the Manhattan distance between two strings, normalized to [0, 1].
993
994
    This is identical to Canberra distance.
995
996
    :param src:
997
    :param tar:
998
    :param qval:
999
    :return:
1000
    """
1001
    return manhattan(src, tar, qval, 1, True)
1002
1003
1004
def sim_manhattan(src, tar, qval=2):
1005
    """Return the Manhattan similarity of two strings, normalized to [0, 1].
1006
1007
    :param src:
1008
    :param tar:
1009
    :param qval:
1010
    :return:
1011
    """
1012
    return 1-manhattan(src, tar, qval, 1, True)
1013
1014
1015
def euclidean(src, tar, qval=2, normalize=False):
1016
    """Return the Euclidean distance between two strings.
1017
1018
    :param src:
1019
    :param tar:
1020
    :param qval:
1021
    :return:
1022
    """
1023
    return minkowski(src, tar, qval, 2, normalize)
1024
1025
1026
def dist_euclidean(src, tar, qval=2):
1027
    """Return the Euclidean distance between two strings, normalized to [0, 1].
1028
1029
    :param src:
1030
    :param tar:
1031
    :param qval:
1032
    :return:
1033
    """
1034
    return euclidean(src, tar, qval, True)
1035
1036
1037
def sim_euclidean(src, tar, qval=2):
1038
    """Return the Euclidean similarity of two strings, normalized to [0, 1].
1039
1040
    :param src:
1041
    :param tar:
1042
    :param qval:
1043
    :return:
1044
    """
1045
    return 1-euclidean(src, tar, qval, True)
1046
1047
1048
def chebyshev(src, tar, qval=2, normalize=False):
1049
    """Return the Chebyshev distance between two strings.
1050
1051
    :param src:
1052
    :param tar:
1053
    :param qval:
1054
    :return:
1055
    """
1056
    return minkowski(src, tar, qval, float('inf'), normalize)
1057
1058
1059
def dist_chebyshev(src, tar, qval=2):
1060
    """Return the Chebyshev distance between two strings, normalized to [0, 1].
1061
1062
    :param src:
1063
    :param tar:
1064
    :param qval:
1065
    :return:
1066
    """
1067
    return chebyshev(src, tar, qval, True)
1068
1069
1070
def sim_chebyshev(src, tar, qval=2):
1071
    """Return the Chebyshev similarity of two strings, normalized to [0, 1].
1072
1073
    :param src:
1074
    :param tar:
1075
    :param qval:
1076
    :return:
1077
    """
1078
    return 1 - chebyshev(src, tar, qval, True)
1079
1080
1081
def sim_cosine(src, tar, qval=2):
1082
    r"""Return the cosine similarity of two strings.
1083
1084
    Cosine similarity (Ochiai coefficient)
1085
1086
    For two sets X and Y, the cosine similarity (Ochiai coefficient) is:
1087
    :math:`sim_{cosine}(X, Y) = \\frac{|X \\cap Y|}{\\sqrt{|X| \\cdot |Y|}}`
1088
1089
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
1090
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
1091
        version
1092
    :returns: cosine similarity
1093
    :rtype: float
1094
1095
    >>> sim_cosine('cat', 'hat')
1096
    0.5
1097
    >>> sim_cosine('Niall', 'Neil')
1098
    0.3651483716701107
1099
    >>> sim_cosine('aluminum', 'Catalan')
1100
    0.11785113019775793
1101
    >>> sim_cosine('ATCG', 'TAGC')
1102
    0.0
1103
    """
1104
    if src == tar:
1105
        return 1.0
1106
    if not src or not tar:
1107
        return 0.0
1108
1109
    q_src, q_tar = _get_qgrams(src, tar, qval)
1110
    q_src_mag = sum(q_src.values())
1111
    q_tar_mag = sum(q_tar.values())
1112
    q_intersection_mag = sum((q_src & q_tar).values())
1113
1114
    return q_intersection_mag / math.sqrt(q_src_mag * q_tar_mag)
1115
1116
1117
def dist_cosine(src, tar, qval=2):
1118
    """Return the cosine distance between two strings.
1119
1120
    Cosine distance
1121
1122
    Cosine distance is the complement of cosine similarity:
1123
    :math:`dist_{cosine} = 1 - sim_{cosine}`
1124
1125
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
1126
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
1127
        version
1128
    :returns: cosine distance
1129
    :rtype: float
1130
1131
    >>> dist_cosine('cat', 'hat')
1132
    0.5
1133
    >>> dist_cosine('Niall', 'Neil')
1134
    0.6348516283298893
1135
    >>> dist_cosine('aluminum', 'Catalan')
1136
    0.882148869802242
1137
    >>> dist_cosine('ATCG', 'TAGC')
1138
    1.0
1139
    """
1140
    return 1 - sim_cosine(src, tar, qval)
1141
1142
1143
def sim_strcmp95(src, tar, long_strings=False):
1144
    """Return the strcmp95 similarity of two strings.
1145
1146
    strcmp95 similarity
1147
1148
    This is a Python translation of the C code for strcmp95:
1149
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
1150
    The above file is a US Government publication and, accordingly,
1151
    in the public domain.
1152
1153
    This is based on the Jaro-Winkler distance, but also attempts to correct
1154
    for some common typos and frequently confused characters. It is also
1155
    limited to uppercase ASCII characters, so it is appropriate to American
1156
    names, but not much else.
1157
1158
    :param str src, tar: two strings to be compared
1159
    :param bool long_strings: set to True to "Increase the probability of a
1160
        match when the number of matched characters is large.  This option
1161
        allows for a little more tolerance when the strings are large. It is
1162
        not an appropriate test when comparing fixed length fields such as
1163
        phone and social security numbers."
1164
    :returns: strcmp95 similarity
1165
    :rtype: float
1166
1167
    >>> sim_strcmp95('cat', 'hat')
1168
    0.7777777777777777
1169
    >>> sim_strcmp95('Niall', 'Neil')
1170
    0.8454999999999999
1171
    >>> sim_strcmp95('aluminum', 'Catalan')
1172
    0.6547619047619048
1173
    >>> sim_strcmp95('ATCG', 'TAGC')
1174
    0.8333333333333334
1175
    """
1176
    def _inrange(char):
1177
        """Return True if char is in the range (0, 91)."""
1178
        return ord(char) > 0 and ord(char) < 91
1179
1180
    ying = src.strip().upper()
1181
    yang = tar.strip().upper()
1182
1183
    if ying == yang:
1184
        return 1.0
1185
    # If either string is blank - return - added in Version 2
1186
    if not ying or not yang:
1187
        return 0.0
1188
1189
    adjwt = defaultdict(int)
1190
    sp_mx = (
1191
        ('A', 'E'), ('A', 'I'), ('A', 'O'), ('A', 'U'), ('B', 'V'), ('E', 'I'),
1192
        ('E', 'O'), ('E', 'U'), ('I', 'O'), ('I', 'U'), ('O', 'U'), ('I', 'Y'),
1193
        ('E', 'Y'), ('C', 'G'), ('E', 'F'), ('W', 'U'), ('W', 'V'), ('X', 'K'),
1194
        ('S', 'Z'), ('X', 'S'), ('Q', 'C'), ('U', 'V'), ('M', 'N'), ('L', 'I'),
1195
        ('Q', 'O'), ('P', 'R'), ('I', 'J'), ('2', 'Z'), ('5', 'S'), ('8', 'B'),
1196
        ('1', 'I'), ('1', 'L'), ('0', 'O'), ('0', 'Q'), ('C', 'K'), ('G', 'J')
1197
    )
1198
1199
    # Initialize the adjwt array on the first call to the function only.
1200
    # The adjwt array is used to give partial credit for characters that
1201
    # may be errors due to known phonetic or character recognition errors.
1202
    # A typical example is to match the letter "O" with the number "0"
1203
    for i in sp_mx:
1204
        adjwt[(i[0], i[1])] = 3
1205
        adjwt[(i[1], i[0])] = 3
1206
1207
    if len(ying) > len(yang):
1208
        search_range = len(ying)
1209
        minv = len(yang)
1210
    else:
1211
        search_range = len(yang)
1212
        minv = len(ying)
1213
1214
    # Blank out the flags
1215
    ying_flag = [0] * search_range
1216
    yang_flag = [0] * search_range
1217
    search_range = max(0, search_range // 2 - 1)
1218
1219
    # Looking only within the search range, count and flag the matched pairs.
1220
    num_com = 0
1221
    yl1 = len(yang) - 1
1222
    for i in range(len(ying)):
1223
        lowlim = (i - search_range) if (i >= search_range) else 0
1224
        hilim = (i + search_range) if ((i + search_range) <= yl1) else yl1
1225
        for j in range(lowlim, hilim+1):
1226
            if (yang_flag[j] == 0) and (yang[j] == ying[i]):
1227
                yang_flag[j] = 1
1228
                ying_flag[i] = 1
1229
                num_com += 1
1230
                break
1231
1232
    # If no characters in common - return
1233
    if num_com == 0:
1234
        return 0.0
1235
1236
    # Count the number of transpositions
1237
    k = n_trans = 0
1238
    for i in range(len(ying)):
1239
        if ying_flag[i] != 0:
1240
            for j in range(k, len(yang)):
1241
                if yang_flag[j] != 0:
1242
                    k = j + 1
1243
                    break
1244
            if ying[i] != yang[j]:
0 ignored issues
show
introduced by
The variable j does not seem to be defined for all execution paths.
Loading history...
1245
                n_trans += 1
1246
    n_trans = n_trans // 2
1247
1248
    # Adjust for similarities in unmatched characters
1249
    n_simi = 0
1250
    if minv > num_com:
1251
        for i in range(len(ying)):
1252
            if ying_flag[i] == 0 and _inrange(ying[i]):
1253
                for j in range(len(yang)):
1254
                    if yang_flag[j] == 0 and _inrange(yang[j]):
1255
                        if (ying[i], yang[j]) in adjwt:
1256
                            n_simi += adjwt[(ying[i], yang[j])]
1257
                            yang_flag[j] = 2
1258
                            break
1259
    num_sim = n_simi/10.0 + num_com
1260
1261
    # Main weight computation
1262
    weight = num_sim / len(ying) + num_sim / len(yang) + \
1263
        (num_com - n_trans) / num_com
1264
    weight = weight / 3.0
1265
1266
    # Continue to boost the weight if the strings are similar
1267
    if weight > 0.7:
1268
1269
        # Adjust for having up to the first 4 characters in common
1270
        j = 4 if (minv >= 4) else minv
1271
        i = 0
1272
        while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
1273
            i += 1
1274
        if i:
1275
            weight += i * 0.1 * (1.0 - weight)
1276
1277
        # Optionally adjust for long strings.
1278
1279
        # After agreeing beginning chars, at least two more must agree and
1280
        # the agreeing characters must be > .5 of remaining characters.
1281
        if (((long_strings) and (minv > 4) and (num_com > i+1) and
1282
             (2*num_com >= minv+i))):
1283
            if not ying[0].isdigit():
1284
                weight += (1.0-weight) * ((num_com-i-1) /
1285
                                          (len(ying)+len(yang)-i*2+2))
1286
1287
    return weight
1288
1289
1290
def dist_strcmp95(src, tar, long_strings=False):
1291
    """Return the strcmp95 distance between two strings.
1292
1293
    strcmp95 distance
1294
1295
    strcmp95 distance is 1 - strcmp95 similarity
1296
1297
    :param str src, tar: two strings to be compared
1298
    :param bool long_strings: set to True to "Increase the probability of a
1299
        match when the number of matched characters is large.  This option
1300
        allows for a little more tolerance when the strings are large. It is
1301
        not an appropriate test when comparing fixed length fields such as
1302
        phone and social security numbers."
1303
    :returns: strcmp95 distance
1304
    :rtype: float
1305
1306
    >>> dist_strcmp95('cat', 'hat')
1307
    0.22222222222222232
1308
    >>> dist_strcmp95('Niall', 'Neil')
1309
    0.15450000000000008
1310
    >>> dist_strcmp95('aluminum', 'Catalan')
1311
    0.34523809523809523
1312
    >>> dist_strcmp95('ATCG', 'TAGC')
1313
    0.16666666666666663
1314
    """
1315
    return 1 - sim_strcmp95(src, tar, long_strings)
1316
1317
1318
def sim_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False,
1319
                     boost_threshold=0.7, scaling_factor=0.1):
1320
    """Return the Jaro or Jaro-Winkler similarity of two strings.
1321
1322
    Jaro(-Winkler) distance
1323
1324
    This is Python based on the C code for strcmp95:
1325
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
1326
    The above file is a US Government publication and, accordingly,
1327
    in the public domain.
1328
1329
    :param str src, tar: two strings to be compared
1330
    :param int qval: the length of each q-gram (defaults to 1: character-wise
1331
        matching)
1332
    :param str mode: indicates which variant of this distance metric to
1333
        compute:
1334
1335
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
1336
              increases the score for matches near the start of the word
1337
            - 'jaro' -- computes the Jaro distance
1338
1339
    The following arguments apply only when mode is 'winkler':
1340
1341
    :param bool long_strings: set to True to "Increase the probability of a
1342
        match when the number of matched characters is large.  This option
1343
        allows for a little more tolerance when the strings are large.  It is
1344
        not an appropriate test when comparing fixed length fields such as
1345
        phone and social security numbers."
1346
    :param float boost_threshold: a value between 0 and 1, below which the
1347
        Winkler boost is not applied (defaults to 0.7)
1348
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
1349
        much to boost scores for matching prefixes (defaults to 0.1)
1350
1351
    :returns: Jaro or Jaro-Winkler similarity
1352
    :rtype: float
1353
1354
    >>> sim_jaro_winkler('cat', 'hat')
1355
    0.7777777777777777
1356
    >>> sim_jaro_winkler('Niall', 'Neil')
1357
    0.8049999999999999
1358
    >>> sim_jaro_winkler('aluminum', 'Catalan')
1359
    0.6011904761904762
1360
    >>> sim_jaro_winkler('ATCG', 'TAGC')
1361
    0.8333333333333334
1362
1363
    >>> sim_jaro_winkler('cat', 'hat', mode='jaro')
1364
    0.7777777777777777
1365
    >>> sim_jaro_winkler('Niall', 'Neil', mode='jaro')
1366
    0.7833333333333333
1367
    >>> sim_jaro_winkler('aluminum', 'Catalan', mode='jaro')
1368
    0.6011904761904762
1369
    >>> sim_jaro_winkler('ATCG', 'TAGC', mode='jaro')
1370
    0.8333333333333334
1371
    """
1372
    if mode == 'winkler':
1373
        if boost_threshold > 1 or boost_threshold < 0:
1374
            raise ValueError('Unsupported boost_threshold assignment; ' +
1375
                             'boost_threshold must be between 0 and 1.')
1376
        if scaling_factor > 0.25 or scaling_factor < 0:
1377
            raise ValueError('Unsupported scaling_factor assignment; ' +
1378
                             'scaling_factor must be between 0 and 0.25.')
1379
1380
    if src == tar:
1381
        return 1.0
1382
1383
    src = QGrams(src.strip(), qval).ordered_list
1384
    tar = QGrams(tar.strip(), qval).ordered_list
1385
1386
    lens = len(src)
1387
    lent = len(tar)
1388
1389
    # If either string is blank - return - added in Version 2
1390
    if lens == 0 or lent == 0:
1391
        return 0.0
1392
1393
    if lens > lent:
1394
        search_range = lens
1395
        minv = lent
1396
    else:
1397
        search_range = lent
1398
        minv = lens
1399
1400
    # Zero out the flags
1401
    src_flag = [0] * search_range
1402
    tar_flag = [0] * search_range
1403
    search_range = max(0, search_range//2 - 1)
1404
1405
    # Looking only within the search range, count and flag the matched pairs.
1406
    num_com = 0
1407
    yl1 = lent - 1
1408
    for i in range(lens):
1409
        lowlim = (i - search_range) if (i >= search_range) else 0
1410
        hilim = (i + search_range) if ((i + search_range) <= yl1) else yl1
1411
        for j in range(lowlim, hilim+1):
1412
            if (tar_flag[j] == 0) and (tar[j] == src[i]):
1413
                tar_flag[j] = 1
1414
                src_flag[i] = 1
1415
                num_com += 1
1416
                break
1417
1418
    # If no characters in common - return
1419
    if num_com == 0:
1420
        return 0.0
1421
1422
    # Count the number of transpositions
1423
    k = n_trans = 0
1424
    for i in range(lens):
1425
        if src_flag[i] != 0:
1426
            for j in range(k, lent):
1427
                if tar_flag[j] != 0:
1428
                    k = j + 1
1429
                    break
1430
            if src[i] != tar[j]:
0 ignored issues
show
introduced by
The variable j does not seem to be defined for all execution paths.
Loading history...
1431
                n_trans += 1
1432
    n_trans = n_trans // 2
1433
1434
    # Main weight computation for Jaro distance
1435
    weight = num_com / lens + num_com / lent + (num_com - n_trans) / num_com
1436
    weight = weight / 3.0
1437
1438
    # Continue to boost the weight if the strings are similar
1439
    # This is the Winkler portion of Jaro-Winkler distance
1440
    if mode == 'winkler' and weight > boost_threshold:
1441
1442
        # Adjust for having up to the first 4 characters in common
1443
        j = 4 if (minv >= 4) else minv
1444
        i = 0
1445
        while (i < j) and (src[i] == tar[i]):
1446
            i += 1
1447
        if i:
1448
            weight += i * scaling_factor * (1.0 - weight)
1449
1450
        # Optionally adjust for long strings.
1451
1452
        # After agreeing beginning chars, at least two more must agree and
1453
        # the agreeing characters must be > .5 of remaining characters.
1454
        if (((long_strings) and (minv > 4) and (num_com > i+1) and
1455
             (2*num_com >= minv+i))):
1456
            weight += (1.0-weight) * ((num_com-i-1) / (lens+lent-i*2+2))
1457
1458
    return weight
1459
1460
1461
def dist_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False,
1462
                      boost_threshold=0.7, scaling_factor=0.1):
1463
    """Return the Jaro or Jaro-Winkler distance between two strings.
1464
1465
    Jaro(-Winkler) distance
1466
1467
    Jaro-Winkler distance is 1 - the Jaro-Winkler similarity
1468
1469
    :param str src, tar: two strings to be compared
1470
    :param int qval: the length of each q-gram (defaults to 1: character-wise
1471
        matching)
1472
    :param str mode: indicates which variant of this distance metric to
1473
        compute:
1474
1475
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
1476
              increases the score for matches near the start of the word
1477
            - 'jaro' -- computes the Jaro distance
1478
1479
    The following arguments apply only when mode is 'winkler':
1480
1481
    :param bool long_strings: set to True to "Increase the probability of a
1482
        match when the number of matched characters is large.  This option
1483
        allows for a little more tolerance when the strings are large.  It is
1484
        not an appropriate test when comparing fixed length fields such as
1485
        phone and social security numbers."
1486
    :param float boost_threshold: a value between 0 and 1, below which the
1487
        Winkler boost is not applied (defaults to 0.7)
1488
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
1489
        much to boost scores for matching prefixes (defaults to 0.1)
1490
1491
    :returns: Jaro or Jaro-Winkler distance
1492
    :rtype: float
1493
1494
    >>> dist_jaro_winkler('cat', 'hat')
1495
    0.22222222222222232
1496
    >>> dist_jaro_winkler('Niall', 'Neil')
1497
    0.19500000000000006
1498
    >>> dist_jaro_winkler('aluminum', 'Catalan')
1499
    0.39880952380952384
1500
    >>> dist_jaro_winkler('ATCG', 'TAGC')
1501
    0.16666666666666663
1502
1503
    >>> dist_jaro_winkler('cat', 'hat', mode='jaro')
1504
    0.22222222222222232
1505
    >>> dist_jaro_winkler('Niall', 'Neil', mode='jaro')
1506
    0.21666666666666667
1507
    >>> dist_jaro_winkler('aluminum', 'Catalan', mode='jaro')
1508
    0.39880952380952384
1509
    >>> dist_jaro_winkler('ATCG', 'TAGC', mode='jaro')
1510
    0.16666666666666663
1511
    """
1512
    return 1 - sim_jaro_winkler(src, tar, qval, mode, long_strings,
1513
                                boost_threshold, scaling_factor)
1514
1515
1516
def lcsseq(src, tar):
1517
    """Return the longest common subsequence of two strings.
1518
1519
    Longest common subsequence (LCSseq)
1520
1521
    Based on the dynamic programming algorithm from
1522
    http://rosettacode.org/wiki/Longest_common_subsequence#Dynamic_Programming_6
1523
    This is licensed GFDL 1.2
1524
1525
    Modifications include:
1526
        conversion to a numpy array in place of a list of lists
1527
1528
    :param str src, tar: two strings to be compared
1529
    :returns: the longes common subsequence
1530
    :rtype: str
1531
1532
    >>> lcsseq('cat', 'hat')
1533
    'at'
1534
    >>> lcsseq('Niall', 'Neil')
1535
    'Nil'
1536
    >>> lcsseq('aluminum', 'Catalan')
1537
    'aln'
1538
    >>> lcsseq('ATCG', 'TAGC')
1539
    'AC'
1540
    """
1541
    # pylint: disable=no-member
1542
    lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1543
    # pylint: enable=no-member
1544
1545
    # row 0 and column 0 are initialized to 0 already
1546
    for i, src_char in enumerate(src):
1547
        for j, tar_char in enumerate(tar):
1548
            if src_char == tar_char:
1549
                lengths[i+1, j+1] = lengths[i, j] + 1
1550
            else:
1551
                lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1])
1552
1553
    # read the substring out from the matrix
1554
    result = ''
1555
    i, j = len(src), len(tar)
1556
    while i != 0 and j != 0:
1557
        if lengths[i, j] == lengths[i-1, j]:
1558
            i -= 1
1559
        elif lengths[i, j] == lengths[i, j-1]:
1560
            j -= 1
1561
        else:
1562
            result = src[i-1] + result
1563
            i -= 1
1564
            j -= 1
1565
    return result
1566
1567
1568
def sim_lcsseq(src, tar):
1569
    r"""Return the longest common subsequence similarity of two strings.
1570
1571
    Longest common subsequence similarity (:math:`sim_{LCSseq}`)
1572
1573
    This employs the LCSseq function to derive a similarity metric:
1574
    :math:`sim_{LCSseq}(s,t) = \\frac{|LCSseq(s,t)|}{max(|s|, |t|)}`
1575
1576
    :param str src, tar: two strings to be compared
1577
    :returns: LCSseq similarity
1578
    :rtype: float
1579
1580
    >>> sim_lcsseq('cat', 'hat')
1581
    0.6666666666666666
1582
    >>> sim_lcsseq('Niall', 'Neil')
1583
    0.6
1584
    >>> sim_lcsseq('aluminum', 'Catalan')
1585
    0.375
1586
    >>> sim_lcsseq('ATCG', 'TAGC')
1587
    0.5
1588
    """
1589
    if src == tar:
1590
        return 1.0
1591
    elif not src or not tar:
1592
        return 0.0
1593
    return len(lcsseq(src, tar)) / max(len(src), len(tar))
1594
1595
1596
def dist_lcsseq(src, tar):
1597
    """Return the longest common subsequence distance between two strings.
1598
1599
    Longest common subsequence distance (:math:`dist_{LCSseq}`)
1600
1601
    This employs the LCSseq function to derive a similarity metric:
1602
    :math:`dist_{LCSseq}(s,t) = 1 - sim_{LCSseq}(s,t)`
1603
1604
    :param str src, tar: two strings to be compared
1605
    :returns: LCSseq distance
1606
    :rtype: float
1607
1608
    >>> dist_lcsseq('cat', 'hat')
1609
    0.33333333333333337
1610
    >>> dist_lcsseq('Niall', 'Neil')
1611
    0.4
1612
    >>> dist_lcsseq('aluminum', 'Catalan')
1613
    0.625
1614
    >>> dist_lcsseq('ATCG', 'TAGC')
1615
    0.5
1616
    """
1617
    return 1 - sim_lcsseq(src, tar)
1618
1619
1620
def lcsstr(src, tar):
1621
    """Return the longest common substring of two strings.
1622
1623
    Longest common substring (LCSstr)
1624
1625
    Based on the code from
1626
    https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python
1627
    This is licensed Creative Commons: Attribution-ShareAlike 3.0
1628
1629
    Modifications include:
1630
1631
        - conversion to a numpy array in place of a list of lists
1632
        - conversion to Python 2/3-safe range from xrange via six
1633
1634
    :param str src, tar: two strings to be compared
1635
    :returns: the longes common substring
1636
    :rtype: float
1637
1638
    >>> lcsstr('cat', 'hat')
1639
    'at'
1640
    >>> lcsstr('Niall', 'Neil')
1641
    'N'
1642
    >>> lcsstr('aluminum', 'Catalan')
1643
    'al'
1644
    >>> lcsstr('ATCG', 'TAGC')
1645
    'A'
1646
    """
1647
    # pylint: disable=no-member
1648
    lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1649
    # pylint: enable=no-member
1650
    longest, i_longest = 0, 0
1651
    for i in range(1, len(src)+1):
1652
        for j in range(1, len(tar)+1):
1653
            if src[i-1] == tar[j-1]:
1654
                lengths[i, j] = lengths[i-1, j-1] + 1
1655
                if lengths[i, j] > longest:
1656
                    longest = lengths[i, j]
1657
                    i_longest = i
1658
            else:
1659
                lengths[i, j] = 0
1660
    return src[i_longest - longest:i_longest]
1661
1662
1663
def sim_lcsstr(src, tar):
1664
    r"""Return the longest common substring similarity of two strings.
1665
1666
    Longest common substring similarity (:math:`sim_{LCSstr}`)
1667
1668
    This employs the LCS function to derive a similarity metric:
1669
    :math:`sim_{LCSstr}(s,t) = \\frac{|LCSstr(s,t)|}{max(|s|, |t|)}`
1670
1671
    :param str src, tar: two strings to be compared
1672
    :returns: LCSstr similarity
1673
    :rtype: float
1674
1675
    >>> sim_lcsstr('cat', 'hat')
1676
    0.6666666666666666
1677
    >>> sim_lcsstr('Niall', 'Neil')
1678
    0.2
1679
    >>> sim_lcsstr('aluminum', 'Catalan')
1680
    0.25
1681
    >>> sim_lcsstr('ATCG', 'TAGC')
1682
    0.25
1683
    """
1684
    if src == tar:
1685
        return 1.0
1686
    elif not src or not tar:
1687
        return 0.0
1688
    return len(lcsstr(src, tar)) / max(len(src), len(tar))
1689
1690
1691
def dist_lcsstr(src, tar):
1692
    """Return the longest common substring distance between two strings.
1693
1694
    Longest common substring distance (:math:`dist_{LCSstr}`)
1695
1696
    This employs the LCS function to derive a similarity metric:
1697
    :math:`dist_{LCSstr}(s,t) = 1 - sim_{LCSstr}(s,t)`
1698
1699
    :param str src, tar: two strings to be compared
1700
    :returns: LCSstr distance
1701
    :rtype: float
1702
1703
    >>> dist_lcsstr('cat', 'hat')
1704
    0.33333333333333337
1705
    >>> dist_lcsstr('Niall', 'Neil')
1706
    0.8
1707
    >>> dist_lcsstr('aluminum', 'Catalan')
1708
    0.75
1709
    >>> dist_lcsstr('ATCG', 'TAGC')
1710
    0.75
1711
    """
1712
    return 1 - sim_lcsstr(src, tar)
1713
1714
1715
def sim_ratcliff_obershelp(src, tar):
1716
    """Return the Ratcliff-Obershelp similarity of two strings.
1717
1718
    Ratcliff-Obershelp similarity
1719
1720
    This follows the Ratcliff-Obershelp algorithm to derive a similarity
1721
    measure:
1722
1723
        1. Find the length of the longest common substring in src & tar.
1724
        2. Recurse on the strings to the left & right of each this substring
1725
           in src & tar. The base case is a 0 length common substring, in which
1726
           case, return 0. Otherwise, return the sum of the current longest
1727
           common substring and the left & right recursed sums.
1728
        3. Multiply this length by 2 and divide by the sum of the lengths of
1729
           src & tar.
1730
1731
    Cf.
1732
    http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
1733
1734
    :param str src, tar: two strings to be compared
1735
    :returns: Ratcliff-Obserhelp similarity
1736
    :rtype: float
1737
1738
    >>> sim_ratcliff_obershelp('cat', 'hat')
1739
    0.66666666666666663
1740
    >>> sim_ratcliff_obershelp('Niall', 'Neil')
1741
    0.66666666666666663
1742
    >>> sim_ratcliff_obershelp('aluminum', 'Catalan')
1743
    0.40000000000000002
1744
    >>> sim_ratcliff_obershelp('ATCG', 'TAGC')
1745
    0.5
1746
    """
1747
    def _lcsstr_stl(src, tar):
1748
        """Return start positions & length for Ratcliff-Obershelp.
1749
1750
        Return the start position in the source string, start position in
1751
        the target string, and length of the longest common substring of
1752
        strings src and tar.
1753
        """
1754
        # pylint: disable=no-member
1755
        lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1756
        # pylint: enable=no-member
1757
        longest, src_longest, tar_longest = 0, 0, 0
1758
        for i in range(1, len(src)+1):
1759
            for j in range(1, len(tar)+1):
1760
                if src[i-1] == tar[j-1]:
1761
                    lengths[i, j] = lengths[i-1, j-1] + 1
1762
                    if lengths[i, j] > longest:
1763
                        longest = lengths[i, j]
1764
                        src_longest = i
1765
                        tar_longest = j
1766
                else:
1767
                    lengths[i, j] = 0
1768
        return (src_longest-longest, tar_longest-longest, longest)
1769
1770
    def _sstr_matches(src, tar):
1771
        """Return the sum of substring match lengths.
1772
1773
        This follows the Ratcliff-Obershelp algorithm:
1774
             1. Find the length of the longest common substring in src & tar.
1775
             2. Recurse on the strings to the left & right of each this
1776
                 substring in src & tar.
1777
             3. Base case is a 0 length common substring, in which case,
1778
                 return 0.
1779
             4. Return the sum.
1780
        """
1781
        src_start, tar_start, length = _lcsstr_stl(src, tar)
1782
        if length == 0:
1783
            return 0
1784
        return (_sstr_matches(src[:src_start], tar[:tar_start]) +
1785
                length +
1786
                _sstr_matches(src[src_start+length:], tar[tar_start+length:]))
1787
1788
    if src == tar:
1789
        return 1.0
1790
    elif not src or not tar:
1791
        return 0.0
1792
    return 2*_sstr_matches(src, tar)/(len(src)+len(tar))
1793
1794
1795
def dist_ratcliff_obershelp(src, tar):
1796
    """Return the Ratcliff-Obershelp distance between two strings.
1797
1798
    Ratcliff-Obershelp distance
1799
1800
    Ratcliff-Obsershelp distance the complement of Ratcliff-Obershelp
1801
    similarity:
1802
    :math:`dist_{Ratcliff-Obershelp} = 1 - sim_{Ratcliff-Obershelp}`
1803
1804
    :param str src, tar: two strings to be compared
1805
    :returns: Ratcliffe-Obershelp distance
1806
    :rtype: float
1807
1808
    >>> dist_ratcliff_obershelp('cat', 'hat')
1809
    0.33333333333333337
1810
    >>> dist_ratcliff_obershelp('Niall', 'Neil')
1811
    0.33333333333333337
1812
    >>> dist_ratcliff_obershelp('aluminum', 'Catalan')
1813
    0.59999999999999998
1814
    >>> dist_ratcliff_obershelp('ATCG', 'TAGC')
1815
    0.5
1816
    """
1817
    return 1 - sim_ratcliff_obershelp(src, tar)
1818
1819
1820
def mra_compare(src, tar):
1821
    """Return the MRA comparison rating of two strings.
1822
1823
    Western Airlines Surname Match Rating Algorithm comparison rating
1824
1825
    A description of the algorithm can be found on page 18 of
1826
    https://archive.org/details/accessingindivid00moor
1827
1828
    :param str src, tar: two strings to be compared
1829
    :returns: MRA comparison rating
1830
    :rtype: int
1831
1832
    >>> mra_compare('cat', 'hat')
1833
    5
1834
    >>> mra_compare('Niall', 'Neil')
1835
    6
1836
    >>> mra_compare('aluminum', 'Catalan')
1837
    0
1838
    >>> mra_compare('ATCG', 'TAGC')
1839
    5
1840
    """
1841
    if src == tar:
1842
        return 6
1843
    if src == '' or tar == '':
1844
        return 0
1845
    src = list(mra(src))
1846
    tar = list(mra(tar))
1847
1848
    if abs(len(src)-len(tar)) > 2:
1849
        return 0
1850
1851
    length_sum = len(src) + len(tar)
1852
    if length_sum < 5:
1853
        min_rating = 5
1854
    elif length_sum < 8:
1855
        min_rating = 4
1856
    elif length_sum < 12:
1857
        min_rating = 3
1858
    else:
1859
        min_rating = 2
1860
1861
    for _ in range(2):
1862
        new_src = []
1863
        new_tar = []
1864
        minlen = min(len(src), len(tar))
1865
        for i in range(minlen):
1866
            if src[i] != tar[i]:
1867
                new_src.append(src[i])
1868
                new_tar.append(tar[i])
1869
        src = new_src+src[minlen:]
1870
        tar = new_tar+tar[minlen:]
1871
        src.reverse()
1872
        tar.reverse()
1873
1874
    similarity = 6 - max(len(src), len(tar))
1875
1876
    if similarity >= min_rating:
1877
        return similarity
1878
    return 0
1879
1880
1881
def sim_mra(src, tar):
1882
    """Return the normalized MRA similarity of two strings.
1883
1884
    Normalized Match Rating Algorithm similarity
1885
1886
    This is the MRA normalized to :math:`[0, 1]`, given that MRA itself is
1887
    constrained to the range :math:`[0, 6]`.
1888
1889
    :param str src, tar: two strings to be compared
1890
    :returns: normalized MRA similarity
1891
    :rtype: float
1892
1893
    >>> sim_mra('cat', 'hat')
1894
    0.8333333333333334
1895
    >>> sim_mra('Niall', 'Neil')
1896
    1.0
1897
    >>> sim_mra('aluminum', 'Catalan')
1898
    0.0
1899
    >>> sim_mra('ATCG', 'TAGC')
1900
    0.8333333333333334
1901
    """
1902
    return mra_compare(src, tar)/6
1903
1904
1905
def dist_mra(src, tar):
1906
    """Return the normalized MRA distance between two strings.
1907
1908
    Normalized Match Rating Algorithm distance
1909
1910
    MRA distance is the complement of MRA similarity:
1911
    :math:`dist_{MRA} = 1 - sim_{MRA}`
1912
1913
    :param str src, tar: two strings to be compared
1914
    :returns: normalized MRA distance
1915
    :rtype: float
1916
1917
    >>> dist_mra('cat', 'hat')
1918
    0.16666666666666663
1919
    >>> dist_mra('Niall', 'Neil')
1920
    0.0
1921
    >>> dist_mra('aluminum', 'Catalan')
1922
    1.0
1923
    >>> dist_mra('ATCG', 'TAGC')
1924
    0.16666666666666663
1925
    """
1926
    return 1 - sim_mra(src, tar)
1927
1928
1929
def dist_compression(src, tar, compressor='bz2', probs=None):
1930
    """Return the normalized compression distance between two strings.
1931
1932
    Normalized compression distance (NCD)
1933
1934
    Cf.
1935
    https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance
1936
1937
    :param str src, tar: two strings to be compared
1938
    :param str compressor: a compression scheme to use for the similarity
1939
        calculation, from the following:
1940
1941
            - `zlib` -- standard zlib/gzip
1942
            - `bz2` -- bzip2 (default)
1943
            - `lzma` -- Lempel–Ziv–Markov chain algorithm
1944
            - `arith` -- arithmetic coding
1945
            - `rle` -- run-length encoding
1946
            - `bwtrle` -- Burrows-Wheeler transform followed by run-length
1947
              encoding
1948
1949
    :param doct probs: a dictionary trained with ac_train (for the arith
1950
        compressor only)
1951
    :returns: compression distance
1952
    :rtype: float
1953
1954
    >>> dist_compression('cat', 'hat')
1955
    0.08
1956
    >>> dist_compression('Niall', 'Neil')
1957
    0.037037037037037035
1958
    >>> dist_compression('aluminum', 'Catalan')
1959
    0.20689655172413793
1960
    >>> dist_compression('ATCG', 'TAGC')
1961
    0.037037037037037035
1962
1963
    >>> dist_compression('Niall', 'Neil', compressor='zlib')
1964
    0.45454545454545453
1965
    >>> dist_compression('Niall', 'Neil', compressor='bz2')
1966
    0.037037037037037035
1967
    >>> dist_compression('Niall', 'Neil', compressor='lzma')
1968
    0.16
1969
    >>> dist_compression('Niall', 'Neil', compressor='arith')
1970
    0.6875
1971
    >>> dist_compression('Niall', 'Neil', compressor='rle')
1972
    1.0
1973
    >>> dist_compression('Niall', 'Neil', compressor='bwtrle')
1974
    0.8333333333333334
1975
    """
1976
    if src == tar:
1977
        return 0.0
1978
1979
    if compressor not in {'arith', 'rle', 'bwtrle'}:
1980
        src = src.encode('utf-8')
1981
        tar = tar.encode('utf-8')
1982
1983
    if compressor == 'bz2':
1984
        src_comp = codecs.encode(src, 'bz2_codec')[15:]
1985
        tar_comp = codecs.encode(tar, 'bz2_codec')[15:]
1986
        concat_comp = codecs.encode(src+tar, 'bz2_codec')[15:]
1987
        concat_comp2 = codecs.encode(tar+src, 'bz2_codec')[15:]
1988
    elif compressor == 'lzma':
1989
        if 'lzma' in sys.modules:
1990
            src_comp = lzma.compress(src)[14:]
1991
            tar_comp = lzma.compress(tar)[14:]
1992
            concat_comp = lzma.compress(src+tar)[14:]
1993
            concat_comp2 = lzma.compress(tar+src)[14:]
1994
        else:  # pragma: no cover
1995
            raise ValueError('Install the PylibLZMA module in order to use ' +
1996
                             'lzma compression similarity')
1997
    elif compressor == 'arith':
1998
        if probs is None:
1999
            # lacking a reasonable dictionary, train on the strings themselves
2000
            probs = ac_train(src+tar)
2001
        src_comp = ac_encode(src, probs)[1]
2002
        tar_comp = ac_encode(tar, probs)[1]
2003
        concat_comp = ac_encode(src+tar, probs)[1]
2004
        concat_comp2 = ac_encode(tar+src, probs)[1]
2005
        return ((min(concat_comp, concat_comp2) - min(src_comp, tar_comp)) /
2006
                max(src_comp, tar_comp))
2007
    elif compressor in {'rle', 'bwtrle'}:
2008
        src_comp = rle_encode(src, (compressor == 'bwtrle'))
2009
        tar_comp = rle_encode(tar, (compressor == 'bwtrle'))
2010
        concat_comp = rle_encode(src+tar, (compressor == 'bwtrle'))
2011
        concat_comp2 = rle_encode(tar+src, (compressor == 'bwtrle'))
2012
    else:  # zlib
2013
        src_comp = codecs.encode(src, 'zlib_codec')[2:]
2014
        tar_comp = codecs.encode(tar, 'zlib_codec')[2:]
2015
        concat_comp = codecs.encode(src+tar, 'zlib_codec')[2:]
2016
        concat_comp2 = codecs.encode(tar+src, 'zlib_codec')[2:]
2017
    return ((min(len(concat_comp), len(concat_comp2)) -
2018
             min(len(src_comp), len(tar_comp))) /
2019
            max(len(src_comp), len(tar_comp)))
2020
2021
2022
def sim_compression(src, tar, compressor='bz2', probs=None):
2023
    """Return the normalized compression similarity of two strings.
2024
2025
    Normalized compression similarity (NCS)
2026
2027
    Normalized compression similarity is the complement of normalized
2028
    compression distance:
2029
    :math:`sim_{NCS} = 1 - dist_{NCD}`
2030
2031
    :param str src, tar: two strings to be compared
2032
    :param str compressor: a compression scheme to use for the similarity
2033
        calculation:
2034
2035
            - `zlib` -- standard zlib/gzip
2036
            - `bz2` -- bzip2 (default)
2037
            - `lzma` -- Lempel–Ziv–Markov chain algorithm
2038
            - `arith` -- arithmetic coding
2039
            - `rle` -- run-length encoding
2040
            - `bwtrle` -- Burrows-Wheeler transform followed by run-length
2041
              encoding
2042
2043
    :param dict probs: a dictionary trained with ac_train (for the arith
2044
        compressor only)
2045
    :returns: compression similarity
2046
    :rtype: float
2047
2048
    >>> sim_compression('cat', 'hat')
2049
    0.92
2050
    >>> sim_compression('Niall', 'Neil')
2051
    0.962962962962963
2052
    >>> sim_compression('aluminum', 'Catalan')
2053
    0.7931034482758621
2054
    >>> sim_compression('ATCG', 'TAGC')
2055
    0.962962962962963
2056
2057
    >>> sim_compression('Niall', 'Neil', compressor='zlib')
2058
    0.5454545454545454
2059
    >>> sim_compression('Niall', 'Neil', compressor='bz2')
2060
    0.962962962962963
2061
    >>> sim_compression('Niall', 'Neil', compressor='lzma')
2062
    0.84
2063
    >>> sim_compression('Niall', 'Neil', compressor='arith')
2064
    0.3125
2065
    >>> sim_compression('Niall', 'Neil', compressor='rle')
2066
    0.0
2067
    >>> sim_compression('Niall', 'Neil', compressor='bwtrle')
2068
    0.16666666666666663
2069
    """
2070
    return 1 - dist_compression(src, tar, compressor, probs)
2071
2072
2073
def sim_monge_elkan(src, tar, sim_func=sim_levenshtein, symmetric=False):
2074
    """Return the Monge-Elkan similarity of two strings.
2075
2076
    Monge-Elkan similarity
2077
2078
    Monge-Elkan is defined in:
2079
    Monge, Alvaro E. and Charles P. Elkan. 1996. "The field matching problem:
2080
    Algorithms and applications." KDD-9 Proceedings.
2081
    http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
2082
2083
    Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the
2084
    similarity of src to tar is not necessarily equal to the similarity of
2085
    tar to src. If the sym argument is True, a symmetric value is calculated,
2086
    at the cost of doubling the computation time (since the
2087
    :math:`sim_{Monge-Elkan}(src, tar)` and
2088
    :math:`sim_{Monge-Elkan}(tar, src)` are both calculated and then averaged).
2089
2090
    :param str src, tar: two strings to be compared
2091
    :param function sim_func: the internal similarity metric to emply
2092
    :param bool symmetric: return a symmetric similarity measure
2093
    :returns: Monge-Elkan similarity
2094
    :rtype: float
2095
2096
    >>> sim_monge_elkan('cat', 'hat')
2097
    0.75
2098
    >>> sim_monge_elkan('Niall', 'Neil')
2099
    0.66666666666666663
2100
    >>> sim_monge_elkan('aluminum', 'Catalan')
2101
    0.3888888888888889
2102
    >>> sim_monge_elkan('ATCG', 'TAGC')
2103
    0.5
2104
    """
2105
    if src == tar:
2106
        return 1.0
2107
2108
    q_src = sorted(QGrams(src).elements())
2109
    q_tar = sorted(QGrams(tar).elements())
2110
2111
    if not q_src or not q_tar:
2112
        return 0.0
2113
2114
    sum_of_maxes = 0
2115
    for q_s in q_src:
2116
        max_sim = float('-inf')
2117
        for q_t in q_tar:
2118
            max_sim = max(max_sim, sim_func(q_s, q_t))
2119
        sum_of_maxes += max_sim
2120
    sim_em = sum_of_maxes / len(q_src)
2121
2122
    if symmetric:
2123
        sim_em = (sim_em + sim_monge_elkan(tar, src, sim, False))/2
2124
2125
    return sim_em
2126
2127
2128
def dist_monge_elkan(src, tar, sim_func=sim_levenshtein, symmetric=False):
2129
    """Return the Monge-Elkan distance between two strings.
2130
2131
    Monge-Elkan distance
2132
2133
    Monge-Elkan is defined in:
2134
    Monge, Alvaro E. and Charles P. Elkan. 1996. "The field matching problem:
2135
    Algorithms and applications." KDD-9 Proceedings.
2136
    http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
2137
2138
    Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the
2139
    distance between src and tar is not necessarily equal to the distance
2140
    between tar and src. If the sym argument is True, a symmetric value is
2141
    calculated, at the cost of doubling the computation time (since the
2142
    :math:`sim_{Monge-Elkan}(src, tar)` and :math:`sim_{Monge-Elkan}(tar, src)`
2143
    are both calculated and then averaged).
2144
2145
    :param str src, tar: two strings to be compared
2146
    :param function sim_func: the internal similarity metric to emply
2147
    :param bool symmetric: return a symmetric similarity measure
2148
    :returns: Monge-Elkan distance
2149
    :rtype: float
2150
2151
    >>> dist_monge_elkan('cat', 'hat')
2152
    0.25
2153
    >>> dist_monge_elkan('Niall', 'Neil')
2154
    0.33333333333333337
2155
    >>> dist_monge_elkan('aluminum', 'Catalan')
2156
    0.61111111111111116
2157
    >>> dist_monge_elkan('ATCG', 'TAGC')
2158
    0.5
2159
    """
2160
    return 1 - sim_monge_elkan(src, tar, sim_func, symmetric)
2161
2162
2163
def sim_ident(src, tar):
2164
    """Return the identity similarity of two strings.
2165
2166
    Identity similarity
2167
2168
    This is 1 if the two strings are identical, otherwise 0.
2169
2170
    :param str src, tar: two strings to be compared
2171
    :returns: identity similarity
2172
    :rtype: int
2173
2174
    >>> sim_ident('cat', 'hat')
2175
    0
2176
    >>> sim_ident('cat', 'cat')
2177
    1
2178
    """
2179
    return int(src == tar)
2180
2181
2182
def dist_ident(src, tar):
2183
    """Return the identity distance between two strings.
2184
2185
    Identity distance
2186
2187
    This is 0 if the two strings are identical, otherwise 1, i.e.
2188
    :math:`dist_{identity} = 1 - sim_{identity}`
2189
2190
    :param str src, tar: two strings to be compared
2191
    :returns: indentity distance
2192
    :rtype: int
2193
2194
    >>> dist_ident('cat', 'hat')
2195
    1
2196
    >>> dist_ident('cat', 'cat')
2197
    0
2198
    """
2199
    return 1 - sim_ident(src, tar)
2200
2201
2202
def sim_matrix(src, tar, mat=None, mismatch_cost=0, match_cost=1,
2203
               symmetric=True, alphabet=None):
2204
    """Return the matrix similarity of two strings.
2205
2206
    Matrix similarity
2207
2208
    With the default parameters, this is identical to sim_ident.
2209
    It is possible for sim_matrix to return values outside of the range
2210
    :math:`[0, 1]`, if values outside that range are present in mat,
2211
    mismatch_cost, or match_cost.
2212
2213
    :param str src, tar: two strings to be compared
2214
    :param dict mat: a dict mapping tuples to costs; the tuples are (src, tar)
2215
        pairs of symbols from the alphabet parameter
2216
    :param float mismatch_cost: the value returned if (src, tar) is absent from
2217
        mat when src does not equal tar
2218
    :param float match_cost: the value returned if (src, tar) is absent from
2219
        mat when src equals tar
2220
    :param bool symmetric: True if the cost of src not matching tar is
2221
        identical to the cost of tar not matching src; in this case, the values
2222
        in mat need only contain (src, tar) or (tar, src), not both
2223
    :param str alphabet: a collection of tokens from which src and tar are
2224
        drawn; if this is defined a ValueError is raised if either tar or src
2225
        is not found in alphabet
2226
    :returns: matrix similarity
2227
    :rtype: float
2228
2229
    >>> sim_matrix('cat', 'hat')
2230
    0
2231
    >>> sim_matrix('hat', 'hat')
2232
    1
2233
    """
2234
    if alphabet:
2235
        alphabet = tuple(alphabet)
2236
        for i in src:
2237
            if i not in alphabet:
2238
                raise ValueError('src value not in alphabet')
2239
        for i in tar:
2240
            if i not in alphabet:
2241
                raise ValueError('tar value not in alphabet')
2242
2243
    if src == tar:
2244
        if mat and (src, src) in mat:
2245
            return mat[(src, src)]
2246
        return match_cost
2247
    if mat and (src, tar) in mat:
2248
        return mat[(src, tar)]
2249
    elif symmetric and mat and (tar, src) in mat:
2250
        return mat[(tar, src)]
2251
    return mismatch_cost
2252
2253
2254 View Code Duplication
def needleman_wunsch(src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2255
    """Return the Needleman-Wunsch score of two strings.
2256
2257
    Needleman-Wunsch score
2258
2259
    This is the standard edit distance measure.
2260
2261
    Cf. https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm
2262
2263
    Cf.
2264
    http://csb.stanford.edu/class/public/readings/Bioinformatics_I_Lecture6/Needleman_Wunsch_JMB_70_Global_alignment.pdf
2265
2266
    :param str src, tar: two strings to be compared
2267
    :param float gap_cost: the cost of an alignment gap (1 by default)
2268
    :param function sim_func: a function that returns the similarity of two
2269
        characters (identity similarity by default)
2270
    :returns: Needleman-Wunsch score
2271
    :rtype: int (in fact dependent on the gap_cost & return value of sim_func)
2272
2273
    >>> needleman_wunsch('cat', 'hat')
2274
    2.0
2275
    >>> needleman_wunsch('Niall', 'Neil')
2276
    1.0
2277
    >>> needleman_wunsch('aluminum', 'Catalan')
2278
    -1.0
2279
    >>> needleman_wunsch('ATCG', 'TAGC')
2280
    0.0
2281
    """
2282
    # pylint: disable=no-member
2283
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2284
    # pylint: enable=no-member
2285
2286
    for i in range(len(src)+1):
2287
        d_mat[i, 0] = -(i * gap_cost)
2288
    for j in range(len(tar)+1):
2289
        d_mat[0, j] = -(j * gap_cost)
2290
    for i in range(1, len(src)+1):
2291
        for j in range(1, len(tar)+1):
2292
            match = d_mat[i-1, j-1] + sim_func(src[i-1], tar[j-1])
2293
            delete = d_mat[i-1, j] - gap_cost
2294
            insert = d_mat[i, j-1] - gap_cost
2295
            d_mat[i, j] = max(match, delete, insert)
2296
    return d_mat[d_mat.shape[0]-1, d_mat.shape[1]-1]
2297
2298
2299 View Code Duplication
def smith_waterman(src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2300
    """Return the Smith-Waterman score of two strings.
2301
2302
    Smith-Waterman score
2303
2304
    This is the standard edit distance measure.
2305
2306
    Cf. https://en.wikipedia.org/wiki/Smith–Waterman_algorithm
2307
2308
    :param str src, tar: two strings to be compared
2309
    :param float gap_cost: the cost of an alignment gap (1 by default)
2310
    :param function sim_func: a function that returns the similarity of two
2311
        characters (identity similarity by default)
2312
    :returns: Smith-Waterman score
2313
    :rtype: int (in fact dependent on the gap_cost & return value of sim_func)
2314
2315
    >>> smith_waterman('cat', 'hat')
2316
    2.0
2317
    >>> smith_waterman('Niall', 'Neil')
2318
    1.0
2319
    >>> smith_waterman('aluminum', 'Catalan')
2320
    0.0
2321
    >>> smith_waterman('ATCG', 'TAGC')
2322
    1.0
2323
    """
2324
    # pylint: disable=no-member
2325
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2326
    # pylint: enable=no-member
2327
2328
    for i in range(len(src)+1):
2329
        d_mat[i, 0] = 0
2330
    for j in range(len(tar)+1):
2331
        d_mat[0, j] = 0
2332
    for i in range(1, len(src)+1):
2333
        for j in range(1, len(tar)+1):
2334
            match = d_mat[i-1, j-1] + sim_func(src[i-1], tar[j-1])
2335
            delete = d_mat[i-1, j] - gap_cost
2336
            insert = d_mat[i, j-1] - gap_cost
2337
            d_mat[i, j] = max(0, match, delete, insert)
2338
    return d_mat[d_mat.shape[0]-1, d_mat.shape[1]-1]
2339
2340
2341
def gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=sim_ident):
2342
    """Return the Gotoh score of two strings.
2343
2344
    Gotoh score
2345
2346
    Gotoh's algorithm is essentially Needleman-Wunsch with affine gap
2347
    penalties:
2348
    https://www.cs.umd.edu/class/spring2003/cmsc838t/papers/gotoh1982.pdf
2349
2350
    :param str src, tar: two strings to be compared
2351
    :param float gap_open: the cost of an open alignment gap (1 by default)
2352
    :param float gap_ext: the cost of an alignment gap extension (0.4 by
2353
        default)
2354
    :param function sim_func: a function that returns the similarity of two
2355
        characters (identity similarity by default)
2356
    :returns: Gotoh score
2357
    :rtype: float (in fact dependent on the gap_cost & return value of
2358
        sim_func)
2359
2360
    >>> gotoh('cat', 'hat')
2361
    2.0
2362
    >>> gotoh('Niall', 'Neil')
2363
    1.0
2364
    >>> gotoh('aluminum', 'Catalan')
2365
    -0.40000000000000002
2366
    >>> gotoh('cat', 'hat')
2367
    2.0
2368
    """
2369
    # pylint: disable=no-member
2370
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2371
    p_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2372
    q_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2373
    # pylint: enable=no-member
2374
2375
    d_mat[0, 0] = 0
2376
    p_mat[0, 0] = float('-inf')
2377
    q_mat[0, 0] = float('-inf')
2378
    for i in range(1, len(src)+1):
2379
        d_mat[i, 0] = float('-inf')
2380
        p_mat[i, 0] = -gap_open - gap_ext*(i-1)
2381
        q_mat[i, 0] = float('-inf')
2382
        q_mat[i, 1] = -gap_open
2383
    for j in range(1, len(tar)+1):
2384
        d_mat[0, j] = float('-inf')
2385
        p_mat[0, j] = float('-inf')
2386
        p_mat[1, j] = -gap_open
2387
        q_mat[0, j] = -gap_open - gap_ext*(j-1)
2388
2389
    for i in range(1, len(src)+1):
2390
        for j in range(1, len(tar)+1):
2391
            sim_val = sim_func(src[i-1], tar[j-1])
2392
            d_mat[i, j] = max(d_mat[i-1, j-1] + sim_val,
2393
                              p_mat[i-1, j-1] + sim_val,
2394
                              q_mat[i-1, j-1] + sim_val)
2395
2396
            p_mat[i, j] = max(d_mat[i-1, j] - gap_open,
2397
                              p_mat[i-1, j] - gap_ext)
2398
2399
            q_mat[i, j] = max(d_mat[i, j-1] - gap_open,
2400
                              q_mat[i, j-1] - gap_ext)
2401
2402
    i, j = (n - 1 for n in d_mat.shape)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable n does not seem to be defined.
Loading history...
2403
    return max(d_mat[i, j], p_mat[i, j], q_mat[i, j])
2404
2405
2406
def sim_length(src, tar):
2407
    """Return the length similarty of two strings.
2408
2409
    Length similarity
2410
2411
    This is the ratio of the length of the shorter string to the longer.
2412
2413
    :param str src, tar: two strings to be compared
2414
    :returns: length similarity
2415
    :rtype: float
2416
2417
    >>> sim_length('cat', 'hat')
2418
    1.0
2419
    >>> sim_length('Niall', 'Neil')
2420
    0.8
2421
    >>> sim_length('aluminum', 'Catalan')
2422
    0.875
2423
    >>> sim_length('ATCG', 'TAGC')
2424
    1.0
2425
    """
2426
    if src == tar:
2427
        return 1.0
2428
    if not src or not tar:
2429
        return 0.0
2430
    return len(src)/len(tar) if len(src) < len(tar) else len(tar)/len(src)
2431
2432
2433
def dist_length(src, tar):
2434
    """Return the length distance between two strings.
2435
2436
    Length distance
2437
2438
    Length distance is the complement of length similarity:
2439
    :math:`dist_{length} = 1 - sim_{length}`
2440
2441
    :param str src, tar: two strings to be compared
2442
    :returns: length distance
2443
    :rtype: float
2444
2445
    >>> dist_length('cat', 'hat')
2446
    0.0
2447
    >>> dist_length('Niall', 'Neil')
2448
    0.19999999999999996
2449
    >>> dist_length('aluminum', 'Catalan')
2450
    0.125
2451
    >>> dist_length('ATCG', 'TAGC')
2452
    0.0
2453
    """
2454
    return 1 - sim_length(src, tar)
2455
2456
2457 View Code Duplication
def sim_prefix(src, tar):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2458
    """Return the prefix similarty of two strings.
2459
2460
    Prefix similarity
2461
2462
    Prefix similarity is the ratio of the length of the shorter term that
2463
    exactly matches the longer term to the length of the shorter term,
2464
    beginning at the start of both terms.
2465
2466
    :param str src, tar: two strings to be compared
2467
    :returns: prefix similarity
2468
    :rtype: float
2469
2470
    >>> sim_prefix('cat', 'hat')
2471
    0.0
2472
    >>> sim_prefix('Niall', 'Neil')
2473
    0.25
2474
    >>> sim_prefix('aluminum', 'Catalan')
2475
    0.0
2476
    >>> sim_prefix('ATCG', 'TAGC')
2477
    0.0
2478
    """
2479
    if src == tar:
2480
        return 1.0
2481
    if not src or not tar:
2482
        return 0.0
2483
    min_word, max_word = (src, tar) if len(src) < len(tar) else (tar, src)
2484
    min_len = len(min_word)
2485
    for i in range(min_len, 0, -1):
2486
        if min_word[:i] == max_word[:i]:
2487
            return i/min_len
2488
    return 0.0
2489
2490
2491
def dist_prefix(src, tar):
2492
    """Return the prefix distance between two strings.
2493
2494
    Prefix distance
2495
2496
    Prefix distance is the complement of prefix similarity:
2497
    :math:`dist_{prefix} = 1 - sim_{prefix}`
2498
2499
    :param str src, tar: two strings to be compared
2500
    :returns: prefix distance
2501
    :rtype: float
2502
2503
        >>> dist_prefix('cat', 'hat')
2504
    1.0
2505
    >>> dist_prefix('Niall', 'Neil')
2506
    0.75
2507
    >>> dist_prefix('aluminum', 'Catalan')
2508
    1.0
2509
    >>> dist_prefix('ATCG', 'TAGC')
2510
    1.0
2511
    """
2512
    return 1 - sim_prefix(src, tar)
2513
2514
2515 View Code Duplication
def sim_suffix(src, tar):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2516
    """Return the suffix similarity of two strings.
2517
2518
    Suffix similarity
2519
2520
    Suffix similarity is the ratio of the length of the shorter term that
2521
    exactly matches the longer term to the length of the shorter term,
2522
    beginning at the end of both terms.
2523
2524
    :param str src, tar: two strings to be compared
2525
    :returns: suffix similarity
2526
    :rtype: float
2527
2528
    >>> sim_suffix('cat', 'hat')
2529
    0.6666666666666666
2530
    >>> sim_suffix('Niall', 'Neil')
2531
    0.25
2532
    >>> sim_suffix('aluminum', 'Catalan')
2533
    0.0
2534
    >>> sim_suffix('ATCG', 'TAGC')
2535
    0.0
2536
    """
2537
    if src == tar:
2538
        return 1.0
2539
    if not src or not tar:
2540
        return 0.0
2541
    min_word, max_word = (src, tar) if len(src) < len(tar) else (tar, src)
2542
    min_len = len(min_word)
2543
    for i in range(min_len, 0, -1):
2544
        if min_word[-i:] == max_word[-i:]:
2545
            return i/min_len
2546
    return 0.0
2547
2548
2549
def dist_suffix(src, tar):
2550
    """Return the suffix distance between two strings.
2551
2552
    Suffix distance
2553
2554
    Suffix distance is the complement of suffix similarity:
2555
    :math:`dist_{suffix} = 1 - sim_{suffix}`
2556
2557
    :param str src, tar: two strings to be compared
2558
    :returns: suffix distance
2559
    :rtype: float
2560
2561
    >>> dist_suffix('cat', 'hat')
2562
    0.33333333333333337
2563
    >>> dist_suffix('Niall', 'Neil')
2564
    0.75
2565
    >>> dist_suffix('aluminum', 'Catalan')
2566
    1.0
2567
    >>> dist_suffix('ATCG', 'TAGC')
2568
    1.0
2569
    """
2570
    return 1 - sim_suffix(src, tar)
2571
2572
2573
def sim_mlipns(src, tar, threshold=0.25, maxmismatches=2):
2574
    """Return the MLIPNS similarity of two strings.
2575
2576
    Modified Language-Independent Product Name Search (MLIPNS)
2577
2578
    The MLIPNS algorithm is described in Shannaq, Boumedyen A. N. and Victor V.
2579
    Alexandrov. 2010. "Using Product Similarity for Adding Business." Global
2580
    Journal of Computer Science and Technology. 10(12). 2-8.
2581
    http://www.sial.iias.spb.su/files/386-386-1-PB.pdf
2582
2583
    This function returns only 1.0 (similar) or 0.0 (not similar).
2584
2585
    LIPNS similarity is identical to normalized Hamming similarity.
2586
2587
    :param str src, tar: two strings to be compared
2588
    :param float threshold: a number [0, 1] indicating the maximum similarity
2589
        score, below which the strings are considered 'similar' (0.25 by
2590
        default)
2591
    :param int maxmismatches: a number indicating the allowable number of
2592
        mismatches to remove before declaring two strings not similar (2 by
2593
        default)
2594
    :returns: MLIPNS similarity
2595
    :rtype: float
2596
2597
    >>> sim_mlipns('cat', 'hat')
2598
    1.0
2599
    >>> sim_mlipns('Niall', 'Neil')
2600
    0.0
2601
    >>> sim_mlipns('aluminum', 'Catalan')
2602
    0.0
2603
    >>> sim_mlipns('ATCG', 'TAGC')
2604
    0.0
2605
    """
2606
    if tar == src:
2607
        return 1.0
2608
    if not src or not tar:
2609
        return 0.0
2610
2611
    mismatches = 0
2612
    ham = hamming(src, tar, difflens=True)
2613
    maxlen = max(len(src), len(tar))
2614
    while src and tar and mismatches <= maxmismatches:
2615
        if maxlen < 1 or (1-(maxlen-ham)/maxlen) <= threshold:
2616
            return 1.0
2617
        else:
2618
            mismatches += 1
2619
            ham -= 1
2620
            maxlen -= 1
2621
2622
    if maxlen < 1:
2623
        return 1.0
2624
    return 0.0
2625
2626
2627
def dist_mlipns(src, tar, threshold=0.25, maxmismatches=2):
2628
    """Return the MLIPNS distance between two strings.
2629
2630
    Modified Language-Independent Product Name Search (MLIPNS)
2631
2632
    MLIPNS distance is the complement of MLIPNS similarity:
2633
    :math:`dist_{MLIPNS} = 1 - sim_{MLIPNS}`
2634
2635
    This function returns only 0.0 (distant) or 1.0 (not distant)
2636
2637
    :param str src, tar: two strings to be compared
2638
    :param float threshold: a number [0, 1] indicating the maximum similarity
2639
        score, below which the strings are considered 'similar' (0.25 by
2640
        default)
2641
    :param int maxmismatches: a number indicating the allowable number of
2642
        mismatches to remove before declaring two strings not similar (2 by
2643
        default)
2644
    :returns: MLIPNS distance
2645
    :rtype: float
2646
2647
    >>> dist_mlipns('cat', 'hat')
2648
    0.0
2649
    >>> dist_mlipns('Niall', 'Neil')
2650
    1.0
2651
    >>> dist_mlipns('aluminum', 'Catalan')
2652
    1.0
2653
    >>> dist_mlipns('ATCG', 'TAGC')
2654
    1.0
2655
    """
2656
    return 1.0 - sim_mlipns(src, tar, threshold, maxmismatches)
2657
2658
2659
def bag(src, tar):
2660
    """Return the bag distance between two strings.
2661
2662
    Bag distance
2663
2664
    Bag distance is proposed in Bartolini, Illaria, Paolo Ciaccia, and Marco
2665
    Patella. 2002. "String Matching with Metric Trees Using and Approximate
2666
    Distance. Proceedings of the 9th International Symposium on String
2667
    Processing and Information Retrieval, Lisbone, Portugal, September 2002.
2668
    271-283.
2669
    http://www-db.disi.unibo.it/research/papers/SPIRE02.pdf
2670
2671
    It is defined as:
2672
    :math:`max( |multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)| )`
2673
2674
    :param str src, tar: two strings to be compared
2675
    :returns: bag distance
2676
    :rtype: int
2677
2678
    >>> bag('cat', 'hat')
2679
    1
2680
    >>> bag('Niall', 'Neil')
2681
    2
2682
    >>> bag('aluminum', 'Catalan')
2683
    5
2684
    >>> bag('ATCG', 'TAGC')
2685
    0
2686
    >>> bag('abcdefg', 'hijklm')
2687
    7
2688
    >>> bag('abcdefg', 'hijklmno')
2689
    8
2690
    """
2691
    if tar == src:
2692
        return 0
2693
    elif not src:
2694
        return len(tar)
2695
    elif not tar:
2696
        return len(src)
2697
2698
    src_bag = Counter(src)
2699
    tar_bag = Counter(tar)
2700
    return max(sum((src_bag-tar_bag).values()),
2701
               sum((tar_bag-src_bag).values()))
2702
2703
2704
def dist_bag(src, tar):
2705
    """Return the normalized bag distance between two strings.
2706
2707
    Normalized bag distance
2708
2709
    Bag distance is normalized by dividing by :math:`max( |src|, |tar| )`.
2710
2711
    :param str src, tar: two strings to be compared
2712
    :returns: normalized bag distance
2713
    :rtype: float
2714
2715
    >>> dist_bag('cat', 'hat')
2716
    0.3333333333333333
2717
    >>> dist_bag('Niall', 'Neil')
2718
    0.4
2719
    >>> dist_bag('aluminum', 'Catalan')
2720
    0.375
2721
    >>> dist_bag('ATCG', 'TAGC')
2722
    0.0
2723
    """
2724
    if tar == src:
2725
        return 0.0
2726
    if not src or not tar:
2727
        return 1.0
2728
2729
    maxlen = max(len(src), len(tar))
2730
2731
    return bag(src, tar)/maxlen
2732
2733
2734
def sim_bag(src, tar):
2735
    """Return the normalized bag similarity of two strings.
2736
2737
    Normalized bag similarity
2738
2739
    Normalized bag similarity is the complement of normalized bag distance:
2740
    :math:`sim_{bag} = 1 - dist_{bag}`
2741
2742
    :param str src, tar: two strings to be compared
2743
    :returns: normalized bag similarity
2744
    :rtype: float
2745
2746
    >>> sim_bag('cat', 'hat')
2747
    0.6666666666666667
2748
    >>> sim_bag('Niall', 'Neil')
2749
    0.6
2750
    >>> sim_bag('aluminum', 'Catalan')
2751
    0.625
2752
    >>> sim_bag('ATCG', 'TAGC')
2753
    1.0
2754
    """
2755
    return 1-dist_bag(src, tar)
2756
2757
2758
def editex(src, tar, cost=(0, 1, 2), local=False):
2759
    """Return the Editex distance between two strings.
2760
2761
    Editex distance
2762
2763
    As described on pages 3 & 4 of
2764
    Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from
2765
    information retrieval. In: Proceedings of the ACM-SIGIR Conference on
2766
    Research and Development in Information Retrieval, Zurich, Switzerland.
2767
    166–173. https://doi.org/10.1145/243199.243258
2768
2769
    The local variant is based on
2770
    Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in
2771
    Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th
2772
    Asia Information Retrieval Symposium, Sapporo, Japan. 157-167.
2773
    http://www.seg.rmit.edu.au/research/download.php?manuscript=404
2774
2775
    :param str src, tar: two strings to be compared
2776
    :param tuple cost: a 3-tuple representing the cost of the four possible
2777
        edits:
2778
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2779
    :param bool local: if True, the local variant of Editex is used
2780
    :returns: Editex distance
2781
    :rtype: int
2782
2783
    >>> editex('cat', 'hat')
2784
    2
2785
    >>> editex('Niall', 'Neil')
2786
    2
2787
    >>> editex('aluminum', 'Catalan')
2788
    12
2789
    >>> editex('ATCG', 'TAGC')
2790
    6
2791
    """
2792
    match_cost, group_cost, mismatch_cost = cost
2793
    letter_groups = ({'A', 'E', 'I', 'O', 'U', 'Y'},
2794
                     {'B', 'P'},
2795
                     {'C', 'K', 'Q'},
2796
                     {'D', 'T'},
2797
                     {'L', 'R'},
2798
                     {'M', 'N'},
2799
                     {'G', 'J'},
2800
                     {'F', 'P', 'V'},
2801
                     {'S', 'X', 'Z'},
2802
                     {'C', 'S', 'Z'})
2803
    all_letters = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'J', 'K', 'L', 'M',
2804
                   'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'}
2805
2806
    def r_cost(ch1, ch2):
2807
        """Return r(a,b) according to Zobel & Dart's definition."""
2808
        if ch1 == ch2:
2809
            return match_cost
2810
        if ch1 in all_letters and ch2 in all_letters:
2811
            for group in letter_groups:
2812
                if ch1 in group and ch2 in group:
2813
                    return group_cost
2814
        return mismatch_cost
2815
2816
    def d_cost(ch1, ch2):
2817
        """Return d(a,b) according to Zobel & Dart's definition."""
2818
        if ch1 != ch2 and (ch1 == 'H' or ch1 == 'W'):
2819
            return group_cost
2820
        return r_cost(ch1, ch2)
2821
2822
    # convert both src & tar to NFKD normalized unicode
2823
    src = unicodedata.normalize('NFKD', text_type(src.upper()))
2824
    tar = unicodedata.normalize('NFKD', text_type(tar.upper()))
2825
    # convert ß to SS (for Python2)
2826
    src = src.replace('ß', 'SS')
2827
    tar = tar.replace('ß', 'SS')
2828
2829
    if src == tar:
2830
        return 0
2831
    if not src:
2832
        return len(tar) * mismatch_cost
2833
    if not tar:
2834
        return len(src) * mismatch_cost
2835
2836
    # pylint: disable=no-member
2837
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
2838
    # pylint: enable=no-member
2839
    lens = len(src)
2840
    lent = len(tar)
2841
    src = ' '+src
2842
    tar = ' '+tar
2843
2844
    if not local:
2845
        for i in range(1, lens+1):
2846
            d_mat[i, 0] = d_mat[i-1, 0] + d_cost(src[i-1], src[i])
2847
    for j in range(1, lent+1):
2848
        d_mat[0, j] = d_mat[0, j-1] + d_cost(tar[j-1], tar[j])
2849
2850
    for i in range(1, lens+1):
2851
        for j in range(1, lent+1):
2852
            d_mat[i, j] = min(d_mat[i-1, j] + d_cost(src[i-1], src[i]),
2853
                              d_mat[i, j-1] + d_cost(tar[j-1], tar[j]),
2854
                              d_mat[i-1, j-1] + r_cost(src[i], tar[j]))
2855
2856
    return d_mat[lens, lent]
2857
2858
2859
def dist_editex(src, tar, cost=(0, 1, 2), local=False):
2860
    """Return the normalized Editex distance between two strings.
2861
2862
    Editex distance normalized to the interval [0, 1]
2863
2864
    The Editex distance is normalized by dividing the Editex distance
2865
    (calculated by any of the three supported methods) by the greater of
2866
    the number of characters in src times the cost of a delete and
2867
    the number of characters in tar times the cost of an insert.
2868
    For the case in which all operations have :math:`cost = 1`, this is
2869
    equivalent to the greater of the length of the two strings src & tar.
2870
2871
    :param str src, tar: two strings to be compared
2872
    :param tuple cost: a 3-tuple representing the cost of the four possible
2873
        edits:
2874
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2875
    :param bool local: if True, the local variant of Editex is used
2876
    :returns: normalized Editex distance
2877
    :rtype: float
2878
2879
    >>> dist_editex('cat', 'hat')
2880
    0.33333333333333331
2881
    >>> dist_editex('Niall', 'Neil')
2882
    0.20000000000000001
2883
    >>> dist_editex('aluminum', 'Catalan')
2884
    0.75
2885
    >>> dist_editex('ATCG', 'TAGC')
2886
    0.75
2887
    """
2888
    if src == tar:
2889
        return 0
2890
    mismatch_cost = cost[2]
2891
    return (editex(src, tar, cost, local) /
2892
            (max(len(src)*mismatch_cost, len(tar)*mismatch_cost)))
2893
2894
2895
def sim_editex(src, tar, cost=(0, 1, 2), local=False):
2896
    """Return the normalized Editex similarity of two strings.
2897
2898
    Editex similarity normalized to the interval [0, 1]
2899
2900
    The Editex similarity is the complement of Editex distance
2901
    :math:`sim_{Editex} = 1 - dist_{Editex}`
2902
2903
    The arguments are identical to those of the editex() function.
2904
2905
    :param str src, tar: two strings to be compared
2906
    :param tuple cost: a 3-tuple representing the cost of the four possible
2907
        edits:
2908
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2909
    :param bool local: if True, the local variant of Editex is used
2910
    :returns: normalized Editex similarity
2911
    :rtype: float
2912
2913
    >>> sim_editex('cat', 'hat')
2914
    0.66666666666666674
2915
    >>> sim_editex('Niall', 'Neil')
2916
    0.80000000000000004
2917
    >>> sim_editex('aluminum', 'Catalan')
2918
    0.25
2919
    >>> sim_editex('ATCG', 'TAGC')
2920
    0.25
2921
    """
2922
    return 1 - dist_editex(src, tar, cost, local)
2923
2924
2925
def eudex_hamming(src, tar, weights='exponential', maxlength=8,
2926
                  normalized=False):
2927
    """Calculate the Hamming distance between the Eudex hashes of two terms.
2928
2929
    If weights is set to None, a simple Hamming distance is calculated.
2930
    If weights is set to 'exponential', weight decays by powers of 2, as
2931
         proposed in the eudex specification: https://github.com/ticki/eudex.
2932
    If weights is set to 'fibonacci', weight decays through the Fibonacci
2933
         series, as in the eudex reference implementation.
2934
    If weights is set to a callable function, this assumes it creates a
2935
         generator and the generator is used to populate a series of weights.
2936
    If weights is set to an iterable, the iterable's values should be integers
2937
         and will be used as the weights.
2938
2939
    :param str src, tar: two strings to be compared
2940
    :param iterable or generator function weights:
2941
    :param maxlength: the number of characters to encode as a eudex hash
2942
    :return:
2943
    """
2944
    def _gen_fibonacci():
2945
        """Yield the next Fibonacci number.
2946
2947
        Based on https://www.python-course.eu/generators.php
2948
        Starts at Fibonacci number 3 (the second 1)
2949
        """
2950
        num_a, num_b = 1, 2
2951
        while True:
2952
            yield num_a
2953
            num_a, num_b = num_b, num_a + num_b
2954
2955
    def _gen_exponential(base=2):
2956
        """Yield the next value in an exponential series of the base.
2957
2958
        Based on https://www.python-course.eu/generators.php
2959
        Starts at base**0
2960
        """
2961
        exp = 0
2962
        while True:
2963
            yield base ** exp
2964
            exp += 1
2965
2966
    # Calculate the eudex hashes and XOR them
2967
    xored = eudex(src, maxlength=maxlength) ^ eudex(tar, maxlength=maxlength)
2968
2969
    # Simple hamming distance (all bits are equal)
2970
    if not weights:
2971
        return bin(xored).count('1')
2972
2973
    # If weights is a function, it should create a generator,
2974
    # which we now use to populate a list
2975
    if callable(weights):
2976
        weights = weights()
2977
    elif weights == 'exponential':
2978
        weights = _gen_exponential()
2979
    elif weights == 'fibonacci':
2980
        weights = _gen_fibonacci()
2981
    if isinstance(weights, types.GeneratorType):
2982
        weights = [next(weights) for _ in range(maxlength)][::-1]
2983
2984
    # Sum the weighted hamming distance
2985
    dist = 0
2986
    maxdist = 0
2987
    while (xored or normalized) and weights:
2988
        maxdist += 8*weights[-1]
2989
        dist += bin(xored & 0xFF).count('1') * weights.pop()
2990
        xored >>= 8
2991
2992
    if normalized:
2993
        dist /= maxdist
2994
2995
    return dist
2996
2997
2998
def dist_eudex(src, tar, weights='exponential', maxlength=8):
2999
    """Return normalized Hamming distance between Eudex hashes of two terms.
3000
3001
    If weights is set to None, a simple Hamming distance is calculated.
3002
    If weights is set to 'exponential', weight decays by powers of 2, as
3003
         proposed in the eudex specification: https://github.com/ticki/eudex.
3004
    If weights is set to 'fibonacci', weight decays through the Fibonacci
3005
         series, as in the eudex reference implementation.
3006
    If weights is set to a callable function, this assumes it creates a
3007
         generator and the generator is used to populate a series of weights.
3008
    If weights is set to an iterable, the iterable's values should be integers
3009
         and will be used as the weights.
3010
3011
    :param str src, tar: two strings to be compared
3012
    :param iterable or generator function weights:
3013
    :param maxlength: the number of characters to encode as a eudex hash
3014
    :return:
3015
    """
3016
    return eudex_hamming(src, tar, weights, maxlength, True)
3017
3018
3019
def sim_eudex(src, tar, weights='exponential', maxlength=8):
3020
    """Return normalized Hamming similarity between Eudex hashes of two terms.
3021
3022
    If weights is set to None, a simple Hamming distance is calculated.
3023
    If weights is set to 'exponential', weight decays by powers of 2, as
3024
         proposed in the eudex specification: https://github.com/ticki/eudex.
3025
    If weights is set to 'fibonacci', weight decays through the Fibonacci
3026
         series, as in the eudex reference implementation.
3027
    If weights is set to a callable function, this assumes it creates a
3028
         generator and the generator is used to populate a series of weights.
3029
    If weights is set to an iterable, the iterable's values should be integers
3030
         and will be used as the weights.
3031
3032
    :param str src, tar: two strings to be compared
3033
    :param iterable or generator function weights:
3034
    :param maxlength: the number of characters to encode as a eudex hash
3035
    :return:
3036
    """
3037
    return 1-dist_eudex(src, tar, weights, maxlength)
3038
3039
3040
def sift4_simplest(src, tar, max_offset=0):
3041
    """Return the "simplest" Sift4 distance between two terms.
3042
3043
    This is an approximation of edit distance, described in:
3044
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3045
    algorithm: Sift4."
3046
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3047
3048
    :param str src, tar: two strings to be compared
3049
    :param max_offset: the number of characters to search for matching letters
3050
    :return:
3051
    """
3052
    if not src:
3053
        return len(tar)
3054
3055
    if not tar:
3056
        return len(src)
3057
3058
    src_len = len(src)
3059
    tar_len = len(tar)
3060
3061
    src_cur = 0
3062
    tar_cur = 0
3063
    lcss = 0
3064
    local_cs = 0
3065
3066
    while (src_cur < src_len) and (tar_cur < tar_len):
3067
        if src[src_cur] == tar[tar_cur]:
3068
            local_cs += 1
3069
        else:
3070
            lcss += local_cs
3071
            local_cs = 0
3072
            if src_cur != tar_cur:
3073
                src_cur = tar_cur = max(src_cur, tar_cur)
3074
            for i in range(max_offset):
3075
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
3076
                    break
3077
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
3078
                    src_cur += i
3079
                    local_cs += 1
3080
                    break
3081
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
3082
                    tar_cur += i
3083
                    local_cs += 1
3084
                    break
3085
3086
        src_cur += 1
3087
        tar_cur += 1
3088
3089
    lcss += local_cs
3090
    return round(max(src_len, tar_len) - lcss)
3091
3092
3093
def sift4_common(src, tar, max_offset=0, max_distance=0):
3094
    """Return the "common" Sift4 distance between two terms.
3095
3096
    This is an approximation of edit distance, described in:
3097
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3098
    algorithm: Sift4."
3099
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3100
3101
    :param str src, tar: two strings to be compared
3102
    :param max_offset: the number of characters to search for matching letters
3103
    :param max_distance: the distance at which to stop and exit
3104
    :return:
3105
    """
3106
    if not src:
3107
        return len(tar)
3108
3109
    if not tar:
3110
        return len(src)
3111
3112
    src_len = len(src)
3113
    tar_len = len(tar)
3114
3115
    src_cur = 0
3116
    tar_cur = 0
3117
    lcss = 0
3118
    local_cs = 0
3119
    trans = 0
3120
    offset_arr = []
3121
3122
    while (src_cur < src_len) and (tar_cur < tar_len):
3123
        if src[src_cur] == tar[tar_cur]:
3124
            local_cs += 1
3125
            is_trans = False
3126
            i = 0
3127
            while i < len(offset_arr):
3128
                ofs = offset_arr[i]
3129
                if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
3130
                    is_trans = (abs(tar_cur-src_cur) >=
3131
                                abs(ofs['tar_cur']-ofs['src_cur']))
3132
                    if is_trans:
3133
                        trans += 1
3134
                    elif not ofs['trans']:
3135
                        ofs['trans'] = True
3136
                        trans += 1
3137
                    break
3138
                elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
3139
                    del offset_arr[i]
3140
                else:
3141
                    i += 1
3142
3143
            offset_arr.append({'src_cur': src_cur, 'tar_cur': tar_cur,
3144
                               'trans': is_trans})
3145
        else:
3146
            lcss += local_cs
3147
            local_cs = 0
3148
            if src_cur != tar_cur:
3149
                src_cur = tar_cur = min(src_cur, tar_cur)
3150
            for i in range(max_offset):
3151
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
3152
                    break
3153
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
3154
                    src_cur += i-1
3155
                    tar_cur -= 1
3156
                    break
3157
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
3158
                    src_cur -= 1
3159
                    tar_cur += i-1
3160
                    break
3161
3162
        src_cur += 1
3163
        tar_cur += 1
3164
3165
        if max_distance:
3166
            temporary_distance = max(src_cur, tar_cur) - lcss + trans
3167
            if temporary_distance >= max_distance:
3168
                return round(temporary_distance)
3169
3170
        if (src_cur >= src_len) or (tar_cur >= tar_len):
3171
            lcss += local_cs
3172
            local_cs = 0
3173
            src_cur = tar_cur = min(src_cur, tar_cur)
3174
3175
    lcss += local_cs
3176
    return round(max(src_len, tar_len) - lcss + trans)
3177
3178
3179
def dist_sift4(src, tar, max_offset=0, max_distance=0):
3180
    """Return the normalized "common" Sift4 distance between two terms.
3181
3182
    This is an approximation of edit distance, described in:
3183
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3184
    algorithm: Sift4."
3185
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3186
3187
    :param str src, tar: two strings to be compared
3188
    :param max_offset: the number of characters to search for matching letters
3189
    :param max_distance: the distance at which to stop and exit
3190
    :return:
3191
    """
3192
    return (sift4_common(src, tar, max_offset, max_distance) /
3193
            (max(len(src), len(tar))))
3194
3195
3196
def sim_sift4(src, tar, max_offset=0, max_distance=0):
3197
    """Return the normalized "common" Sift4 similarity of two terms.
3198
3199
    This is an approximation of edit distance, described in:
3200
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3201
    algorithm: Sift4."
3202
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3203
3204
    :param str src, tar: two strings to be compared
3205
    :param max_offset: the number of characters to search for matching letters
3206
    :param max_distance: the distance at which to stop and exit
3207
    :return:
3208
    """
3209
    return 1-dist_sift4(src, tar, max_offset, max_distance)
3210
3211
3212
def sim_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None):
3213
    """Return the Baystat similarity.
3214
3215
    Good results for shorter words are reported when setting min_ss_len to 1
3216
    and either left_ext OR right_ext to 1.
3217
3218
    The Baystat similarity is defined in:
3219
    Fürnrohr, Michael, Birgit Rimmelspacher, and Tilman von Roncador. 2002.
3220
    "Zusammenführung von Datenbeständen ohne numerische Identifikatoren: ein
3221
    Verfahren im Rahmen der Testuntersuchungen zu einem registergestützten
3222
    Zensus." Bayern in Zahlen, 2002(7). 308--321.
3223
    https://www.statistik.bayern.de/medien/statistik/zensus/zusammenf__hrung_von_datenbest__nden_ohne_numerische_identifikatoren.pdf
3224
3225
    This is ostensibly a port of the R module PPRL's implementation:
3226
    https://github.com/cran/PPRL/blob/master/src/MTB_Baystat.cpp
3227
    As such, this could be made more pythonic.
3228
3229
    :param str src, tar: two strings to be compared
3230
    :param int min_ss_len: minimum substring length to be considered
3231
    :param int left_ext: left-side extension length
3232
    :param int right_ext: right-side extension length
3233
    :rtype: float
3234
    :return: the Baystat similarity
3235
    """
3236
    if src == tar:
3237
        return 1
3238
    if not src or not tar:
3239
        return 0
3240
3241
    max_len = max(len(src), len(tar))
3242
3243
    if not (min_ss_len and left_ext and right_ext):
3244
        # These can be set via arguments to the function. Otherwise they are
3245
        # set automatically based on values from the article.
3246
        if max_len >= 7:
3247
            min_ss_len = 2
3248
            left_ext = 2
3249
            right_ext = 2
3250
        else:
3251
            # The paper suggests that for short names, (exclusively) one or the
3252
            # other of left_ext and right_ext can be 1, with good results.
3253
            # I use 0 & 0 as the default in this case.
3254
            min_ss_len = 1
3255
            left_ext = 0
3256
            right_ext = 0
3257
3258
    pos = 0
3259
    match_len = 0
3260
3261
    while (True):
3262
        if pos + min_ss_len > len(src):
3263
            return match_len/max_len
3264
3265
        hit_len = 0
3266
        ix = 1
3267
3268
        substring = src[pos:pos + min_ss_len]
3269
        search_begin = pos - left_ext
3270
3271
        if search_begin < 0:
3272
            search_begin = 0
3273
            left_ext_len = pos
3274
        else:
3275
            left_ext_len = left_ext
3276
3277
        if pos + min_ss_len + right_ext >= len(tar):
3278
            right_ext_len = len(tar) - pos - min_ss_len
3279
        else:
3280
            right_ext_len = right_ext
3281
3282
        if (search_begin + left_ext_len + min_ss_len + right_ext_len >
3283
                search_begin):
3284
            search_val = tar[search_begin:(search_begin + left_ext_len +
3285
                                           min_ss_len + right_ext_len)]
3286
        else:
3287
            search_val = ''
3288
3289
        flagged_tar = ''
3290
        while substring in search_val and pos + ix <= len(src):
3291
            hit_len = len(substring)
3292
            flagged_tar = tar.replace(substring, '#'*hit_len)
3293
3294
            if pos + min_ss_len + ix <= len(src):
3295
                substring = src[pos:pos + min_ss_len + ix]
3296
3297
            if pos+min_ss_len + right_ext_len + 1 <= len(tar):
3298
                right_ext_len += 1
3299
3300
            if (search_begin + left_ext_len + min_ss_len + right_ext_len <=
3301
                    len(tar)):
3302
                search_val = tar[search_begin:(search_begin + left_ext_len +
3303
                                               min_ss_len + right_ext_len)]
3304
3305
            ix += 1
3306
3307
        if hit_len > 0:
3308
            tar = flagged_tar
3309
3310
        match_len += hit_len
3311
        pos += ix
3312
3313
3314
def dist_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None):
3315
    """Return the Baystat distance.
3316
3317
    :param str src, tar: two strings to be compared
3318
    :param int min_ss_len: minimum substring length to be considered
3319
    :param int left_ext: left-side extension length
3320
    :param int right_ext: right-side extension length
3321
    :rtype: float
3322
    :return: the Baystat distance
3323
    """
3324
    return 1-sim_baystat(src, tar, min_ss_len, left_ext, right_ext)
3325
3326
3327
def typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5)):
3328
    """Return the typo distance between two strings.
3329
3330
    This is inspired by Wayne Song's Typo-Distance
3331
    (https://github.com/wsong/Typo-Distance), and a fair bit of this was
3332
    copied from his module. Compared to the original, this has supports
3333
    different metrics for substitution.
3334
3335
    :param str src, tar: two strings to be compared
3336
    :param str metric: supported values include: 'euclidean', 'manhattan',
3337
          'log-euclidean', and 'log-manhattan'
3338
    :param tuple cost: a 4-tuple representing the cost of the four possible
3339
        edits: inserts, deletes, substitutions, and shift, respectively (by
3340
        default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be
3341
        significantly less than the cost of an insertion & deletion unless
3342
        a log metric is used.
3343
    :return:
3344
    """
3345
    ins_cost, del_cost, sub_cost, shift_cost = cost
3346
3347
    if src == tar:
3348
        return 0.0
3349
    if not src:
3350
        return len(tar) * ins_cost
3351
    if not tar:
3352
        return len(src) * del_cost
3353
3354
    layout = {'QWERTY': (
3355
        (('`', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '-', '='),
3356
         ('', 'q', 'w', 'e', 'r', 't', 'y', 'u', 'i', 'o', 'p', '[', ']',
3357
          '\\'),
3358
         ('', 'a', 's', 'd', 'f', 'g', 'h', 'j', 'k', 'l', ';', '\''),
3359
         ('', 'z', 'x', 'c', 'v', 'b', 'n', 'm', ',', '.', '/'),
3360
         ('', '', ' ', ' ', ' ', ' ', ' ', ' ', ' ')),
3361
        (('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+'),
3362
         ('', 'Q', 'W', 'E', 'R', 'T', 'Y', 'U', 'I', 'O', 'P', '{', '}', '|'),
3363
         ('', 'A', 'S', 'D', 'F', 'G', 'H', 'J', 'K', 'L', ':', '"'),
3364
         ('', 'Z', 'X', 'C', 'V', 'B', 'N', 'M', '<', '>', '?'),
3365
         ('', '', ' ', ' ', ' ', ' ', ' ', ' ', ' '))
3366
    )}
3367
3368
    keyboard = layout['QWERTY']
3369
    lowercase = {item for sublist in keyboard[0] for item in sublist}
3370
    uppercase = {item for sublist in keyboard[1] for item in sublist}
3371
3372
    def _kb_array_for_char(char):
3373
        """Return the keyboard layout that contains ch."""
3374
        if char in lowercase:
3375
            return keyboard[0]
3376
        elif char in uppercase:
3377
            return keyboard[1]
3378
        else:
3379
            raise ValueError(char + ' not found in any keyboard layouts')
3380
3381
    def _get_char_coord(char, keyboard):
3382
        """Return the row & column of char in the keyboard."""
3383
        for row in keyboard:
3384
            if char in row:
3385
                return keyboard.index(row), row.index(char)
3386
        raise ValueError(char + ' not found in given keyboard layout')
3387
3388
    def _euclidean_keyboard_distance(char1, char2):
3389
        row1, col1 = _get_char_coord(char1, _kb_array_for_char(char1))
3390
        row2, col2 = _get_char_coord(char2, _kb_array_for_char(char2))
3391
        return ((row1 - row2) ** 2 + (col1 - col2) ** 2) ** 0.5
3392
3393
    def _manhattan_keyboard_distance(char1, char2):
3394
        row1, col1 = _get_char_coord(char1, _kb_array_for_char(char1))
3395
        row2, col2 = _get_char_coord(char2, _kb_array_for_char(char2))
3396
        return abs(row1 - row2) + abs(col1 - col2)
3397
3398
    def _log_euclidean_keyboard_distance(char1, char2):
3399
        return math.log(1 + _euclidean_keyboard_distance(char1, char2))
3400
3401
    def _log_manhattan_keyboard_distance(char1, char2):
3402
        return math.log(1 + _manhattan_keyboard_distance(char1, char2))
3403
3404
    metric_dict = {'euclidean': _euclidean_keyboard_distance,
3405
                   'manhattan': _manhattan_keyboard_distance,
3406
                   'log-euclidean': _log_euclidean_keyboard_distance,
3407
                   'log-manhattan': _log_manhattan_keyboard_distance}
3408
3409
    def substitution_cost(char1, char2):
3410
        cost = sub_cost
3411
        cost *= (metric_dict[metric](char1, char2) +
3412
                 shift_cost * (_kb_array_for_char(char1) !=
3413
                               _kb_array_for_char(char2)))
3414
        return cost
3415
3416
    d_mat = np.zeros((len(src) + 1, len(tar) + 1), dtype=np.float32)
3417
    for i in range(len(src) + 1):
3418
        d_mat[i, 0] = i * del_cost
3419
    for j in range(len(tar) + 1):
3420
        d_mat[0, j] = j * ins_cost
3421
3422
    for i in range(len(src)):
3423
        for j in range(len(tar)):
3424
            d_mat[i + 1, j + 1] = min(
3425
                d_mat[i + 1, j] + ins_cost,  # ins
3426
                d_mat[i, j + 1] + del_cost,  # del
3427
                d_mat[i, j] + (substitution_cost(src[i], tar[j])
3428
                               if src[i] != tar[j] else 0)  # sub/==
3429
            )
3430
3431
    return d_mat[len(src), len(tar)]
3432
3433
3434
def dist_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5)):
3435
    """Return the normalized typo distance between two strings.
3436
3437
    This is inspired by Wayne Song's Typo-Distance
3438
    (https://github.com/wsong/Typo-Distance), and a fair bit of this was
3439
    copied from his module. Compared to the original, this has supports
3440
    different metrics for substitution.
3441
3442
    :param str src, tar: two strings to be compared
3443
    :param str metric: supported values include: 'euclidean', 'manhattan',
3444
          'log-euclidean', and 'log-manhattan'
3445
    :param tuple cost: a 4-tuple representing the cost of the four possible
3446
        edits: inserts, deletes, substitutions, and shift, respectively (by
3447
        default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be
3448
        significantly less than the cost of an insertion & deletion unless
3449
        a log metric is used.
3450
    :return:
3451
    """
3452
    if src == tar:
3453
        return 0
3454
    ins_cost, del_cost = cost[:2]
3455
    return (typo(src, tar, metric, cost) /
3456
            (max(len(src)*del_cost, len(tar)*ins_cost)))
3457
3458
3459
def sim_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5)):
3460
    """Return the normalized typo similarity between two strings.
3461
3462
    This is inspired by Wayne Song's Typo-Distance
3463
    (https://github.com/wsong/Typo-Distance), and a fair bit of this was
3464
    copied from his module. Compared to the original, this has supports
3465
    different metrics for substitution.
3466
3467
    :param str src, tar: two strings to be compared
3468
    :param str metric: supported values include: 'euclidean', 'manhattan',
3469
          'log-euclidean', and 'log-manhattan'
3470
    :param tuple cost: a 4-tuple representing the cost of the four possible
3471
        edits: inserts, deletes, substitutions, and shift, respectively (by
3472
        default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be
3473
        significantly less than the cost of an insertion & deletion unless
3474
        a log metric is used.
3475
    :return:
3476
    """
3477
    return 1 - dist_typo(src, tar, metric, cost)
3478
3479
3480
def sim_tfidf(src, tar, qval=2, docs_src=None, docs_tar=None):
3481
    """Return the TF-IDF similarity of two strings.
3482
3483
    TF-IDF similarity
3484
3485
    This is chiefly based on the "Formal Definition of TF/IDF Distance" at:
3486
    http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html
3487
3488
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
3489
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
3490
        version
3491
    :param Counter docs_src: a Counter object or string representing the
3492
        document corpus for the src string
3493
    :param Counter docs_tar: a Counter object or string representing the
3494
        document corpus for the tar string (or set to None to use the docs_src
3495
        for both)
3496
    :returns: TF-IDF similarity
3497
    :rtype: float
3498
    """
3499
    if src == tar:
3500
        return 1.0  # TODO: confirm correctness of this when docs are different
3501
    elif not src or not tar:
3502
        return 0.0
3503
3504
    q_src, q_tar = _get_qgrams(src, tar, qval)
3505
3506
    if isinstance(docs_src, Counter):
3507
        q_docs = docs_src
3508
    elif qval and qval > 0:
3509
        q_docs = QGrams(docs_src, qval)
3510
    else:
3511
        q_docs = Counter(docs_src.strip().split())
3512
3513
    if not q_src or not q_tar:
3514
        return 0.0
3515
3516
    # TODO: finish implementation
3517
    return 0.5  # hardcoded to half
3518
3519
###############################################################################
3520
3521
3522
def sim(src, tar, method=sim_levenshtein):
3523
    """Return a similarity of two strings.
3524
3525
    This is a generalized function for calling other similarity functions.
3526
3527
    :param str src, tar: two strings to be compared
3528
    :param function method: specifies the similarity metric (Levenshtein by
3529
        default)
3530
    :returns: similarity according to the specified function
3531
    :rtype: float
3532
3533
    >>> sim('cat', 'hat')
3534
    0.66666666666666674
3535
    >>> sim('Niall', 'Neil')
3536
    0.40000000000000002
3537
    >>> sim('aluminum', 'Catalan')
3538
    0.125
3539
    >>> sim('ATCG', 'TAGC')
3540
    0.25
3541
    """
3542
    if callable(method):
3543
        return method(src, tar)
3544
    else:
3545
        raise AttributeError('Unknown similarity function: ' + str(method))
3546
3547
3548
def dist(src, tar, method=sim_levenshtein):
3549
    """Return a distance between two strings.
3550
3551
    This is a generalized function for calling other distance functions.
3552
3553
    :param str src, tar: two strings to be compared
3554
    :param function method: specifies the similarity metric (Levenshtein by
3555
        default) -- Note that this takes a similarity metric function, not
3556
        a distance metric function.
3557
    :returns: distance according to the specified function
3558
    :rtype: float
3559
3560
    >>> dist('cat', 'hat')
3561
    0.33333333333333326
3562
    >>> dist('Niall', 'Neil')
3563
    0.59999999999999998
3564
    >>> dist('aluminum', 'Catalan')
3565
    0.875
3566
    >>> dist('ATCG', 'TAGC')
3567
    0.75
3568
    """
3569
    if callable(method):
3570
        return 1 - method(src, tar)
3571
    else:
3572
        raise AttributeError('Unknown distance function: ' + str(method))
3573