Completed
Push — master ( e4e852...00b1e1 )
by Chris
12:16
created

abydos.distance.dist_tversky()   A

Complexity

Conditions 1

Size

Total Lines 33
Code Lines 2

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 1
eloc 2
nop 6
dl 0
loc 33
rs 10
c 0
b 0
f 0
1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (3302/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.distance.
20
21
The distance module implements string edit distance functions including:
22
23
    - Levenshtein distance (incl. a [0, 1] normalized variant)
24
    - Optimal String Alignment distance (incl. a [0, 1] normalized variant)
25
    - Levenshtein-Damerau distance (incl. a [0, 1] normalized variant)
26
    - Hamming distance (incl. a [0, 1] normalized variant)
27
    - Tversky index
28
    - Sørensen–Dice coefficient & distance
29
    - Jaccard similarity coefficient & distance
30
    - overlap similarity & distance
31
    - Tanimoto coefficient & distance
32
    - Minkowski distance & similarity (incl. a [0, 1] normalized option)
33
    - Manhattan distance & similarity (incl. a [0, 1] normalized option)
34
    - Euclidean distance & similarity (incl. a [0, 1] normalized option)
35
    - Chebyshev distance & similarity (incl. a [0, 1] normalized option)
36
    - cosine similarity & distance
37
    - Jaro distance
38
    - Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
39
    - Longest common substring
40
    - Ratcliff-Obershelp similarity & distance
41
    - Match Rating Algorithm similarity
42
    - Normalized Compression Distance (NCD) & similarity
43
    - Monge-Elkan similarity & distance
44
    - Matrix similarity
45
    - Needleman-Wunsch score
46
    - Smither-Waterman score
47
    - Gotoh score
48
    - Length similarity
49
    - Prefix, Suffix, and Identity similarity & distance
50
    - Modified Language-Independent Product Name Search (MLIPNS) similarity &
51
      distance
52
    - Bag distance (incl. a [0, 1] normalized variant)
53
    - Editex distance (incl. a [0, 1] normalized variant)
54
    - Eudex distances
55
    - Sift4 distance
56
    - TF-IDF similarity
57
58
Functions beginning with the prefixes 'sim' and 'dist' are guaranteed to be
59
in the range [0, 1], and sim_X = 1 - dist_X since the two are complements.
60
If a sim_X function is supplied identical src & tar arguments, it is guaranteed
61
to return 1; the corresponding dist_X function is guaranteed to return 0.
62
"""
63
64
from __future__ import division, unicode_literals
65
66
import codecs
67
import math
68
import sys
69
import types
70
import unicodedata
71
from collections import Counter, defaultdict
72
73
import numpy as np
74
75
from six import text_type
76
from six.moves import range
77
78
from .compression import ac_encode, ac_train, rle_encode
79
from .phonetic import eudex, mra
80
from .qgram import QGrams
81
82
try:
83
    import lzma
84
except ImportError:  # pragma: no cover
85
    # If the system lacks the lzma library, that's fine, but lzma comrpession
86
    # similarity won't be supported.
87
    pass
88
89
90
def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
91
    """Return the Levenshtein distance between two strings.
92
93
    Levenshtein distance
94
95
    This is the standard edit distance measure. Cf.
96
    https://en.wikipedia.org/wiki/Levenshtein_distance
97
98
    Two additional variants: optimal string alignment (aka restricted
99
    Damerau-Levenshtein distance) and the Damerau-Levenshtein distance
100
    are also supported. Cf.
101
    https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
102
103
    The ordinary Levenshtein & Optimal String Alignment distance both
104
    employ the Wagner-Fischer dynamic programming algorithm. Cf.
105
    https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
106
107
    Levenshtein edit distance ordinarily has unit insertion, deletion, and
108
    substitution costs.
109
110
    :param str src, tar: two strings to be compared
111
    :param str mode: specifies a mode for computing the Levenshtein distance:
112
113
        - 'lev' (default) computes the ordinary Levenshtein distance,
114
          in which edits may include inserts, deletes, and substitutions
115
        - 'osa' computes the Optimal String Alignment distance, in which
116
          edits may include inserts, deletes, substitutions, and
117
          transpositions but substrings may only be edited once
118
        - 'dam' computes the Damerau-Levenshtein distance, in which
119
          edits may include inserts, deletes, substitutions, and
120
          transpositions and substrings may undergo repeated edits
121
122
    :param tuple cost: a 4-tuple representing the cost of the four possible
123
        edits: inserts, deletes, substitutions, and transpositions,
124
        respectively (by default: (1, 1, 1, 1))
125
    :returns: the Levenshtein distance between src & tar
126
    :rtype: int (may return a float if cost has float values)
127
128
    >>> levenshtein('cat', 'hat')
129
    1
130
    >>> levenshtein('Niall', 'Neil')
131
    3
132
    >>> levenshtein('aluminum', 'Catalan')
133
    7
134
    >>> levenshtein('ATCG', 'TAGC')
135
    3
136
137
    >>> levenshtein('ATCG', 'TAGC', mode='osa')
138
    2
139
    >>> levenshtein('ACTG', 'TAGC', mode='osa')
140
    4
141
142
    >>> levenshtein('ATCG', 'TAGC', mode='dam')
143
    2
144
    >>> levenshtein('ACTG', 'TAGC', mode='dam')
145
    3
146
    """
147
    ins_cost, del_cost, sub_cost, trans_cost = cost
148
149
    if src == tar:
150
        return 0
151
    if not src:
152
        return len(tar) * ins_cost
153
    if not tar:
154
        return len(src) * del_cost
155
156
    if 'dam' in mode:
157
        return damerau_levenshtein(src, tar, cost)
158
159
    # pylint: disable=no-member
160
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
161
    # pylint: enable=no-member
162
    for i in range(len(src)+1):
163
        d_mat[i, 0] = i * del_cost
164
    for j in range(len(tar)+1):
165
        d_mat[0, j] = j * ins_cost
166
167
    for i in range(len(src)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
168
        for j in range(len(tar)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
169
            d_mat[i+1, j+1] = min(
170
                d_mat[i+1, j] + ins_cost,  # ins
171
                d_mat[i, j+1] + del_cost,  # del
172
                d_mat[i, j] + (sub_cost if src[i] != tar[j] else 0)  # sub/==
173
            )
174
175
            if mode == 'osa':
176
                if ((i+1 > 1 and j+1 > 1 and src[i] == tar[j-1] and
177
                     src[i-1] == tar[j])):
178
                    # transposition
179
                    d_mat[i+1, j+1] = min(d_mat[i+1, j+1],
180
                                          d_mat[i-1, j-1] + trans_cost)
181
182
    return d_mat[len(src), len(tar)]
183
184
185
def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
186
    """Return the normalized Levenshtein distance between two strings.
187
188
    Levenshtein distance normalized to the interval [0, 1]
189
190
    The Levenshtein distance is normalized by dividing the Levenshtein distance
191
    (calculated by any of the three supported methods) by the greater of
192
    the number of characters in src times the cost of a delete and
193
    the number of characters in tar times the cost of an insert.
194
    For the case in which all operations have :math:`cost = 1`, this is
195
    equivalent to the greater of the length of the two strings src & tar.
196
197
    :param str src, tar: two strings to be compared
198
    :param str mode: specifies a mode for computing the Levenshtein distance:
199
200
        - 'lev' (default) computes the ordinary Levenshtein distance,
201
          in which edits may include inserts, deletes, and substitutions
202
        - 'osa' computes the Optimal String Alignment distance, in which
203
          edits may include inserts, deletes, substitutions, and
204
          transpositions but substrings may only be edited once
205
        - 'dam' computes the Damerau-Levenshtein distance, in which
206
          edits may include inserts, deletes, substitutions, and
207
          transpositions and substrings may undergo repeated edits
208
209
    :param tuple cost: a 4-tuple representing the cost of the four possible
210
        edits: inserts, deletes, substitutions, and transpositions,
211
        respectively (by default: (1, 1, 1, 1))
212
    :returns: normalized Levenshtein distance
213
    :rtype: float
214
215
    >>> dist_levenshtein('cat', 'hat')
216
    0.33333333333333331
217
    >>> dist_levenshtein('Niall', 'Neil')
218
    0.59999999999999998
219
    >>> dist_levenshtein('aluminum', 'Catalan')
220
    0.875
221
    >>> dist_levenshtein('ATCG', 'TAGC')
222
    0.75
223
    """
224
    if src == tar:
225
        return 0
226
    ins_cost, del_cost = cost[:2]
227
    return (levenshtein(src, tar, mode, cost) /
228
            (max(len(src)*del_cost, len(tar)*ins_cost)))
229
230
231
def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
232
    """Return the Levenshtein similarity of two strings.
233
234
    Levenshtein similarity normalized to the interval [0, 1]
235
236
    Levenshtein similarity the complement of Levenshtein distance:
237
    :math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`
238
239
    The arguments are identical to those of the levenshtein() function.
240
241
    :param str src, tar: two strings to be compared
242
    :param str mode: specifies a mode for computing the Levenshtein distance:
243
244
            - 'lev' (default) computes the ordinary Levenshtein distance,
245
              in which edits may include inserts, deletes, and substitutions
246
            - 'osa' computes the Optimal String Alignment distance, in which
247
              edits may include inserts, deletes, substitutions, and
248
              transpositions but substrings may only be edited once
249
            - 'dam' computes the Damerau-Levenshtein distance, in which
250
              edits may include inserts, deletes, substitutions, and
251
              transpositions and substrings may undergo repeated edits
252
253
    :param tuple cost: a 4-tuple representing the cost of the four possible
254
        edits:
255
        inserts, deletes, substitutions, and transpositions, respectively
256
        (by default: (1, 1, 1, 1))
257
    :returns: normalized Levenshtein similarity
258
    :rtype: float
259
260
    >>> sim_levenshtein('cat', 'hat')
261
    0.66666666666666674
262
    >>> sim_levenshtein('Niall', 'Neil')
263
    0.40000000000000002
264
    >>> sim_levenshtein('aluminum', 'Catalan')
265
    0.125
266
    >>> sim_levenshtein('ATCG', 'TAGC')
267
    0.25
268
    """
269
    return 1 - dist_levenshtein(src, tar, mode, cost)
270
271
272
def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (20/15).
Loading history...
273
    """Return the Damerau-Levenshtein distance between two strings.
274
275
    Damerau-Levenshtein distance
276
277
    This computes the Damerau-Levenshtein distance. Cf.
278
    https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
279
280
    Damerau-Levenshtein code based on Java code by Kevin L. Stern,
281
    under the MIT license:
282
    https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
283
284
    :param str src, tar: two strings to be compared
285
    :param tuple cost: a 4-tuple representing the cost of the four possible
286
        edits:
287
        inserts, deletes, substitutions, and transpositions, respectively
288
        (by default: (1, 1, 1, 1))
289
    :returns: the Damerau-Levenshtein distance between src & tar
290
    :rtype: int (may return a float if cost has float values)
291
292
    >>> damerau_levenshtein('cat', 'hat')
293
    1
294
    >>> damerau_levenshtein('Niall', 'Neil')
295
    3
296
    >>> damerau_levenshtein('aluminum', 'Catalan')
297
    7
298
    >>> damerau_levenshtein('ATCG', 'TAGC')
299
    2
300
    """
301
    ins_cost, del_cost, sub_cost, trans_cost = cost
302
303
    if src == tar:
304
        return 0
305
    if not src:
306
        return len(tar) * ins_cost
307
    if not tar:
308
        return len(src) * del_cost
309
310
    if 2*trans_cost < ins_cost + del_cost:
311
        raise ValueError('Unsupported cost assignment; the cost of two ' +
312
                         'transpositions must not be less than the cost of ' +
313
                         'an insert plus a delete.')
314
315
    # pylint: disable=no-member
316
    d_mat = (np.zeros((len(src))*(len(tar)), dtype=np.int).
317
             reshape((len(src), len(tar))))
318
    # pylint: enable=no-member
319
320
    if src[0] != tar[0]:
321
        d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)
322
323
    src_index_by_character = {}
324
    src_index_by_character[src[0]] = 0
325
    for i in range(1, len(src)):
326
        del_distance = d_mat[i-1, 0] + del_cost
327
        ins_distance = (i+1) * del_cost + ins_cost
328
        match_distance = (i * del_cost +
329
                          (0 if src[i] == tar[0] else sub_cost))
330
        d_mat[i, 0] = min(del_distance, ins_distance, match_distance)
331
332
    for j in range(1, len(tar)):
333
        del_distance = (j+1) * ins_cost + del_cost
334
        ins_distance = d_mat[0, j-1] + ins_cost
335
        match_distance = (j * ins_cost +
336
                          (0 if src[0] == tar[j] else sub_cost))
337
        d_mat[0, j] = min(del_distance, ins_distance, match_distance)
338
339
    for i in range(1, len(src)):
340
        max_src_letter_match_index = (0 if src[i] == tar[0] else -1)
341
        for j in range(1, len(tar)):
342
            candidate_swap_index = (-1 if tar[j] not in
343
                                    src_index_by_character else
344
                                    src_index_by_character[tar[j]])
345
            j_swap = max_src_letter_match_index
346
            del_distance = d_mat[i-1, j] + del_cost
347
            ins_distance = d_mat[i, j-1] + ins_cost
348
            match_distance = d_mat[i-1, j-1]
349
            if src[i] != tar[j]:
350
                match_distance += sub_cost
351
            else:
352
                max_src_letter_match_index = j
353
354
            if candidate_swap_index != -1 and j_swap != -1:
355
                i_swap = candidate_swap_index
356
357
                if i_swap == 0 and j_swap == 0:
358
                    pre_swap_cost = 0
359
                else:
360
                    pre_swap_cost = d_mat[max(0, i_swap-1), max(0, j_swap-1)]
361
                swap_distance = (pre_swap_cost + (i - i_swap - 1) *
362
                                 del_cost + (j - j_swap - 1) * ins_cost +
363
                                 trans_cost)
364
            else:
365
                swap_distance = sys.maxsize
366
367
            d_mat[i, j] = min(del_distance, ins_distance,
368
                              match_distance, swap_distance)
369
        src_index_by_character[src[i]] = i
370
371
    return d_mat[len(src)-1, len(tar)-1]
372
373
374
def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
375
    """Return the Damerau-Levenshtein similarity of two strings.
376
377
    Damerau-Levenshtein distance normalized to the interval [0, 1]
378
379
    The Damerau-Levenshtein distance is normalized by dividing the
380
    Damerau-Levenshtein distance by the greater of
381
    the number of characters in src times the cost of a delete and
382
    the number of characters in tar times the cost of an insert.
383
    For the case in which all operations have :math:`cost = 1`, this is
384
    equivalent to the greater of the length of the two strings src & tar.
385
386
    The arguments are identical to those of the levenshtein() function.
387
388
    :param str src, tar: two strings to be compared
389
    :param tuple cost: a 4-tuple representing the cost of the four possible
390
        edits:
391
        inserts, deletes, substitutions, and transpositions, respectively
392
        (by default: (1, 1, 1, 1))
393
    :returns: normalized Damerau-Levenshtein distance
394
    :rtype: float
395
396
    >>> dist_damerau('cat', 'hat')
397
    0.33333333333333331
398
    >>> dist_damerau('Niall', 'Neil')
399
    0.59999999999999998
400
    >>> dist_damerau('aluminum', 'Catalan')
401
    0.875
402
    >>> dist_damerau('ATCG', 'TAGC')
403
    0.5
404
    """
405
    if src == tar:
406
        return 0
407
    ins_cost, del_cost = cost[:2]
408
    return (damerau_levenshtein(src, tar, cost) /
409
            (max(len(src)*del_cost, len(tar)*ins_cost)))
410
411
412
def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
413
    """Return the Damerau-Levenshtein similarity of two strings.
414
415
    Damerau-Levenshtein similarity normalized to the interval [0, 1]
416
417
    Damerau-Levenshtein similarity the complement of Damerau-Levenshtein
418
    distance:
419
    :math:`sim_{Damerau} = 1 - dist_{Damerau}`
420
421
    The arguments are identical to those of the levenshtein() function.
422
423
    :param str src, tar: two strings to be compared
424
    :param tuple cost: a 4-tuple representing the cost of the four possible
425
        edits:
426
        inserts, deletes, substitutions, and transpositions, respectively
427
        (by default: (1, 1, 1, 1))
428
    :returns: normalized Damerau-Levenshtein similarity
429
    :rtype: float
430
431
    >>> sim_damerau('cat', 'hat')
432
    0.66666666666666674
433
    >>> sim_damerau('Niall', 'Neil')
434
    0.40000000000000002
435
    >>> sim_damerau('aluminum', 'Catalan')
436
    0.125
437
    >>> sim_damerau('ATCG', 'TAGC')
438
    0.5
439
    """
440
    return 1 - dist_damerau(src, tar, cost)
441
442
443
def hamming(src, tar, difflens=True):
444
    """Return the Hamming distance between two strings.
445
446
    Hamming distance
447
448
    Hamming distance equals the number of character positions at which two
449
    strings differ. For strings of unequal lengths, it is not normally defined.
450
    By default, this implementation calculates the Hamming distance of the
451
    first n characters where n is the lesser of the two strings' lengths and
452
    adds to this the difference in string lengths.
453
454
    :param str src, tar: two strings to be compared
455
    :param bool allow_different_lengths:
456
        If True (default), this returns the Hamming distance for those
457
        characters that have a matching character in both strings plus the
458
        difference in the strings' lengths. This is equivalent to extending
459
        the shorter string with obligatorily non-matching characters.
460
        If False, an exception is raised in the case of strings of unequal
461
        lengths.
462
    :returns: the Hamming distance between src & tar
463
    :rtype: int
464
465
    >>> hamming('cat', 'hat')
466
    1
467
    >>> hamming('Niall', 'Neil')
468
    3
469
    >>> hamming('aluminum', 'Catalan')
470
    8
471
    >>> hamming('ATCG', 'TAGC')
472
    4
473
    """
474
    if not difflens and len(src) != len(tar):
475
        raise ValueError('Undefined for sequences of unequal length; set ' +
476
                         'difflens to True for Hamming distance between ' +
477
                         'strings of unequal lengths.')
478
479
    hdist = 0
480
    if difflens:
481
        hdist += abs(len(src)-len(tar))
482
    hdist += sum(c1 != c2 for c1, c2 in zip(src, tar))
483
484
    return hdist
485
486
487
def dist_hamming(src, tar, difflens=True):
488
    """Return the normalized Hamming distance between two strings.
489
490
    Hamming distance normalized to the interval [0, 1]
491
492
    The Hamming distance is normalized by dividing it
493
    by the greater of the number of characters in src & tar (unless difflens is
494
    set to False, in which case an exception is raised).
495
496
    The arguments are identical to those of the hamming() function.
497
498
    :param str src, tar: two strings to be compared
499
    :param bool allow_different_lengths:
500
        If True (default), this returns the Hamming distance for those
501
        characters that have a matching character in both strings plus the
502
        difference in the strings' lengths. This is equivalent to extending
503
        the shorter string with obligatorily non-matching characters.
504
        If False, an exception is raised in the case of strings of unequal
505
        lengths.
506
    :returns: normalized Hamming distance
507
    :rtype: float
508
509
    >>> dist_hamming('cat', 'hat')
510
    0.3333333333333333
511
    >>> dist_hamming('Niall', 'Neil')
512
    0.6
513
    >>> dist_hamming('aluminum', 'Catalan')
514
    1.0
515
    >>> dist_hamming('ATCG', 'TAGC')
516
    1.0
517
    """
518
    if src == tar:
519
        return 0
520
    return hamming(src, tar, difflens) / max(len(src), len(tar))
521
522
523
def sim_hamming(src, tar, difflens=True):
524
    """Return the normalized Hamming similarity of two strings.
525
526
    Hamming similarity normalized to the interval [0, 1]
527
528
    Hamming similarity is the complement of normalized Hamming distance:
529
    :math:`sim_{Hamming} = 1 - dist{Hamming}`
530
531
    Provided that difflens==True, the Hamming similarity is identical to the
532
    Language-Independent Product Name Search (LIPNS) similarity score. For
533
    further information, see the sim_mlipns documentation.
534
535
    The arguments are identical to those of the hamming() function.
536
537
    :param str src, tar: two strings to be compared
538
    :param bool allow_different_lengths:
539
        If True (default), this returns the Hamming distance for those
540
        characters that have a matching character in both strings plus the
541
        difference in the strings' lengths. This is equivalent to extending
542
        the shorter string with obligatorily non-matching characters.
543
        If False, an exception is raised in the case of strings of unequal
544
        lengths.
545
    :returns: normalized Hamming similarity
546
    :rtype: float
547
548
    >>> sim_hamming('cat', 'hat')
549
    0.6666666666666667
550
    >>> sim_hamming('Niall', 'Neil')
551
    0.4
552
    >>> sim_hamming('aluminum', 'Catalan')
553
    0.0
554
    >>> sim_hamming('ATCG', 'TAGC')
555
    0.0
556
    """
557
    return 1 - dist_hamming(src, tar, difflens)
558
559
560
def _get_qgrams(src, tar, qval):
561
    """Return the Q-Grams in src & tar.
562
563
    :param str src, tar: two strings to be compared
564
        (or QGrams/Counter objects)
565
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
566
        version
567
    :return: Q-Grams
568
    """
569
    if isinstance(src, Counter) and isinstance(tar, Counter):
570
        return src, tar
571
    if qval and qval > 0:
572
        return QGrams(src, qval), QGrams(tar, qval)
573
    return Counter(src.strip().split()), Counter(tar.strip().split())
574
575
576
def sim_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None):
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
577
    r"""Return the Tversky index of two strings.
578
579
    Tversky index
580
581
    The Tversky index is defined as:
582
    For two sets X and Y:
583
    :math:`sim_{Tversky}(X, Y) = \\frac{|X \\cap Y|}
584
    {|X \\cap Y| + \\alpha|X - Y| + \\beta|Y - X|}`
585
586
    Cf. https://en.wikipedia.org/wiki/Tversky_index
587
588
    :math:`\\alpha = \\beta = 1` is equivalent to the Jaccard & Tanimoto
589
    similarity coefficients.
590
591
    :math:`\\alpha = \\beta = 0.5` is equivalent to the Sørensen-Dice
592
    similarity coefficient.
593
594
    Unequal α and β will tend to emphasize one or the other set's
595
    contributions:
596
597
        - :math:`\\alpha > \\beta` emphasizes the contributions of X over Y
598
        - :math:`\\alpha < \\beta` emphasizes the contributions of Y over X)
599
600
    Parameter values' relation to 1 emphasizes different types of
601
    contributions:
602
603
        - :math:`\\alpha and \\beta > 1` emphsize unique contributions over the
604
          intersection
605
        - :math:`\\alpha and \\beta < 1` emphsize the intersection over unique
606
          contributions
607
608
    The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and
609
    Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with
610
    Distributional Measures for Semantic Textual Similarity. This is activated
611
    by specifying a bias parameter.
612
    Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf
613
614
    :param str src, tar: two strings to be compared
615
        (or QGrams/Counter objects)
616
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
617
        version
618
    :param float alpha, beta: two Tversky index parameters as indicated in the
619
        description below
620
    :returns: Tversky similarity
621
    :rtype: float
622
623
    >>> sim_tversky('cat', 'hat')
624
    0.3333333333333333
625
    >>> sim_tversky('Niall', 'Neil')
626
    0.2222222222222222
627
    >>> sim_tversky('aluminum', 'Catalan')
628
    0.0625
629
    >>> sim_tversky('ATCG', 'TAGC')
630
    0.0
631
    """
632
    if alpha < 0 or beta < 0:
633
        raise ValueError('Unsupported weight assignment; alpha and beta ' +
634
                         'must be greater than or equal to 0.')
635
636
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
637
        return 1.0
638
    elif not src or not tar:
639
        return 0.0
640
641
    q_src, q_tar = _get_qgrams(src, tar, qval)
642
    q_src_mag = sum(q_src.values())
643
    q_tar_mag = sum(q_tar.values())
644
    q_intersection_mag = sum((q_src & q_tar).values())
645
646
    if not q_src or not q_tar:
647
        return 0.0
648
649
    if bias is None:
650
        return q_intersection_mag / (q_intersection_mag + alpha *
651
                                     (q_src_mag - q_intersection_mag) +
652
                                     beta * (q_tar_mag - q_intersection_mag))
653
654
    a_val = min(q_src_mag - q_intersection_mag,
655
                q_tar_mag - q_intersection_mag)
656
    b_val = max(q_src_mag - q_intersection_mag,
657
                q_tar_mag - q_intersection_mag)
658
    c_val = q_intersection_mag + bias
659
    return c_val / (beta * (alpha * a_val + (1 - alpha) * b_val) + c_val)
660
661
662
def dist_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None):
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
663
    """Return the Tverssky distance between two strings.
664
665
    Tversky distance
666
667
    Tversky distance is the complement of the Tvesrsky index (similarity):
668
    :math:`dist_{Tversky} = 1-sim_{Tversky}`
669
670
    The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and
671
    Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with
672
    Distributional Measures for Semantic Textual Similarity. This is activated
673
    by specifying a bias parameter.
674
    Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf
675
676
    :param str src, tar: two strings to be compared
677
        (or QGrams/Counter objects)
678
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
679
        version
680
    :param float alpha, beta: two Tversky index parameters as indicated in the
681
        description below
682
    :returns: Tversky distance
683
    :rtype: float
684
685
    >>> dist_tversky('cat', 'hat')
686
    0.6666666666666667
687
    >>> dist_tversky('Niall', 'Neil')
688
    0.7777777777777778
689
    >>> dist_tversky('aluminum', 'Catalan')
690
    0.9375
691
    >>> dist_tversky('ATCG', 'TAGC')
692
    1.0
693
    """
694
    return 1 - sim_tversky(src, tar, qval, alpha, beta, bias)
695
696
697
def sim_dice(src, tar, qval=2):
698
    r"""Return the Sørensen–Dice coefficient of two strings.
699
700
    Sørensen–Dice coefficient
701
702
    For two sets X and Y, the Sørensen–Dice coefficient is
703
    :math:`sim_{dice}(X, Y) = \\frac{2 \\cdot |X \\cap Y|}{|X| + |Y|}`
704
705
    This is identical to the Tanimoto similarity coefficient
706
    and the Tversky index for :math:`\\alpha = \\beta = 0.5`
707
708
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
709
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
710
        version
711
    :returns: Sørensen–Dice similarity
712
    :rtype: float
713
714
    >>> sim_dice('cat', 'hat')
715
    0.5
716
    >>> sim_dice('Niall', 'Neil')
717
    0.36363636363636365
718
    >>> sim_dice('aluminum', 'Catalan')
719
    0.11764705882352941
720
    >>> sim_dice('ATCG', 'TAGC')
721
    0.0
722
    """
723
    return sim_tversky(src, tar, qval, 0.5, 0.5)
724
725
726
def dist_dice(src, tar, qval=2):
727
    """Return the Sørensen–Dice distance between two strings.
728
729
    Sørensen–Dice distance
730
731
    Sørensen–Dice distance is the complemenjt of the Sørensen–Dice coefficient:
732
    :math:`dist_{dice} = 1 - sim_{dice}`
733
734
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
735
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
736
        version
737
    :returns: Sørensen–Dice distance
738
    :rtype: float
739
740
    >>> dist_dice('cat', 'hat')
741
    0.5
742
    >>> dist_dice('Niall', 'Neil')
743
    0.6363636363636364
744
    >>> dist_dice('aluminum', 'Catalan')
745
    0.8823529411764706
746
    >>> dist_dice('ATCG', 'TAGC')
747
    1.0
748
    """
749
    return 1 - sim_dice(src, tar, qval)
750
751
752
def sim_jaccard(src, tar, qval=2):
753
    r"""Return the Jaccard similarity of two strings.
754
755
    Jaccard similarity coefficient
756
757
    For two sets X and Y, the Jaccard similarity coefficient is
758
    :math:`sim_{jaccard}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}`
759
760
    This is identical to the Tanimoto similarity coefficient
761
    and the Tversky index for :math:`\\alpha = \\beta = 1`
762
763
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
764
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
765
        version
766
    :returns: Jaccard similarity
767
    :rtype: float
768
769
    >>> sim_jaccard('cat', 'hat')
770
    0.3333333333333333
771
    >>> sim_jaccard('Niall', 'Neil')
772
    0.2222222222222222
773
    >>> sim_jaccard('aluminum', 'Catalan')
774
    0.0625
775
    >>> sim_jaccard('ATCG', 'TAGC')
776
    0.0
777
    """
778
    return sim_tversky(src, tar, qval, 1, 1)
779
780
781
def dist_jaccard(src, tar, qval=2):
782
    """Return the Jaccard distance between two strings.
783
784
    Jaccard distance
785
786
    Jaccard distance is the complement of the Jaccard similarity coefficient:
787
    :math:`dist_{Jaccard} = 1 - sim_{Jaccard}`
788
789
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
790
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
791
        version
792
    :returns: Jaccard distance
793
    :rtype: float
794
795
    >>> dist_jaccard('cat', 'hat')
796
    0.6666666666666667
797
    >>> dist_jaccard('Niall', 'Neil')
798
    0.7777777777777778
799
    >>> dist_jaccard('aluminum', 'Catalan')
800
    0.9375
801
    >>> dist_jaccard('ATCG', 'TAGC')
802
    1.0
803
    """
804
    return 1 - sim_jaccard(src, tar, qval)
805
806
807
def sim_overlap(src, tar, qval=2):
808
    r"""Return the overlap coefficient of two strings.
809
810
    Overlap coefficient
811
812
    For two sets X and Y, the overlap coefficient is
813
    :math:`sim_{overlap}(X, Y) = \\frac{|X \\cap Y|}{min(|X|, |Y|)}`
814
815
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
816
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
817
        version
818
    :returns: overlap similarity
819
    :rtype: float
820
821
    >>> sim_overlap('cat', 'hat')
822
    0.5
823
    >>> sim_overlap('Niall', 'Neil')
824
    0.4
825
    >>> sim_overlap('aluminum', 'Catalan')
826
    0.125
827
    >>> sim_overlap('ATCG', 'TAGC')
828
    0.0
829
    """
830
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
831
        return 1.0
832
    elif not src or not tar:
833
        return 0.0
834
835
    q_src, q_tar = _get_qgrams(src, tar, qval)
836
    q_src_mag = sum(q_src.values())
837
    q_tar_mag = sum(q_tar.values())
838
    q_intersection_mag = sum((q_src & q_tar).values())
839
840
    return q_intersection_mag / min(q_src_mag, q_tar_mag)
841
842
843
def dist_overlap(src, tar, qval=2):
844
    """Return the overlap distance between two strings.
845
846
    Overlap distance
847
848
    Overlap distance is the complement of the overlap coefficient:
849
    :math:`sim_{overlap} = 1 - dist_{overlap}`
850
851
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
852
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
853
        version
854
    :returns: overlap distance
855
    :rtype: float
856
857
    >>> dist_overlap('cat', 'hat')
858
    0.5
859
    >>> dist_overlap('Niall', 'Neil')
860
    0.6
861
    >>> dist_overlap('aluminum', 'Catalan')
862
    0.875
863
    >>> dist_overlap('ATCG', 'TAGC')
864
    1.0
865
    """
866
    return 1 - sim_overlap(src, tar, qval)
867
868
869
def sim_tanimoto(src, tar, qval=2):
870
    r"""Return the Tanimoto similarity of two strings.
871
872
    Tanimoto similarity
873
874
    For two sets X and Y, the Tanimoto similarity coefficient is
875
    :math:`sim_{Tanimoto}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}`
876
    This is identical to the Jaccard similarity coefficient
877
    and the Tversky index for :math:`\\alpha = \\beta = 1`
878
879
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
880
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
881
        version
882
    :returns: Tanimoto similarity
883
    :rtype: float
884
885
    >>> sim_tanimoto('cat', 'hat')
886
    0.3333333333333333
887
    >>> sim_tanimoto('Niall', 'Neil')
888
    0.2222222222222222
889
    >>> sim_tanimoto('aluminum', 'Catalan')
890
    0.0625
891
    >>> sim_tanimoto('ATCG', 'TAGC')
892
    0.0
893
    """
894
    return sim_jaccard(src, tar, qval)
895
896
897
def tanimoto(src, tar, qval=2):
898
    """Return the Tanimoto distance between two strings.
899
900
    Tanimoto distance
901
902
    Tanimoto distance is :math:`-log_{2}sim_{Tanimoto}`
903
904
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
905
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
906
        version
907
    :returns: Tanimoto distance
908
    :rtype: float
909
910
    >>> tanimoto('cat', 'hat')
911
    -1.5849625007211563
912
    >>> tanimoto('Niall', 'Neil')
913
    -2.1699250014423126
914
    >>> tanimoto('aluminum', 'Catalan')
915
    -4.0
916
    >>> tanimoto('ATCG', 'TAGC')
917
    -inf
918
    """
919
    coeff = sim_jaccard(src, tar, qval)
920
    if coeff != 0:
921
        return math.log(coeff, 2)
922
923
    return float('-inf')
924
925
926
def minkowski(src, tar, qval=2, pval=1, normalize=False):
927
    """Return the Minkowski distance (:math:`L^p-norm`) of two strings.
928
929
    :param src:
930
    :param tar:
931
    :param qval:
932
    :param pval:
933
    :return:
934
    """
935
    q_src, q_tar = _get_qgrams(src, tar, qval)
936
    diffs = ((q_src - q_tar) + (q_tar - q_src)).values()
937
938
    normalizer = 1
939
    if normalize:
940
        totals = (q_src + q_tar).values()
941
        if pval == 0:
942
            normalizer = len(totals)
943
        else:
944
            normalizer = sum(_**pval for _ in totals)**(1 / pval)
945
946
    if pval == float('inf'):
947
        # Chebyshev distance
948
        return max(diffs)/normalizer
949
    if pval == 0:
950
        # This is the l_0 "norm" as developed by David Donoho
951
        return len(diffs)
952
    return sum(_**pval for _ in diffs)**(1 / pval)/normalizer
953
954
955
def dist_minkowski(src, tar, qval=2, pval=1):
956
    """Return Minkowski distance of two strings, normalized to [0, 1].
957
958
    :param src:
959
    :param tar:
960
    :param qval2:
961
    :param pval:
962
    :return:
963
    """
964
    return minkowski(src, tar, qval, pval, True)
965
966
967
def sim_minkowski(src, tar, qval=2, pval=1):
968
    """Return Minkowski similarity of two strings, normalized to [0, 1].
969
970
    :param src:
971
    :param tar:
972
    :param qval2:
973
    :param pval:
974
    :return:
975
    """
976
    return 1-minkowski(src, tar, qval, pval, True)
977
978
979
def manhattan(src, tar, qval=2, normalize=False):
980
    """Return the Manhattan distance between two strings.
981
982
    :param src:
983
    :param tar:
984
    :param qval:
985
    :return:
986
    """
987
    return minkowski(src, tar, qval, 1, normalize)
988
989
990
def dist_manhattan(src, tar, qval=2):
991
    """Return the Manhattan distance between two strings, normalized to [0, 1].
992
993
    This is identical to Canberra distance.
994
995
    :param src:
996
    :param tar:
997
    :param qval:
998
    :return:
999
    """
1000
    return manhattan(src, tar, qval, 1, True)
0 ignored issues
show
Bug introduced by
There seem to be too many positional arguments for this function call.
Loading history...
1001
1002
1003
def sim_manhattan(src, tar, qval=2):
1004
    """Return the Manhattan similarity of two strings, normalized to [0, 1].
1005
1006
    :param src:
1007
    :param tar:
1008
    :param qval:
1009
    :return:
1010
    """
1011
    return 1-manhattan(src, tar, qval, 1, True)
0 ignored issues
show
Bug introduced by
There seem to be too many positional arguments for this function call.
Loading history...
1012
1013
1014
def euclidean(src, tar, qval=2, normalize=False):
1015
    """Return the Euclidean distance between two strings.
1016
1017
    :param src:
1018
    :param tar:
1019
    :param qval:
1020
    :return:
1021
    """
1022
    return minkowski(src, tar, qval, 2, normalize)
1023
1024
1025
def dist_euclidean(src, tar, qval=2):
1026
    """Return the Euclidean distance between two strings, normalized to [0, 1].
1027
1028
    :param src:
1029
    :param tar:
1030
    :param qval:
1031
    :return:
1032
    """
1033
    return euclidean(src, tar, qval, True)
1034
1035
1036
def sim_euclidean(src, tar, qval=2):
1037
    """Return the Euclidean similarity of two strings, normalized to [0, 1].
1038
1039
    :param src:
1040
    :param tar:
1041
    :param qval:
1042
    :return:
1043
    """
1044
    return 1-euclidean(src, tar, qval, True)
1045
1046
1047
def chebyshev(src, tar, qval=2, normalize=False):
1048
    """Return the Chebyshev distance between two strings.
1049
1050
    :param src:
1051
    :param tar:
1052
    :param qval:
1053
    :return:
1054
    """
1055
    return minkowski(src, tar, qval, float('inf'), normalize)
1056
1057
1058
def dist_chebyshev(src, tar, qval=2):
1059
    """Return the Chebyshev distance between two strings, normalized to [0, 1].
1060
1061
    :param src:
1062
    :param tar:
1063
    :param qval:
1064
    :return:
1065
    """
1066
    return chebyshev(src, tar, qval, True)
1067
1068
1069
def sim_chebyshev(src, tar, qval=2):
1070
    """Return the Chebyshev similarity of two strings, normalized to [0, 1].
1071
1072
    :param src:
1073
    :param tar:
1074
    :param qval:
1075
    :return:
1076
    """
1077
    return 1 - chebyshev(src, tar, qval, True)
1078
1079
1080
def sim_cosine(src, tar, qval=2):
1081
    r"""Return the cosine similarity of two strings.
1082
1083
    Cosine similarity (Ochiai coefficient)
1084
1085
    For two sets X and Y, the cosine similarity (Ochiai coefficient) is:
1086
    :math:`sim_{cosine}(X, Y) = \\frac{|X \\cap Y|}{\\sqrt{|X| \\cdot |Y|}}`
1087
1088
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
1089
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
1090
        version
1091
    :returns: cosine similarity
1092
    :rtype: float
1093
1094
    >>> sim_cosine('cat', 'hat')
1095
    0.5
1096
    >>> sim_cosine('Niall', 'Neil')
1097
    0.3651483716701107
1098
    >>> sim_cosine('aluminum', 'Catalan')
1099
    0.11785113019775793
1100
    >>> sim_cosine('ATCG', 'TAGC')
1101
    0.0
1102
    """
1103
    if src == tar:
1104
        return 1.0
1105
    if not src or not tar:
1106
        return 0.0
1107
1108
    q_src, q_tar = _get_qgrams(src, tar, qval)
1109
    q_src_mag = sum(q_src.values())
1110
    q_tar_mag = sum(q_tar.values())
1111
    q_intersection_mag = sum((q_src & q_tar).values())
1112
1113
    return q_intersection_mag / math.sqrt(q_src_mag * q_tar_mag)
1114
1115
1116
def dist_cosine(src, tar, qval=2):
1117
    """Return the cosine distance between two strings.
1118
1119
    Cosine distance
1120
1121
    Cosine distance is the complement of cosine similarity:
1122
    :math:`dist_{cosine} = 1 - sim_{cosine}`
1123
1124
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
1125
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
1126
        version
1127
    :returns: cosine distance
1128
    :rtype: float
1129
1130
    >>> dist_cosine('cat', 'hat')
1131
    0.5
1132
    >>> dist_cosine('Niall', 'Neil')
1133
    0.6348516283298893
1134
    >>> dist_cosine('aluminum', 'Catalan')
1135
    0.882148869802242
1136
    >>> dist_cosine('ATCG', 'TAGC')
1137
    1.0
1138
    """
1139
    return 1 - sim_cosine(src, tar, qval)
1140
1141
1142
def sim_strcmp95(src, tar, long_strings=False):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (23/15).
Loading history...
1143
    """Return the strcmp95 similarity of two strings.
1144
1145
    strcmp95 similarity
1146
1147
    This is a Python translation of the C code for strcmp95:
1148
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
1149
    The above file is a US Government publication and, accordingly,
1150
    in the public domain.
1151
1152
    This is based on the Jaro-Winkler distance, but also attempts to correct
1153
    for some common typos and frequently confused characters. It is also
1154
    limited to uppercase ASCII characters, so it is appropriate to American
1155
    names, but not much else.
1156
1157
    :param str src, tar: two strings to be compared
1158
    :param bool long_strings: set to True to "Increase the probability of a
1159
        match when the number of matched characters is large.  This option
1160
        allows for a little more tolerance when the strings are large. It is
1161
        not an appropriate test when comparing fixed length fields such as
1162
        phone and social security numbers."
1163
    :returns: strcmp95 similarity
1164
    :rtype: float
1165
1166
    >>> sim_strcmp95('cat', 'hat')
1167
    0.7777777777777777
1168
    >>> sim_strcmp95('Niall', 'Neil')
1169
    0.8454999999999999
1170
    >>> sim_strcmp95('aluminum', 'Catalan')
1171
    0.6547619047619048
1172
    >>> sim_strcmp95('ATCG', 'TAGC')
1173
    0.8333333333333334
1174
    """
1175
    def _inrange(char):
1176
        """Return True if char is in the range (0, 91)."""
1177
        return ord(char) > 0 and ord(char) < 91
1178
1179
    ying = src.strip().upper()
1180
    yang = tar.strip().upper()
1181
1182
    if ying == yang:
1183
        return 1.0
1184
    # If either string is blank - return - added in Version 2
1185
    if not ying or not yang:
1186
        return 0.0
1187
1188
    adjwt = defaultdict(int)
1189
    sp_mx = (
1190
        ('A', 'E'), ('A', 'I'), ('A', 'O'), ('A', 'U'), ('B', 'V'), ('E', 'I'),
1191
        ('E', 'O'), ('E', 'U'), ('I', 'O'), ('I', 'U'), ('O', 'U'), ('I', 'Y'),
1192
        ('E', 'Y'), ('C', 'G'), ('E', 'F'), ('W', 'U'), ('W', 'V'), ('X', 'K'),
1193
        ('S', 'Z'), ('X', 'S'), ('Q', 'C'), ('U', 'V'), ('M', 'N'), ('L', 'I'),
1194
        ('Q', 'O'), ('P', 'R'), ('I', 'J'), ('2', 'Z'), ('5', 'S'), ('8', 'B'),
1195
        ('1', 'I'), ('1', 'L'), ('0', 'O'), ('0', 'Q'), ('C', 'K'), ('G', 'J')
1196
    )
1197
1198
    # Initialize the adjwt array on the first call to the function only.
1199
    # The adjwt array is used to give partial credit for characters that
1200
    # may be errors due to known phonetic or character recognition errors.
1201
    # A typical example is to match the letter "O" with the number "0"
1202
    for i in sp_mx:
1203
        adjwt[(i[0], i[1])] = 3
1204
        adjwt[(i[1], i[0])] = 3
1205
1206
    if len(ying) > len(yang):
1207
        search_range = len(ying)
1208
        minv = len(yang)
1209
    else:
1210
        search_range = len(yang)
1211
        minv = len(ying)
1212
1213
    # Blank out the flags
1214
    ying_flag = [0] * search_range
1215
    yang_flag = [0] * search_range
1216
    search_range = max(0, search_range // 2 - 1)
1217
1218
    # Looking only within the search range, count and flag the matched pairs.
1219
    num_com = 0
1220
    yl1 = len(yang) - 1
1221
    for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1222
        lowlim = (i - search_range) if (i >= search_range) else 0
1223
        hilim = (i + search_range) if ((i + search_range) <= yl1) else yl1
1224
        for j in range(lowlim, hilim+1):
1225
            if (yang_flag[j] == 0) and (yang[j] == ying[i]):
1226
                yang_flag[j] = 1
1227
                ying_flag[i] = 1
1228
                num_com += 1
1229
                break
1230
1231
    # If no characters in common - return
1232
    if num_com == 0:
1233
        return 0.0
1234
1235
    # Count the number of transpositions
1236
    k = n_trans = 0
1237
    for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1238
        if ying_flag[i] != 0:
1239
            for j in range(k, len(yang)):
1240
                if yang_flag[j] != 0:
1241
                    k = j + 1
1242
                    break
1243
            if ying[i] != yang[j]:
0 ignored issues
show
introduced by
The variable j does not seem to be defined for all execution paths.
Loading history...
1244
                n_trans += 1
1245
    n_trans = n_trans // 2
1246
1247
    # Adjust for similarities in unmatched characters
1248
    n_simi = 0
1249
    if minv > num_com:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1250
        for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1251
            if ying_flag[i] == 0 and _inrange(ying[i]):
1252
                for j in range(len(yang)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1253
                    if yang_flag[j] == 0 and _inrange(yang[j]):
1254
                        if (ying[i], yang[j]) in adjwt:
1255
                            n_simi += adjwt[(ying[i], yang[j])]
1256
                            yang_flag[j] = 2
1257
                            break
1258
    num_sim = n_simi/10.0 + num_com
1259
1260
    # Main weight computation
1261
    weight = num_sim / len(ying) + num_sim / len(yang) + \
1262
        (num_com - n_trans) / num_com
1263
    weight = weight / 3.0
1264
1265
    # Continue to boost the weight if the strings are similar
1266
    if weight > 0.7:
1267
1268
        # Adjust for having up to the first 4 characters in common
1269
        j = 4 if (minv >= 4) else minv
1270
        i = 0
1271
        while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
1272
            i += 1
1273
        if i:
1274
            weight += i * 0.1 * (1.0 - weight)
1275
1276
        # Optionally adjust for long strings.
1277
1278
        # After agreeing beginning chars, at least two more must agree and
1279
        # the agreeing characters must be > .5 of remaining characters.
1280
        if (((long_strings) and (minv > 4) and (num_com > i+1) and
1281
             (2*num_com >= minv+i))):
1282
            if not ying[0].isdigit():
1283
                weight += (1.0-weight) * ((num_com-i-1) /
1284
                                          (len(ying)+len(yang)-i*2+2))
1285
1286
    return weight
1287
1288
1289
def dist_strcmp95(src, tar, long_strings=False):
1290
    """Return the strcmp95 distance between two strings.
1291
1292
    strcmp95 distance
1293
1294
    strcmp95 distance is 1 - strcmp95 similarity
1295
1296
    :param str src, tar: two strings to be compared
1297
    :param bool long_strings: set to True to "Increase the probability of a
1298
        match when the number of matched characters is large.  This option
1299
        allows for a little more tolerance when the strings are large. It is
1300
        not an appropriate test when comparing fixed length fields such as
1301
        phone and social security numbers."
1302
    :returns: strcmp95 distance
1303
    :rtype: float
1304
1305
    >>> dist_strcmp95('cat', 'hat')
1306
    0.22222222222222232
1307
    >>> dist_strcmp95('Niall', 'Neil')
1308
    0.15450000000000008
1309
    >>> dist_strcmp95('aluminum', 'Catalan')
1310
    0.34523809523809523
1311
    >>> dist_strcmp95('ATCG', 'TAGC')
1312
    0.16666666666666663
1313
    """
1314
    return 1 - sim_strcmp95(src, tar, long_strings)
1315
1316
1317
def sim_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False,
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (22/15).
Loading history...
1318
                     boost_threshold=0.7, scaling_factor=0.1):
1319
    """Return the Jaro or Jaro-Winkler similarity of two strings.
1320
1321
    Jaro(-Winkler) distance
1322
1323
    This is Python based on the C code for strcmp95:
1324
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
1325
    The above file is a US Government publication and, accordingly,
1326
    in the public domain.
1327
1328
    :param str src, tar: two strings to be compared
1329
    :param int qval: the length of each q-gram (defaults to 1: character-wise
1330
        matching)
1331
    :param str mode: indicates which variant of this distance metric to
1332
        compute:
1333
1334
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
1335
              increases the score for matches near the start of the word
1336
            - 'jaro' -- computes the Jaro distance
1337
1338
    The following arguments apply only when mode is 'winkler':
1339
1340
    :param bool long_strings: set to True to "Increase the probability of a
1341
        match when the number of matched characters is large.  This option
1342
        allows for a little more tolerance when the strings are large.  It is
1343
        not an appropriate test when comparing fixed length fields such as
1344
        phone and social security numbers."
1345
    :param float boost_threshold: a value between 0 and 1, below which the
1346
        Winkler boost is not applied (defaults to 0.7)
1347
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
1348
        much to boost scores for matching prefixes (defaults to 0.1)
1349
1350
    :returns: Jaro or Jaro-Winkler similarity
1351
    :rtype: float
1352
1353
    >>> sim_jaro_winkler('cat', 'hat')
1354
    0.7777777777777777
1355
    >>> sim_jaro_winkler('Niall', 'Neil')
1356
    0.8049999999999999
1357
    >>> sim_jaro_winkler('aluminum', 'Catalan')
1358
    0.6011904761904762
1359
    >>> sim_jaro_winkler('ATCG', 'TAGC')
1360
    0.8333333333333334
1361
1362
    >>> sim_jaro_winkler('cat', 'hat', mode='jaro')
1363
    0.7777777777777777
1364
    >>> sim_jaro_winkler('Niall', 'Neil', mode='jaro')
1365
    0.7833333333333333
1366
    >>> sim_jaro_winkler('aluminum', 'Catalan', mode='jaro')
1367
    0.6011904761904762
1368
    >>> sim_jaro_winkler('ATCG', 'TAGC', mode='jaro')
1369
    0.8333333333333334
1370
    """
1371
    if mode == 'winkler':
1372
        if boost_threshold > 1 or boost_threshold < 0:
1373
            raise ValueError('Unsupported boost_threshold assignment; ' +
1374
                             'boost_threshold must be between 0 and 1.')
1375
        if scaling_factor > 0.25 or scaling_factor < 0:
1376
            raise ValueError('Unsupported scaling_factor assignment; ' +
1377
                             'scaling_factor must be between 0 and 0.25.')
1378
1379
    if src == tar:
1380
        return 1.0
1381
1382
    src = QGrams(src.strip(), qval).ordered_list
1383
    tar = QGrams(tar.strip(), qval).ordered_list
1384
1385
    lens = len(src)
1386
    lent = len(tar)
1387
1388
    # If either string is blank - return - added in Version 2
1389
    if lens == 0 or lent == 0:
1390
        return 0.0
1391
1392
    if lens > lent:
1393
        search_range = lens
1394
        minv = lent
1395
    else:
1396
        search_range = lent
1397
        minv = lens
1398
1399
    # Zero out the flags
1400
    src_flag = [0] * search_range
1401
    tar_flag = [0] * search_range
1402
    search_range = max(0, search_range//2 - 1)
1403
1404
    # Looking only within the search range, count and flag the matched pairs.
1405
    num_com = 0
1406
    yl1 = lent - 1
1407
    for i in range(lens):
1408
        lowlim = (i - search_range) if (i >= search_range) else 0
1409
        hilim = (i + search_range) if ((i + search_range) <= yl1) else yl1
1410
        for j in range(lowlim, hilim+1):
1411
            if (tar_flag[j] == 0) and (tar[j] == src[i]):
1412
                tar_flag[j] = 1
1413
                src_flag[i] = 1
1414
                num_com += 1
1415
                break
1416
1417
    # If no characters in common - return
1418
    if num_com == 0:
1419
        return 0.0
1420
1421
    # Count the number of transpositions
1422
    k = n_trans = 0
1423
    for i in range(lens):
1424
        if src_flag[i] != 0:
1425
            for j in range(k, lent):
1426
                if tar_flag[j] != 0:
1427
                    k = j + 1
1428
                    break
1429
            if src[i] != tar[j]:
0 ignored issues
show
introduced by
The variable j does not seem to be defined for all execution paths.
Loading history...
1430
                n_trans += 1
1431
    n_trans = n_trans // 2
1432
1433
    # Main weight computation for Jaro distance
1434
    weight = num_com / lens + num_com / lent + (num_com - n_trans) / num_com
1435
    weight = weight / 3.0
1436
1437
    # Continue to boost the weight if the strings are similar
1438
    # This is the Winkler portion of Jaro-Winkler distance
1439
    if mode == 'winkler' and weight > boost_threshold:
1440
1441
        # Adjust for having up to the first 4 characters in common
1442
        j = 4 if (minv >= 4) else minv
1443
        i = 0
1444
        while (i < j) and (src[i] == tar[i]):
1445
            i += 1
1446
        if i:
1447
            weight += i * scaling_factor * (1.0 - weight)
1448
1449
        # Optionally adjust for long strings.
1450
1451
        # After agreeing beginning chars, at least two more must agree and
1452
        # the agreeing characters must be > .5 of remaining characters.
1453
        if (((long_strings) and (minv > 4) and (num_com > i+1) and
1454
             (2*num_com >= minv+i))):
1455
            weight += (1.0-weight) * ((num_com-i-1) / (lens+lent-i*2+2))
1456
1457
    return weight
1458
1459
1460
def dist_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False,
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
1461
                      boost_threshold=0.7, scaling_factor=0.1):
1462
    """Return the Jaro or Jaro-Winkler distance between two strings.
1463
1464
    Jaro(-Winkler) distance
1465
1466
    Jaro-Winkler distance is 1 - the Jaro-Winkler similarity
1467
1468
    :param str src, tar: two strings to be compared
1469
    :param int qval: the length of each q-gram (defaults to 1: character-wise
1470
        matching)
1471
    :param str mode: indicates which variant of this distance metric to
1472
        compute:
1473
1474
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
1475
              increases the score for matches near the start of the word
1476
            - 'jaro' -- computes the Jaro distance
1477
1478
    The following arguments apply only when mode is 'winkler':
1479
1480
    :param bool long_strings: set to True to "Increase the probability of a
1481
        match when the number of matched characters is large.  This option
1482
        allows for a little more tolerance when the strings are large.  It is
1483
        not an appropriate test when comparing fixed length fields such as
1484
        phone and social security numbers."
1485
    :param float boost_threshold: a value between 0 and 1, below which the
1486
        Winkler boost is not applied (defaults to 0.7)
1487
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
1488
        much to boost scores for matching prefixes (defaults to 0.1)
1489
1490
    :returns: Jaro or Jaro-Winkler distance
1491
    :rtype: float
1492
1493
    >>> dist_jaro_winkler('cat', 'hat')
1494
    0.22222222222222232
1495
    >>> dist_jaro_winkler('Niall', 'Neil')
1496
    0.19500000000000006
1497
    >>> dist_jaro_winkler('aluminum', 'Catalan')
1498
    0.39880952380952384
1499
    >>> dist_jaro_winkler('ATCG', 'TAGC')
1500
    0.16666666666666663
1501
1502
    >>> dist_jaro_winkler('cat', 'hat', mode='jaro')
1503
    0.22222222222222232
1504
    >>> dist_jaro_winkler('Niall', 'Neil', mode='jaro')
1505
    0.21666666666666667
1506
    >>> dist_jaro_winkler('aluminum', 'Catalan', mode='jaro')
1507
    0.39880952380952384
1508
    >>> dist_jaro_winkler('ATCG', 'TAGC', mode='jaro')
1509
    0.16666666666666663
1510
    """
1511
    return 1 - sim_jaro_winkler(src, tar, qval, mode, long_strings,
1512
                                boost_threshold, scaling_factor)
1513
1514
1515
def lcsseq(src, tar):
1516
    """Return the longest common subsequence of two strings.
1517
1518
    Longest common subsequence (LCSseq)
1519
1520
    Based on the dynamic programming algorithm from
1521
    http://rosettacode.org/wiki/Longest_common_subsequence#Dynamic_Programming_6
1522
    This is licensed GFDL 1.2
1523
1524
    Modifications include:
1525
        conversion to a numpy array in place of a list of lists
1526
1527
    :param str src, tar: two strings to be compared
1528
    :returns: the longes common subsequence
1529
    :rtype: str
1530
1531
    >>> lcsseq('cat', 'hat')
1532
    'at'
1533
    >>> lcsseq('Niall', 'Neil')
1534
    'Nil'
1535
    >>> lcsseq('aluminum', 'Catalan')
1536
    'aln'
1537
    >>> lcsseq('ATCG', 'TAGC')
1538
    'AC'
1539
    """
1540
    # pylint: disable=no-member
1541
    lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1542
    # pylint: enable=no-member
1543
1544
    # row 0 and column 0 are initialized to 0 already
1545
    for i, src_char in enumerate(src):
1546
        for j, tar_char in enumerate(tar):
1547
            if src_char == tar_char:
1548
                lengths[i+1, j+1] = lengths[i, j] + 1
1549
            else:
1550
                lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1])
1551
1552
    # read the substring out from the matrix
1553
    result = ''
1554
    i, j = len(src), len(tar)
1555
    while i != 0 and j != 0:
1556
        if lengths[i, j] == lengths[i-1, j]:
1557
            i -= 1
1558
        elif lengths[i, j] == lengths[i, j-1]:
1559
            j -= 1
1560
        else:
1561
            result = src[i-1] + result
1562
            i -= 1
1563
            j -= 1
1564
    return result
1565
1566
1567
def sim_lcsseq(src, tar):
1568
    r"""Return the longest common subsequence similarity of two strings.
1569
1570
    Longest common subsequence similarity (:math:`sim_{LCSseq}`)
1571
1572
    This employs the LCSseq function to derive a similarity metric:
1573
    :math:`sim_{LCSseq}(s,t) = \\frac{|LCSseq(s,t)|}{max(|s|, |t|)}`
1574
1575
    :param str src, tar: two strings to be compared
1576
    :returns: LCSseq similarity
1577
    :rtype: float
1578
1579
    >>> sim_lcsseq('cat', 'hat')
1580
    0.6666666666666666
1581
    >>> sim_lcsseq('Niall', 'Neil')
1582
    0.6
1583
    >>> sim_lcsseq('aluminum', 'Catalan')
1584
    0.375
1585
    >>> sim_lcsseq('ATCG', 'TAGC')
1586
    0.5
1587
    """
1588
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
1589
        return 1.0
1590
    elif not src or not tar:
1591
        return 0.0
1592
    return len(lcsseq(src, tar)) / max(len(src), len(tar))
1593
1594
1595
def dist_lcsseq(src, tar):
1596
    """Return the longest common subsequence distance between two strings.
1597
1598
    Longest common subsequence distance (:math:`dist_{LCSseq}`)
1599
1600
    This employs the LCSseq function to derive a similarity metric:
1601
    :math:`dist_{LCSseq}(s,t) = 1 - sim_{LCSseq}(s,t)`
1602
1603
    :param str src, tar: two strings to be compared
1604
    :returns: LCSseq distance
1605
    :rtype: float
1606
1607
    >>> dist_lcsseq('cat', 'hat')
1608
    0.33333333333333337
1609
    >>> dist_lcsseq('Niall', 'Neil')
1610
    0.4
1611
    >>> dist_lcsseq('aluminum', 'Catalan')
1612
    0.625
1613
    >>> dist_lcsseq('ATCG', 'TAGC')
1614
    0.5
1615
    """
1616
    return 1 - sim_lcsseq(src, tar)
1617
1618
1619
def lcsstr(src, tar):
1620
    """Return the longest common substring of two strings.
1621
1622
    Longest common substring (LCSstr)
1623
1624
    Based on the code from
1625
    https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python
1626
    This is licensed Creative Commons: Attribution-ShareAlike 3.0
1627
1628
    Modifications include:
1629
1630
        - conversion to a numpy array in place of a list of lists
1631
        - conversion to Python 2/3-safe range from xrange via six
1632
1633
    :param str src, tar: two strings to be compared
1634
    :returns: the longes common substring
1635
    :rtype: float
1636
1637
    >>> lcsstr('cat', 'hat')
1638
    'at'
1639
    >>> lcsstr('Niall', 'Neil')
1640
    'N'
1641
    >>> lcsstr('aluminum', 'Catalan')
1642
    'al'
1643
    >>> lcsstr('ATCG', 'TAGC')
1644
    'A'
1645
    """
1646
    # pylint: disable=no-member
1647
    lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1648
    # pylint: enable=no-member
1649
    longest, i_longest = 0, 0
1650
    for i in range(1, len(src)+1):
1651
        for j in range(1, len(tar)+1):
1652
            if src[i-1] == tar[j-1]:
1653
                lengths[i, j] = lengths[i-1, j-1] + 1
1654
                if lengths[i, j] > longest:
1655
                    longest = lengths[i, j]
1656
                    i_longest = i
1657
            else:
1658
                lengths[i, j] = 0
1659
    return src[i_longest - longest:i_longest]
1660
1661
1662
def sim_lcsstr(src, tar):
1663
    r"""Return the longest common substring similarity of two strings.
1664
1665
    Longest common substring similarity (:math:`sim_{LCSstr}`)
1666
1667
    This employs the LCS function to derive a similarity metric:
1668
    :math:`sim_{LCSstr}(s,t) = \\frac{|LCSstr(s,t)|}{max(|s|, |t|)}`
1669
1670
    :param str src, tar: two strings to be compared
1671
    :returns: LCSstr similarity
1672
    :rtype: float
1673
1674
    >>> sim_lcsstr('cat', 'hat')
1675
    0.6666666666666666
1676
    >>> sim_lcsstr('Niall', 'Neil')
1677
    0.2
1678
    >>> sim_lcsstr('aluminum', 'Catalan')
1679
    0.25
1680
    >>> sim_lcsstr('ATCG', 'TAGC')
1681
    0.25
1682
    """
1683
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
1684
        return 1.0
1685
    elif not src or not tar:
1686
        return 0.0
1687
    return len(lcsstr(src, tar)) / max(len(src), len(tar))
1688
1689
1690
def dist_lcsstr(src, tar):
1691
    """Return the longest common substring distance between two strings.
1692
1693
    Longest common substring distance (:math:`dist_{LCSstr}`)
1694
1695
    This employs the LCS function to derive a similarity metric:
1696
    :math:`dist_{LCSstr}(s,t) = 1 - sim_{LCSstr}(s,t)`
1697
1698
    :param str src, tar: two strings to be compared
1699
    :returns: LCSstr distance
1700
    :rtype: float
1701
1702
    >>> dist_lcsstr('cat', 'hat')
1703
    0.33333333333333337
1704
    >>> dist_lcsstr('Niall', 'Neil')
1705
    0.8
1706
    >>> dist_lcsstr('aluminum', 'Catalan')
1707
    0.75
1708
    >>> dist_lcsstr('ATCG', 'TAGC')
1709
    0.75
1710
    """
1711
    return 1 - sim_lcsstr(src, tar)
1712
1713
1714
def sim_ratcliff_obershelp(src, tar):
1715
    """Return the Ratcliff-Obershelp similarity of two strings.
1716
1717
    Ratcliff-Obershelp similarity
1718
1719
    This follows the Ratcliff-Obershelp algorithm to derive a similarity
1720
    measure:
1721
1722
        1. Find the length of the longest common substring in src & tar.
1723
        2. Recurse on the strings to the left & right of each this substring
1724
           in src & tar. The base case is a 0 length common substring, in which
1725
           case, return 0. Otherwise, return the sum of the current longest
1726
           common substring and the left & right recursed sums.
1727
        3. Multiply this length by 2 and divide by the sum of the lengths of
1728
           src & tar.
1729
1730
    Cf.
1731
    http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
1732
1733
    :param str src, tar: two strings to be compared
1734
    :returns: Ratcliff-Obserhelp similarity
1735
    :rtype: float
1736
1737
    >>> sim_ratcliff_obershelp('cat', 'hat')
1738
    0.66666666666666663
1739
    >>> sim_ratcliff_obershelp('Niall', 'Neil')
1740
    0.66666666666666663
1741
    >>> sim_ratcliff_obershelp('aluminum', 'Catalan')
1742
    0.40000000000000002
1743
    >>> sim_ratcliff_obershelp('ATCG', 'TAGC')
1744
    0.5
1745
    """
1746
    def _lcsstr_stl(src, tar):
1747
        """Return start positions & length for Ratcliff-Obershelp.
1748
1749
        Return the start position in the source string, start position in
1750
        the target string, and length of the longest common substring of
1751
        strings src and tar.
1752
        """
1753
        # pylint: disable=no-member
1754
        lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1755
        # pylint: enable=no-member
1756
        longest, src_longest, tar_longest = 0, 0, 0
1757
        for i in range(1, len(src)+1):
1758
            for j in range(1, len(tar)+1):
1759
                if src[i-1] == tar[j-1]:
1760
                    lengths[i, j] = lengths[i-1, j-1] + 1
1761
                    if lengths[i, j] > longest:
1762
                        longest = lengths[i, j]
1763
                        src_longest = i
1764
                        tar_longest = j
1765
                else:
1766
                    lengths[i, j] = 0
1767
        return (src_longest-longest, tar_longest-longest, longest)
1768
1769
    def _sstr_matches(src, tar):
1770
        """Return the sum of substring match lengths.
1771
1772
        This follows the Ratcliff-Obershelp algorithm:
1773
             1. Find the length of the longest common substring in src & tar.
1774
             2. Recurse on the strings to the left & right of each this
1775
                 substring in src & tar.
1776
             3. Base case is a 0 length common substring, in which case,
1777
                 return 0.
1778
             4. Return the sum.
1779
        """
1780
        src_start, tar_start, length = _lcsstr_stl(src, tar)
1781
        if length == 0:
1782
            return 0
1783
        return (_sstr_matches(src[:src_start], tar[:tar_start]) +
1784
                length +
1785
                _sstr_matches(src[src_start+length:], tar[tar_start+length:]))
1786
1787
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
1788
        return 1.0
1789
    elif not src or not tar:
1790
        return 0.0
1791
    return 2*_sstr_matches(src, tar)/(len(src)+len(tar))
1792
1793
1794
def dist_ratcliff_obershelp(src, tar):
1795
    """Return the Ratcliff-Obershelp distance between two strings.
1796
1797
    Ratcliff-Obershelp distance
1798
1799
    Ratcliff-Obsershelp distance the complement of Ratcliff-Obershelp
1800
    similarity:
1801
    :math:`dist_{Ratcliff-Obershelp} = 1 - sim_{Ratcliff-Obershelp}`
1802
1803
    :param str src, tar: two strings to be compared
1804
    :returns: Ratcliffe-Obershelp distance
1805
    :rtype: float
1806
1807
    >>> dist_ratcliff_obershelp('cat', 'hat')
1808
    0.33333333333333337
1809
    >>> dist_ratcliff_obershelp('Niall', 'Neil')
1810
    0.33333333333333337
1811
    >>> dist_ratcliff_obershelp('aluminum', 'Catalan')
1812
    0.59999999999999998
1813
    >>> dist_ratcliff_obershelp('ATCG', 'TAGC')
1814
    0.5
1815
    """
1816
    return 1 - sim_ratcliff_obershelp(src, tar)
1817
1818
1819
def mra_compare(src, tar):
1820
    """Return the MRA comparison rating of two strings.
1821
1822
    Western Airlines Surname Match Rating Algorithm comparison rating
1823
1824
    A description of the algorithm can be found on page 18 of
1825
    https://archive.org/details/accessingindivid00moor
1826
1827
    :param str src, tar: two strings to be compared
1828
    :returns: MRA comparison rating
1829
    :rtype: int
1830
1831
    >>> mra_compare('cat', 'hat')
1832
    5
1833
    >>> mra_compare('Niall', 'Neil')
1834
    6
1835
    >>> mra_compare('aluminum', 'Catalan')
1836
    0
1837
    >>> mra_compare('ATCG', 'TAGC')
1838
    5
1839
    """
1840
    if src == tar:
1841
        return 6
1842
    if src == '' or tar == '':
1843
        return 0
1844
    src = list(mra(src))
1845
    tar = list(mra(tar))
1846
1847
    if abs(len(src)-len(tar)) > 2:
1848
        return 0
1849
1850
    length_sum = len(src) + len(tar)
1851
    if length_sum < 5:
1852
        min_rating = 5
1853
    elif length_sum < 8:
1854
        min_rating = 4
1855
    elif length_sum < 12:
1856
        min_rating = 3
1857
    else:
1858
        min_rating = 2
1859
1860
    for _ in range(2):
1861
        new_src = []
1862
        new_tar = []
1863
        minlen = min(len(src), len(tar))
1864
        for i in range(minlen):
1865
            if src[i] != tar[i]:
1866
                new_src.append(src[i])
1867
                new_tar.append(tar[i])
1868
        src = new_src+src[minlen:]
1869
        tar = new_tar+tar[minlen:]
1870
        src.reverse()
1871
        tar.reverse()
1872
1873
    similarity = 6 - max(len(src), len(tar))
1874
1875
    if similarity >= min_rating:
1876
        return similarity
1877
    return 0
1878
1879
1880
def sim_mra(src, tar):
1881
    """Return the normalized MRA similarity of two strings.
1882
1883
    Normalized Match Rating Algorithm similarity
1884
1885
    This is the MRA normalized to :math:`[0, 1]`, given that MRA itself is
1886
    constrained to the range :math:`[0, 6]`.
1887
1888
    :param str src, tar: two strings to be compared
1889
    :returns: normalized MRA similarity
1890
    :rtype: float
1891
1892
    >>> sim_mra('cat', 'hat')
1893
    0.8333333333333334
1894
    >>> sim_mra('Niall', 'Neil')
1895
    1.0
1896
    >>> sim_mra('aluminum', 'Catalan')
1897
    0.0
1898
    >>> sim_mra('ATCG', 'TAGC')
1899
    0.8333333333333334
1900
    """
1901
    return mra_compare(src, tar)/6
1902
1903
1904
def dist_mra(src, tar):
1905
    """Return the normalized MRA distance between two strings.
1906
1907
    Normalized Match Rating Algorithm distance
1908
1909
    MRA distance is the complement of MRA similarity:
1910
    :math:`dist_{MRA} = 1 - sim_{MRA}`
1911
1912
    :param str src, tar: two strings to be compared
1913
    :returns: normalized MRA distance
1914
    :rtype: float
1915
1916
    >>> dist_mra('cat', 'hat')
1917
    0.16666666666666663
1918
    >>> dist_mra('Niall', 'Neil')
1919
    0.0
1920
    >>> dist_mra('aluminum', 'Catalan')
1921
    1.0
1922
    >>> dist_mra('ATCG', 'TAGC')
1923
    0.16666666666666663
1924
    """
1925
    return 1 - sim_mra(src, tar)
1926
1927
1928
def dist_compression(src, tar, compressor='bz2', probs=None):
1929
    """Return the normalized compression distance between two strings.
1930
1931
    Normalized compression distance (NCD)
1932
1933
    Cf.
1934
    https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance
1935
1936
    :param str src, tar: two strings to be compared
1937
    :param str compressor: a compression scheme to use for the similarity
1938
        calculation, from the following:
1939
1940
            - `zlib` -- standard zlib/gzip
1941
            - `bz2` -- bzip2 (default)
1942
            - `lzma` -- Lempel–Ziv–Markov chain algorithm
1943
            - `arith` -- arithmetic coding
1944
            - `rle` -- run-length encoding
1945
            - `bwtrle` -- Burrows-Wheeler transform followed by run-length
1946
              encoding
1947
1948
    :param doct probs: a dictionary trained with ac_train (for the arith
1949
        compressor only)
1950
    :returns: compression distance
1951
    :rtype: float
1952
1953
    >>> dist_compression('cat', 'hat')
1954
    0.08
1955
    >>> dist_compression('Niall', 'Neil')
1956
    0.037037037037037035
1957
    >>> dist_compression('aluminum', 'Catalan')
1958
    0.20689655172413793
1959
    >>> dist_compression('ATCG', 'TAGC')
1960
    0.037037037037037035
1961
1962
    >>> dist_compression('Niall', 'Neil', compressor='zlib')
1963
    0.45454545454545453
1964
    >>> dist_compression('Niall', 'Neil', compressor='bz2')
1965
    0.037037037037037035
1966
    >>> dist_compression('Niall', 'Neil', compressor='lzma')
1967
    0.16
1968
    >>> dist_compression('Niall', 'Neil', compressor='arith')
1969
    0.6875
1970
    >>> dist_compression('Niall', 'Neil', compressor='rle')
1971
    1.0
1972
    >>> dist_compression('Niall', 'Neil', compressor='bwtrle')
1973
    0.8333333333333334
1974
    """
1975
    if src == tar:
1976
        return 0.0
1977
1978
    if compressor not in {'arith', 'rle', 'bwtrle'}:
1979
        src = src.encode('utf-8')
1980
        tar = tar.encode('utf-8')
1981
1982
    if compressor == 'bz2':
1983
        src_comp = codecs.encode(src, 'bz2_codec')[15:]
1984
        tar_comp = codecs.encode(tar, 'bz2_codec')[15:]
1985
        concat_comp = codecs.encode(src+tar, 'bz2_codec')[15:]
1986
        concat_comp2 = codecs.encode(tar+src, 'bz2_codec')[15:]
1987
    elif compressor == 'lzma':
1988
        if 'lzma' in sys.modules:
1989
            src_comp = lzma.compress(src)[14:]
1990
            tar_comp = lzma.compress(tar)[14:]
1991
            concat_comp = lzma.compress(src+tar)[14:]
1992
            concat_comp2 = lzma.compress(tar+src)[14:]
1993
        else:  # pragma: no cover
1994
            raise ValueError('Install the PylibLZMA module in order to use ' +
1995
                             'lzma compression similarity')
1996
    elif compressor == 'arith':
1997
        if probs is None:
1998
            # lacking a reasonable dictionary, train on the strings themselves
1999
            probs = ac_train(src+tar)
2000
        src_comp = ac_encode(src, probs)[1]
2001
        tar_comp = ac_encode(tar, probs)[1]
2002
        concat_comp = ac_encode(src+tar, probs)[1]
2003
        concat_comp2 = ac_encode(tar+src, probs)[1]
2004
        return ((min(concat_comp, concat_comp2) - min(src_comp, tar_comp)) /
2005
                max(src_comp, tar_comp))
2006
    elif compressor in {'rle', 'bwtrle'}:
2007
        src_comp = rle_encode(src, (compressor == 'bwtrle'))
2008
        tar_comp = rle_encode(tar, (compressor == 'bwtrle'))
2009
        concat_comp = rle_encode(src+tar, (compressor == 'bwtrle'))
2010
        concat_comp2 = rle_encode(tar+src, (compressor == 'bwtrle'))
2011
    else:  # zlib
2012
        src_comp = codecs.encode(src, 'zlib_codec')[2:]
2013
        tar_comp = codecs.encode(tar, 'zlib_codec')[2:]
2014
        concat_comp = codecs.encode(src+tar, 'zlib_codec')[2:]
2015
        concat_comp2 = codecs.encode(tar+src, 'zlib_codec')[2:]
2016
    return ((min(len(concat_comp), len(concat_comp2)) -
2017
             min(len(src_comp), len(tar_comp))) /
2018
            max(len(src_comp), len(tar_comp)))
2019
2020
2021
def sim_compression(src, tar, compressor='bz2', probs=None):
2022
    """Return the normalized compression similarity of two strings.
2023
2024
    Normalized compression similarity (NCS)
2025
2026
    Normalized compression similarity is the complement of normalized
2027
    compression distance:
2028
    :math:`sim_{NCS} = 1 - dist_{NCD}`
2029
2030
    :param str src, tar: two strings to be compared
2031
    :param str compressor: a compression scheme to use for the similarity
2032
        calculation:
2033
2034
            - `zlib` -- standard zlib/gzip
2035
            - `bz2` -- bzip2 (default)
2036
            - `lzma` -- Lempel–Ziv–Markov chain algorithm
2037
            - `arith` -- arithmetic coding
2038
            - `rle` -- run-length encoding
2039
            - `bwtrle` -- Burrows-Wheeler transform followed by run-length
2040
              encoding
2041
2042
    :param dict probs: a dictionary trained with ac_train (for the arith
2043
        compressor only)
2044
    :returns: compression similarity
2045
    :rtype: float
2046
2047
    >>> sim_compression('cat', 'hat')
2048
    0.92
2049
    >>> sim_compression('Niall', 'Neil')
2050
    0.962962962962963
2051
    >>> sim_compression('aluminum', 'Catalan')
2052
    0.7931034482758621
2053
    >>> sim_compression('ATCG', 'TAGC')
2054
    0.962962962962963
2055
2056
    >>> sim_compression('Niall', 'Neil', compressor='zlib')
2057
    0.5454545454545454
2058
    >>> sim_compression('Niall', 'Neil', compressor='bz2')
2059
    0.962962962962963
2060
    >>> sim_compression('Niall', 'Neil', compressor='lzma')
2061
    0.84
2062
    >>> sim_compression('Niall', 'Neil', compressor='arith')
2063
    0.3125
2064
    >>> sim_compression('Niall', 'Neil', compressor='rle')
2065
    0.0
2066
    >>> sim_compression('Niall', 'Neil', compressor='bwtrle')
2067
    0.16666666666666663
2068
    """
2069
    return 1 - dist_compression(src, tar, compressor, probs)
2070
2071
2072
def sim_monge_elkan(src, tar, sim_func=sim_levenshtein, symmetric=False):
2073
    """Return the Monge-Elkan similarity of two strings.
2074
2075
    Monge-Elkan similarity
2076
2077
    Monge-Elkan is defined in:
2078
    Monge, Alvaro E. and Charles P. Elkan. 1996. "The field matching problem:
2079
    Algorithms and applications." KDD-9 Proceedings.
2080
    http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
2081
2082
    Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the
2083
    similarity of src to tar is not necessarily equal to the similarity of
2084
    tar to src. If the sym argument is True, a symmetric value is calculated,
2085
    at the cost of doubling the computation time (since the
2086
    :math:`sim_{Monge-Elkan}(src, tar)` and
2087
    :math:`sim_{Monge-Elkan}(tar, src)` are both calculated and then averaged).
2088
2089
    :param str src, tar: two strings to be compared
2090
    :param function sim_func: the internal similarity metric to emply
2091
    :param bool symmetric: return a symmetric similarity measure
2092
    :returns: Monge-Elkan similarity
2093
    :rtype: float
2094
2095
    >>> sim_monge_elkan('cat', 'hat')
2096
    0.75
2097
    >>> sim_monge_elkan('Niall', 'Neil')
2098
    0.66666666666666663
2099
    >>> sim_monge_elkan('aluminum', 'Catalan')
2100
    0.3888888888888889
2101
    >>> sim_monge_elkan('ATCG', 'TAGC')
2102
    0.5
2103
    """
2104
    if src == tar:
2105
        return 1.0
2106
2107
    q_src = sorted(QGrams(src).elements())
2108
    q_tar = sorted(QGrams(tar).elements())
2109
2110
    if not q_src or not q_tar:
2111
        return 0.0
2112
2113
    sum_of_maxes = 0
2114
    for q_s in q_src:
2115
        max_sim = float('-inf')
2116
        for q_t in q_tar:
2117
            max_sim = max(max_sim, sim_func(q_s, q_t))
2118
        sum_of_maxes += max_sim
2119
    sim_em = sum_of_maxes / len(q_src)
2120
2121
    if symmetric:
2122
        sim_em = (sim_em + sim_monge_elkan(tar, src, sim, False))/2
2123
2124
    return sim_em
2125
2126
2127
def dist_monge_elkan(src, tar, sim_func=sim_levenshtein, symmetric=False):
2128
    """Return the Monge-Elkan distance between two strings.
2129
2130
    Monge-Elkan distance
2131
2132
    Monge-Elkan is defined in:
2133
    Monge, Alvaro E. and Charles P. Elkan. 1996. "The field matching problem:
2134
    Algorithms and applications." KDD-9 Proceedings.
2135
    http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
2136
2137
    Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the
2138
    distance between src and tar is not necessarily equal to the distance
2139
    between tar and src. If the sym argument is True, a symmetric value is
2140
    calculated, at the cost of doubling the computation time (since the
2141
    :math:`sim_{Monge-Elkan}(src, tar)` and :math:`sim_{Monge-Elkan}(tar, src)`
2142
    are both calculated and then averaged).
2143
2144
    :param str src, tar: two strings to be compared
2145
    :param function sim_func: the internal similarity metric to emply
2146
    :param bool symmetric: return a symmetric similarity measure
2147
    :returns: Monge-Elkan distance
2148
    :rtype: float
2149
2150
    >>> dist_monge_elkan('cat', 'hat')
2151
    0.25
2152
    >>> dist_monge_elkan('Niall', 'Neil')
2153
    0.33333333333333337
2154
    >>> dist_monge_elkan('aluminum', 'Catalan')
2155
    0.61111111111111116
2156
    >>> dist_monge_elkan('ATCG', 'TAGC')
2157
    0.5
2158
    """
2159
    return 1 - sim_monge_elkan(src, tar, sim_func, symmetric)
2160
2161
2162
def sim_ident(src, tar):
2163
    """Return the identity similarity of two strings.
2164
2165
    Identity similarity
2166
2167
    This is 1 if the two strings are identical, otherwise 0.
2168
2169
    :param str src, tar: two strings to be compared
2170
    :returns: identity similarity
2171
    :rtype: int
2172
2173
    >>> sim_ident('cat', 'hat')
2174
    0
2175
    >>> sim_ident('cat', 'cat')
2176
    1
2177
    """
2178
    return int(src == tar)
2179
2180
2181
def dist_ident(src, tar):
2182
    """Return the identity distance between two strings.
2183
2184
    Identity distance
2185
2186
    This is 0 if the two strings are identical, otherwise 1, i.e.
2187
    :math:`dist_{identity} = 1 - sim_{identity}`
2188
2189
    :param str src, tar: two strings to be compared
2190
    :returns: indentity distance
2191
    :rtype: int
2192
2193
    >>> dist_ident('cat', 'hat')
2194
    1
2195
    >>> dist_ident('cat', 'cat')
2196
    0
2197
    """
2198
    return 1 - sim_ident(src, tar)
2199
2200
2201
def sim_matrix(src, tar, mat=None, mismatch_cost=0, match_cost=1,
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
2202
               symmetric=True, alphabet=None):
2203
    """Return the matrix similarity of two strings.
2204
2205
    Matrix similarity
2206
2207
    With the default parameters, this is identical to sim_ident.
2208
    It is possible for sim_matrix to return values outside of the range
2209
    :math:`[0, 1]`, if values outside that range are present in mat,
2210
    mismatch_cost, or match_cost.
2211
2212
    :param str src, tar: two strings to be compared
2213
    :param dict mat: a dict mapping tuples to costs; the tuples are (src, tar)
2214
        pairs of symbols from the alphabet parameter
2215
    :param float mismatch_cost: the value returned if (src, tar) is absent from
2216
        mat when src does not equal tar
2217
    :param float match_cost: the value returned if (src, tar) is absent from
2218
        mat when src equals tar
2219
    :param bool symmetric: True if the cost of src not matching tar is
2220
        identical to the cost of tar not matching src; in this case, the values
2221
        in mat need only contain (src, tar) or (tar, src), not both
2222
    :param str alphabet: a collection of tokens from which src and tar are
2223
        drawn; if this is defined a ValueError is raised if either tar or src
2224
        is not found in alphabet
2225
    :returns: matrix similarity
2226
    :rtype: float
2227
2228
    >>> sim_matrix('cat', 'hat')
2229
    0
2230
    >>> sim_matrix('hat', 'hat')
2231
    1
2232
    """
2233
    if alphabet:
2234
        alphabet = tuple(alphabet)
2235
        for i in src:
2236
            if i not in alphabet:
2237
                raise ValueError('src value not in alphabet')
2238
        for i in tar:
2239
            if i not in alphabet:
2240
                raise ValueError('tar value not in alphabet')
2241
2242
    if src == tar:
2243
        if mat and (src, src) in mat:
2244
            return mat[(src, src)]
2245
        return match_cost
2246
    if mat and (src, tar) in mat:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
2247
        return mat[(src, tar)]
2248
    elif symmetric and mat and (tar, src) in mat:
2249
        return mat[(tar, src)]
2250
    return mismatch_cost
2251
2252
2253 View Code Duplication
def needleman_wunsch(src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2254
    """Return the Needleman-Wunsch score of two strings.
2255
2256
    Needleman-Wunsch score
2257
2258
    This is the standard edit distance measure.
2259
2260
    Cf. https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm
2261
2262
    Cf.
2263
    http://csb.stanford.edu/class/public/readings/Bioinformatics_I_Lecture6/Needleman_Wunsch_JMB_70_Global_alignment.pdf
2264
2265
    :param str src, tar: two strings to be compared
2266
    :param float gap_cost: the cost of an alignment gap (1 by default)
2267
    :param function sim_func: a function that returns the similarity of two
2268
        characters (identity similarity by default)
2269
    :returns: Needleman-Wunsch score
2270
    :rtype: int (in fact dependent on the gap_cost & return value of sim_func)
2271
2272
    >>> needleman_wunsch('cat', 'hat')
2273
    2.0
2274
    >>> needleman_wunsch('Niall', 'Neil')
2275
    1.0
2276
    >>> needleman_wunsch('aluminum', 'Catalan')
2277
    -1.0
2278
    >>> needleman_wunsch('ATCG', 'TAGC')
2279
    0.0
2280
    """
2281
    # pylint: disable=no-member
2282
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2283
    # pylint: enable=no-member
2284
2285
    for i in range(len(src)+1):
2286
        d_mat[i, 0] = -(i * gap_cost)
2287
    for j in range(len(tar)+1):
2288
        d_mat[0, j] = -(j * gap_cost)
2289
    for i in range(1, len(src)+1):
2290
        for j in range(1, len(tar)+1):
2291
            match = d_mat[i-1, j-1] + sim_func(src[i-1], tar[j-1])
2292
            delete = d_mat[i-1, j] - gap_cost
2293
            insert = d_mat[i, j-1] - gap_cost
2294
            d_mat[i, j] = max(match, delete, insert)
2295
    return d_mat[d_mat.shape[0]-1, d_mat.shape[1]-1]
2296
2297
2298 View Code Duplication
def smith_waterman(src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2299
    """Return the Smith-Waterman score of two strings.
2300
2301
    Smith-Waterman score
2302
2303
    This is the standard edit distance measure.
2304
2305
    Cf. https://en.wikipedia.org/wiki/Smith–Waterman_algorithm
2306
2307
    :param str src, tar: two strings to be compared
2308
    :param float gap_cost: the cost of an alignment gap (1 by default)
2309
    :param function sim_func: a function that returns the similarity of two
2310
        characters (identity similarity by default)
2311
    :returns: Smith-Waterman score
2312
    :rtype: int (in fact dependent on the gap_cost & return value of sim_func)
2313
2314
    >>> smith_waterman('cat', 'hat')
2315
    2.0
2316
    >>> smith_waterman('Niall', 'Neil')
2317
    1.0
2318
    >>> smith_waterman('aluminum', 'Catalan')
2319
    0.0
2320
    >>> smith_waterman('ATCG', 'TAGC')
2321
    1.0
2322
    """
2323
    # pylint: disable=no-member
2324
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2325
    # pylint: enable=no-member
2326
2327
    for i in range(len(src)+1):
2328
        d_mat[i, 0] = 0
2329
    for j in range(len(tar)+1):
2330
        d_mat[0, j] = 0
2331
    for i in range(1, len(src)+1):
2332
        for j in range(1, len(tar)+1):
2333
            match = d_mat[i-1, j-1] + sim_func(src[i-1], tar[j-1])
2334
            delete = d_mat[i-1, j] - gap_cost
2335
            insert = d_mat[i, j-1] - gap_cost
2336
            d_mat[i, j] = max(0, match, delete, insert)
2337
    return d_mat[d_mat.shape[0]-1, d_mat.shape[1]-1]
2338
2339
2340
def gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=sim_ident):
2341
    """Return the Gotoh score of two strings.
2342
2343
    Gotoh score
2344
2345
    Gotoh's algorithm is essentially Needleman-Wunsch with affine gap
2346
    penalties:
2347
    https://www.cs.umd.edu/class/spring2003/cmsc838t/papers/gotoh1982.pdf
2348
2349
    :param str src, tar: two strings to be compared
2350
    :param float gap_open: the cost of an open alignment gap (1 by default)
2351
    :param float gap_ext: the cost of an alignment gap extension (0.4 by
2352
        default)
2353
    :param function sim_func: a function that returns the similarity of two
2354
        characters (identity similarity by default)
2355
    :returns: Gotoh score
2356
    :rtype: float (in fact dependent on the gap_cost & return value of
2357
        sim_func)
2358
2359
    >>> gotoh('cat', 'hat')
2360
    2.0
2361
    >>> gotoh('Niall', 'Neil')
2362
    1.0
2363
    >>> gotoh('aluminum', 'Catalan')
2364
    -0.40000000000000002
2365
    >>> gotoh('cat', 'hat')
2366
    2.0
2367
    """
2368
    # pylint: disable=no-member
2369
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2370
    p_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2371
    q_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2372
    # pylint: enable=no-member
2373
2374
    d_mat[0, 0] = 0
2375
    p_mat[0, 0] = float('-inf')
2376
    q_mat[0, 0] = float('-inf')
2377
    for i in range(1, len(src)+1):
2378
        d_mat[i, 0] = float('-inf')
2379
        p_mat[i, 0] = -gap_open - gap_ext*(i-1)
2380
        q_mat[i, 0] = float('-inf')
2381
        q_mat[i, 1] = -gap_open
2382
    for j in range(1, len(tar)+1):
2383
        d_mat[0, j] = float('-inf')
2384
        p_mat[0, j] = float('-inf')
2385
        p_mat[1, j] = -gap_open
2386
        q_mat[0, j] = -gap_open - gap_ext*(j-1)
2387
2388
    for i in range(1, len(src)+1):
2389
        for j in range(1, len(tar)+1):
2390
            sim_val = sim_func(src[i-1], tar[j-1])
2391
            d_mat[i, j] = max(d_mat[i-1, j-1] + sim_val,
2392
                              p_mat[i-1, j-1] + sim_val,
2393
                              q_mat[i-1, j-1] + sim_val)
2394
2395
            p_mat[i, j] = max(d_mat[i-1, j] - gap_open,
2396
                              p_mat[i-1, j] - gap_ext)
2397
2398
            q_mat[i, j] = max(d_mat[i, j-1] - gap_open,
2399
                              q_mat[i, j-1] - gap_ext)
2400
2401
    i, j = (n - 1 for n in d_mat.shape)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable n does not seem to be defined.
Loading history...
2402
    return max(d_mat[i, j], p_mat[i, j], q_mat[i, j])
2403
2404
2405
def sim_length(src, tar):
2406
    """Return the length similarty of two strings.
2407
2408
    Length similarity
2409
2410
    This is the ratio of the length of the shorter string to the longer.
2411
2412
    :param str src, tar: two strings to be compared
2413
    :returns: length similarity
2414
    :rtype: float
2415
2416
    >>> sim_length('cat', 'hat')
2417
    1.0
2418
    >>> sim_length('Niall', 'Neil')
2419
    0.8
2420
    >>> sim_length('aluminum', 'Catalan')
2421
    0.875
2422
    >>> sim_length('ATCG', 'TAGC')
2423
    1.0
2424
    """
2425
    if src == tar:
2426
        return 1.0
2427
    if not src or not tar:
2428
        return 0.0
2429
    return len(src)/len(tar) if len(src) < len(tar) else len(tar)/len(src)
2430
2431
2432
def dist_length(src, tar):
2433
    """Return the length distance between two strings.
2434
2435
    Length distance
2436
2437
    Length distance is the complement of length similarity:
2438
    :math:`dist_{length} = 1 - sim_{length}`
2439
2440
    :param str src, tar: two strings to be compared
2441
    :returns: length distance
2442
    :rtype: float
2443
2444
    >>> dist_length('cat', 'hat')
2445
    0.0
2446
    >>> dist_length('Niall', 'Neil')
2447
    0.19999999999999996
2448
    >>> dist_length('aluminum', 'Catalan')
2449
    0.125
2450
    >>> dist_length('ATCG', 'TAGC')
2451
    0.0
2452
    """
2453
    return 1 - sim_length(src, tar)
2454
2455
2456 View Code Duplication
def sim_prefix(src, tar):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2457
    """Return the prefix similarty of two strings.
2458
2459
    Prefix similarity
2460
2461
    Prefix similarity is the ratio of the length of the shorter term that
2462
    exactly matches the longer term to the length of the shorter term,
2463
    beginning at the start of both terms.
2464
2465
    :param str src, tar: two strings to be compared
2466
    :returns: prefix similarity
2467
    :rtype: float
2468
2469
    >>> sim_prefix('cat', 'hat')
2470
    0.0
2471
    >>> sim_prefix('Niall', 'Neil')
2472
    0.25
2473
    >>> sim_prefix('aluminum', 'Catalan')
2474
    0.0
2475
    >>> sim_prefix('ATCG', 'TAGC')
2476
    0.0
2477
    """
2478
    if src == tar:
2479
        return 1.0
2480
    if not src or not tar:
2481
        return 0.0
2482
    min_word, max_word = (src, tar) if len(src) < len(tar) else (tar, src)
2483
    min_len = len(min_word)
2484
    for i in range(min_len, 0, -1):
2485
        if min_word[:i] == max_word[:i]:
2486
            return i/min_len
2487
    return 0.0
2488
2489
2490
def dist_prefix(src, tar):
2491
    """Return the prefix distance between two strings.
2492
2493
    Prefix distance
2494
2495
    Prefix distance is the complement of prefix similarity:
2496
    :math:`dist_{prefix} = 1 - sim_{prefix}`
2497
2498
    :param str src, tar: two strings to be compared
2499
    :returns: prefix distance
2500
    :rtype: float
2501
2502
        >>> dist_prefix('cat', 'hat')
2503
    1.0
2504
    >>> dist_prefix('Niall', 'Neil')
2505
    0.75
2506
    >>> dist_prefix('aluminum', 'Catalan')
2507
    1.0
2508
    >>> dist_prefix('ATCG', 'TAGC')
2509
    1.0
2510
    """
2511
    return 1 - sim_prefix(src, tar)
2512
2513
2514 View Code Duplication
def sim_suffix(src, tar):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2515
    """Return the suffix similarity of two strings.
2516
2517
    Suffix similarity
2518
2519
    Suffix similarity is the ratio of the length of the shorter term that
2520
    exactly matches the longer term to the length of the shorter term,
2521
    beginning at the end of both terms.
2522
2523
    :param str src, tar: two strings to be compared
2524
    :returns: suffix similarity
2525
    :rtype: float
2526
2527
    >>> sim_suffix('cat', 'hat')
2528
    0.6666666666666666
2529
    >>> sim_suffix('Niall', 'Neil')
2530
    0.25
2531
    >>> sim_suffix('aluminum', 'Catalan')
2532
    0.0
2533
    >>> sim_suffix('ATCG', 'TAGC')
2534
    0.0
2535
    """
2536
    if src == tar:
2537
        return 1.0
2538
    if not src or not tar:
2539
        return 0.0
2540
    min_word, max_word = (src, tar) if len(src) < len(tar) else (tar, src)
2541
    min_len = len(min_word)
2542
    for i in range(min_len, 0, -1):
2543
        if min_word[-i:] == max_word[-i:]:
2544
            return i/min_len
2545
    return 0.0
2546
2547
2548
def dist_suffix(src, tar):
2549
    """Return the suffix distance between two strings.
2550
2551
    Suffix distance
2552
2553
    Suffix distance is the complement of suffix similarity:
2554
    :math:`dist_{suffix} = 1 - sim_{suffix}`
2555
2556
    :param str src, tar: two strings to be compared
2557
    :returns: suffix distance
2558
    :rtype: float
2559
2560
    >>> dist_suffix('cat', 'hat')
2561
    0.33333333333333337
2562
    >>> dist_suffix('Niall', 'Neil')
2563
    0.75
2564
    >>> dist_suffix('aluminum', 'Catalan')
2565
    1.0
2566
    >>> dist_suffix('ATCG', 'TAGC')
2567
    1.0
2568
    """
2569
    return 1 - sim_suffix(src, tar)
2570
2571
2572
def sim_mlipns(src, tar, threshold=0.25, maxmismatches=2):
2573
    """Return the MLIPNS similarity of two strings.
2574
2575
    Modified Language-Independent Product Name Search (MLIPNS)
2576
2577
    The MLIPNS algorithm is described in Shannaq, Boumedyen A. N. and Victor V.
2578
    Alexandrov. 2010. "Using Product Similarity for Adding Business." Global
2579
    Journal of Computer Science and Technology. 10(12). 2-8.
2580
    http://www.sial.iias.spb.su/files/386-386-1-PB.pdf
2581
2582
    This function returns only 1.0 (similar) or 0.0 (not similar).
2583
2584
    LIPNS similarity is identical to normalized Hamming similarity.
2585
2586
    :param str src, tar: two strings to be compared
2587
    :param float threshold: a number [0, 1] indicating the maximum similarity
2588
        score, below which the strings are considered 'similar' (0.25 by
2589
        default)
2590
    :param int maxmismatches: a number indicating the allowable number of
2591
        mismatches to remove before declaring two strings not similar (2 by
2592
        default)
2593
    :returns: MLIPNS similarity
2594
    :rtype: float
2595
2596
    >>> sim_mlipns('cat', 'hat')
2597
    1.0
2598
    >>> sim_mlipns('Niall', 'Neil')
2599
    0.0
2600
    >>> sim_mlipns('aluminum', 'Catalan')
2601
    0.0
2602
    >>> sim_mlipns('ATCG', 'TAGC')
2603
    0.0
2604
    """
2605
    if tar == src:
2606
        return 1.0
2607
    if not src or not tar:
2608
        return 0.0
2609
2610
    mismatches = 0
2611
    ham = hamming(src, tar, difflens=True)
2612
    maxlen = max(len(src), len(tar))
2613
    while src and tar and mismatches <= maxmismatches:
2614
        if maxlen < 1 or (1-(maxlen-ham)/maxlen) <= threshold:
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2615
            return 1.0
2616
        else:
2617
            mismatches += 1
2618
            ham -= 1
2619
            maxlen -= 1
2620
2621
    if maxlen < 1:
2622
        return 1.0
2623
    return 0.0
2624
2625
2626
def dist_mlipns(src, tar, threshold=0.25, maxmismatches=2):
2627
    """Return the MLIPNS distance between two strings.
2628
2629
    Modified Language-Independent Product Name Search (MLIPNS)
2630
2631
    MLIPNS distance is the complement of MLIPNS similarity:
2632
    :math:`dist_{MLIPNS} = 1 - sim_{MLIPNS}`
2633
2634
    This function returns only 0.0 (distant) or 1.0 (not distant)
2635
2636
    :param str src, tar: two strings to be compared
2637
    :param float threshold: a number [0, 1] indicating the maximum similarity
2638
        score, below which the strings are considered 'similar' (0.25 by
2639
        default)
2640
    :param int maxmismatches: a number indicating the allowable number of
2641
        mismatches to remove before declaring two strings not similar (2 by
2642
        default)
2643
    :returns: MLIPNS distance
2644
    :rtype: float
2645
2646
    >>> dist_mlipns('cat', 'hat')
2647
    0.0
2648
    >>> dist_mlipns('Niall', 'Neil')
2649
    1.0
2650
    >>> dist_mlipns('aluminum', 'Catalan')
2651
    1.0
2652
    >>> dist_mlipns('ATCG', 'TAGC')
2653
    1.0
2654
    """
2655
    return 1.0 - sim_mlipns(src, tar, threshold, maxmismatches)
2656
2657
2658
def bag(src, tar):
2659
    """Return the bag distance between two strings.
2660
2661
    Bag distance
2662
2663
    Bag distance is proposed in Bartolini, Illaria, Paolo Ciaccia, and Marco
2664
    Patella. 2002. "String Matching with Metric Trees Using and Approximate
2665
    Distance. Proceedings of the 9th International Symposium on String
2666
    Processing and Information Retrieval, Lisbone, Portugal, September 2002.
2667
    271-283.
2668
    http://www-db.disi.unibo.it/research/papers/SPIRE02.pdf
2669
2670
    It is defined as:
2671
    :math:`max( |multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)| )`
2672
2673
    :param str src, tar: two strings to be compared
2674
    :returns: bag distance
2675
    :rtype: int
2676
2677
    >>> bag('cat', 'hat')
2678
    1
2679
    >>> bag('Niall', 'Neil')
2680
    2
2681
    >>> bag('aluminum', 'Catalan')
2682
    5
2683
    >>> bag('ATCG', 'TAGC')
2684
    0
2685
    >>> bag('abcdefg', 'hijklm')
2686
    7
2687
    >>> bag('abcdefg', 'hijklmno')
2688
    8
2689
    """
2690
    if tar == src:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
2691
        return 0
2692
    elif not src:
2693
        return len(tar)
2694
    elif not tar:
2695
        return len(src)
2696
2697
    src_bag = Counter(src)
2698
    tar_bag = Counter(tar)
2699
    return max(sum((src_bag-tar_bag).values()),
2700
               sum((tar_bag-src_bag).values()))
2701
2702
2703
def dist_bag(src, tar):
2704
    """Return the normalized bag distance between two strings.
2705
2706
    Normalized bag distance
2707
2708
    Bag distance is normalized by dividing by :math:`max( |src|, |tar| )`.
2709
2710
    :param str src, tar: two strings to be compared
2711
    :returns: normalized bag distance
2712
    :rtype: float
2713
2714
    >>> dist_bag('cat', 'hat')
2715
    0.3333333333333333
2716
    >>> dist_bag('Niall', 'Neil')
2717
    0.4
2718
    >>> dist_bag('aluminum', 'Catalan')
2719
    0.375
2720
    >>> dist_bag('ATCG', 'TAGC')
2721
    0.0
2722
    """
2723
    if tar == src:
2724
        return 0.0
2725
    if not src or not tar:
2726
        return 1.0
2727
2728
    maxlen = max(len(src), len(tar))
2729
2730
    return bag(src, tar)/maxlen
2731
2732
2733
def sim_bag(src, tar):
2734
    """Return the normalized bag similarity of two strings.
2735
2736
    Normalized bag similarity
2737
2738
    Normalized bag similarity is the complement of normalized bag distance:
2739
    :math:`sim_{bag} = 1 - dist_{bag}`
2740
2741
    :param str src, tar: two strings to be compared
2742
    :returns: normalized bag similarity
2743
    :rtype: float
2744
2745
    >>> sim_bag('cat', 'hat')
2746
    0.6666666666666667
2747
    >>> sim_bag('Niall', 'Neil')
2748
    0.6
2749
    >>> sim_bag('aluminum', 'Catalan')
2750
    0.625
2751
    >>> sim_bag('ATCG', 'TAGC')
2752
    1.0
2753
    """
2754
    return 1-dist_bag(src, tar)
2755
2756
2757
def editex(src, tar, cost=(0, 1, 2), local=False):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
2758
    """Return the Editex distance between two strings.
2759
2760
    Editex distance
2761
2762
    As described on pages 3 & 4 of
2763
    Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from
2764
    information retrieval. In: Proceedings of the ACM-SIGIR Conference on
2765
    Research and Development in Information Retrieval, Zurich, Switzerland.
2766
    166–173. http://goanna.cs.rmit.edu.au/~jz/fulltext/sigir96.pdf
2767
2768
    The local variant is based on
2769
    Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in
2770
    Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th
2771
    Asia Information Retrieval Symposium, Sapporo, Japan. 157-167.
2772
    http://www.seg.rmit.edu.au/research/download.php?manuscript=404
2773
2774
    :param str src, tar: two strings to be compared
2775
    :param tuple cost: a 3-tuple representing the cost of the four possible
2776
        edits:
2777
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2778
    :param bool local: if True, the local variant of Editex is used
2779
    :returns: Editex distance
2780
    :rtype: int
2781
2782
    >>> editex('cat', 'hat')
2783
    2
2784
    >>> editex('Niall', 'Neil')
2785
    2
2786
    >>> editex('aluminum', 'Catalan')
2787
    12
2788
    >>> editex('ATCG', 'TAGC')
2789
    6
2790
    """
2791
    match_cost, group_cost, mismatch_cost = cost
2792
    letter_groups = ({'A', 'E', 'I', 'O', 'U', 'Y'},
2793
                     {'B', 'P'},
2794
                     {'C', 'K', 'Q'},
2795
                     {'D', 'T'},
2796
                     {'L', 'R'},
2797
                     {'M', 'N'},
2798
                     {'G', 'J'},
2799
                     {'F', 'P', 'V'},
2800
                     {'S', 'X', 'Z'},
2801
                     {'C', 'S', 'Z'})
2802
    all_letters = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'J', 'K', 'L', 'M',
2803
                   'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'}
2804
2805
    def r_cost(ch1, ch2):
2806
        """Return r(a,b) according to Zobel & Dart's definition."""
2807
        if ch1 == ch2:
2808
            return match_cost
2809
        if ch1 in all_letters and ch2 in all_letters:
2810
            for group in letter_groups:
2811
                if ch1 in group and ch2 in group:
2812
                    return group_cost
2813
        return mismatch_cost
2814
2815
    def d_cost(ch1, ch2):
2816
        """Return d(a,b) according to Zobel & Dart's definition."""
2817
        if ch1 != ch2 and (ch1 == 'H' or ch1 == 'W'):
0 ignored issues
show
Unused Code introduced by
Consider merging these comparisons with "in" to "ch1 in ('H', 'W')"
Loading history...
2818
            return group_cost
2819
        return r_cost(ch1, ch2)
2820
2821
    # convert both src & tar to NFKD normalized unicode
2822
    src = unicodedata.normalize('NFKD', text_type(src.upper()))
2823
    tar = unicodedata.normalize('NFKD', text_type(tar.upper()))
2824
    # convert ß to SS (for Python2)
2825
    src = src.replace('ß', 'SS')
2826
    tar = tar.replace('ß', 'SS')
2827
2828
    if src == tar:
2829
        return 0
2830
    if not src:
2831
        return len(tar) * mismatch_cost
2832
    if not tar:
2833
        return len(src) * mismatch_cost
2834
2835
    # pylint: disable=no-member
2836
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
2837
    # pylint: enable=no-member
2838
    lens = len(src)
2839
    lent = len(tar)
2840
    src = ' '+src
2841
    tar = ' '+tar
2842
2843
    if not local:
2844
        for i in range(1, lens+1):
2845
            d_mat[i, 0] = d_mat[i-1, 0] + d_cost(src[i-1], src[i])
2846
    for j in range(1, lent+1):
2847
        d_mat[0, j] = d_mat[0, j-1] + d_cost(tar[j-1], tar[j])
2848
2849
    for i in range(1, lens+1):
2850
        for j in range(1, lent+1):
2851
            d_mat[i, j] = min(d_mat[i-1, j] + d_cost(src[i-1], src[i]),
2852
                              d_mat[i, j-1] + d_cost(tar[j-1], tar[j]),
2853
                              d_mat[i-1, j-1] + r_cost(src[i], tar[j]))
2854
2855
    return d_mat[lens, lent]
2856
2857
2858
def dist_editex(src, tar, cost=(0, 1, 2), local=False):
2859
    """Return the normalized Editex distance between two strings.
2860
2861
    Editex distance normalized to the interval [0, 1]
2862
2863
    The Editex distance is normalized by dividing the Editex distance
2864
    (calculated by any of the three supported methods) by the greater of
2865
    the number of characters in src times the cost of a delete and
2866
    the number of characters in tar times the cost of an insert.
2867
    For the case in which all operations have :math:`cost = 1`, this is
2868
    equivalent to the greater of the length of the two strings src & tar.
2869
2870
    :param str src, tar: two strings to be compared
2871
    :param tuple cost: a 3-tuple representing the cost of the four possible
2872
        edits:
2873
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2874
    :param bool local: if True, the local variant of Editex is used
2875
    :returns: normalized Editex distance
2876
    :rtype: float
2877
2878
    >>> dist_editex('cat', 'hat')
2879
    0.33333333333333331
2880
    >>> dist_editex('Niall', 'Neil')
2881
    0.20000000000000001
2882
    >>> dist_editex('aluminum', 'Catalan')
2883
    0.75
2884
    >>> dist_editex('ATCG', 'TAGC')
2885
    0.75
2886
    """
2887
    if src == tar:
2888
        return 0
2889
    mismatch_cost = cost[2]
2890
    return (editex(src, tar, cost, local) /
2891
            (max(len(src)*mismatch_cost, len(tar)*mismatch_cost)))
2892
2893
2894
def sim_editex(src, tar, cost=(0, 1, 2), local=False):
2895
    """Return the normalized Editex similarity of two strings.
2896
2897
    Editex similarity normalized to the interval [0, 1]
2898
2899
    The Editex similarity is the complement of Editex distance
2900
    :math:`sim_{Editex} = 1 - dist_{Editex}`
2901
2902
    The arguments are identical to those of the editex() function.
2903
2904
    :param str src, tar: two strings to be compared
2905
    :param tuple cost: a 3-tuple representing the cost of the four possible
2906
        edits:
2907
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2908
    :param bool local: if True, the local variant of Editex is used
2909
    :returns: normalized Editex similarity
2910
    :rtype: float
2911
2912
    >>> sim_editex('cat', 'hat')
2913
    0.66666666666666674
2914
    >>> sim_editex('Niall', 'Neil')
2915
    0.80000000000000004
2916
    >>> sim_editex('aluminum', 'Catalan')
2917
    0.25
2918
    >>> sim_editex('ATCG', 'TAGC')
2919
    0.25
2920
    """
2921
    return 1 - dist_editex(src, tar, cost, local)
2922
2923
2924
def eudex_hamming(src, tar, weights='exponential', maxlength=8,
2925
                  normalized=False):
2926
    """Calculate the Hamming distance between the Eudex hashes of two terms.
2927
2928
    If weights is set to None, a simple Hamming distance is calculated.
2929
    If weights is set to 'exponential', weight decays by powers of 2, as
2930
         proposed in the eudex specification: https://github.com/ticki/eudex.
2931
    If weights is set to 'fibonacci', weight decays through the Fibonacci
2932
         series, as in the eudex reference implementation.
2933
    If weights is set to a callable function, this assumes it creates a
2934
         generator and the generator is used to populate a series of weights.
2935
    If weights is set to an iterable, the iterable's values should be integers
2936
         and will be used as the weights.
2937
2938
    :param str src, tar: two strings to be compared
2939
    :param iterable or generator function weights:
2940
    :param maxlength: the number of characters to encode as a eudex hash
2941
    :return:
2942
    """
2943
2944
    def _gen_fibonacci():
2945
        """Yield the next Fibonacci number.
2946
2947
        Based on https://www.python-course.eu/generators.php
2948
        Starts at Fibonacci number 3 (the second 1)
2949
        """
2950
        num_a, num_b = 1, 2
2951
        while True:
2952
            yield num_a
2953
            num_a, num_b = num_b, num_a + num_b
2954
2955
    def _gen_exponential(base=2):
2956
        """Yield the next value in an exponential series of the base.
2957
2958
        Based on https://www.python-course.eu/generators.php
2959
        Starts at base**0
2960
        """
2961
        exp = 0
2962
        while True:
2963
            yield base ** exp
2964
            exp += 1
2965
2966
    # Calculate the eudex hashes and XOR them
2967
    xored = eudex(src, maxlength=maxlength) ^ eudex(tar, maxlength=maxlength)
2968
2969
    # Simple hamming distance (all bits are equal)
2970
    if not weights:
2971
        return bin(xored).count('1')
2972
2973
    # If weights is a function, it should create a generator,
2974
    # which we now use to populate a list
2975
    if callable(weights):
2976
        weights = weights()
2977
    elif weights == 'exponential':
2978
        weights = _gen_exponential()
2979
    elif weights == 'fibonacci':
2980
        weights = _gen_fibonacci()
2981
    if isinstance(weights, types.GeneratorType):
2982
        weights = [next(weights) for _ in range(maxlength)][::-1]
2983
2984
    # Sum the weighted hamming distance
2985
    dist = 0
0 ignored issues
show
Comprehensibility Bug introduced by
dist is re-defining a name which is already available in the outer-scope (previously defined on line 3278).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
2986
    maxdist = 0
2987
    while (xored or normalized) and weights:
2988
        maxdist += 8*weights[-1]
2989
        dist += bin(xored & 0xFF).count('1') * weights.pop()
2990
        xored >>= 8
2991
2992
    if normalized:
2993
        dist /= maxdist
2994
2995
    return dist
2996
2997
2998
def dist_eudex(src, tar, weights='exponential', maxlength=8):
2999
    """Return normalized Hamming distance between Eudex hashes of two terms.
3000
3001
    If weights is set to None, a simple Hamming distance is calculated.
3002
    If weights is set to 'exponential', weight decays by powers of 2, as
3003
         proposed in the eudex specification: https://github.com/ticki/eudex.
3004
    If weights is set to 'fibonacci', weight decays through the Fibonacci
3005
         series, as in the eudex reference implementation.
3006
    If weights is set to a callable function, this assumes it creates a
3007
         generator and the generator is used to populate a series of weights.
3008
    If weights is set to an iterable, the iterable's values should be integers
3009
         and will be used as the weights.
3010
3011
    :param str src, tar: two strings to be compared
3012
    :param iterable or generator function weights:
3013
    :param maxlength: the number of characters to encode as a eudex hash
3014
    :return:
3015
    """
3016
    return eudex_hamming(src, tar, weights, maxlength, True)
3017
3018
3019
def sim_eudex(src, tar, weights='exponential', maxlength=8):
3020
    """Return normalized Hamming similarity between Eudex hashes of two terms.
3021
3022
    If weights is set to None, a simple Hamming distance is calculated.
3023
    If weights is set to 'exponential', weight decays by powers of 2, as
3024
         proposed in the eudex specification: https://github.com/ticki/eudex.
3025
    If weights is set to 'fibonacci', weight decays through the Fibonacci
3026
         series, as in the eudex reference implementation.
3027
    If weights is set to a callable function, this assumes it creates a
3028
         generator and the generator is used to populate a series of weights.
3029
    If weights is set to an iterable, the iterable's values should be integers
3030
         and will be used as the weights.
3031
3032
    :param str src, tar: two strings to be compared
3033
    :param iterable or generator function weights:
3034
    :param maxlength: the number of characters to encode as a eudex hash
3035
    :return:
3036
    """
3037
    return 1-dist_eudex(src, tar, weights, maxlength)
3038
3039
3040
def sift4_simplest(src, tar, max_offset=0):
3041
    """Return the "simplest" Sift4 distance between two terms.
3042
3043
    This is an approximation of edit distance, described in:
3044
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3045
    algorithm: Sift4."
3046
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3047
3048
    :param str src, tar: two strings to be compared
3049
    :param max_offset: the number of characters to search for matching letters
3050
    :return:
3051
    """
3052
    if not src:
3053
        return len(tar)
3054
3055
    if not tar:
3056
        return len(src)
3057
3058
    src_len = len(src)
3059
    tar_len = len(tar)
3060
3061
    src_cur = 0
3062
    tar_cur = 0
3063
    lcss = 0
3064
    local_cs = 0
3065
3066
    while (src_cur < src_len) and (tar_cur < tar_len):
3067
        if src[src_cur] == tar[tar_cur]:
3068
            local_cs += 1
3069
        else:
3070
            lcss += local_cs
3071
            local_cs = 0
3072
            if src_cur != tar_cur:
3073
                src_cur = tar_cur = max(src_cur, tar_cur)
3074
            for i in range(max_offset):
3075
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
3076
                    break
3077
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
3078
                    src_cur += i
3079
                    local_cs += 1
3080
                    break
3081
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
3082
                    tar_cur += i
3083
                    local_cs += 1
3084
                    break
3085
3086
        src_cur += 1
3087
        tar_cur += 1
3088
3089
    lcss += local_cs
3090
    return round(max(src_len, tar_len) - lcss)
3091
3092
3093
def sift4_common(src, tar, max_offset=0, max_distance=0):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
3094
    """Return the "common" Sift4 distance between two terms.
3095
3096
    This is an approximation of edit distance, described in:
3097
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3098
    algorithm: Sift4."
3099
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3100
3101
    :param str src, tar: two strings to be compared
3102
    :param max_offset: the number of characters to search for matching letters
3103
    :param max_distance: the distance at which to stop and exit
3104
    :return:
3105
    """
3106
    if not src:
3107
        return len(tar)
3108
3109
    if not tar:
3110
        return len(src)
3111
3112
    src_len = len(src)
3113
    tar_len = len(tar)
3114
3115
    src_cur = 0
3116
    tar_cur = 0
3117
    lcss = 0
3118
    local_cs = 0
3119
    trans = 0
3120
    offset_arr = []
3121
3122
    while (src_cur < src_len) and (tar_cur < tar_len):
3123
        if src[src_cur] == tar[tar_cur]:
3124
            local_cs += 1
3125
            is_trans = False
3126
            i = 0
3127
            while i < len(offset_arr):
3128
                ofs = offset_arr[i]
3129
                if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
3130
                    is_trans = (abs(tar_cur-src_cur) >= abs(ofs['tar_cur']-ofs['src_cur']))
3131
                    if is_trans:
3132
                        trans += 1
3133
                    elif not ofs['trans']:
3134
                        ofs['trans'] = True
3135
                        trans += 1
3136
                    break
3137
                elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
3138
                    del offset_arr[i]
3139
                else:
3140
                    i += 1
3141
3142
            offset_arr.append({'src_cur': src_cur, 'tar_cur': tar_cur, 'trans': is_trans})
3143
        else:
3144
            lcss += local_cs
3145
            local_cs = 0
3146
            if src_cur != tar_cur:
3147
                src_cur = tar_cur = min(src_cur, tar_cur)
3148
            for i in range(max_offset):
3149
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
3150
                    break
3151
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
3152
                    src_cur += i-1
3153
                    tar_cur -= 1
3154
                    break
3155
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
3156
                    src_cur -= 1
3157
                    tar_cur += i-1
3158
                    break
3159
3160
        src_cur += 1
3161
        tar_cur += 1
3162
3163
        if max_distance:
3164
            temporary_distance = max(src_cur, tar_cur) - lcss + trans
3165
            if temporary_distance >= max_distance:
3166
                return round(temporary_distance)
3167
3168
        if (src_cur >= src_len) or (tar_cur >= tar_len):
3169
            lcss += local_cs
3170
            local_cs = 0
3171
            src_cur = tar_cur = min(src_cur, tar_cur)
3172
3173
    lcss += local_cs
3174
    return round(max(src_len, tar_len) - lcss + trans)
3175
3176
3177
def dist_sift4(src, tar, max_offset=0, max_distance=0):
3178
    """Return the normalized "common" Sift4 distance between two terms.
3179
3180
    This is an approximation of edit distance, described in:
3181
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3182
    algorithm: Sift4."
3183
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3184
3185
    :param str src, tar: two strings to be compared
3186
    :param max_offset: the number of characters to search for matching letters
3187
    :param max_distance: the distance at which to stop and exit
3188
    :return:
3189
    """
3190
    return (sift4_common(src, tar, max_offset, max_distance) /
3191
            (max(len(src), len(tar))))
3192
3193
3194
def sim_sift4(src, tar, max_offset=0, max_distance=0):
3195
    """Return the normalized "common" Sift4 similarity of two terms.
3196
3197
    This is an approximation of edit distance, described in:
3198
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3199
    algorithm: Sift4."
3200
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3201
3202
    :param str src, tar: two strings to be compared
3203
    :param max_offset: the number of characters to search for matching letters
3204
    :param max_distance: the distance at which to stop and exit
3205
    :return:
3206
    """
3207
    return 1-dist_sift4(src, tar, max_offset, max_distance)
3208
3209
3210
def sim_tfidf(src, tar, qval=2, docs_src=None, docs_tar=None):
0 ignored issues
show
Unused Code introduced by
The argument docs_tar seems to be unused.
Loading history...
3211
    """Return the TF-IDF similarity of two strings.
3212
3213
    TF-IDF similarity
3214
3215
    This is chiefly based on the "Formal Definition of TF/IDF Distance" at:
3216
    http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html
3217
3218
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
3219
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
3220
        version
3221
    :param Counter docs_src: a Counter object or string representing the
3222
        document corpus for the src string
3223
    :param Counter docs_tar: a Counter object or string representing the
3224
        document corpus for the tar string (or set to None to use the docs_src
3225
        for both)
3226
    :returns: TF-IDF similarity
3227
    :rtype: float
3228
    """
3229
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
3230
        return 1.0  # TODO: confirm correctness of this when docs are different
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
3231
    elif not src or not tar:
3232
        return 0.0
3233
3234
    q_src, q_tar = _get_qgrams(src, tar, qval)
3235
3236
    if isinstance(docs_src, Counter):
3237
        q_docs = docs_src
0 ignored issues
show
Unused Code introduced by
The variable q_docs seems to be unused.
Loading history...
3238
    elif qval and qval > 0:
3239
        q_docs = QGrams(docs_src, qval)
3240
    else:
3241
        q_docs = Counter(docs_src.strip().split())
3242
3243
    if not q_src or not q_tar:
3244
        return 0.0
3245
3246
    # TODO: finish implementation
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
3247
    return 0.5  # hardcoded to half
3248
3249
###############################################################################
3250
3251
3252
def sim(src, tar, method=sim_levenshtein):
3253
    """Return a similarity of two strings.
3254
3255
    This is a generalized function for calling other similarity functions.
3256
3257
    :param str src, tar: two strings to be compared
3258
    :param function method: specifies the similarity metric (Levenshtein by
3259
        default)
3260
    :returns: similarity according to the specified function
3261
    :rtype: float
3262
3263
    >>> sim('cat', 'hat')
3264
    0.66666666666666674
3265
    >>> sim('Niall', 'Neil')
3266
    0.40000000000000002
3267
    >>> sim('aluminum', 'Catalan')
3268
    0.125
3269
    >>> sim('ATCG', 'TAGC')
3270
    0.25
3271
    """
3272
    if callable(method):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
3273
        return method(src, tar)
3274
    else:
3275
        raise AttributeError('Unknown similarity function: ' + str(method))
3276
3277
3278
def dist(src, tar, method=sim_levenshtein):
3279
    """Return a distance between two strings.
3280
3281
    This is a generalized function for calling other distance functions.
3282
3283
    :param str src, tar: two strings to be compared
3284
    :param function method: specifies the similarity metric (Levenshtein by
3285
        default) -- Note that this takes a similarity metric function, not
3286
        a distance metric function.
3287
    :returns: distance according to the specified function
3288
    :rtype: float
3289
3290
    >>> dist('cat', 'hat')
3291
    0.33333333333333326
3292
    >>> dist('Niall', 'Neil')
3293
    0.59999999999999998
3294
    >>> dist('aluminum', 'Catalan')
3295
    0.875
3296
    >>> dist('ATCG', 'TAGC')
3297
    0.75
3298
    """
3299
    if callable(method):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
3300
        return 1 - method(src, tar)
3301
    else:
3302
        raise AttributeError('Unknown distance function: ' + str(method))
3303