Completed
Push — master ( f7faf9...ef8477 )
by Chris
12:37
created

abydos.distance.synoname()   F

Complexity

Conditions 15

Size

Total Lines 100
Code Lines 73

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 15
eloc 73
nop 5
dl 0
loc 100
rs 2.9836
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.distance.synoname() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (3680/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.distance.
20
21
The distance module implements string edit distance functions including:
22
23
    - Levenshtein distance
24
    - Optimal String Alignment distance
25
    - Levenshtein-Damerau distance
26
    - Hamming distance
27
    - Tversky index
28
    - Sørensen–Dice coefficient & distance
29
    - Jaccard similarity coefficient & distance
30
    - overlap similarity & distance
31
    - Tanimoto coefficient & distance
32
    - Minkowski distance & similarity
33
    - Manhattan distance & similarity
34
    - Euclidean distance & similarity
35
    - Chebyshev distance & similarity
36
    - cosine similarity & distance
37
    - Jaro distance
38
    - Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
39
    - Longest common substring
40
    - Ratcliff-Obershelp similarity & distance
41
    - Match Rating Algorithm similarity
42
    - Normalized Compression Distance (NCD) & similarity
43
    - Monge-Elkan similarity & distance
44
    - Matrix similarity
45
    - Needleman-Wunsch score
46
    - Smither-Waterman score
47
    - Gotoh score
48
    - Length similarity
49
    - Prefix, Suffix, and Identity similarity & distance
50
    - Modified Language-Independent Product Name Search (MLIPNS) similarity &
51
      distance
52
    - Bag distance
53
    - Editex distance
54
    - Eudex distances
55
    - Sift4 distance
56
    - Baystat distance & similarity
57
    - Typo distance
58
59
Functions beginning with the prefixes 'sim' and 'dist' are guaranteed to be
60
in the range [0, 1], and sim_X = 1 - dist_X since the two are complements.
61
If a sim_X function is supplied identical src & tar arguments, it is guaranteed
62
to return 1; the corresponding dist_X function is guaranteed to return 0.
63
"""
64
65
from __future__ import division, unicode_literals
66
67
import codecs
68
import math
69
import sys
70
import types
71
import unicodedata
72
from collections import Counter, Iterable, defaultdict
73
74
import numpy as np
75
76
from six import text_type
77
from six.moves import range
78
79
from .compression import ac_encode, ac_train, rle_encode
80
from .fingerprint import synoname_toolcode
81
from .phonetic import eudex, mra
82
from .qgram import QGrams
83
84
try:
85
    import lzma
86
except ImportError:  # pragma: no cover
87
    # If the system lacks the lzma library, that's fine, but lzma comrpession
88
    # similarity won't be supported.
89
    pass
90
91
92
def levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
93
    """Return the Levenshtein distance between two strings.
94
95
    Levenshtein distance
96
97
    This is the standard edit distance measure. Cf.
98
    https://en.wikipedia.org/wiki/Levenshtein_distance
99
100
    Two additional variants: optimal string alignment (aka restricted
101
    Damerau-Levenshtein distance) and the Damerau-Levenshtein distance
102
    are also supported. Cf.
103
    https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
104
105
    The ordinary Levenshtein & Optimal String Alignment distance both
106
    employ the Wagner-Fischer dynamic programming algorithm. Cf.
107
    https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
108
109
    Levenshtein edit distance ordinarily has unit insertion, deletion, and
110
    substitution costs.
111
112
    :param str src, tar: two strings to be compared
113
    :param str mode: specifies a mode for computing the Levenshtein distance:
114
115
        - 'lev' (default) computes the ordinary Levenshtein distance,
116
          in which edits may include inserts, deletes, and substitutions
117
        - 'osa' computes the Optimal String Alignment distance, in which
118
          edits may include inserts, deletes, substitutions, and
119
          transpositions but substrings may only be edited once
120
        - 'dam' computes the Damerau-Levenshtein distance, in which
121
          edits may include inserts, deletes, substitutions, and
122
          transpositions and substrings may undergo repeated edits
123
124
    :param tuple cost: a 4-tuple representing the cost of the four possible
125
        edits: inserts, deletes, substitutions, and transpositions,
126
        respectively (by default: (1, 1, 1, 1))
127
    :returns: the Levenshtein distance between src & tar
128
    :rtype: int (may return a float if cost has float values)
129
130
    >>> levenshtein('cat', 'hat')
131
    1
132
    >>> levenshtein('Niall', 'Neil')
133
    3
134
    >>> levenshtein('aluminum', 'Catalan')
135
    7
136
    >>> levenshtein('ATCG', 'TAGC')
137
    3
138
139
    >>> levenshtein('ATCG', 'TAGC', mode='osa')
140
    2
141
    >>> levenshtein('ACTG', 'TAGC', mode='osa')
142
    4
143
144
    >>> levenshtein('ATCG', 'TAGC', mode='dam')
145
    2
146
    >>> levenshtein('ACTG', 'TAGC', mode='dam')
147
    3
148
    """
149
    ins_cost, del_cost, sub_cost, trans_cost = cost
150
151
    if src == tar:
152
        return 0
153
    if not src:
154
        return len(tar) * ins_cost
155
    if not tar:
156
        return len(src) * del_cost
157
158
    if 'dam' in mode:
159
        return damerau_levenshtein(src, tar, cost)
160
161
    # pylint: disable=no-member
162
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
163
    # pylint: enable=no-member
164
    for i in range(len(src)+1):
165
        d_mat[i, 0] = i * del_cost
166
    for j in range(len(tar)+1):
167
        d_mat[0, j] = j * ins_cost
168
169
    for i in range(len(src)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
170
        for j in range(len(tar)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
171
            d_mat[i+1, j+1] = min(
172
                d_mat[i+1, j] + ins_cost,  # ins
173
                d_mat[i, j+1] + del_cost,  # del
174
                d_mat[i, j] + (sub_cost if src[i] != tar[j] else 0)  # sub/==
175
            )
176
177
            if mode == 'osa':
178
                if ((i+1 > 1 and j+1 > 1 and src[i] == tar[j-1] and
179
                     src[i-1] == tar[j])):
180
                    # transposition
181
                    d_mat[i+1, j+1] = min(d_mat[i+1, j+1],
182
                                          d_mat[i-1, j-1] + trans_cost)
183
184
    return d_mat[len(src), len(tar)]
185
186
187
def dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
188
    """Return the normalized Levenshtein distance between two strings.
189
190
    Levenshtein distance normalized to the interval [0, 1]
191
192
    The Levenshtein distance is normalized by dividing the Levenshtein distance
193
    (calculated by any of the three supported methods) by the greater of
194
    the number of characters in src times the cost of a delete and
195
    the number of characters in tar times the cost of an insert.
196
    For the case in which all operations have :math:`cost = 1`, this is
197
    equivalent to the greater of the length of the two strings src & tar.
198
199
    :param str src, tar: two strings to be compared
200
    :param str mode: specifies a mode for computing the Levenshtein distance:
201
202
        - 'lev' (default) computes the ordinary Levenshtein distance,
203
          in which edits may include inserts, deletes, and substitutions
204
        - 'osa' computes the Optimal String Alignment distance, in which
205
          edits may include inserts, deletes, substitutions, and
206
          transpositions but substrings may only be edited once
207
        - 'dam' computes the Damerau-Levenshtein distance, in which
208
          edits may include inserts, deletes, substitutions, and
209
          transpositions and substrings may undergo repeated edits
210
211
    :param tuple cost: a 4-tuple representing the cost of the four possible
212
        edits: inserts, deletes, substitutions, and transpositions,
213
        respectively (by default: (1, 1, 1, 1))
214
    :returns: normalized Levenshtein distance
215
    :rtype: float
216
217
    >>> dist_levenshtein('cat', 'hat')
218
    0.33333333333333331
219
    >>> dist_levenshtein('Niall', 'Neil')
220
    0.59999999999999998
221
    >>> dist_levenshtein('aluminum', 'Catalan')
222
    0.875
223
    >>> dist_levenshtein('ATCG', 'TAGC')
224
    0.75
225
    """
226
    if src == tar:
227
        return 0
228
    ins_cost, del_cost = cost[:2]
229
    return (levenshtein(src, tar, mode, cost) /
230
            (max(len(src)*del_cost, len(tar)*ins_cost)))
231
232
233
def sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1)):
234
    """Return the Levenshtein similarity of two strings.
235
236
    Levenshtein similarity normalized to the interval [0, 1]
237
238
    Levenshtein similarity the complement of Levenshtein distance:
239
    :math:`sim_{Levenshtein} = 1 - dist_{Levenshtein}`
240
241
    The arguments are identical to those of the levenshtein() function.
242
243
    :param str src, tar: two strings to be compared
244
    :param str mode: specifies a mode for computing the Levenshtein distance:
245
246
            - 'lev' (default) computes the ordinary Levenshtein distance,
247
              in which edits may include inserts, deletes, and substitutions
248
            - 'osa' computes the Optimal String Alignment distance, in which
249
              edits may include inserts, deletes, substitutions, and
250
              transpositions but substrings may only be edited once
251
            - 'dam' computes the Damerau-Levenshtein distance, in which
252
              edits may include inserts, deletes, substitutions, and
253
              transpositions and substrings may undergo repeated edits
254
255
    :param tuple cost: a 4-tuple representing the cost of the four possible
256
        edits:
257
        inserts, deletes, substitutions, and transpositions, respectively
258
        (by default: (1, 1, 1, 1))
259
    :returns: normalized Levenshtein similarity
260
    :rtype: float
261
262
    >>> sim_levenshtein('cat', 'hat')
263
    0.66666666666666674
264
    >>> sim_levenshtein('Niall', 'Neil')
265
    0.40000000000000002
266
    >>> sim_levenshtein('aluminum', 'Catalan')
267
    0.125
268
    >>> sim_levenshtein('ATCG', 'TAGC')
269
    0.25
270
    """
271
    return 1 - dist_levenshtein(src, tar, mode, cost)
272
273
274
def damerau_levenshtein(src, tar, cost=(1, 1, 1, 1)):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (20/15).
Loading history...
275
    """Return the Damerau-Levenshtein distance between two strings.
276
277
    Damerau-Levenshtein distance
278
279
    This computes the Damerau-Levenshtein distance. Cf.
280
    https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
281
282
    Damerau-Levenshtein code based on Java code by Kevin L. Stern,
283
    under the MIT license:
284
    https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
285
286
    :param str src, tar: two strings to be compared
287
    :param tuple cost: a 4-tuple representing the cost of the four possible
288
        edits:
289
        inserts, deletes, substitutions, and transpositions, respectively
290
        (by default: (1, 1, 1, 1))
291
    :returns: the Damerau-Levenshtein distance between src & tar
292
    :rtype: int (may return a float if cost has float values)
293
294
    >>> damerau_levenshtein('cat', 'hat')
295
    1
296
    >>> damerau_levenshtein('Niall', 'Neil')
297
    3
298
    >>> damerau_levenshtein('aluminum', 'Catalan')
299
    7
300
    >>> damerau_levenshtein('ATCG', 'TAGC')
301
    2
302
    """
303
    ins_cost, del_cost, sub_cost, trans_cost = cost
304
305
    if src == tar:
306
        return 0
307
    if not src:
308
        return len(tar) * ins_cost
309
    if not tar:
310
        return len(src) * del_cost
311
312
    if 2*trans_cost < ins_cost + del_cost:
313
        raise ValueError('Unsupported cost assignment; the cost of two ' +
314
                         'transpositions must not be less than the cost of ' +
315
                         'an insert plus a delete.')
316
317
    # pylint: disable=no-member
318
    d_mat = (np.zeros((len(src))*(len(tar)), dtype=np.int).
319
             reshape((len(src), len(tar))))
320
    # pylint: enable=no-member
321
322
    if src[0] != tar[0]:
323
        d_mat[0, 0] = min(sub_cost, ins_cost + del_cost)
324
325
    src_index_by_character = {}
326
    src_index_by_character[src[0]] = 0
327
    for i in range(1, len(src)):
328
        del_distance = d_mat[i-1, 0] + del_cost
329
        ins_distance = (i+1) * del_cost + ins_cost
330
        match_distance = (i * del_cost +
331
                          (0 if src[i] == tar[0] else sub_cost))
332
        d_mat[i, 0] = min(del_distance, ins_distance, match_distance)
333
334
    for j in range(1, len(tar)):
335
        del_distance = (j+1) * ins_cost + del_cost
336
        ins_distance = d_mat[0, j-1] + ins_cost
337
        match_distance = (j * ins_cost +
338
                          (0 if src[0] == tar[j] else sub_cost))
339
        d_mat[0, j] = min(del_distance, ins_distance, match_distance)
340
341
    for i in range(1, len(src)):
342
        max_src_letter_match_index = (0 if src[i] == tar[0] else -1)
343
        for j in range(1, len(tar)):
344
            candidate_swap_index = (-1 if tar[j] not in
345
                                    src_index_by_character else
346
                                    src_index_by_character[tar[j]])
347
            j_swap = max_src_letter_match_index
348
            del_distance = d_mat[i-1, j] + del_cost
349
            ins_distance = d_mat[i, j-1] + ins_cost
350
            match_distance = d_mat[i-1, j-1]
351
            if src[i] != tar[j]:
352
                match_distance += sub_cost
353
            else:
354
                max_src_letter_match_index = j
355
356
            if candidate_swap_index != -1 and j_swap != -1:
357
                i_swap = candidate_swap_index
358
359
                if i_swap == 0 and j_swap == 0:
360
                    pre_swap_cost = 0
361
                else:
362
                    pre_swap_cost = d_mat[max(0, i_swap-1), max(0, j_swap-1)]
363
                swap_distance = (pre_swap_cost + (i - i_swap - 1) *
364
                                 del_cost + (j - j_swap - 1) * ins_cost +
365
                                 trans_cost)
366
            else:
367
                swap_distance = sys.maxsize
368
369
            d_mat[i, j] = min(del_distance, ins_distance,
370
                              match_distance, swap_distance)
371
        src_index_by_character[src[i]] = i
372
373
    return d_mat[len(src)-1, len(tar)-1]
374
375
376
def dist_damerau(src, tar, cost=(1, 1, 1, 1)):
377
    """Return the Damerau-Levenshtein similarity of two strings.
378
379
    Damerau-Levenshtein distance normalized to the interval [0, 1]
380
381
    The Damerau-Levenshtein distance is normalized by dividing the
382
    Damerau-Levenshtein distance by the greater of
383
    the number of characters in src times the cost of a delete and
384
    the number of characters in tar times the cost of an insert.
385
    For the case in which all operations have :math:`cost = 1`, this is
386
    equivalent to the greater of the length of the two strings src & tar.
387
388
    The arguments are identical to those of the levenshtein() function.
389
390
    :param str src, tar: two strings to be compared
391
    :param tuple cost: a 4-tuple representing the cost of the four possible
392
        edits:
393
        inserts, deletes, substitutions, and transpositions, respectively
394
        (by default: (1, 1, 1, 1))
395
    :returns: normalized Damerau-Levenshtein distance
396
    :rtype: float
397
398
    >>> dist_damerau('cat', 'hat')
399
    0.33333333333333331
400
    >>> dist_damerau('Niall', 'Neil')
401
    0.59999999999999998
402
    >>> dist_damerau('aluminum', 'Catalan')
403
    0.875
404
    >>> dist_damerau('ATCG', 'TAGC')
405
    0.5
406
    """
407
    if src == tar:
408
        return 0
409
    ins_cost, del_cost = cost[:2]
410
    return (damerau_levenshtein(src, tar, cost) /
411
            (max(len(src)*del_cost, len(tar)*ins_cost)))
412
413
414
def sim_damerau(src, tar, cost=(1, 1, 1, 1)):
415
    """Return the Damerau-Levenshtein similarity of two strings.
416
417
    Damerau-Levenshtein similarity normalized to the interval [0, 1]
418
419
    Damerau-Levenshtein similarity the complement of Damerau-Levenshtein
420
    distance:
421
    :math:`sim_{Damerau} = 1 - dist_{Damerau}`
422
423
    The arguments are identical to those of the levenshtein() function.
424
425
    :param str src, tar: two strings to be compared
426
    :param tuple cost: a 4-tuple representing the cost of the four possible
427
        edits:
428
        inserts, deletes, substitutions, and transpositions, respectively
429
        (by default: (1, 1, 1, 1))
430
    :returns: normalized Damerau-Levenshtein similarity
431
    :rtype: float
432
433
    >>> sim_damerau('cat', 'hat')
434
    0.66666666666666674
435
    >>> sim_damerau('Niall', 'Neil')
436
    0.40000000000000002
437
    >>> sim_damerau('aluminum', 'Catalan')
438
    0.125
439
    >>> sim_damerau('ATCG', 'TAGC')
440
    0.5
441
    """
442
    return 1 - dist_damerau(src, tar, cost)
443
444
445
def hamming(src, tar, difflens=True):
446
    """Return the Hamming distance between two strings.
447
448
    Hamming distance
449
450
    Hamming distance equals the number of character positions at which two
451
    strings differ. For strings of unequal lengths, it is not normally defined.
452
    By default, this implementation calculates the Hamming distance of the
453
    first n characters where n is the lesser of the two strings' lengths and
454
    adds to this the difference in string lengths.
455
456
    :param str src, tar: two strings to be compared
457
    :param bool allow_different_lengths:
458
        If True (default), this returns the Hamming distance for those
459
        characters that have a matching character in both strings plus the
460
        difference in the strings' lengths. This is equivalent to extending
461
        the shorter string with obligatorily non-matching characters.
462
        If False, an exception is raised in the case of strings of unequal
463
        lengths.
464
    :returns: the Hamming distance between src & tar
465
    :rtype: int
466
467
    >>> hamming('cat', 'hat')
468
    1
469
    >>> hamming('Niall', 'Neil')
470
    3
471
    >>> hamming('aluminum', 'Catalan')
472
    8
473
    >>> hamming('ATCG', 'TAGC')
474
    4
475
    """
476
    if not difflens and len(src) != len(tar):
477
        raise ValueError('Undefined for sequences of unequal length; set ' +
478
                         'difflens to True for Hamming distance between ' +
479
                         'strings of unequal lengths.')
480
481
    hdist = 0
482
    if difflens:
483
        hdist += abs(len(src)-len(tar))
484
    hdist += sum(c1 != c2 for c1, c2 in zip(src, tar))
485
486
    return hdist
487
488
489
def dist_hamming(src, tar, difflens=True):
490
    """Return the normalized Hamming distance between two strings.
491
492
    Hamming distance normalized to the interval [0, 1]
493
494
    The Hamming distance is normalized by dividing it
495
    by the greater of the number of characters in src & tar (unless difflens is
496
    set to False, in which case an exception is raised).
497
498
    The arguments are identical to those of the hamming() function.
499
500
    :param str src, tar: two strings to be compared
501
    :param bool allow_different_lengths:
502
        If True (default), this returns the Hamming distance for those
503
        characters that have a matching character in both strings plus the
504
        difference in the strings' lengths. This is equivalent to extending
505
        the shorter string with obligatorily non-matching characters.
506
        If False, an exception is raised in the case of strings of unequal
507
        lengths.
508
    :returns: normalized Hamming distance
509
    :rtype: float
510
511
    >>> dist_hamming('cat', 'hat')
512
    0.3333333333333333
513
    >>> dist_hamming('Niall', 'Neil')
514
    0.6
515
    >>> dist_hamming('aluminum', 'Catalan')
516
    1.0
517
    >>> dist_hamming('ATCG', 'TAGC')
518
    1.0
519
    """
520
    if src == tar:
521
        return 0
522
    return hamming(src, tar, difflens) / max(len(src), len(tar))
523
524
525
def sim_hamming(src, tar, difflens=True):
526
    """Return the normalized Hamming similarity of two strings.
527
528
    Hamming similarity normalized to the interval [0, 1]
529
530
    Hamming similarity is the complement of normalized Hamming distance:
531
    :math:`sim_{Hamming} = 1 - dist{Hamming}`
532
533
    Provided that difflens==True, the Hamming similarity is identical to the
534
    Language-Independent Product Name Search (LIPNS) similarity score. For
535
    further information, see the sim_mlipns documentation.
536
537
    The arguments are identical to those of the hamming() function.
538
539
    :param str src, tar: two strings to be compared
540
    :param bool allow_different_lengths:
541
        If True (default), this returns the Hamming distance for those
542
        characters that have a matching character in both strings plus the
543
        difference in the strings' lengths. This is equivalent to extending
544
        the shorter string with obligatorily non-matching characters.
545
        If False, an exception is raised in the case of strings of unequal
546
        lengths.
547
    :returns: normalized Hamming similarity
548
    :rtype: float
549
550
    >>> sim_hamming('cat', 'hat')
551
    0.6666666666666667
552
    >>> sim_hamming('Niall', 'Neil')
553
    0.4
554
    >>> sim_hamming('aluminum', 'Catalan')
555
    0.0
556
    >>> sim_hamming('ATCG', 'TAGC')
557
    0.0
558
    """
559
    return 1 - dist_hamming(src, tar, difflens)
560
561
562
def _get_qgrams(src, tar, qval):
563
    """Return the Q-Grams in src & tar.
564
565
    :param str src, tar: two strings to be compared
566
        (or QGrams/Counter objects)
567
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
568
        version
569
    :return: Q-Grams
570
    """
571
    if isinstance(src, Counter) and isinstance(tar, Counter):
572
        return src, tar
573
    if qval and qval > 0:
574
        return QGrams(src, qval), QGrams(tar, qval)
575
    return Counter(src.strip().split()), Counter(tar.strip().split())
576
577
578
def sim_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None):
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
579
    r"""Return the Tversky index of two strings.
580
581
    Tversky index
582
583
    The Tversky index is defined as:
584
    For two sets X and Y:
585
    :math:`sim_{Tversky}(X, Y) = \\frac{|X \\cap Y|}
586
    {|X \\cap Y| + \\alpha|X - Y| + \\beta|Y - X|}`
587
588
    Cf. https://en.wikipedia.org/wiki/Tversky_index
589
590
    :math:`\\alpha = \\beta = 1` is equivalent to the Jaccard & Tanimoto
591
    similarity coefficients.
592
593
    :math:`\\alpha = \\beta = 0.5` is equivalent to the Sørensen-Dice
594
    similarity coefficient.
595
596
    Unequal α and β will tend to emphasize one or the other set's
597
    contributions:
598
599
        - :math:`\\alpha > \\beta` emphasizes the contributions of X over Y
600
        - :math:`\\alpha < \\beta` emphasizes the contributions of Y over X)
601
602
    Parameter values' relation to 1 emphasizes different types of
603
    contributions:
604
605
        - :math:`\\alpha and \\beta > 1` emphsize unique contributions over the
606
          intersection
607
        - :math:`\\alpha and \\beta < 1` emphsize the intersection over unique
608
          contributions
609
610
    The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and
611
    Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with
612
    Distributional Measures for Semantic Textual Similarity. This is activated
613
    by specifying a bias parameter.
614
    Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf
615
616
    :param str src, tar: two strings to be compared
617
        (or QGrams/Counter objects)
618
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
619
        version
620
    :param float alpha, beta: two Tversky index parameters as indicated in the
621
        description below
622
    :returns: Tversky similarity
623
    :rtype: float
624
625
    >>> sim_tversky('cat', 'hat')
626
    0.3333333333333333
627
    >>> sim_tversky('Niall', 'Neil')
628
    0.2222222222222222
629
    >>> sim_tversky('aluminum', 'Catalan')
630
    0.0625
631
    >>> sim_tversky('ATCG', 'TAGC')
632
    0.0
633
    """
634
    if alpha < 0 or beta < 0:
635
        raise ValueError('Unsupported weight assignment; alpha and beta ' +
636
                         'must be greater than or equal to 0.')
637
638
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
639
        return 1.0
640
    elif not src or not tar:
641
        return 0.0
642
643
    q_src, q_tar = _get_qgrams(src, tar, qval)
644
    q_src_mag = sum(q_src.values())
645
    q_tar_mag = sum(q_tar.values())
646
    q_intersection_mag = sum((q_src & q_tar).values())
647
648
    if not q_src or not q_tar:
649
        return 0.0
650
651
    if bias is None:
652
        return q_intersection_mag / (q_intersection_mag + alpha *
653
                                     (q_src_mag - q_intersection_mag) +
654
                                     beta * (q_tar_mag - q_intersection_mag))
655
656
    a_val = min(q_src_mag - q_intersection_mag,
657
                q_tar_mag - q_intersection_mag)
658
    b_val = max(q_src_mag - q_intersection_mag,
659
                q_tar_mag - q_intersection_mag)
660
    c_val = q_intersection_mag + bias
661
    return c_val / (beta * (alpha * a_val + (1 - alpha) * b_val) + c_val)
662
663
664
def dist_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None):
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
665
    """Return the Tverssky distance between two strings.
666
667
    Tversky distance
668
669
    Tversky distance is the complement of the Tvesrsky index (similarity):
670
    :math:`dist_{Tversky} = 1-sim_{Tversky}`
671
672
    The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and
673
    Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with
674
    Distributional Measures for Semantic Textual Similarity. This is activated
675
    by specifying a bias parameter.
676
    Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf
677
678
    :param str src, tar: two strings to be compared
679
        (or QGrams/Counter objects)
680
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
681
        version
682
    :param float alpha, beta: two Tversky index parameters as indicated in the
683
        description below
684
    :returns: Tversky distance
685
    :rtype: float
686
687
    >>> dist_tversky('cat', 'hat')
688
    0.6666666666666667
689
    >>> dist_tversky('Niall', 'Neil')
690
    0.7777777777777778
691
    >>> dist_tversky('aluminum', 'Catalan')
692
    0.9375
693
    >>> dist_tversky('ATCG', 'TAGC')
694
    1.0
695
    """
696
    return 1 - sim_tversky(src, tar, qval, alpha, beta, bias)
697
698
699
def sim_dice(src, tar, qval=2):
700
    r"""Return the Sørensen–Dice coefficient of two strings.
701
702
    Sørensen–Dice coefficient
703
704
    For two sets X and Y, the Sørensen–Dice coefficient is
705
    :math:`sim_{dice}(X, Y) = \\frac{2 \\cdot |X \\cap Y|}{|X| + |Y|}`
706
707
    This is identical to the Tanimoto similarity coefficient
708
    and the Tversky index for :math:`\\alpha = \\beta = 0.5`
709
710
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
711
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
712
        version
713
    :returns: Sørensen–Dice similarity
714
    :rtype: float
715
716
    >>> sim_dice('cat', 'hat')
717
    0.5
718
    >>> sim_dice('Niall', 'Neil')
719
    0.36363636363636365
720
    >>> sim_dice('aluminum', 'Catalan')
721
    0.11764705882352941
722
    >>> sim_dice('ATCG', 'TAGC')
723
    0.0
724
    """
725
    return sim_tversky(src, tar, qval, 0.5, 0.5)
726
727
728
def dist_dice(src, tar, qval=2):
729
    """Return the Sørensen–Dice distance between two strings.
730
731
    Sørensen–Dice distance
732
733
    Sørensen–Dice distance is the complemenjt of the Sørensen–Dice coefficient:
734
    :math:`dist_{dice} = 1 - sim_{dice}`
735
736
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
737
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
738
        version
739
    :returns: Sørensen–Dice distance
740
    :rtype: float
741
742
    >>> dist_dice('cat', 'hat')
743
    0.5
744
    >>> dist_dice('Niall', 'Neil')
745
    0.6363636363636364
746
    >>> dist_dice('aluminum', 'Catalan')
747
    0.8823529411764706
748
    >>> dist_dice('ATCG', 'TAGC')
749
    1.0
750
    """
751
    return 1 - sim_dice(src, tar, qval)
752
753
754
def sim_jaccard(src, tar, qval=2):
755
    r"""Return the Jaccard similarity of two strings.
756
757
    Jaccard similarity coefficient
758
759
    For two sets X and Y, the Jaccard similarity coefficient is
760
    :math:`sim_{jaccard}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}`
761
762
    This is identical to the Tanimoto similarity coefficient
763
    and the Tversky index for :math:`\\alpha = \\beta = 1`
764
765
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
766
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
767
        version
768
    :returns: Jaccard similarity
769
    :rtype: float
770
771
    >>> sim_jaccard('cat', 'hat')
772
    0.3333333333333333
773
    >>> sim_jaccard('Niall', 'Neil')
774
    0.2222222222222222
775
    >>> sim_jaccard('aluminum', 'Catalan')
776
    0.0625
777
    >>> sim_jaccard('ATCG', 'TAGC')
778
    0.0
779
    """
780
    return sim_tversky(src, tar, qval, 1, 1)
781
782
783
def dist_jaccard(src, tar, qval=2):
784
    """Return the Jaccard distance between two strings.
785
786
    Jaccard distance
787
788
    Jaccard distance is the complement of the Jaccard similarity coefficient:
789
    :math:`dist_{Jaccard} = 1 - sim_{Jaccard}`
790
791
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
792
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
793
        version
794
    :returns: Jaccard distance
795
    :rtype: float
796
797
    >>> dist_jaccard('cat', 'hat')
798
    0.6666666666666667
799
    >>> dist_jaccard('Niall', 'Neil')
800
    0.7777777777777778
801
    >>> dist_jaccard('aluminum', 'Catalan')
802
    0.9375
803
    >>> dist_jaccard('ATCG', 'TAGC')
804
    1.0
805
    """
806
    return 1 - sim_jaccard(src, tar, qval)
807
808
809
def sim_overlap(src, tar, qval=2):
810
    r"""Return the overlap coefficient of two strings.
811
812
    Overlap coefficient
813
814
    For two sets X and Y, the overlap coefficient is
815
    :math:`sim_{overlap}(X, Y) = \\frac{|X \\cap Y|}{min(|X|, |Y|)}`
816
817
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
818
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
819
        version
820
    :returns: overlap similarity
821
    :rtype: float
822
823
    >>> sim_overlap('cat', 'hat')
824
    0.5
825
    >>> sim_overlap('Niall', 'Neil')
826
    0.4
827
    >>> sim_overlap('aluminum', 'Catalan')
828
    0.125
829
    >>> sim_overlap('ATCG', 'TAGC')
830
    0.0
831
    """
832
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
833
        return 1.0
834
    elif not src or not tar:
835
        return 0.0
836
837
    q_src, q_tar = _get_qgrams(src, tar, qval)
838
    q_src_mag = sum(q_src.values())
839
    q_tar_mag = sum(q_tar.values())
840
    q_intersection_mag = sum((q_src & q_tar).values())
841
842
    return q_intersection_mag / min(q_src_mag, q_tar_mag)
843
844
845
def dist_overlap(src, tar, qval=2):
846
    """Return the overlap distance between two strings.
847
848
    Overlap distance
849
850
    Overlap distance is the complement of the overlap coefficient:
851
    :math:`sim_{overlap} = 1 - dist_{overlap}`
852
853
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
854
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
855
        version
856
    :returns: overlap distance
857
    :rtype: float
858
859
    >>> dist_overlap('cat', 'hat')
860
    0.5
861
    >>> dist_overlap('Niall', 'Neil')
862
    0.6
863
    >>> dist_overlap('aluminum', 'Catalan')
864
    0.875
865
    >>> dist_overlap('ATCG', 'TAGC')
866
    1.0
867
    """
868
    return 1 - sim_overlap(src, tar, qval)
869
870
871
def sim_tanimoto(src, tar, qval=2):
872
    r"""Return the Tanimoto similarity of two strings.
873
874
    Tanimoto similarity
875
876
    For two sets X and Y, the Tanimoto similarity coefficient is
877
    :math:`sim_{Tanimoto}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}`
878
    This is identical to the Jaccard similarity coefficient
879
    and the Tversky index for :math:`\\alpha = \\beta = 1`
880
881
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
882
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
883
        version
884
    :returns: Tanimoto similarity
885
    :rtype: float
886
887
    >>> sim_tanimoto('cat', 'hat')
888
    0.3333333333333333
889
    >>> sim_tanimoto('Niall', 'Neil')
890
    0.2222222222222222
891
    >>> sim_tanimoto('aluminum', 'Catalan')
892
    0.0625
893
    >>> sim_tanimoto('ATCG', 'TAGC')
894
    0.0
895
    """
896
    return sim_jaccard(src, tar, qval)
897
898
899
def tanimoto(src, tar, qval=2):
900
    """Return the Tanimoto distance between two strings.
901
902
    Tanimoto distance
903
904
    Tanimoto distance is :math:`-log_{2}sim_{Tanimoto}`
905
906
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
907
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
908
        version
909
    :returns: Tanimoto distance
910
    :rtype: float
911
912
    >>> tanimoto('cat', 'hat')
913
    -1.5849625007211563
914
    >>> tanimoto('Niall', 'Neil')
915
    -2.1699250014423126
916
    >>> tanimoto('aluminum', 'Catalan')
917
    -4.0
918
    >>> tanimoto('ATCG', 'TAGC')
919
    -inf
920
    """
921
    coeff = sim_jaccard(src, tar, qval)
922
    if coeff != 0:
923
        return math.log(coeff, 2)
924
925
    return float('-inf')
926
927
928
def minkowski(src, tar, qval=2, pval=1, normalize=False):
929
    """Return the Minkowski distance (:math:`L^p-norm`) of two strings.
930
931
    :param src:
932
    :param tar:
933
    :param qval:
934
    :param pval:
935
    :return:
936
    """
937
    q_src, q_tar = _get_qgrams(src, tar, qval)
938
    diffs = ((q_src - q_tar) + (q_tar - q_src)).values()
939
940
    normalizer = 1
941
    if normalize:
942
        totals = (q_src + q_tar).values()
943
        if pval == 0:
944
            normalizer = len(totals)
945
        else:
946
            normalizer = sum(_**pval for _ in totals)**(1 / pval)
947
948
    if pval == float('inf'):
949
        # Chebyshev distance
950
        return max(diffs)/normalizer
951
    if pval == 0:
952
        # This is the l_0 "norm" as developed by David Donoho
953
        return len(diffs)
954
    return sum(_**pval for _ in diffs)**(1 / pval)/normalizer
955
956
957
def dist_minkowski(src, tar, qval=2, pval=1):
958
    """Return Minkowski distance of two strings, normalized to [0, 1].
959
960
    :param src:
961
    :param tar:
962
    :param qval2:
963
    :param pval:
964
    :return:
965
    """
966
    return minkowski(src, tar, qval, pval, True)
967
968
969
def sim_minkowski(src, tar, qval=2, pval=1):
970
    """Return Minkowski similarity of two strings, normalized to [0, 1].
971
972
    :param src:
973
    :param tar:
974
    :param qval2:
975
    :param pval:
976
    :return:
977
    """
978
    return 1-minkowski(src, tar, qval, pval, True)
979
980
981
def manhattan(src, tar, qval=2, normalize=False):
982
    """Return the Manhattan distance between two strings.
983
984
    :param src:
985
    :param tar:
986
    :param qval:
987
    :return:
988
    """
989
    return minkowski(src, tar, qval, 1, normalize)
990
991
992
def dist_manhattan(src, tar, qval=2):
993
    """Return the Manhattan distance between two strings, normalized to [0, 1].
994
995
    This is identical to Canberra distance.
996
997
    :param src:
998
    :param tar:
999
    :param qval:
1000
    :return:
1001
    """
1002
    return manhattan(src, tar, qval, 1, True)
0 ignored issues
show
Bug introduced by
There seem to be too many positional arguments for this function call.
Loading history...
1003
1004
1005
def sim_manhattan(src, tar, qval=2):
1006
    """Return the Manhattan similarity of two strings, normalized to [0, 1].
1007
1008
    :param src:
1009
    :param tar:
1010
    :param qval:
1011
    :return:
1012
    """
1013
    return 1-manhattan(src, tar, qval, 1, True)
0 ignored issues
show
Bug introduced by
There seem to be too many positional arguments for this function call.
Loading history...
1014
1015
1016
def euclidean(src, tar, qval=2, normalize=False):
1017
    """Return the Euclidean distance between two strings.
1018
1019
    :param src:
1020
    :param tar:
1021
    :param qval:
1022
    :return:
1023
    """
1024
    return minkowski(src, tar, qval, 2, normalize)
1025
1026
1027
def dist_euclidean(src, tar, qval=2):
1028
    """Return the Euclidean distance between two strings, normalized to [0, 1].
1029
1030
    :param src:
1031
    :param tar:
1032
    :param qval:
1033
    :return:
1034
    """
1035
    return euclidean(src, tar, qval, True)
1036
1037
1038
def sim_euclidean(src, tar, qval=2):
1039
    """Return the Euclidean similarity of two strings, normalized to [0, 1].
1040
1041
    :param src:
1042
    :param tar:
1043
    :param qval:
1044
    :return:
1045
    """
1046
    return 1-euclidean(src, tar, qval, True)
1047
1048
1049
def chebyshev(src, tar, qval=2, normalize=False):
1050
    """Return the Chebyshev distance between two strings.
1051
1052
    :param src:
1053
    :param tar:
1054
    :param qval:
1055
    :return:
1056
    """
1057
    return minkowski(src, tar, qval, float('inf'), normalize)
1058
1059
1060
def dist_chebyshev(src, tar, qval=2):
1061
    """Return the Chebyshev distance between two strings, normalized to [0, 1].
1062
1063
    :param src:
1064
    :param tar:
1065
    :param qval:
1066
    :return:
1067
    """
1068
    return chebyshev(src, tar, qval, True)
1069
1070
1071
def sim_chebyshev(src, tar, qval=2):
1072
    """Return the Chebyshev similarity of two strings, normalized to [0, 1].
1073
1074
    :param src:
1075
    :param tar:
1076
    :param qval:
1077
    :return:
1078
    """
1079
    return 1 - chebyshev(src, tar, qval, True)
1080
1081
1082
def sim_cosine(src, tar, qval=2):
1083
    r"""Return the cosine similarity of two strings.
1084
1085
    Cosine similarity (Ochiai coefficient)
1086
1087
    For two sets X and Y, the cosine similarity (Ochiai coefficient) is:
1088
    :math:`sim_{cosine}(X, Y) = \\frac{|X \\cap Y|}{\\sqrt{|X| \\cdot |Y|}}`
1089
1090
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
1091
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
1092
        version
1093
    :returns: cosine similarity
1094
    :rtype: float
1095
1096
    >>> sim_cosine('cat', 'hat')
1097
    0.5
1098
    >>> sim_cosine('Niall', 'Neil')
1099
    0.3651483716701107
1100
    >>> sim_cosine('aluminum', 'Catalan')
1101
    0.11785113019775793
1102
    >>> sim_cosine('ATCG', 'TAGC')
1103
    0.0
1104
    """
1105
    if src == tar:
1106
        return 1.0
1107
    if not src or not tar:
1108
        return 0.0
1109
1110
    q_src, q_tar = _get_qgrams(src, tar, qval)
1111
    q_src_mag = sum(q_src.values())
1112
    q_tar_mag = sum(q_tar.values())
1113
    q_intersection_mag = sum((q_src & q_tar).values())
1114
1115
    return q_intersection_mag / math.sqrt(q_src_mag * q_tar_mag)
1116
1117
1118
def dist_cosine(src, tar, qval=2):
1119
    """Return the cosine distance between two strings.
1120
1121
    Cosine distance
1122
1123
    Cosine distance is the complement of cosine similarity:
1124
    :math:`dist_{cosine} = 1 - sim_{cosine}`
1125
1126
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
1127
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
1128
        version
1129
    :returns: cosine distance
1130
    :rtype: float
1131
1132
    >>> dist_cosine('cat', 'hat')
1133
    0.5
1134
    >>> dist_cosine('Niall', 'Neil')
1135
    0.6348516283298893
1136
    >>> dist_cosine('aluminum', 'Catalan')
1137
    0.882148869802242
1138
    >>> dist_cosine('ATCG', 'TAGC')
1139
    1.0
1140
    """
1141
    return 1 - sim_cosine(src, tar, qval)
1142
1143
1144
def sim_strcmp95(src, tar, long_strings=False):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (23/15).
Loading history...
1145
    """Return the strcmp95 similarity of two strings.
1146
1147
    strcmp95 similarity
1148
1149
    This is a Python translation of the C code for strcmp95:
1150
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
1151
    The above file is a US Government publication and, accordingly,
1152
    in the public domain.
1153
1154
    This is based on the Jaro-Winkler distance, but also attempts to correct
1155
    for some common typos and frequently confused characters. It is also
1156
    limited to uppercase ASCII characters, so it is appropriate to American
1157
    names, but not much else.
1158
1159
    :param str src, tar: two strings to be compared
1160
    :param bool long_strings: set to True to "Increase the probability of a
1161
        match when the number of matched characters is large.  This option
1162
        allows for a little more tolerance when the strings are large. It is
1163
        not an appropriate test when comparing fixed length fields such as
1164
        phone and social security numbers."
1165
    :returns: strcmp95 similarity
1166
    :rtype: float
1167
1168
    >>> sim_strcmp95('cat', 'hat')
1169
    0.7777777777777777
1170
    >>> sim_strcmp95('Niall', 'Neil')
1171
    0.8454999999999999
1172
    >>> sim_strcmp95('aluminum', 'Catalan')
1173
    0.6547619047619048
1174
    >>> sim_strcmp95('ATCG', 'TAGC')
1175
    0.8333333333333334
1176
    """
1177
    def _inrange(char):
1178
        """Return True if char is in the range (0, 91)."""
1179
        return ord(char) > 0 and ord(char) < 91
1180
1181
    ying = src.strip().upper()
1182
    yang = tar.strip().upper()
1183
1184
    if ying == yang:
1185
        return 1.0
1186
    # If either string is blank - return - added in Version 2
1187
    if not ying or not yang:
1188
        return 0.0
1189
1190
    adjwt = defaultdict(int)
1191
    sp_mx = (
1192
        ('A', 'E'), ('A', 'I'), ('A', 'O'), ('A', 'U'), ('B', 'V'), ('E', 'I'),
1193
        ('E', 'O'), ('E', 'U'), ('I', 'O'), ('I', 'U'), ('O', 'U'), ('I', 'Y'),
1194
        ('E', 'Y'), ('C', 'G'), ('E', 'F'), ('W', 'U'), ('W', 'V'), ('X', 'K'),
1195
        ('S', 'Z'), ('X', 'S'), ('Q', 'C'), ('U', 'V'), ('M', 'N'), ('L', 'I'),
1196
        ('Q', 'O'), ('P', 'R'), ('I', 'J'), ('2', 'Z'), ('5', 'S'), ('8', 'B'),
1197
        ('1', 'I'), ('1', 'L'), ('0', 'O'), ('0', 'Q'), ('C', 'K'), ('G', 'J')
1198
    )
1199
1200
    # Initialize the adjwt array on the first call to the function only.
1201
    # The adjwt array is used to give partial credit for characters that
1202
    # may be errors due to known phonetic or character recognition errors.
1203
    # A typical example is to match the letter "O" with the number "0"
1204
    for i in sp_mx:
1205
        adjwt[(i[0], i[1])] = 3
1206
        adjwt[(i[1], i[0])] = 3
1207
1208
    if len(ying) > len(yang):
1209
        search_range = len(ying)
1210
        minv = len(yang)
1211
    else:
1212
        search_range = len(yang)
1213
        minv = len(ying)
1214
1215
    # Blank out the flags
1216
    ying_flag = [0] * search_range
1217
    yang_flag = [0] * search_range
1218
    search_range = max(0, search_range // 2 - 1)
1219
1220
    # Looking only within the search range, count and flag the matched pairs.
1221
    num_com = 0
1222
    yl1 = len(yang) - 1
1223
    for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1224
        lowlim = (i - search_range) if (i >= search_range) else 0
1225
        hilim = (i + search_range) if ((i + search_range) <= yl1) else yl1
1226
        for j in range(lowlim, hilim+1):
1227
            if (yang_flag[j] == 0) and (yang[j] == ying[i]):
1228
                yang_flag[j] = 1
1229
                ying_flag[i] = 1
1230
                num_com += 1
1231
                break
1232
1233
    # If no characters in common - return
1234
    if num_com == 0:
1235
        return 0.0
1236
1237
    # Count the number of transpositions
1238
    k = n_trans = 0
1239
    for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1240
        if ying_flag[i] != 0:
1241
            for j in range(k, len(yang)):
1242
                if yang_flag[j] != 0:
1243
                    k = j + 1
1244
                    break
1245
            if ying[i] != yang[j]:
0 ignored issues
show
introduced by
The variable j does not seem to be defined for all execution paths.
Loading history...
1246
                n_trans += 1
1247
    n_trans = n_trans // 2
1248
1249
    # Adjust for similarities in unmatched characters
1250
    n_simi = 0
1251
    if minv > num_com:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1252
        for i in range(len(ying)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1253
            if ying_flag[i] == 0 and _inrange(ying[i]):
1254
                for j in range(len(yang)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1255
                    if yang_flag[j] == 0 and _inrange(yang[j]):
1256
                        if (ying[i], yang[j]) in adjwt:
1257
                            n_simi += adjwt[(ying[i], yang[j])]
1258
                            yang_flag[j] = 2
1259
                            break
1260
    num_sim = n_simi/10.0 + num_com
1261
1262
    # Main weight computation
1263
    weight = num_sim / len(ying) + num_sim / len(yang) + \
1264
        (num_com - n_trans) / num_com
1265
    weight = weight / 3.0
1266
1267
    # Continue to boost the weight if the strings are similar
1268
    if weight > 0.7:
1269
1270
        # Adjust for having up to the first 4 characters in common
1271
        j = 4 if (minv >= 4) else minv
1272
        i = 0
1273
        while (i < j) and (ying[i] == yang[i]) and (not ying[i].isdigit()):
1274
            i += 1
1275
        if i:
1276
            weight += i * 0.1 * (1.0 - weight)
1277
1278
        # Optionally adjust for long strings.
1279
1280
        # After agreeing beginning chars, at least two more must agree and
1281
        # the agreeing characters must be > .5 of remaining characters.
1282
        if (((long_strings) and (minv > 4) and (num_com > i+1) and
1283
             (2*num_com >= minv+i))):
1284
            if not ying[0].isdigit():
1285
                weight += (1.0-weight) * ((num_com-i-1) /
1286
                                          (len(ying)+len(yang)-i*2+2))
1287
1288
    return weight
1289
1290
1291
def dist_strcmp95(src, tar, long_strings=False):
1292
    """Return the strcmp95 distance between two strings.
1293
1294
    strcmp95 distance
1295
1296
    strcmp95 distance is 1 - strcmp95 similarity
1297
1298
    :param str src, tar: two strings to be compared
1299
    :param bool long_strings: set to True to "Increase the probability of a
1300
        match when the number of matched characters is large.  This option
1301
        allows for a little more tolerance when the strings are large. It is
1302
        not an appropriate test when comparing fixed length fields such as
1303
        phone and social security numbers."
1304
    :returns: strcmp95 distance
1305
    :rtype: float
1306
1307
    >>> dist_strcmp95('cat', 'hat')
1308
    0.22222222222222232
1309
    >>> dist_strcmp95('Niall', 'Neil')
1310
    0.15450000000000008
1311
    >>> dist_strcmp95('aluminum', 'Catalan')
1312
    0.34523809523809523
1313
    >>> dist_strcmp95('ATCG', 'TAGC')
1314
    0.16666666666666663
1315
    """
1316
    return 1 - sim_strcmp95(src, tar, long_strings)
1317
1318
1319
def sim_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False,
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (22/15).
Loading history...
1320
                     boost_threshold=0.7, scaling_factor=0.1):
1321
    """Return the Jaro or Jaro-Winkler similarity of two strings.
1322
1323
    Jaro(-Winkler) distance
1324
1325
    This is Python based on the C code for strcmp95:
1326
    http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c
1327
    The above file is a US Government publication and, accordingly,
1328
    in the public domain.
1329
1330
    :param str src, tar: two strings to be compared
1331
    :param int qval: the length of each q-gram (defaults to 1: character-wise
1332
        matching)
1333
    :param str mode: indicates which variant of this distance metric to
1334
        compute:
1335
1336
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
1337
              increases the score for matches near the start of the word
1338
            - 'jaro' -- computes the Jaro distance
1339
1340
    The following arguments apply only when mode is 'winkler':
1341
1342
    :param bool long_strings: set to True to "Increase the probability of a
1343
        match when the number of matched characters is large.  This option
1344
        allows for a little more tolerance when the strings are large.  It is
1345
        not an appropriate test when comparing fixed length fields such as
1346
        phone and social security numbers."
1347
    :param float boost_threshold: a value between 0 and 1, below which the
1348
        Winkler boost is not applied (defaults to 0.7)
1349
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
1350
        much to boost scores for matching prefixes (defaults to 0.1)
1351
1352
    :returns: Jaro or Jaro-Winkler similarity
1353
    :rtype: float
1354
1355
    >>> sim_jaro_winkler('cat', 'hat')
1356
    0.7777777777777777
1357
    >>> sim_jaro_winkler('Niall', 'Neil')
1358
    0.8049999999999999
1359
    >>> sim_jaro_winkler('aluminum', 'Catalan')
1360
    0.6011904761904762
1361
    >>> sim_jaro_winkler('ATCG', 'TAGC')
1362
    0.8333333333333334
1363
1364
    >>> sim_jaro_winkler('cat', 'hat', mode='jaro')
1365
    0.7777777777777777
1366
    >>> sim_jaro_winkler('Niall', 'Neil', mode='jaro')
1367
    0.7833333333333333
1368
    >>> sim_jaro_winkler('aluminum', 'Catalan', mode='jaro')
1369
    0.6011904761904762
1370
    >>> sim_jaro_winkler('ATCG', 'TAGC', mode='jaro')
1371
    0.8333333333333334
1372
    """
1373
    if mode == 'winkler':
1374
        if boost_threshold > 1 or boost_threshold < 0:
1375
            raise ValueError('Unsupported boost_threshold assignment; ' +
1376
                             'boost_threshold must be between 0 and 1.')
1377
        if scaling_factor > 0.25 or scaling_factor < 0:
1378
            raise ValueError('Unsupported scaling_factor assignment; ' +
1379
                             'scaling_factor must be between 0 and 0.25.')
1380
1381
    if src == tar:
1382
        return 1.0
1383
1384
    src = QGrams(src.strip(), qval).ordered_list
1385
    tar = QGrams(tar.strip(), qval).ordered_list
1386
1387
    lens = len(src)
1388
    lent = len(tar)
1389
1390
    # If either string is blank - return - added in Version 2
1391
    if lens == 0 or lent == 0:
1392
        return 0.0
1393
1394
    if lens > lent:
1395
        search_range = lens
1396
        minv = lent
1397
    else:
1398
        search_range = lent
1399
        minv = lens
1400
1401
    # Zero out the flags
1402
    src_flag = [0] * search_range
1403
    tar_flag = [0] * search_range
1404
    search_range = max(0, search_range//2 - 1)
1405
1406
    # Looking only within the search range, count and flag the matched pairs.
1407
    num_com = 0
1408
    yl1 = lent - 1
1409
    for i in range(lens):
1410
        lowlim = (i - search_range) if (i >= search_range) else 0
1411
        hilim = (i + search_range) if ((i + search_range) <= yl1) else yl1
1412
        for j in range(lowlim, hilim+1):
1413
            if (tar_flag[j] == 0) and (tar[j] == src[i]):
1414
                tar_flag[j] = 1
1415
                src_flag[i] = 1
1416
                num_com += 1
1417
                break
1418
1419
    # If no characters in common - return
1420
    if num_com == 0:
1421
        return 0.0
1422
1423
    # Count the number of transpositions
1424
    k = n_trans = 0
1425
    for i in range(lens):
1426
        if src_flag[i] != 0:
1427
            for j in range(k, lent):
1428
                if tar_flag[j] != 0:
1429
                    k = j + 1
1430
                    break
1431
            if src[i] != tar[j]:
0 ignored issues
show
introduced by
The variable j does not seem to be defined for all execution paths.
Loading history...
1432
                n_trans += 1
1433
    n_trans = n_trans // 2
1434
1435
    # Main weight computation for Jaro distance
1436
    weight = num_com / lens + num_com / lent + (num_com - n_trans) / num_com
1437
    weight = weight / 3.0
1438
1439
    # Continue to boost the weight if the strings are similar
1440
    # This is the Winkler portion of Jaro-Winkler distance
1441
    if mode == 'winkler' and weight > boost_threshold:
1442
1443
        # Adjust for having up to the first 4 characters in common
1444
        j = 4 if (minv >= 4) else minv
1445
        i = 0
1446
        while (i < j) and (src[i] == tar[i]):
1447
            i += 1
1448
        if i:
1449
            weight += i * scaling_factor * (1.0 - weight)
1450
1451
        # Optionally adjust for long strings.
1452
1453
        # After agreeing beginning chars, at least two more must agree and
1454
        # the agreeing characters must be > .5 of remaining characters.
1455
        if (((long_strings) and (minv > 4) and (num_com > i+1) and
1456
             (2*num_com >= minv+i))):
1457
            weight += (1.0-weight) * ((num_com-i-1) / (lens+lent-i*2+2))
1458
1459
    return weight
1460
1461
1462
def dist_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False,
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
1463
                      boost_threshold=0.7, scaling_factor=0.1):
1464
    """Return the Jaro or Jaro-Winkler distance between two strings.
1465
1466
    Jaro(-Winkler) distance
1467
1468
    Jaro-Winkler distance is 1 - the Jaro-Winkler similarity
1469
1470
    :param str src, tar: two strings to be compared
1471
    :param int qval: the length of each q-gram (defaults to 1: character-wise
1472
        matching)
1473
    :param str mode: indicates which variant of this distance metric to
1474
        compute:
1475
1476
            - 'winkler' -- computes the Jaro-Winkler distance (default) which
1477
              increases the score for matches near the start of the word
1478
            - 'jaro' -- computes the Jaro distance
1479
1480
    The following arguments apply only when mode is 'winkler':
1481
1482
    :param bool long_strings: set to True to "Increase the probability of a
1483
        match when the number of matched characters is large.  This option
1484
        allows for a little more tolerance when the strings are large.  It is
1485
        not an appropriate test when comparing fixed length fields such as
1486
        phone and social security numbers."
1487
    :param float boost_threshold: a value between 0 and 1, below which the
1488
        Winkler boost is not applied (defaults to 0.7)
1489
    :param float scaling_factor: a value between 0 and 0.25, indicating by how
1490
        much to boost scores for matching prefixes (defaults to 0.1)
1491
1492
    :returns: Jaro or Jaro-Winkler distance
1493
    :rtype: float
1494
1495
    >>> dist_jaro_winkler('cat', 'hat')
1496
    0.22222222222222232
1497
    >>> dist_jaro_winkler('Niall', 'Neil')
1498
    0.19500000000000006
1499
    >>> dist_jaro_winkler('aluminum', 'Catalan')
1500
    0.39880952380952384
1501
    >>> dist_jaro_winkler('ATCG', 'TAGC')
1502
    0.16666666666666663
1503
1504
    >>> dist_jaro_winkler('cat', 'hat', mode='jaro')
1505
    0.22222222222222232
1506
    >>> dist_jaro_winkler('Niall', 'Neil', mode='jaro')
1507
    0.21666666666666667
1508
    >>> dist_jaro_winkler('aluminum', 'Catalan', mode='jaro')
1509
    0.39880952380952384
1510
    >>> dist_jaro_winkler('ATCG', 'TAGC', mode='jaro')
1511
    0.16666666666666663
1512
    """
1513
    return 1 - sim_jaro_winkler(src, tar, qval, mode, long_strings,
1514
                                boost_threshold, scaling_factor)
1515
1516
1517
def lcsseq(src, tar):
1518
    """Return the longest common subsequence of two strings.
1519
1520
    Longest common subsequence (LCSseq)
1521
1522
    Based on the dynamic programming algorithm from
1523
    http://rosettacode.org/wiki/Longest_common_subsequence#Dynamic_Programming_6
1524
    This is licensed GFDL 1.2
1525
1526
    Modifications include:
1527
        conversion to a numpy array in place of a list of lists
1528
1529
    :param str src, tar: two strings to be compared
1530
    :returns: the longes common subsequence
1531
    :rtype: str
1532
1533
    >>> lcsseq('cat', 'hat')
1534
    'at'
1535
    >>> lcsseq('Niall', 'Neil')
1536
    'Nil'
1537
    >>> lcsseq('aluminum', 'Catalan')
1538
    'aln'
1539
    >>> lcsseq('ATCG', 'TAGC')
1540
    'AC'
1541
    """
1542
    # pylint: disable=no-member
1543
    lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1544
    # pylint: enable=no-member
1545
1546
    # row 0 and column 0 are initialized to 0 already
1547
    for i, src_char in enumerate(src):
1548
        for j, tar_char in enumerate(tar):
1549
            if src_char == tar_char:
1550
                lengths[i+1, j+1] = lengths[i, j] + 1
1551
            else:
1552
                lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1])
1553
1554
    # read the substring out from the matrix
1555
    result = ''
1556
    i, j = len(src), len(tar)
1557
    while i != 0 and j != 0:
1558
        if lengths[i, j] == lengths[i-1, j]:
1559
            i -= 1
1560
        elif lengths[i, j] == lengths[i, j-1]:
1561
            j -= 1
1562
        else:
1563
            result = src[i-1] + result
1564
            i -= 1
1565
            j -= 1
1566
    return result
1567
1568
1569
def sim_lcsseq(src, tar):
1570
    r"""Return the longest common subsequence similarity of two strings.
1571
1572
    Longest common subsequence similarity (:math:`sim_{LCSseq}`)
1573
1574
    This employs the LCSseq function to derive a similarity metric:
1575
    :math:`sim_{LCSseq}(s,t) = \\frac{|LCSseq(s,t)|}{max(|s|, |t|)}`
1576
1577
    :param str src, tar: two strings to be compared
1578
    :returns: LCSseq similarity
1579
    :rtype: float
1580
1581
    >>> sim_lcsseq('cat', 'hat')
1582
    0.6666666666666666
1583
    >>> sim_lcsseq('Niall', 'Neil')
1584
    0.6
1585
    >>> sim_lcsseq('aluminum', 'Catalan')
1586
    0.375
1587
    >>> sim_lcsseq('ATCG', 'TAGC')
1588
    0.5
1589
    """
1590
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
1591
        return 1.0
1592
    elif not src or not tar:
1593
        return 0.0
1594
    return len(lcsseq(src, tar)) / max(len(src), len(tar))
1595
1596
1597
def dist_lcsseq(src, tar):
1598
    """Return the longest common subsequence distance between two strings.
1599
1600
    Longest common subsequence distance (:math:`dist_{LCSseq}`)
1601
1602
    This employs the LCSseq function to derive a similarity metric:
1603
    :math:`dist_{LCSseq}(s,t) = 1 - sim_{LCSseq}(s,t)`
1604
1605
    :param str src, tar: two strings to be compared
1606
    :returns: LCSseq distance
1607
    :rtype: float
1608
1609
    >>> dist_lcsseq('cat', 'hat')
1610
    0.33333333333333337
1611
    >>> dist_lcsseq('Niall', 'Neil')
1612
    0.4
1613
    >>> dist_lcsseq('aluminum', 'Catalan')
1614
    0.625
1615
    >>> dist_lcsseq('ATCG', 'TAGC')
1616
    0.5
1617
    """
1618
    return 1 - sim_lcsseq(src, tar)
1619
1620
1621
def lcsstr(src, tar):
1622
    """Return the longest common substring of two strings.
1623
1624
    Longest common substring (LCSstr)
1625
1626
    Based on the code from
1627
    https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python
1628
    This is licensed Creative Commons: Attribution-ShareAlike 3.0
1629
1630
    Modifications include:
1631
1632
        - conversion to a numpy array in place of a list of lists
1633
        - conversion to Python 2/3-safe range from xrange via six
1634
1635
    :param str src, tar: two strings to be compared
1636
    :returns: the longes common substring
1637
    :rtype: float
1638
1639
    >>> lcsstr('cat', 'hat')
1640
    'at'
1641
    >>> lcsstr('Niall', 'Neil')
1642
    'N'
1643
    >>> lcsstr('aluminum', 'Catalan')
1644
    'al'
1645
    >>> lcsstr('ATCG', 'TAGC')
1646
    'A'
1647
    """
1648
    # pylint: disable=no-member
1649
    lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1650
    # pylint: enable=no-member
1651
    longest, i_longest = 0, 0
1652
    for i in range(1, len(src)+1):
1653
        for j in range(1, len(tar)+1):
1654
            if src[i-1] == tar[j-1]:
1655
                lengths[i, j] = lengths[i-1, j-1] + 1
1656
                if lengths[i, j] > longest:
1657
                    longest = lengths[i, j]
1658
                    i_longest = i
1659
            else:
1660
                lengths[i, j] = 0
1661
    return src[i_longest - longest:i_longest]
1662
1663
1664
def sim_lcsstr(src, tar):
1665
    r"""Return the longest common substring similarity of two strings.
1666
1667
    Longest common substring similarity (:math:`sim_{LCSstr}`)
1668
1669
    This employs the LCS function to derive a similarity metric:
1670
    :math:`sim_{LCSstr}(s,t) = \\frac{|LCSstr(s,t)|}{max(|s|, |t|)}`
1671
1672
    :param str src, tar: two strings to be compared
1673
    :returns: LCSstr similarity
1674
    :rtype: float
1675
1676
    >>> sim_lcsstr('cat', 'hat')
1677
    0.6666666666666666
1678
    >>> sim_lcsstr('Niall', 'Neil')
1679
    0.2
1680
    >>> sim_lcsstr('aluminum', 'Catalan')
1681
    0.25
1682
    >>> sim_lcsstr('ATCG', 'TAGC')
1683
    0.25
1684
    """
1685
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
1686
        return 1.0
1687
    elif not src or not tar:
1688
        return 0.0
1689
    return len(lcsstr(src, tar)) / max(len(src), len(tar))
1690
1691
1692
def dist_lcsstr(src, tar):
1693
    """Return the longest common substring distance between two strings.
1694
1695
    Longest common substring distance (:math:`dist_{LCSstr}`)
1696
1697
    This employs the LCS function to derive a similarity metric:
1698
    :math:`dist_{LCSstr}(s,t) = 1 - sim_{LCSstr}(s,t)`
1699
1700
    :param str src, tar: two strings to be compared
1701
    :returns: LCSstr distance
1702
    :rtype: float
1703
1704
    >>> dist_lcsstr('cat', 'hat')
1705
    0.33333333333333337
1706
    >>> dist_lcsstr('Niall', 'Neil')
1707
    0.8
1708
    >>> dist_lcsstr('aluminum', 'Catalan')
1709
    0.75
1710
    >>> dist_lcsstr('ATCG', 'TAGC')
1711
    0.75
1712
    """
1713
    return 1 - sim_lcsstr(src, tar)
1714
1715
1716
def sim_ratcliff_obershelp(src, tar):
1717
    """Return the Ratcliff-Obershelp similarity of two strings.
1718
1719
    Ratcliff-Obershelp similarity
1720
1721
    This follows the Ratcliff-Obershelp algorithm to derive a similarity
1722
    measure:
1723
1724
        1. Find the length of the longest common substring in src & tar.
1725
        2. Recurse on the strings to the left & right of each this substring
1726
           in src & tar. The base case is a 0 length common substring, in which
1727
           case, return 0. Otherwise, return the sum of the current longest
1728
           common substring and the left & right recursed sums.
1729
        3. Multiply this length by 2 and divide by the sum of the lengths of
1730
           src & tar.
1731
1732
    Cf.
1733
    http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
1734
1735
    :param str src, tar: two strings to be compared
1736
    :returns: Ratcliff-Obserhelp similarity
1737
    :rtype: float
1738
1739
    >>> sim_ratcliff_obershelp('cat', 'hat')
1740
    0.66666666666666663
1741
    >>> sim_ratcliff_obershelp('Niall', 'Neil')
1742
    0.66666666666666663
1743
    >>> sim_ratcliff_obershelp('aluminum', 'Catalan')
1744
    0.40000000000000002
1745
    >>> sim_ratcliff_obershelp('ATCG', 'TAGC')
1746
    0.5
1747
    """
1748
    def _lcsstr_stl(src, tar):
1749
        """Return start positions & length for Ratcliff-Obershelp.
1750
1751
        Return the start position in the source string, start position in
1752
        the target string, and length of the longest common substring of
1753
        strings src and tar.
1754
        """
1755
        # pylint: disable=no-member
1756
        lengths = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
1757
        # pylint: enable=no-member
1758
        longest, src_longest, tar_longest = 0, 0, 0
1759
        for i in range(1, len(src)+1):
1760
            for j in range(1, len(tar)+1):
1761
                if src[i-1] == tar[j-1]:
1762
                    lengths[i, j] = lengths[i-1, j-1] + 1
1763
                    if lengths[i, j] > longest:
1764
                        longest = lengths[i, j]
1765
                        src_longest = i
1766
                        tar_longest = j
1767
                else:
1768
                    lengths[i, j] = 0
1769
        return (src_longest-longest, tar_longest-longest, longest)
1770
1771
    def _sstr_matches(src, tar):
1772
        """Return the sum of substring match lengths.
1773
1774
        This follows the Ratcliff-Obershelp algorithm:
1775
             1. Find the length of the longest common substring in src & tar.
1776
             2. Recurse on the strings to the left & right of each this
1777
                 substring in src & tar.
1778
             3. Base case is a 0 length common substring, in which case,
1779
                 return 0.
1780
             4. Return the sum.
1781
        """
1782
        src_start, tar_start, length = _lcsstr_stl(src, tar)
1783
        if length == 0:
1784
            return 0
1785
        return (_sstr_matches(src[:src_start], tar[:tar_start]) +
1786
                length +
1787
                _sstr_matches(src[src_start+length:], tar[tar_start+length:]))
1788
1789
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
1790
        return 1.0
1791
    elif not src or not tar:
1792
        return 0.0
1793
    return 2*_sstr_matches(src, tar)/(len(src)+len(tar))
1794
1795
1796
def dist_ratcliff_obershelp(src, tar):
1797
    """Return the Ratcliff-Obershelp distance between two strings.
1798
1799
    Ratcliff-Obershelp distance
1800
1801
    Ratcliff-Obsershelp distance the complement of Ratcliff-Obershelp
1802
    similarity:
1803
    :math:`dist_{Ratcliff-Obershelp} = 1 - sim_{Ratcliff-Obershelp}`
1804
1805
    :param str src, tar: two strings to be compared
1806
    :returns: Ratcliffe-Obershelp distance
1807
    :rtype: float
1808
1809
    >>> dist_ratcliff_obershelp('cat', 'hat')
1810
    0.33333333333333337
1811
    >>> dist_ratcliff_obershelp('Niall', 'Neil')
1812
    0.33333333333333337
1813
    >>> dist_ratcliff_obershelp('aluminum', 'Catalan')
1814
    0.59999999999999998
1815
    >>> dist_ratcliff_obershelp('ATCG', 'TAGC')
1816
    0.5
1817
    """
1818
    return 1 - sim_ratcliff_obershelp(src, tar)
1819
1820
1821
def mra_compare(src, tar):
1822
    """Return the MRA comparison rating of two strings.
1823
1824
    Western Airlines Surname Match Rating Algorithm comparison rating
1825
1826
    A description of the algorithm can be found on page 18 of
1827
    https://archive.org/details/accessingindivid00moor
1828
1829
    :param str src, tar: two strings to be compared
1830
    :returns: MRA comparison rating
1831
    :rtype: int
1832
1833
    >>> mra_compare('cat', 'hat')
1834
    5
1835
    >>> mra_compare('Niall', 'Neil')
1836
    6
1837
    >>> mra_compare('aluminum', 'Catalan')
1838
    0
1839
    >>> mra_compare('ATCG', 'TAGC')
1840
    5
1841
    """
1842
    if src == tar:
1843
        return 6
1844
    if src == '' or tar == '':
1845
        return 0
1846
    src = list(mra(src))
1847
    tar = list(mra(tar))
1848
1849
    if abs(len(src)-len(tar)) > 2:
1850
        return 0
1851
1852
    length_sum = len(src) + len(tar)
1853
    if length_sum < 5:
1854
        min_rating = 5
1855
    elif length_sum < 8:
1856
        min_rating = 4
1857
    elif length_sum < 12:
1858
        min_rating = 3
1859
    else:
1860
        min_rating = 2
1861
1862
    for _ in range(2):
1863
        new_src = []
1864
        new_tar = []
1865
        minlen = min(len(src), len(tar))
1866
        for i in range(minlen):
1867
            if src[i] != tar[i]:
1868
                new_src.append(src[i])
1869
                new_tar.append(tar[i])
1870
        src = new_src+src[minlen:]
1871
        tar = new_tar+tar[minlen:]
1872
        src.reverse()
1873
        tar.reverse()
1874
1875
    similarity = 6 - max(len(src), len(tar))
1876
1877
    if similarity >= min_rating:
1878
        return similarity
1879
    return 0
1880
1881
1882
def sim_mra(src, tar):
1883
    """Return the normalized MRA similarity of two strings.
1884
1885
    Normalized Match Rating Algorithm similarity
1886
1887
    This is the MRA normalized to :math:`[0, 1]`, given that MRA itself is
1888
    constrained to the range :math:`[0, 6]`.
1889
1890
    :param str src, tar: two strings to be compared
1891
    :returns: normalized MRA similarity
1892
    :rtype: float
1893
1894
    >>> sim_mra('cat', 'hat')
1895
    0.8333333333333334
1896
    >>> sim_mra('Niall', 'Neil')
1897
    1.0
1898
    >>> sim_mra('aluminum', 'Catalan')
1899
    0.0
1900
    >>> sim_mra('ATCG', 'TAGC')
1901
    0.8333333333333334
1902
    """
1903
    return mra_compare(src, tar)/6
1904
1905
1906
def dist_mra(src, tar):
1907
    """Return the normalized MRA distance between two strings.
1908
1909
    Normalized Match Rating Algorithm distance
1910
1911
    MRA distance is the complement of MRA similarity:
1912
    :math:`dist_{MRA} = 1 - sim_{MRA}`
1913
1914
    :param str src, tar: two strings to be compared
1915
    :returns: normalized MRA distance
1916
    :rtype: float
1917
1918
    >>> dist_mra('cat', 'hat')
1919
    0.16666666666666663
1920
    >>> dist_mra('Niall', 'Neil')
1921
    0.0
1922
    >>> dist_mra('aluminum', 'Catalan')
1923
    1.0
1924
    >>> dist_mra('ATCG', 'TAGC')
1925
    0.16666666666666663
1926
    """
1927
    return 1 - sim_mra(src, tar)
1928
1929
1930
def dist_compression(src, tar, compressor='bz2', probs=None):
1931
    """Return the normalized compression distance between two strings.
1932
1933
    Normalized compression distance (NCD)
1934
1935
    Cf.
1936
    https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance
1937
1938
    :param str src, tar: two strings to be compared
1939
    :param str compressor: a compression scheme to use for the similarity
1940
        calculation, from the following:
1941
1942
            - `zlib` -- standard zlib/gzip
1943
            - `bz2` -- bzip2 (default)
1944
            - `lzma` -- Lempel–Ziv–Markov chain algorithm
1945
            - `arith` -- arithmetic coding
1946
            - `rle` -- run-length encoding
1947
            - `bwtrle` -- Burrows-Wheeler transform followed by run-length
1948
              encoding
1949
1950
    :param doct probs: a dictionary trained with ac_train (for the arith
1951
        compressor only)
1952
    :returns: compression distance
1953
    :rtype: float
1954
1955
    >>> dist_compression('cat', 'hat')
1956
    0.08
1957
    >>> dist_compression('Niall', 'Neil')
1958
    0.037037037037037035
1959
    >>> dist_compression('aluminum', 'Catalan')
1960
    0.20689655172413793
1961
    >>> dist_compression('ATCG', 'TAGC')
1962
    0.037037037037037035
1963
1964
    >>> dist_compression('Niall', 'Neil', compressor='zlib')
1965
    0.45454545454545453
1966
    >>> dist_compression('Niall', 'Neil', compressor='bz2')
1967
    0.037037037037037035
1968
    >>> dist_compression('Niall', 'Neil', compressor='lzma')
1969
    0.16
1970
    >>> dist_compression('Niall', 'Neil', compressor='arith')
1971
    0.6875
1972
    >>> dist_compression('Niall', 'Neil', compressor='rle')
1973
    1.0
1974
    >>> dist_compression('Niall', 'Neil', compressor='bwtrle')
1975
    0.8333333333333334
1976
    """
1977
    if src == tar:
1978
        return 0.0
1979
1980
    if compressor not in {'arith', 'rle', 'bwtrle'}:
1981
        src = src.encode('utf-8')
1982
        tar = tar.encode('utf-8')
1983
1984
    if compressor == 'bz2':
1985
        src_comp = codecs.encode(src, 'bz2_codec')[15:]
1986
        tar_comp = codecs.encode(tar, 'bz2_codec')[15:]
1987
        concat_comp = codecs.encode(src+tar, 'bz2_codec')[15:]
1988
        concat_comp2 = codecs.encode(tar+src, 'bz2_codec')[15:]
1989
    elif compressor == 'lzma':
1990
        if 'lzma' in sys.modules:
1991
            src_comp = lzma.compress(src)[14:]
1992
            tar_comp = lzma.compress(tar)[14:]
1993
            concat_comp = lzma.compress(src+tar)[14:]
1994
            concat_comp2 = lzma.compress(tar+src)[14:]
1995
        else:  # pragma: no cover
1996
            raise ValueError('Install the PylibLZMA module in order to use ' +
1997
                             'lzma compression similarity')
1998
    elif compressor == 'arith':
1999
        if probs is None:
2000
            # lacking a reasonable dictionary, train on the strings themselves
2001
            probs = ac_train(src+tar)
2002
        src_comp = ac_encode(src, probs)[1]
2003
        tar_comp = ac_encode(tar, probs)[1]
2004
        concat_comp = ac_encode(src+tar, probs)[1]
2005
        concat_comp2 = ac_encode(tar+src, probs)[1]
2006
        return ((min(concat_comp, concat_comp2) - min(src_comp, tar_comp)) /
2007
                max(src_comp, tar_comp))
2008
    elif compressor in {'rle', 'bwtrle'}:
2009
        src_comp = rle_encode(src, (compressor == 'bwtrle'))
2010
        tar_comp = rle_encode(tar, (compressor == 'bwtrle'))
2011
        concat_comp = rle_encode(src+tar, (compressor == 'bwtrle'))
2012
        concat_comp2 = rle_encode(tar+src, (compressor == 'bwtrle'))
2013
    else:  # zlib
2014
        src_comp = codecs.encode(src, 'zlib_codec')[2:]
2015
        tar_comp = codecs.encode(tar, 'zlib_codec')[2:]
2016
        concat_comp = codecs.encode(src+tar, 'zlib_codec')[2:]
2017
        concat_comp2 = codecs.encode(tar+src, 'zlib_codec')[2:]
2018
    return ((min(len(concat_comp), len(concat_comp2)) -
2019
             min(len(src_comp), len(tar_comp))) /
2020
            max(len(src_comp), len(tar_comp)))
2021
2022
2023
def sim_compression(src, tar, compressor='bz2', probs=None):
2024
    """Return the normalized compression similarity of two strings.
2025
2026
    Normalized compression similarity (NCS)
2027
2028
    Normalized compression similarity is the complement of normalized
2029
    compression distance:
2030
    :math:`sim_{NCS} = 1 - dist_{NCD}`
2031
2032
    :param str src, tar: two strings to be compared
2033
    :param str compressor: a compression scheme to use for the similarity
2034
        calculation:
2035
2036
            - `zlib` -- standard zlib/gzip
2037
            - `bz2` -- bzip2 (default)
2038
            - `lzma` -- Lempel–Ziv–Markov chain algorithm
2039
            - `arith` -- arithmetic coding
2040
            - `rle` -- run-length encoding
2041
            - `bwtrle` -- Burrows-Wheeler transform followed by run-length
2042
              encoding
2043
2044
    :param dict probs: a dictionary trained with ac_train (for the arith
2045
        compressor only)
2046
    :returns: compression similarity
2047
    :rtype: float
2048
2049
    >>> sim_compression('cat', 'hat')
2050
    0.92
2051
    >>> sim_compression('Niall', 'Neil')
2052
    0.962962962962963
2053
    >>> sim_compression('aluminum', 'Catalan')
2054
    0.7931034482758621
2055
    >>> sim_compression('ATCG', 'TAGC')
2056
    0.962962962962963
2057
2058
    >>> sim_compression('Niall', 'Neil', compressor='zlib')
2059
    0.5454545454545454
2060
    >>> sim_compression('Niall', 'Neil', compressor='bz2')
2061
    0.962962962962963
2062
    >>> sim_compression('Niall', 'Neil', compressor='lzma')
2063
    0.84
2064
    >>> sim_compression('Niall', 'Neil', compressor='arith')
2065
    0.3125
2066
    >>> sim_compression('Niall', 'Neil', compressor='rle')
2067
    0.0
2068
    >>> sim_compression('Niall', 'Neil', compressor='bwtrle')
2069
    0.16666666666666663
2070
    """
2071
    return 1 - dist_compression(src, tar, compressor, probs)
2072
2073
2074
def sim_monge_elkan(src, tar, sim_func=sim_levenshtein, symmetric=False):
2075
    """Return the Monge-Elkan similarity of two strings.
2076
2077
    Monge-Elkan similarity
2078
2079
    Monge-Elkan is defined in:
2080
    Monge, Alvaro E. and Charles P. Elkan. 1996. "The field matching problem:
2081
    Algorithms and applications." KDD-9 Proceedings.
2082
    http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
2083
2084
    Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the
2085
    similarity of src to tar is not necessarily equal to the similarity of
2086
    tar to src. If the sym argument is True, a symmetric value is calculated,
2087
    at the cost of doubling the computation time (since the
2088
    :math:`sim_{Monge-Elkan}(src, tar)` and
2089
    :math:`sim_{Monge-Elkan}(tar, src)` are both calculated and then averaged).
2090
2091
    :param str src, tar: two strings to be compared
2092
    :param function sim_func: the internal similarity metric to emply
2093
    :param bool symmetric: return a symmetric similarity measure
2094
    :returns: Monge-Elkan similarity
2095
    :rtype: float
2096
2097
    >>> sim_monge_elkan('cat', 'hat')
2098
    0.75
2099
    >>> sim_monge_elkan('Niall', 'Neil')
2100
    0.66666666666666663
2101
    >>> sim_monge_elkan('aluminum', 'Catalan')
2102
    0.3888888888888889
2103
    >>> sim_monge_elkan('ATCG', 'TAGC')
2104
    0.5
2105
    """
2106
    if src == tar:
2107
        return 1.0
2108
2109
    q_src = sorted(QGrams(src).elements())
2110
    q_tar = sorted(QGrams(tar).elements())
2111
2112
    if not q_src or not q_tar:
2113
        return 0.0
2114
2115
    sum_of_maxes = 0
2116
    for q_s in q_src:
2117
        max_sim = float('-inf')
2118
        for q_t in q_tar:
2119
            max_sim = max(max_sim, sim_func(q_s, q_t))
2120
        sum_of_maxes += max_sim
2121
    sim_em = sum_of_maxes / len(q_src)
2122
2123
    if symmetric:
2124
        sim_em = (sim_em + sim_monge_elkan(tar, src, sim, False))/2
2125
2126
    return sim_em
2127
2128
2129
def dist_monge_elkan(src, tar, sim_func=sim_levenshtein, symmetric=False):
2130
    """Return the Monge-Elkan distance between two strings.
2131
2132
    Monge-Elkan distance
2133
2134
    Monge-Elkan is defined in:
2135
    Monge, Alvaro E. and Charles P. Elkan. 1996. "The field matching problem:
2136
    Algorithms and applications." KDD-9 Proceedings.
2137
    http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
2138
2139
    Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the
2140
    distance between src and tar is not necessarily equal to the distance
2141
    between tar and src. If the sym argument is True, a symmetric value is
2142
    calculated, at the cost of doubling the computation time (since the
2143
    :math:`sim_{Monge-Elkan}(src, tar)` and :math:`sim_{Monge-Elkan}(tar, src)`
2144
    are both calculated and then averaged).
2145
2146
    :param str src, tar: two strings to be compared
2147
    :param function sim_func: the internal similarity metric to emply
2148
    :param bool symmetric: return a symmetric similarity measure
2149
    :returns: Monge-Elkan distance
2150
    :rtype: float
2151
2152
    >>> dist_monge_elkan('cat', 'hat')
2153
    0.25
2154
    >>> dist_monge_elkan('Niall', 'Neil')
2155
    0.33333333333333337
2156
    >>> dist_monge_elkan('aluminum', 'Catalan')
2157
    0.61111111111111116
2158
    >>> dist_monge_elkan('ATCG', 'TAGC')
2159
    0.5
2160
    """
2161
    return 1 - sim_monge_elkan(src, tar, sim_func, symmetric)
2162
2163
2164
def sim_ident(src, tar):
2165
    """Return the identity similarity of two strings.
2166
2167
    Identity similarity
2168
2169
    This is 1 if the two strings are identical, otherwise 0.
2170
2171
    :param str src, tar: two strings to be compared
2172
    :returns: identity similarity
2173
    :rtype: int
2174
2175
    >>> sim_ident('cat', 'hat')
2176
    0
2177
    >>> sim_ident('cat', 'cat')
2178
    1
2179
    """
2180
    return int(src == tar)
2181
2182
2183
def dist_ident(src, tar):
2184
    """Return the identity distance between two strings.
2185
2186
    Identity distance
2187
2188
    This is 0 if the two strings are identical, otherwise 1, i.e.
2189
    :math:`dist_{identity} = 1 - sim_{identity}`
2190
2191
    :param str src, tar: two strings to be compared
2192
    :returns: indentity distance
2193
    :rtype: int
2194
2195
    >>> dist_ident('cat', 'hat')
2196
    1
2197
    >>> dist_ident('cat', 'cat')
2198
    0
2199
    """
2200
    return 1 - sim_ident(src, tar)
2201
2202
2203
def sim_matrix(src, tar, mat=None, mismatch_cost=0, match_cost=1,
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
2204
               symmetric=True, alphabet=None):
2205
    """Return the matrix similarity of two strings.
2206
2207
    Matrix similarity
2208
2209
    With the default parameters, this is identical to sim_ident.
2210
    It is possible for sim_matrix to return values outside of the range
2211
    :math:`[0, 1]`, if values outside that range are present in mat,
2212
    mismatch_cost, or match_cost.
2213
2214
    :param str src, tar: two strings to be compared
2215
    :param dict mat: a dict mapping tuples to costs; the tuples are (src, tar)
2216
        pairs of symbols from the alphabet parameter
2217
    :param float mismatch_cost: the value returned if (src, tar) is absent from
2218
        mat when src does not equal tar
2219
    :param float match_cost: the value returned if (src, tar) is absent from
2220
        mat when src equals tar
2221
    :param bool symmetric: True if the cost of src not matching tar is
2222
        identical to the cost of tar not matching src; in this case, the values
2223
        in mat need only contain (src, tar) or (tar, src), not both
2224
    :param str alphabet: a collection of tokens from which src and tar are
2225
        drawn; if this is defined a ValueError is raised if either tar or src
2226
        is not found in alphabet
2227
    :returns: matrix similarity
2228
    :rtype: float
2229
2230
    >>> sim_matrix('cat', 'hat')
2231
    0
2232
    >>> sim_matrix('hat', 'hat')
2233
    1
2234
    """
2235
    if alphabet:
2236
        alphabet = tuple(alphabet)
2237
        for i in src:
2238
            if i not in alphabet:
2239
                raise ValueError('src value not in alphabet')
2240
        for i in tar:
2241
            if i not in alphabet:
2242
                raise ValueError('tar value not in alphabet')
2243
2244
    if src == tar:
2245
        if mat and (src, src) in mat:
2246
            return mat[(src, src)]
2247
        return match_cost
2248
    if mat and (src, tar) in mat:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
2249
        return mat[(src, tar)]
2250
    elif symmetric and mat and (tar, src) in mat:
2251
        return mat[(tar, src)]
2252
    return mismatch_cost
2253
2254
2255 View Code Duplication
def needleman_wunsch(src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2256
    """Return the Needleman-Wunsch score of two strings.
2257
2258
    Needleman-Wunsch score
2259
2260
    This is the standard edit distance measure.
2261
2262
    Cf. https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm
2263
2264
    Cf.
2265
    http://csb.stanford.edu/class/public/readings/Bioinformatics_I_Lecture6/Needleman_Wunsch_JMB_70_Global_alignment.pdf
2266
2267
    :param str src, tar: two strings to be compared
2268
    :param float gap_cost: the cost of an alignment gap (1 by default)
2269
    :param function sim_func: a function that returns the similarity of two
2270
        characters (identity similarity by default)
2271
    :returns: Needleman-Wunsch score
2272
    :rtype: int (in fact dependent on the gap_cost & return value of sim_func)
2273
2274
    >>> needleman_wunsch('cat', 'hat')
2275
    2.0
2276
    >>> needleman_wunsch('Niall', 'Neil')
2277
    1.0
2278
    >>> needleman_wunsch('aluminum', 'Catalan')
2279
    -1.0
2280
    >>> needleman_wunsch('ATCG', 'TAGC')
2281
    0.0
2282
    """
2283
    # pylint: disable=no-member
2284
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2285
    # pylint: enable=no-member
2286
2287
    for i in range(len(src)+1):
2288
        d_mat[i, 0] = -(i * gap_cost)
2289
    for j in range(len(tar)+1):
2290
        d_mat[0, j] = -(j * gap_cost)
2291
    for i in range(1, len(src)+1):
2292
        for j in range(1, len(tar)+1):
2293
            match = d_mat[i-1, j-1] + sim_func(src[i-1], tar[j-1])
2294
            delete = d_mat[i-1, j] - gap_cost
2295
            insert = d_mat[i, j-1] - gap_cost
2296
            d_mat[i, j] = max(match, delete, insert)
2297
    return d_mat[d_mat.shape[0]-1, d_mat.shape[1]-1]
2298
2299
2300 View Code Duplication
def smith_waterman(src, tar, gap_cost=1, sim_func=sim_ident):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2301
    """Return the Smith-Waterman score of two strings.
2302
2303
    Smith-Waterman score
2304
2305
    This is the standard edit distance measure.
2306
2307
    Cf. https://en.wikipedia.org/wiki/Smith–Waterman_algorithm
2308
2309
    :param str src, tar: two strings to be compared
2310
    :param float gap_cost: the cost of an alignment gap (1 by default)
2311
    :param function sim_func: a function that returns the similarity of two
2312
        characters (identity similarity by default)
2313
    :returns: Smith-Waterman score
2314
    :rtype: int (in fact dependent on the gap_cost & return value of sim_func)
2315
2316
    >>> smith_waterman('cat', 'hat')
2317
    2.0
2318
    >>> smith_waterman('Niall', 'Neil')
2319
    1.0
2320
    >>> smith_waterman('aluminum', 'Catalan')
2321
    0.0
2322
    >>> smith_waterman('ATCG', 'TAGC')
2323
    1.0
2324
    """
2325
    # pylint: disable=no-member
2326
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2327
    # pylint: enable=no-member
2328
2329
    for i in range(len(src)+1):
2330
        d_mat[i, 0] = 0
2331
    for j in range(len(tar)+1):
2332
        d_mat[0, j] = 0
2333
    for i in range(1, len(src)+1):
2334
        for j in range(1, len(tar)+1):
2335
            match = d_mat[i-1, j-1] + sim_func(src[i-1], tar[j-1])
2336
            delete = d_mat[i-1, j] - gap_cost
2337
            insert = d_mat[i, j-1] - gap_cost
2338
            d_mat[i, j] = max(0, match, delete, insert)
2339
    return d_mat[d_mat.shape[0]-1, d_mat.shape[1]-1]
2340
2341
2342
def gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=sim_ident):
2343
    """Return the Gotoh score of two strings.
2344
2345
    Gotoh score
2346
2347
    Gotoh's algorithm is essentially Needleman-Wunsch with affine gap
2348
    penalties:
2349
    https://www.cs.umd.edu/class/spring2003/cmsc838t/papers/gotoh1982.pdf
2350
2351
    :param str src, tar: two strings to be compared
2352
    :param float gap_open: the cost of an open alignment gap (1 by default)
2353
    :param float gap_ext: the cost of an alignment gap extension (0.4 by
2354
        default)
2355
    :param function sim_func: a function that returns the similarity of two
2356
        characters (identity similarity by default)
2357
    :returns: Gotoh score
2358
    :rtype: float (in fact dependent on the gap_cost & return value of
2359
        sim_func)
2360
2361
    >>> gotoh('cat', 'hat')
2362
    2.0
2363
    >>> gotoh('Niall', 'Neil')
2364
    1.0
2365
    >>> gotoh('aluminum', 'Catalan')
2366
    -0.40000000000000002
2367
    >>> gotoh('cat', 'hat')
2368
    2.0
2369
    """
2370
    # pylint: disable=no-member
2371
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2372
    p_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2373
    q_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.float)
2374
    # pylint: enable=no-member
2375
2376
    d_mat[0, 0] = 0
2377
    p_mat[0, 0] = float('-inf')
2378
    q_mat[0, 0] = float('-inf')
2379
    for i in range(1, len(src)+1):
2380
        d_mat[i, 0] = float('-inf')
2381
        p_mat[i, 0] = -gap_open - gap_ext*(i-1)
2382
        q_mat[i, 0] = float('-inf')
2383
        q_mat[i, 1] = -gap_open
2384
    for j in range(1, len(tar)+1):
2385
        d_mat[0, j] = float('-inf')
2386
        p_mat[0, j] = float('-inf')
2387
        p_mat[1, j] = -gap_open
2388
        q_mat[0, j] = -gap_open - gap_ext*(j-1)
2389
2390
    for i in range(1, len(src)+1):
2391
        for j in range(1, len(tar)+1):
2392
            sim_val = sim_func(src[i-1], tar[j-1])
2393
            d_mat[i, j] = max(d_mat[i-1, j-1] + sim_val,
2394
                              p_mat[i-1, j-1] + sim_val,
2395
                              q_mat[i-1, j-1] + sim_val)
2396
2397
            p_mat[i, j] = max(d_mat[i-1, j] - gap_open,
2398
                              p_mat[i-1, j] - gap_ext)
2399
2400
            q_mat[i, j] = max(d_mat[i, j-1] - gap_open,
2401
                              q_mat[i, j-1] - gap_ext)
2402
2403
    i, j = (n - 1 for n in d_mat.shape)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable n does not seem to be defined.
Loading history...
2404
    return max(d_mat[i, j], p_mat[i, j], q_mat[i, j])
2405
2406
2407
def sim_length(src, tar):
2408
    """Return the length similarty of two strings.
2409
2410
    Length similarity
2411
2412
    This is the ratio of the length of the shorter string to the longer.
2413
2414
    :param str src, tar: two strings to be compared
2415
    :returns: length similarity
2416
    :rtype: float
2417
2418
    >>> sim_length('cat', 'hat')
2419
    1.0
2420
    >>> sim_length('Niall', 'Neil')
2421
    0.8
2422
    >>> sim_length('aluminum', 'Catalan')
2423
    0.875
2424
    >>> sim_length('ATCG', 'TAGC')
2425
    1.0
2426
    """
2427
    if src == tar:
2428
        return 1.0
2429
    if not src or not tar:
2430
        return 0.0
2431
    return len(src)/len(tar) if len(src) < len(tar) else len(tar)/len(src)
2432
2433
2434
def dist_length(src, tar):
2435
    """Return the length distance between two strings.
2436
2437
    Length distance
2438
2439
    Length distance is the complement of length similarity:
2440
    :math:`dist_{length} = 1 - sim_{length}`
2441
2442
    :param str src, tar: two strings to be compared
2443
    :returns: length distance
2444
    :rtype: float
2445
2446
    >>> dist_length('cat', 'hat')
2447
    0.0
2448
    >>> dist_length('Niall', 'Neil')
2449
    0.19999999999999996
2450
    >>> dist_length('aluminum', 'Catalan')
2451
    0.125
2452
    >>> dist_length('ATCG', 'TAGC')
2453
    0.0
2454
    """
2455
    return 1 - sim_length(src, tar)
2456
2457
2458 View Code Duplication
def sim_prefix(src, tar):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2459
    """Return the prefix similarty of two strings.
2460
2461
    Prefix similarity
2462
2463
    Prefix similarity is the ratio of the length of the shorter term that
2464
    exactly matches the longer term to the length of the shorter term,
2465
    beginning at the start of both terms.
2466
2467
    :param str src, tar: two strings to be compared
2468
    :returns: prefix similarity
2469
    :rtype: float
2470
2471
    >>> sim_prefix('cat', 'hat')
2472
    0.0
2473
    >>> sim_prefix('Niall', 'Neil')
2474
    0.25
2475
    >>> sim_prefix('aluminum', 'Catalan')
2476
    0.0
2477
    >>> sim_prefix('ATCG', 'TAGC')
2478
    0.0
2479
    """
2480
    if src == tar:
2481
        return 1.0
2482
    if not src or not tar:
2483
        return 0.0
2484
    min_word, max_word = (src, tar) if len(src) < len(tar) else (tar, src)
2485
    min_len = len(min_word)
2486
    for i in range(min_len, 0, -1):
2487
        if min_word[:i] == max_word[:i]:
2488
            return i/min_len
2489
    return 0.0
2490
2491
2492
def dist_prefix(src, tar):
2493
    """Return the prefix distance between two strings.
2494
2495
    Prefix distance
2496
2497
    Prefix distance is the complement of prefix similarity:
2498
    :math:`dist_{prefix} = 1 - sim_{prefix}`
2499
2500
    :param str src, tar: two strings to be compared
2501
    :returns: prefix distance
2502
    :rtype: float
2503
2504
        >>> dist_prefix('cat', 'hat')
2505
    1.0
2506
    >>> dist_prefix('Niall', 'Neil')
2507
    0.75
2508
    >>> dist_prefix('aluminum', 'Catalan')
2509
    1.0
2510
    >>> dist_prefix('ATCG', 'TAGC')
2511
    1.0
2512
    """
2513
    return 1 - sim_prefix(src, tar)
2514
2515
2516 View Code Duplication
def sim_suffix(src, tar):
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
2517
    """Return the suffix similarity of two strings.
2518
2519
    Suffix similarity
2520
2521
    Suffix similarity is the ratio of the length of the shorter term that
2522
    exactly matches the longer term to the length of the shorter term,
2523
    beginning at the end of both terms.
2524
2525
    :param str src, tar: two strings to be compared
2526
    :returns: suffix similarity
2527
    :rtype: float
2528
2529
    >>> sim_suffix('cat', 'hat')
2530
    0.6666666666666666
2531
    >>> sim_suffix('Niall', 'Neil')
2532
    0.25
2533
    >>> sim_suffix('aluminum', 'Catalan')
2534
    0.0
2535
    >>> sim_suffix('ATCG', 'TAGC')
2536
    0.0
2537
    """
2538
    if src == tar:
2539
        return 1.0
2540
    if not src or not tar:
2541
        return 0.0
2542
    min_word, max_word = (src, tar) if len(src) < len(tar) else (tar, src)
2543
    min_len = len(min_word)
2544
    for i in range(min_len, 0, -1):
2545
        if min_word[-i:] == max_word[-i:]:
2546
            return i/min_len
2547
    return 0.0
2548
2549
2550
def dist_suffix(src, tar):
2551
    """Return the suffix distance between two strings.
2552
2553
    Suffix distance
2554
2555
    Suffix distance is the complement of suffix similarity:
2556
    :math:`dist_{suffix} = 1 - sim_{suffix}`
2557
2558
    :param str src, tar: two strings to be compared
2559
    :returns: suffix distance
2560
    :rtype: float
2561
2562
    >>> dist_suffix('cat', 'hat')
2563
    0.33333333333333337
2564
    >>> dist_suffix('Niall', 'Neil')
2565
    0.75
2566
    >>> dist_suffix('aluminum', 'Catalan')
2567
    1.0
2568
    >>> dist_suffix('ATCG', 'TAGC')
2569
    1.0
2570
    """
2571
    return 1 - sim_suffix(src, tar)
2572
2573
2574
def sim_mlipns(src, tar, threshold=0.25, maxmismatches=2):
2575
    """Return the MLIPNS similarity of two strings.
2576
2577
    Modified Language-Independent Product Name Search (MLIPNS)
2578
2579
    The MLIPNS algorithm is described in Shannaq, Boumedyen A. N. and Victor V.
2580
    Alexandrov. 2010. "Using Product Similarity for Adding Business." Global
2581
    Journal of Computer Science and Technology. 10(12). 2-8.
2582
    http://www.sial.iias.spb.su/files/386-386-1-PB.pdf
2583
2584
    This function returns only 1.0 (similar) or 0.0 (not similar).
2585
2586
    LIPNS similarity is identical to normalized Hamming similarity.
2587
2588
    :param str src, tar: two strings to be compared
2589
    :param float threshold: a number [0, 1] indicating the maximum similarity
2590
        score, below which the strings are considered 'similar' (0.25 by
2591
        default)
2592
    :param int maxmismatches: a number indicating the allowable number of
2593
        mismatches to remove before declaring two strings not similar (2 by
2594
        default)
2595
    :returns: MLIPNS similarity
2596
    :rtype: float
2597
2598
    >>> sim_mlipns('cat', 'hat')
2599
    1.0
2600
    >>> sim_mlipns('Niall', 'Neil')
2601
    0.0
2602
    >>> sim_mlipns('aluminum', 'Catalan')
2603
    0.0
2604
    >>> sim_mlipns('ATCG', 'TAGC')
2605
    0.0
2606
    """
2607
    if tar == src:
2608
        return 1.0
2609
    if not src or not tar:
2610
        return 0.0
2611
2612
    mismatches = 0
2613
    ham = hamming(src, tar, difflens=True)
2614
    maxlen = max(len(src), len(tar))
2615
    while src and tar and mismatches <= maxmismatches:
2616
        if maxlen < 1 or (1-(maxlen-ham)/maxlen) <= threshold:
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2617
            return 1.0
2618
        else:
2619
            mismatches += 1
2620
            ham -= 1
2621
            maxlen -= 1
2622
2623
    if maxlen < 1:
2624
        return 1.0
2625
    return 0.0
2626
2627
2628
def dist_mlipns(src, tar, threshold=0.25, maxmismatches=2):
2629
    """Return the MLIPNS distance between two strings.
2630
2631
    Modified Language-Independent Product Name Search (MLIPNS)
2632
2633
    MLIPNS distance is the complement of MLIPNS similarity:
2634
    :math:`dist_{MLIPNS} = 1 - sim_{MLIPNS}`
2635
2636
    This function returns only 0.0 (distant) or 1.0 (not distant)
2637
2638
    :param str src, tar: two strings to be compared
2639
    :param float threshold: a number [0, 1] indicating the maximum similarity
2640
        score, below which the strings are considered 'similar' (0.25 by
2641
        default)
2642
    :param int maxmismatches: a number indicating the allowable number of
2643
        mismatches to remove before declaring two strings not similar (2 by
2644
        default)
2645
    :returns: MLIPNS distance
2646
    :rtype: float
2647
2648
    >>> dist_mlipns('cat', 'hat')
2649
    0.0
2650
    >>> dist_mlipns('Niall', 'Neil')
2651
    1.0
2652
    >>> dist_mlipns('aluminum', 'Catalan')
2653
    1.0
2654
    >>> dist_mlipns('ATCG', 'TAGC')
2655
    1.0
2656
    """
2657
    return 1.0 - sim_mlipns(src, tar, threshold, maxmismatches)
2658
2659
2660
def bag(src, tar):
2661
    """Return the bag distance between two strings.
2662
2663
    Bag distance
2664
2665
    Bag distance is proposed in Bartolini, Illaria, Paolo Ciaccia, and Marco
2666
    Patella. 2002. "String Matching with Metric Trees Using and Approximate
2667
    Distance. Proceedings of the 9th International Symposium on String
2668
    Processing and Information Retrieval, Lisbone, Portugal, September 2002.
2669
    271-283.
2670
    http://www-db.disi.unibo.it/research/papers/SPIRE02.pdf
2671
2672
    It is defined as:
2673
    :math:`max( |multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)| )`
2674
2675
    :param str src, tar: two strings to be compared
2676
    :returns: bag distance
2677
    :rtype: int
2678
2679
    >>> bag('cat', 'hat')
2680
    1
2681
    >>> bag('Niall', 'Neil')
2682
    2
2683
    >>> bag('aluminum', 'Catalan')
2684
    5
2685
    >>> bag('ATCG', 'TAGC')
2686
    0
2687
    >>> bag('abcdefg', 'hijklm')
2688
    7
2689
    >>> bag('abcdefg', 'hijklmno')
2690
    8
2691
    """
2692
    if tar == src:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
2693
        return 0
2694
    elif not src:
2695
        return len(tar)
2696
    elif not tar:
2697
        return len(src)
2698
2699
    src_bag = Counter(src)
2700
    tar_bag = Counter(tar)
2701
    return max(sum((src_bag-tar_bag).values()),
2702
               sum((tar_bag-src_bag).values()))
2703
2704
2705
def dist_bag(src, tar):
2706
    """Return the normalized bag distance between two strings.
2707
2708
    Normalized bag distance
2709
2710
    Bag distance is normalized by dividing by :math:`max( |src|, |tar| )`.
2711
2712
    :param str src, tar: two strings to be compared
2713
    :returns: normalized bag distance
2714
    :rtype: float
2715
2716
    >>> dist_bag('cat', 'hat')
2717
    0.3333333333333333
2718
    >>> dist_bag('Niall', 'Neil')
2719
    0.4
2720
    >>> dist_bag('aluminum', 'Catalan')
2721
    0.375
2722
    >>> dist_bag('ATCG', 'TAGC')
2723
    0.0
2724
    """
2725
    if tar == src:
2726
        return 0.0
2727
    if not src or not tar:
2728
        return 1.0
2729
2730
    maxlen = max(len(src), len(tar))
2731
2732
    return bag(src, tar)/maxlen
2733
2734
2735
def sim_bag(src, tar):
2736
    """Return the normalized bag similarity of two strings.
2737
2738
    Normalized bag similarity
2739
2740
    Normalized bag similarity is the complement of normalized bag distance:
2741
    :math:`sim_{bag} = 1 - dist_{bag}`
2742
2743
    :param str src, tar: two strings to be compared
2744
    :returns: normalized bag similarity
2745
    :rtype: float
2746
2747
    >>> sim_bag('cat', 'hat')
2748
    0.6666666666666667
2749
    >>> sim_bag('Niall', 'Neil')
2750
    0.6
2751
    >>> sim_bag('aluminum', 'Catalan')
2752
    0.625
2753
    >>> sim_bag('ATCG', 'TAGC')
2754
    1.0
2755
    """
2756
    return 1-dist_bag(src, tar)
2757
2758
2759
def editex(src, tar, cost=(0, 1, 2), local=False):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
2760
    """Return the Editex distance between two strings.
2761
2762
    Editex distance
2763
2764
    As described on pages 3 & 4 of
2765
    Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from
2766
    information retrieval. In: Proceedings of the ACM-SIGIR Conference on
2767
    Research and Development in Information Retrieval, Zurich, Switzerland.
2768
    166–173. https://doi.org/10.1145/243199.243258
2769
2770
    The local variant is based on
2771
    Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in
2772
    Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th
2773
    Asia Information Retrieval Symposium, Sapporo, Japan. 157-167.
2774
    http://www.seg.rmit.edu.au/research/download.php?manuscript=404
2775
2776
    :param str src, tar: two strings to be compared
2777
    :param tuple cost: a 3-tuple representing the cost of the four possible
2778
        edits:
2779
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2780
    :param bool local: if True, the local variant of Editex is used
2781
    :returns: Editex distance
2782
    :rtype: int
2783
2784
    >>> editex('cat', 'hat')
2785
    2
2786
    >>> editex('Niall', 'Neil')
2787
    2
2788
    >>> editex('aluminum', 'Catalan')
2789
    12
2790
    >>> editex('ATCG', 'TAGC')
2791
    6
2792
    """
2793
    match_cost, group_cost, mismatch_cost = cost
2794
    letter_groups = ({'A', 'E', 'I', 'O', 'U', 'Y'},
2795
                     {'B', 'P'},
2796
                     {'C', 'K', 'Q'},
2797
                     {'D', 'T'},
2798
                     {'L', 'R'},
2799
                     {'M', 'N'},
2800
                     {'G', 'J'},
2801
                     {'F', 'P', 'V'},
2802
                     {'S', 'X', 'Z'},
2803
                     {'C', 'S', 'Z'})
2804
    all_letters = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'J', 'K', 'L', 'M',
2805
                   'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'}
2806
2807
    def r_cost(ch1, ch2):
2808
        """Return r(a,b) according to Zobel & Dart's definition."""
2809
        if ch1 == ch2:
2810
            return match_cost
2811
        if ch1 in all_letters and ch2 in all_letters:
2812
            for group in letter_groups:
2813
                if ch1 in group and ch2 in group:
2814
                    return group_cost
2815
        return mismatch_cost
2816
2817
    def d_cost(ch1, ch2):
2818
        """Return d(a,b) according to Zobel & Dart's definition."""
2819
        if ch1 != ch2 and (ch1 == 'H' or ch1 == 'W'):
0 ignored issues
show
Unused Code introduced by
Consider merging these comparisons with "in" to "ch1 in ('H', 'W')"
Loading history...
2820
            return group_cost
2821
        return r_cost(ch1, ch2)
2822
2823
    # convert both src & tar to NFKD normalized unicode
2824
    src = unicodedata.normalize('NFKD', text_type(src.upper()))
2825
    tar = unicodedata.normalize('NFKD', text_type(tar.upper()))
2826
    # convert ß to SS (for Python2)
2827
    src = src.replace('ß', 'SS')
2828
    tar = tar.replace('ß', 'SS')
2829
2830
    if src == tar:
2831
        return 0
2832
    if not src:
2833
        return len(tar) * mismatch_cost
2834
    if not tar:
2835
        return len(src) * mismatch_cost
2836
2837
    # pylint: disable=no-member
2838
    d_mat = np.zeros((len(src)+1, len(tar)+1), dtype=np.int)
2839
    # pylint: enable=no-member
2840
    lens = len(src)
2841
    lent = len(tar)
2842
    src = ' '+src
2843
    tar = ' '+tar
2844
2845
    if not local:
2846
        for i in range(1, lens+1):
2847
            d_mat[i, 0] = d_mat[i-1, 0] + d_cost(src[i-1], src[i])
2848
    for j in range(1, lent+1):
2849
        d_mat[0, j] = d_mat[0, j-1] + d_cost(tar[j-1], tar[j])
2850
2851
    for i in range(1, lens+1):
2852
        for j in range(1, lent+1):
2853
            d_mat[i, j] = min(d_mat[i-1, j] + d_cost(src[i-1], src[i]),
2854
                              d_mat[i, j-1] + d_cost(tar[j-1], tar[j]),
2855
                              d_mat[i-1, j-1] + r_cost(src[i], tar[j]))
2856
2857
    return d_mat[lens, lent]
2858
2859
2860
def dist_editex(src, tar, cost=(0, 1, 2), local=False):
2861
    """Return the normalized Editex distance between two strings.
2862
2863
    Editex distance normalized to the interval [0, 1]
2864
2865
    The Editex distance is normalized by dividing the Editex distance
2866
    (calculated by any of the three supported methods) by the greater of
2867
    the number of characters in src times the cost of a delete and
2868
    the number of characters in tar times the cost of an insert.
2869
    For the case in which all operations have :math:`cost = 1`, this is
2870
    equivalent to the greater of the length of the two strings src & tar.
2871
2872
    :param str src, tar: two strings to be compared
2873
    :param tuple cost: a 3-tuple representing the cost of the four possible
2874
        edits:
2875
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2876
    :param bool local: if True, the local variant of Editex is used
2877
    :returns: normalized Editex distance
2878
    :rtype: float
2879
2880
    >>> dist_editex('cat', 'hat')
2881
    0.33333333333333331
2882
    >>> dist_editex('Niall', 'Neil')
2883
    0.20000000000000001
2884
    >>> dist_editex('aluminum', 'Catalan')
2885
    0.75
2886
    >>> dist_editex('ATCG', 'TAGC')
2887
    0.75
2888
    """
2889
    if src == tar:
2890
        return 0
2891
    mismatch_cost = cost[2]
2892
    return (editex(src, tar, cost, local) /
2893
            (max(len(src)*mismatch_cost, len(tar)*mismatch_cost)))
2894
2895
2896
def sim_editex(src, tar, cost=(0, 1, 2), local=False):
2897
    """Return the normalized Editex similarity of two strings.
2898
2899
    Editex similarity normalized to the interval [0, 1]
2900
2901
    The Editex similarity is the complement of Editex distance
2902
    :math:`sim_{Editex} = 1 - dist_{Editex}`
2903
2904
    The arguments are identical to those of the editex() function.
2905
2906
    :param str src, tar: two strings to be compared
2907
    :param tuple cost: a 3-tuple representing the cost of the four possible
2908
        edits:
2909
        match, same-group, and mismatch respectively (by default: (0, 1, 2))
2910
    :param bool local: if True, the local variant of Editex is used
2911
    :returns: normalized Editex similarity
2912
    :rtype: float
2913
2914
    >>> sim_editex('cat', 'hat')
2915
    0.66666666666666674
2916
    >>> sim_editex('Niall', 'Neil')
2917
    0.80000000000000004
2918
    >>> sim_editex('aluminum', 'Catalan')
2919
    0.25
2920
    >>> sim_editex('ATCG', 'TAGC')
2921
    0.25
2922
    """
2923
    return 1 - dist_editex(src, tar, cost, local)
2924
2925
2926
def eudex_hamming(src, tar, weights='exponential', maxlength=8,
2927
                  normalized=False):
2928
    """Calculate the Hamming distance between the Eudex hashes of two terms.
2929
2930
    If weights is set to None, a simple Hamming distance is calculated.
2931
    If weights is set to 'exponential', weight decays by powers of 2, as
2932
         proposed in the eudex specification: https://github.com/ticki/eudex.
2933
    If weights is set to 'fibonacci', weight decays through the Fibonacci
2934
         series, as in the eudex reference implementation.
2935
    If weights is set to a callable function, this assumes it creates a
2936
         generator and the generator is used to populate a series of weights.
2937
    If weights is set to an iterable, the iterable's values should be integers
2938
         and will be used as the weights.
2939
2940
    :param str src, tar: two strings to be compared
2941
    :param iterable or generator function weights:
2942
    :param maxlength: the number of characters to encode as a eudex hash
2943
    :return:
2944
    """
2945
    def _gen_fibonacci():
2946
        """Yield the next Fibonacci number.
2947
2948
        Based on https://www.python-course.eu/generators.php
2949
        Starts at Fibonacci number 3 (the second 1)
2950
        """
2951
        num_a, num_b = 1, 2
2952
        while True:
2953
            yield num_a
2954
            num_a, num_b = num_b, num_a + num_b
2955
2956
    def _gen_exponential(base=2):
2957
        """Yield the next value in an exponential series of the base.
2958
2959
        Based on https://www.python-course.eu/generators.php
2960
        Starts at base**0
2961
        """
2962
        exp = 0
2963
        while True:
2964
            yield base ** exp
2965
            exp += 1
2966
2967
    # Calculate the eudex hashes and XOR them
2968
    xored = eudex(src, maxlength=maxlength) ^ eudex(tar, maxlength=maxlength)
2969
2970
    # Simple hamming distance (all bits are equal)
2971
    if not weights:
2972
        return bin(xored).count('1')
2973
2974
    # If weights is a function, it should create a generator,
2975
    # which we now use to populate a list
2976
    if callable(weights):
2977
        weights = weights()
2978
    elif weights == 'exponential':
2979
        weights = _gen_exponential()
2980
    elif weights == 'fibonacci':
2981
        weights = _gen_fibonacci()
2982
    if isinstance(weights, types.GeneratorType):
2983
        weights = [next(weights) for _ in range(maxlength)][::-1]
2984
2985
    # Sum the weighted hamming distance
2986
    dist = 0
0 ignored issues
show
Comprehensibility Bug introduced by
dist is re-defining a name which is already available in the outer-scope (previously defined on line 3651).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
2987
    maxdist = 0
2988
    while (xored or normalized) and weights:
2989
        maxdist += 8*weights[-1]
2990
        dist += bin(xored & 0xFF).count('1') * weights.pop()
2991
        xored >>= 8
2992
2993
    if normalized:
2994
        dist /= maxdist
2995
2996
    return dist
2997
2998
2999
def dist_eudex(src, tar, weights='exponential', maxlength=8):
3000
    """Return normalized Hamming distance between Eudex hashes of two terms.
3001
3002
    If weights is set to None, a simple Hamming distance is calculated.
3003
    If weights is set to 'exponential', weight decays by powers of 2, as
3004
         proposed in the eudex specification: https://github.com/ticki/eudex.
3005
    If weights is set to 'fibonacci', weight decays through the Fibonacci
3006
         series, as in the eudex reference implementation.
3007
    If weights is set to a callable function, this assumes it creates a
3008
         generator and the generator is used to populate a series of weights.
3009
    If weights is set to an iterable, the iterable's values should be integers
3010
         and will be used as the weights.
3011
3012
    :param str src, tar: two strings to be compared
3013
    :param iterable or generator function weights:
3014
    :param maxlength: the number of characters to encode as a eudex hash
3015
    :return:
3016
    """
3017
    return eudex_hamming(src, tar, weights, maxlength, True)
3018
3019
3020
def sim_eudex(src, tar, weights='exponential', maxlength=8):
3021
    """Return normalized Hamming similarity between Eudex hashes of two terms.
3022
3023
    If weights is set to None, a simple Hamming distance is calculated.
3024
    If weights is set to 'exponential', weight decays by powers of 2, as
3025
         proposed in the eudex specification: https://github.com/ticki/eudex.
3026
    If weights is set to 'fibonacci', weight decays through the Fibonacci
3027
         series, as in the eudex reference implementation.
3028
    If weights is set to a callable function, this assumes it creates a
3029
         generator and the generator is used to populate a series of weights.
3030
    If weights is set to an iterable, the iterable's values should be integers
3031
         and will be used as the weights.
3032
3033
    :param str src, tar: two strings to be compared
3034
    :param iterable or generator function weights:
3035
    :param maxlength: the number of characters to encode as a eudex hash
3036
    :return:
3037
    """
3038
    return 1-dist_eudex(src, tar, weights, maxlength)
3039
3040
3041
def sift4_simplest(src, tar, max_offset=0):
3042
    """Return the "simplest" Sift4 distance between two terms.
3043
3044
    This is an approximation of edit distance, described in:
3045
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3046
    algorithm: Sift4."
3047
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3048
3049
    :param str src, tar: two strings to be compared
3050
    :param max_offset: the number of characters to search for matching letters
3051
    :return:
3052
    """
3053
    if not src:
3054
        return len(tar)
3055
3056
    if not tar:
3057
        return len(src)
3058
3059
    src_len = len(src)
3060
    tar_len = len(tar)
3061
3062
    src_cur = 0
3063
    tar_cur = 0
3064
    lcss = 0
3065
    local_cs = 0
3066
3067
    while (src_cur < src_len) and (tar_cur < tar_len):
3068
        if src[src_cur] == tar[tar_cur]:
3069
            local_cs += 1
3070
        else:
3071
            lcss += local_cs
3072
            local_cs = 0
3073
            if src_cur != tar_cur:
3074
                src_cur = tar_cur = max(src_cur, tar_cur)
3075
            for i in range(max_offset):
3076
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
3077
                    break
3078
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
3079
                    src_cur += i
3080
                    local_cs += 1
3081
                    break
3082
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
3083
                    tar_cur += i
3084
                    local_cs += 1
3085
                    break
3086
3087
        src_cur += 1
3088
        tar_cur += 1
3089
3090
    lcss += local_cs
3091
    return round(max(src_len, tar_len) - lcss)
3092
3093
3094
def sift4_common(src, tar, max_offset=0, max_distance=0):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
3095
    """Return the "common" Sift4 distance between two terms.
3096
3097
    This is an approximation of edit distance, described in:
3098
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3099
    algorithm: Sift4."
3100
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3101
3102
    :param str src, tar: two strings to be compared
3103
    :param max_offset: the number of characters to search for matching letters
3104
    :param max_distance: the distance at which to stop and exit
3105
    :return:
3106
    """
3107
    if not src:
3108
        return len(tar)
3109
3110
    if not tar:
3111
        return len(src)
3112
3113
    src_len = len(src)
3114
    tar_len = len(tar)
3115
3116
    src_cur = 0
3117
    tar_cur = 0
3118
    lcss = 0
3119
    local_cs = 0
3120
    trans = 0
3121
    offset_arr = []
3122
3123
    while (src_cur < src_len) and (tar_cur < tar_len):
3124
        if src[src_cur] == tar[tar_cur]:
3125
            local_cs += 1
3126
            is_trans = False
3127
            i = 0
3128
            while i < len(offset_arr):
3129
                ofs = offset_arr[i]
3130
                if src_cur <= ofs['src_cur'] or tar_cur <= ofs['tar_cur']:
3131
                    is_trans = (abs(tar_cur-src_cur) >=
3132
                                abs(ofs['tar_cur']-ofs['src_cur']))
3133
                    if is_trans:
3134
                        trans += 1
3135
                    elif not ofs['trans']:
3136
                        ofs['trans'] = True
3137
                        trans += 1
3138
                    break
3139
                elif src_cur > ofs['tar_cur'] and tar_cur > ofs['src_cur']:
3140
                    del offset_arr[i]
3141
                else:
3142
                    i += 1
3143
3144
            offset_arr.append({'src_cur': src_cur, 'tar_cur': tar_cur,
3145
                               'trans': is_trans})
3146
        else:
3147
            lcss += local_cs
3148
            local_cs = 0
3149
            if src_cur != tar_cur:
3150
                src_cur = tar_cur = min(src_cur, tar_cur)
3151
            for i in range(max_offset):
3152
                if not ((src_cur+i < src_len) or (tar_cur+i < tar_len)):
3153
                    break
3154
                if (src_cur+i < src_len) and (src[src_cur+i] == tar[tar_cur]):
3155
                    src_cur += i-1
3156
                    tar_cur -= 1
3157
                    break
3158
                if (tar_cur+i < tar_len) and (src[src_cur] == tar[tar_cur+i]):
3159
                    src_cur -= 1
3160
                    tar_cur += i-1
3161
                    break
3162
3163
        src_cur += 1
3164
        tar_cur += 1
3165
3166
        if max_distance:
3167
            temporary_distance = max(src_cur, tar_cur) - lcss + trans
3168
            if temporary_distance >= max_distance:
3169
                return round(temporary_distance)
3170
3171
        if (src_cur >= src_len) or (tar_cur >= tar_len):
3172
            lcss += local_cs
3173
            local_cs = 0
3174
            src_cur = tar_cur = min(src_cur, tar_cur)
3175
3176
    lcss += local_cs
3177
    return round(max(src_len, tar_len) - lcss + trans)
3178
3179
3180
def dist_sift4(src, tar, max_offset=0, max_distance=0):
3181
    """Return the normalized "common" Sift4 distance between two terms.
3182
3183
    This is an approximation of edit distance, described in:
3184
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3185
    algorithm: Sift4."
3186
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3187
3188
    :param str src, tar: two strings to be compared
3189
    :param max_offset: the number of characters to search for matching letters
3190
    :param max_distance: the distance at which to stop and exit
3191
    :return:
3192
    """
3193
    return (sift4_common(src, tar, max_offset, max_distance) /
3194
            (max(len(src), len(tar))))
3195
3196
3197
def sim_sift4(src, tar, max_offset=0, max_distance=0):
3198
    """Return the normalized "common" Sift4 similarity of two terms.
3199
3200
    This is an approximation of edit distance, described in:
3201
    Zackwehdex, Siderite. 2014. "Super Fast and Accurate string distance
3202
    algorithm: Sift4."
3203
    https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
3204
3205
    :param str src, tar: two strings to be compared
3206
    :param max_offset: the number of characters to search for matching letters
3207
    :param max_distance: the distance at which to stop and exit
3208
    :return:
3209
    """
3210
    return 1-dist_sift4(src, tar, max_offset, max_distance)
3211
3212
3213
def sim_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
3214
    """Return the Baystat similarity.
3215
3216
    Good results for shorter words are reported when setting min_ss_len to 1
3217
    and either left_ext OR right_ext to 1.
3218
3219
    The Baystat similarity is defined in:
3220
    Fürnrohr, Michael, Birgit Rimmelspacher, and Tilman von Roncador. 2002.
3221
    "Zusammenführung von Datenbeständen ohne numerische Identifikatoren: ein
3222
    Verfahren im Rahmen der Testuntersuchungen zu einem registergestützten
3223
    Zensus." Bayern in Zahlen, 2002(7). 308--321.
3224
    https://www.statistik.bayern.de/medien/statistik/zensus/zusammenf__hrung_von_datenbest__nden_ohne_numerische_identifikatoren.pdf
3225
3226
    This is ostensibly a port of the R module PPRL's implementation:
3227
    https://github.com/cran/PPRL/blob/master/src/MTB_Baystat.cpp
3228
    As such, this could be made more pythonic.
3229
3230
    :param str src, tar: two strings to be compared
3231
    :param int min_ss_len: minimum substring length to be considered
3232
    :param int left_ext: left-side extension length
3233
    :param int right_ext: right-side extension length
3234
    :rtype: float
3235
    :return: the Baystat similarity
3236
    """
3237
    if src == tar:
3238
        return 1
3239
    if not src or not tar:
3240
        return 0
3241
3242
    max_len = max(len(src), len(tar))
3243
3244
    if not (min_ss_len and left_ext and right_ext):
3245
        # These can be set via arguments to the function. Otherwise they are
3246
        # set automatically based on values from the article.
3247
        if max_len >= 7:
3248
            min_ss_len = 2
3249
            left_ext = 2
3250
            right_ext = 2
3251
        else:
3252
            # The paper suggests that for short names, (exclusively) one or the
3253
            # other of left_ext and right_ext can be 1, with good results.
3254
            # I use 0 & 0 as the default in this case.
3255
            min_ss_len = 1
3256
            left_ext = 0
3257
            right_ext = 0
3258
3259
    pos = 0
3260
    match_len = 0
3261
3262
    while (True):
0 ignored issues
show
Unused Code Coding Style introduced by
There is an unnecessary parenthesis after while.
Loading history...
3263
        if pos + min_ss_len > len(src):
3264
            return match_len/max_len
3265
3266
        hit_len = 0
3267
        ix = 1
0 ignored issues
show
Coding Style Naming introduced by
Variable name "ix" doesn't conform to snake_case naming style ('(([a-z_][a-z0-9_]2,)|(_[a-z0-9_]*)|(__[a-z][a-z0-9_]+__))$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
3268
3269
        substring = src[pos:pos + min_ss_len]
3270
        search_begin = pos - left_ext
3271
3272
        if search_begin < 0:
3273
            search_begin = 0
3274
            left_ext_len = pos
3275
        else:
3276
            left_ext_len = left_ext
3277
3278
        if pos + min_ss_len + right_ext >= len(tar):
3279
            right_ext_len = len(tar) - pos - min_ss_len
3280
        else:
3281
            right_ext_len = right_ext
3282
3283
        if (search_begin + left_ext_len + min_ss_len + right_ext_len >
3284
                search_begin):
3285
            search_val = tar[search_begin:(search_begin + left_ext_len +
3286
                                           min_ss_len + right_ext_len)]
3287
        else:
3288
            search_val = ''
3289
3290
        flagged_tar = ''
3291
        while substring in search_val and pos + ix <= len(src):
3292
            hit_len = len(substring)
3293
            flagged_tar = tar.replace(substring, '#'*hit_len)
3294
3295
            if pos + min_ss_len + ix <= len(src):
3296
                substring = src[pos:pos + min_ss_len + ix]
3297
3298
            if pos+min_ss_len + right_ext_len + 1 <= len(tar):
3299
                right_ext_len += 1
3300
3301
            if (search_begin + left_ext_len + min_ss_len + right_ext_len <=
3302
                    len(tar)):
3303
                search_val = tar[search_begin:(search_begin + left_ext_len +
3304
                                               min_ss_len + right_ext_len)]
3305
3306
            ix += 1
0 ignored issues
show
Coding Style Naming introduced by
Variable name "ix" doesn't conform to snake_case naming style ('(([a-z_][a-z0-9_]2,)|(_[a-z0-9_]*)|(__[a-z][a-z0-9_]+__))$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
3307
3308
        if hit_len > 0:
3309
            tar = flagged_tar
3310
3311
        match_len += hit_len
3312
        pos += ix
3313
3314
3315
def dist_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None):
3316
    """Return the Baystat distance.
3317
3318
    :param str src, tar: two strings to be compared
3319
    :param int min_ss_len: minimum substring length to be considered
3320
    :param int left_ext: left-side extension length
3321
    :param int right_ext: right-side extension length
3322
    :rtype: float
3323
    :return: the Baystat distance
3324
    """
3325
    return 1-sim_baystat(src, tar, min_ss_len, left_ext, right_ext)
3326
3327
3328
def typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5)):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (23/15).
Loading history...
3329
    """Return the typo distance between two strings.
3330
3331
    This is inspired by Wayne Song's Typo-Distance
3332
    (https://github.com/wsong/Typo-Distance), and a fair bit of this was
3333
    copied from his module. Compared to the original, this has supports
3334
    different metrics for substitution.
3335
3336
    :param str src, tar: two strings to be compared
3337
    :param str metric: supported values include: 'euclidean', 'manhattan',
3338
          'log-euclidean', and 'log-manhattan'
3339
    :param tuple cost: a 4-tuple representing the cost of the four possible
3340
        edits: inserts, deletes, substitutions, and shift, respectively (by
3341
        default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be
3342
        significantly less than the cost of an insertion & deletion unless
3343
        a log metric is used.
3344
    :return:
3345
    """
3346
    ins_cost, del_cost, sub_cost, shift_cost = cost
3347
3348
    if src == tar:
3349
        return 0.0
3350
    if not src:
3351
        return len(tar) * ins_cost
3352
    if not tar:
3353
        return len(src) * del_cost
3354
3355
    layout = {'QWERTY': (
3356
        (('`', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '-', '='),
3357
         ('', 'q', 'w', 'e', 'r', 't', 'y', 'u', 'i', 'o', 'p', '[', ']',
3358
          '\\'),
3359
         ('', 'a', 's', 'd', 'f', 'g', 'h', 'j', 'k', 'l', ';', '\''),
3360
         ('', 'z', 'x', 'c', 'v', 'b', 'n', 'm', ',', '.', '/'),
3361
         ('', '', ' ', ' ', ' ', ' ', ' ', ' ', ' ')),
3362
        (('~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+'),
3363
         ('', 'Q', 'W', 'E', 'R', 'T', 'Y', 'U', 'I', 'O', 'P', '{', '}', '|'),
3364
         ('', 'A', 'S', 'D', 'F', 'G', 'H', 'J', 'K', 'L', ':', '"'),
3365
         ('', 'Z', 'X', 'C', 'V', 'B', 'N', 'M', '<', '>', '?'),
3366
         ('', '', ' ', ' ', ' ', ' ', ' ', ' ', ' '))
3367
    )}
3368
3369
    keyboard = layout['QWERTY']
3370
    lowercase = {item for sublist in keyboard[0] for item in sublist}
3371
    uppercase = {item for sublist in keyboard[1] for item in sublist}
3372
3373
    def _kb_array_for_char(char):
3374
        """Return the keyboard layout that contains ch."""
3375
        if char in lowercase:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
3376
            return keyboard[0]
3377
        elif char in uppercase:
3378
            return keyboard[1]
3379
        else:
3380
            raise ValueError(char + ' not found in any keyboard layouts')
3381
3382
    def _get_char_coord(char, keyboard):
3383
        """Return the row & column of char in the keyboard."""
3384
        for row in keyboard:
3385
            if char in row:
3386
                return keyboard.index(row), row.index(char)
3387
        raise ValueError(char + ' not found in given keyboard layout')
3388
3389
    def _euclidean_keyboard_distance(char1, char2):
3390
        row1, col1 = _get_char_coord(char1, _kb_array_for_char(char1))
3391
        row2, col2 = _get_char_coord(char2, _kb_array_for_char(char2))
3392
        return ((row1 - row2) ** 2 + (col1 - col2) ** 2) ** 0.5
3393
3394
    def _manhattan_keyboard_distance(char1, char2):
3395
        row1, col1 = _get_char_coord(char1, _kb_array_for_char(char1))
3396
        row2, col2 = _get_char_coord(char2, _kb_array_for_char(char2))
3397
        return abs(row1 - row2) + abs(col1 - col2)
3398
3399
    def _log_euclidean_keyboard_distance(char1, char2):
3400
        return math.log(1 + _euclidean_keyboard_distance(char1, char2))
3401
3402
    def _log_manhattan_keyboard_distance(char1, char2):
3403
        return math.log(1 + _manhattan_keyboard_distance(char1, char2))
3404
3405
    metric_dict = {'euclidean': _euclidean_keyboard_distance,
3406
                   'manhattan': _manhattan_keyboard_distance,
3407
                   'log-euclidean': _log_euclidean_keyboard_distance,
3408
                   'log-manhattan': _log_manhattan_keyboard_distance}
3409
3410
    def substitution_cost(char1, char2):
3411
        cost = sub_cost
3412
        cost *= (metric_dict[metric](char1, char2) +
3413
                 shift_cost * (_kb_array_for_char(char1) !=
3414
                               _kb_array_for_char(char2)))
3415
        return cost
3416
3417
    d_mat = np.zeros((len(src) + 1, len(tar) + 1), dtype=np.float32)
3418
    for i in range(len(src) + 1):
3419
        d_mat[i, 0] = i * del_cost
3420
    for j in range(len(tar) + 1):
3421
        d_mat[0, j] = j * ins_cost
3422
3423
    for i in range(len(src)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
3424
        for j in range(len(tar)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
3425
            d_mat[i + 1, j + 1] = min(
3426
                d_mat[i + 1, j] + ins_cost,  # ins
3427
                d_mat[i, j + 1] + del_cost,  # del
3428
                d_mat[i, j] + (substitution_cost(src[i], tar[j])
3429
                               if src[i] != tar[j] else 0)  # sub/==
3430
            )
3431
3432
    return d_mat[len(src), len(tar)]
3433
3434
3435
def dist_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5)):
3436
    """Return the normalized typo distance between two strings.
3437
3438
    This is inspired by Wayne Song's Typo-Distance
3439
    (https://github.com/wsong/Typo-Distance), and a fair bit of this was
3440
    copied from his module. Compared to the original, this has supports
3441
    different metrics for substitution.
3442
3443
    :param str src, tar: two strings to be compared
3444
    :param str metric: supported values include: 'euclidean', 'manhattan',
3445
          'log-euclidean', and 'log-manhattan'
3446
    :param tuple cost: a 4-tuple representing the cost of the four possible
3447
        edits: inserts, deletes, substitutions, and shift, respectively (by
3448
        default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be
3449
        significantly less than the cost of an insertion & deletion unless
3450
        a log metric is used.
3451
    :return:
3452
    """
3453
    if src == tar:
3454
        return 0
3455
    ins_cost, del_cost = cost[:2]
3456
    return (typo(src, tar, metric, cost) /
3457
            (max(len(src)*del_cost, len(tar)*ins_cost)))
3458
3459
3460
def sim_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5)):
3461
    """Return the normalized typo similarity between two strings.
3462
3463
    This is inspired by Wayne Song's Typo-Distance
3464
    (https://github.com/wsong/Typo-Distance), and a fair bit of this was
3465
    copied from his module. Compared to the original, this has supports
3466
    different metrics for substitution.
3467
3468
    :param str src, tar: two strings to be compared
3469
    :param str metric: supported values include: 'euclidean', 'manhattan',
3470
          'log-euclidean', and 'log-manhattan'
3471
    :param tuple cost: a 4-tuple representing the cost of the four possible
3472
        edits: inserts, deletes, substitutions, and shift, respectively (by
3473
        default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be
3474
        significantly less than the cost of an insertion & deletion unless
3475
        a log metric is used.
3476
    :return:
3477
    """
3478
    return 1 - dist_typo(src, tar, metric, cost)
3479
3480
3481
def synoname(src, tar, word_approx_min=0.3, char_approx_min=0.73,
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (43/15).
Loading history...
Unused Code introduced by
The argument word_approx_min seems to be unused.
Loading history...
3482
             tests=2**11-1):
3483
    """Return the Synoname similarity type of two words.
3484
3485
    :param src:
3486
    :param tar:
3487
    :return:
3488
    """
3489
    punct = {'=', "'", '-', '|', '.', ' '}
0 ignored issues
show
Unused Code introduced by
The variable punct seems to be unused.
Loading history...
3490
    test_dict = {val: n**2 for n, val in enumerate([
3491
        'exact', 'omission', 'substitution', 'transposition', 'punctuation',
3492
        'initials', 'extended', 'inclusion', 'no_first', 'word_approx',
3493
        'confusions', 'char_approx'])}
3494
    match_type_dict = {val: n for n, val in enumerate([
0 ignored issues
show
Unused Code introduced by
The variable match_type_dict seems to be unused.
Loading history...
3495
        'exact', 'omission', 'substitution', 'transposition', 'punctuation',
3496
        'initials', 'extended', 'inclusion', 'no_first', 'word_approx',
3497
        'confusions', 'char_approx'], 1)}
3498
3499
    if isinstance(tests, Iterable):
3500
        new_tests = 0
3501
        for term in tests:
3502
            if term in test_dict:
3503
                new_tests += test_dict[term]
3504
        tests = new_tests
3505
3506
    if isinstance(src, tuple):
3507
        src_ln, src_fn, src_qual = src
3508
    else:
3509
        src_ln, src_fn, src_qual = src.split('#')[1:4]
3510
    if isinstance(tar, tuple):
3511
        tar_ln, tar_fn, tar_qual = tar
3512
    else:
3513
        tar_ln, tar_fn, tar_qual = tar.split('#')[1:4]
3514
3515
    # 1. Preprocessing
3516
3517
    # Lowercasing
3518
    src_fn = src_fn.strip().lower()
3519
    src_ln = src_ln.strip().lower()
3520
    src_qual = src_qual.strip().lower()
3521
3522
    tar_fn = tar_fn.strip().lower()
3523
    tar_ln = tar_ln.strip().lower()
3524
    tar_qual = tar_qual.strip().lower()
3525
3526
    # Create toolcodes
3527
    src_fn, src_ln, src_tc = synoname_toolcode(src_fn, src_ln, src_qual)
3528
    tar_fn, tar_ln, tar_tc = synoname_toolcode(tar_fn, tar_ln, tar_qual)
3529
3530
    src_qualcode = int(src_tc[0])
0 ignored issues
show
Unused Code introduced by
The variable src_qualcode seems to be unused.
Loading history...
3531
    src_punctcode = int(src_tc[1])
0 ignored issues
show
Unused Code introduced by
The variable src_punctcode seems to be unused.
Loading history...
3532
    src_generation = int(src_tc[2])
3533
    src_romancode = int(src_tc[3:6])
3534
    src_len_first = int(src_tc[6:8])
0 ignored issues
show
Unused Code introduced by
The variable src_len_first seems to be unused.
Loading history...
3535
    src_len_last = int(src_tc[8:10])
0 ignored issues
show
Unused Code introduced by
The variable src_len_last seems to be unused.
Loading history...
3536
    src_tc = src_tc.split('#')
3537
    src_specials = src_tc[1]
3538
    src_search_range = src_tc[2]
0 ignored issues
show
Unused Code introduced by
The variable src_search_range seems to be unused.
Loading history...
3539
    src_len_specials = len(src_specials)
0 ignored issues
show
Unused Code introduced by
The variable src_len_specials seems to be unused.
Loading history...
3540
3541
    tar_qualcode = int(tar_tc[0])
0 ignored issues
show
Unused Code introduced by
The variable tar_qualcode seems to be unused.
Loading history...
3542
    tar_punctcode = int(tar_tc[1])
0 ignored issues
show
Unused Code introduced by
The variable tar_punctcode seems to be unused.
Loading history...
3543
    tar_generation = int(tar_tc[2])
3544
    tar_romancode = int(tar_tc[3:6])
3545
    tar_len_first = int(tar_tc[6:8])
0 ignored issues
show
Unused Code introduced by
The variable tar_len_first seems to be unused.
Loading history...
3546
    tar_len_last = int(tar_tc[8:10])
0 ignored issues
show
Unused Code introduced by
The variable tar_len_last seems to be unused.
Loading history...
3547
    tar_tc = tar_tc.split('#')
3548
    tar_specials = tar_tc[1]
3549
    tar_search_range = tar_tc[2]
0 ignored issues
show
Unused Code introduced by
The variable tar_search_range seems to be unused.
Loading history...
3550
    tar_len_specials = len(tar_specials)
0 ignored issues
show
Unused Code introduced by
The variable tar_len_specials seems to be unused.
Loading history...
3551
3552
    qual_conflict = src_qual != tar_qual
0 ignored issues
show
Unused Code introduced by
The variable qual_conflict seems to be unused.
Loading history...
3553
    gen_conflict = (src_generation != tar_generation and
3554
                    (src_generation or tar_generation))
3555
    roman_conflict = (src_romancode != tar_romancode and
3556
                      (src_romancode or tar_romancode))
3557
3558
    # approx_c
3559
    def approx_c():
3560
        if gen_conflict or roman_conflict:
3561
            return False, 0.0
3562
        full_name = ' '.join((tar_ln, tar_fn))
3563
        if full_name.startswith('master '):
3564
            full_name = full_name[len('master '):]
3565
            for intro in ['of the ', 'of ', 'known as the ', 'with the ',
3566
                          'with ']:
3567
                if full_name.startswith(intro):
3568
                    full_name = full_name[len(intro):]
3569
3570
        # ca_ratio = simil(cap_full_name, full_name)
3571
        return ca_ratio >= char_approx_min, ca_ratio
3572
3573
    def simil(src, tar):
0 ignored issues
show
Unused Code introduced by
The variable simil seems to be unused.
Loading history...
3574
        return 100*sim_ratcliff_obershelp(src, tar)
3575
3576
    approx_c_result, ca_ratio = approx_c()
0 ignored issues
show
Unused Code introduced by
The variable approx_c_result seems to be unused.
Loading history...
3577
3578
    if ca_ratio >= char_approx_min and ca_ratio >= 70:
3579
        if tests & test_dict['exact'] and src == tar:
3580
            return 1
3581
3582
3583
def sim_tfidf(src, tar, qval=2, docs_src=None, docs_tar=None):
0 ignored issues
show
Unused Code introduced by
The argument docs_tar seems to be unused.
Loading history...
3584
    """Return the TF-IDF similarity of two strings.
3585
3586
    TF-IDF similarity
3587
3588
    This is chiefly based on the "Formal Definition of TF/IDF Distance" at:
3589
    http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html
3590
3591
    :param str src, tar: two strings to be compared (or QGrams/Counter objects)
3592
    :param int qval: the length of each q-gram; 0 or None for non-q-gram
3593
        version
3594
    :param Counter docs_src: a Counter object or string representing the
3595
        document corpus for the src string
3596
    :param Counter docs_tar: a Counter object or string representing the
3597
        document corpus for the tar string (or set to None to use the docs_src
3598
        for both)
3599
    :returns: TF-IDF similarity
3600
    :rtype: float
3601
    """
3602
    if src == tar:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
3603
        return 1.0  # TODO: confirm correctness of this when docs are different
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
3604
    elif not src or not tar:
3605
        return 0.0
3606
3607
    q_src, q_tar = _get_qgrams(src, tar, qval)
3608
3609
    if isinstance(docs_src, Counter):
3610
        q_docs = docs_src
0 ignored issues
show
Unused Code introduced by
The variable q_docs seems to be unused.
Loading history...
3611
    elif qval and qval > 0:
3612
        q_docs = QGrams(docs_src, qval)
3613
    else:
3614
        q_docs = Counter(docs_src.strip().split())
3615
3616
    if not q_src or not q_tar:
3617
        return 0.0
3618
3619
    # TODO: finish implementation
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
3620
    return 0.5  # hardcoded to half
3621
3622
###############################################################################
3623
3624
3625
def sim(src, tar, method=sim_levenshtein):
3626
    """Return a similarity of two strings.
3627
3628
    This is a generalized function for calling other similarity functions.
3629
3630
    :param str src, tar: two strings to be compared
3631
    :param function method: specifies the similarity metric (Levenshtein by
3632
        default)
3633
    :returns: similarity according to the specified function
3634
    :rtype: float
3635
3636
    >>> sim('cat', 'hat')
3637
    0.66666666666666674
3638
    >>> sim('Niall', 'Neil')
3639
    0.40000000000000002
3640
    >>> sim('aluminum', 'Catalan')
3641
    0.125
3642
    >>> sim('ATCG', 'TAGC')
3643
    0.25
3644
    """
3645
    if callable(method):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
3646
        return method(src, tar)
3647
    else:
3648
        raise AttributeError('Unknown similarity function: ' + str(method))
3649
3650
3651
def dist(src, tar, method=sim_levenshtein):
3652
    """Return a distance between two strings.
3653
3654
    This is a generalized function for calling other distance functions.
3655
3656
    :param str src, tar: two strings to be compared
3657
    :param function method: specifies the similarity metric (Levenshtein by
3658
        default) -- Note that this takes a similarity metric function, not
3659
        a distance metric function.
3660
    :returns: distance according to the specified function
3661
    :rtype: float
3662
3663
    >>> dist('cat', 'hat')
3664
    0.33333333333333326
3665
    >>> dist('Niall', 'Neil')
3666
    0.59999999999999998
3667
    >>> dist('aluminum', 'Catalan')
3668
    0.875
3669
    >>> dist('ATCG', 'TAGC')
3670
    0.75
3671
    """
3672
    if callable(method):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
3673
        return 1 - method(src, tar)
3674
    else:
3675
        raise AttributeError('Unknown distance function: ' + str(method))
3676
3677
3678
if __name__ == '__main__':
3679
    import doctest
3680
    doctest.testmod()
3681