Completed
Push — master ( 6ed6e1...91db7a )
by Chris
13:26
created

abydos.phonetic.soundex.soundex()   F

Complexity

Conditions 13

Size

Total Lines 146
Code Lines 68

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 31
CRAP Score 13

Importance

Changes 0
Metric Value
cc 13
eloc 68
nop 5
dl 0
loc 146
ccs 31
cts 31
cp 1
crap 13
rs 3.9872
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic.soundex.soundex() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (1178/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.phonetic.soundex.
20
21
The phonetic.soundex module implements phonetic algorithms that are generally
22
Soundex-like, including:
23
24
    - American Soundex
25
    - Refined Soundex
26
    - Fuzzy Soundex
27
    - Phonex
28
    - Phonix
29
    - Lein
30
    - PSHP Soundex/Viewex Coding
31
32
Being Soundex-like, for the purposes of this module means: targeted at English,
33
returning a code that starts with a letter and continues with (usually 3)
34
numerals, and mostly based on a simple translation table.
35
"""
36
37 1
from __future__ import unicode_literals
38
39 1
from unicodedata import normalize as unicode_normalize
40
41 1
from six import text_type
42 1
from six.moves import range
43
44 1
from . import _delete_consecutive_repeats
45
46 1
__all__ = [
47
    'fuzzy_soundex',
48
    'lein',
49
    'phonex',
50
    'phonix',
51
    'pshp_soundex_first',
52
    'pshp_soundex_last',
53
    'refined_soundex',
54
    'soundex',
55
]
56
57
58 1
def soundex(word, max_length=4, var='American', reverse=False, zero_pad=True):
59
    """Return the Soundex code for a word.
60
61
    :param str word: the word to transform
62
    :param int max_length: the length of the code returned (defaults to 4)
63
    :param str var: the variant of the algorithm to employ (defaults to
64
        'American'):
65
66
        - 'American' follows the American Soundex algorithm, as described at
67
          :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
68
          Miracode
69
        - 'special' follows the rules from the 1880-1910 US Census
70
          retrospective re-analysis, in which h & w are not treated as blocking
71
          consonants but as vowels. Cf. :cite:`Repici:2013`.
72
        - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
73
          US Census, including coding prefixed and unprefixed versions of some
74
          names
75
76
    :param bool reverse: reverse the word before computing the selected Soundex
77
        (defaults to False); This results in "Reverse Soundex", which is useful
78
        for blocking in cases where the initial elements may be in error.
79
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
80
        max_length string
81
    :returns: the Soundex value
82
    :rtype: str
83
84
    >>> soundex("Christopher")
85
    'C623'
86
    >>> soundex("Niall")
87
    'N400'
88
    >>> soundex('Smith')
89
    'S530'
90
    >>> soundex('Schmidt')
91
    'S530'
92
93
    >>> soundex('Christopher', max_length=-1)
94
    'C623160000000000000000000000000000000000000000000000000000000000'
95
    >>> soundex('Christopher', max_length=-1, zero_pad=False)
96
    'C62316'
97
98
    >>> soundex('Christopher', reverse=True)
99
    'R132'
100
101
    >>> soundex('Ashcroft')
102
    'A261'
103
    >>> soundex('Asicroft')
104
    'A226'
105
    >>> soundex('Ashcroft', var='special')
106
    'A226'
107
    >>> soundex('Asicroft', var='special')
108
    'A226'
109
    """
110 1
    _soundex_translation = dict(
111
        zip(
112
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
113
            '01230129022455012623019202',
114
        )
115
    )
116
117
    # Require a max_length of at least 4 and not more than 64
118 1
    if max_length != -1:
119 1
        max_length = min(max(4, max_length), 64)
120
    else:
121 1
        max_length = 64
122
123
    # uppercase, normalize, decompose, and filter non-A-Z out
124 1
    word = unicode_normalize('NFKD', text_type(word.upper()))
125 1
    word = word.replace('ß', 'SS')
126
127 1
    if var == 'Census':
128
        # TODO: Should these prefixes be supplemented? (VANDE, DELA, VON)
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
129 1
        if word[:3] in {'VAN', 'CON'} and len(word) > 4:
130 1
            return (
131
                soundex(word, max_length, 'American', reverse, zero_pad),
132
                soundex(word[3:], max_length, 'American', reverse, zero_pad),
133
            )
134 1
        if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
135 1
            return (
136
                soundex(word, max_length, 'American', reverse, zero_pad),
137
                soundex(word[2:], max_length, 'American', reverse, zero_pad),
138
            )
139
        # Otherwise, proceed as usual (var='American' mode, ostensibly)
140
141 1
    word = ''.join(
142
        c
143
        for c in word
144
        if c
145
        in {
146
            'A',
147
            'B',
148
            'C',
149
            'D',
150
            'E',
151
            'F',
152
            'G',
153
            'H',
154
            'I',
155
            'J',
156
            'K',
157
            'L',
158
            'M',
159
            'N',
160
            'O',
161
            'P',
162
            'Q',
163
            'R',
164
            'S',
165
            'T',
166
            'U',
167
            'V',
168
            'W',
169
            'X',
170
            'Y',
171
            'Z',
172
        }
173
    )
174
175
    # Nothing to convert, return base case
176 1
    if not word:
177 1
        if zero_pad:
178 1
            return '0' * max_length
179 1
        return '0'
180
181
    # Reverse word if computing Reverse Soundex
182 1
    if reverse:
183 1
        word = word[::-1]
184
185
    # apply the Soundex algorithm
186 1
    sdx = word.translate(_soundex_translation)
187
188 1
    if var == 'special':
189 1
        sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
190
    else:
191 1
        sdx = sdx.replace('9', '')  # rule 1
192 1
    sdx = _delete_consecutive_repeats(sdx)  # rule 3
193
194 1
    if word[0] in 'HW':
195 1
        sdx = word[0] + sdx
196
    else:
197 1
        sdx = word[0] + sdx[1:]
198 1
    sdx = sdx.replace('0', '')  # rule 1
199
200 1
    if zero_pad:
201 1
        sdx += '0' * max_length  # rule 4
202
203 1
    return sdx[:max_length]
204
205
206 1
def refined_soundex(word, max_length=-1, zero_pad=False, retain_vowels=False):
207
    """Return the Refined Soundex code for a word.
208
209
    This is Soundex, but with more character classes. It was defined at
210
    :cite:`Boyce:1998`.
211
212
    :param word: the word to transform
213
    :param max_length: the length of the code returned (defaults to unlimited)
214
    :param zero_pad: pad the end of the return value with 0s to achieve a
215
        max_length string
216
    :param retain_vowels: retain vowels (as 0) in the resulting code
217
    :returns: the Refined Soundex value
218
    :rtype: str
219
220
    >>> refined_soundex('Christopher')
221
    'C393619'
222
    >>> refined_soundex('Niall')
223
    'N87'
224
    >>> refined_soundex('Smith')
225
    'S386'
226
    >>> refined_soundex('Schmidt')
227
    'S386'
228
    """
229 1
    _ref_soundex_translation = dict(
230
        zip(
231
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
232
            '01360240043788015936020505',
233
        )
234
    )
235
236
    # uppercase, normalize, decompose, and filter non-A-Z out
237 1
    word = unicode_normalize('NFKD', text_type(word.upper()))
238 1
    word = word.replace('ß', 'SS')
239 1
    word = ''.join(
240
        c
241
        for c in word
242
        if c
243
        in {
244
            'A',
245
            'B',
246
            'C',
247
            'D',
248
            'E',
249
            'F',
250
            'G',
251
            'H',
252
            'I',
253
            'J',
254
            'K',
255
            'L',
256
            'M',
257
            'N',
258
            'O',
259
            'P',
260
            'Q',
261
            'R',
262
            'S',
263
            'T',
264
            'U',
265
            'V',
266
            'W',
267
            'X',
268
            'Y',
269
            'Z',
270
        }
271
    )
272
273
    # apply the Soundex algorithm
274 1
    sdx = word[:1] + word.translate(_ref_soundex_translation)
275 1
    sdx = _delete_consecutive_repeats(sdx)
276 1
    if not retain_vowels:
277 1
        sdx = sdx.replace('0', '')  # Delete vowels, H, W, Y
278
279 1
    if max_length > 0:
280 1
        if zero_pad:
281 1
            sdx += '0' * max_length
282 1
        sdx = sdx[:max_length]
283
284 1
    return sdx
285
286
287 1
def fuzzy_soundex(word, max_length=5, zero_pad=True):
288
    """Return the Fuzzy Soundex code for a word.
289
290
    Fuzzy Soundex is an algorithm derived from Soundex, defined in
291
    :cite:`Holmes:2002`.
292
293
    :param str word: the word to transform
294
    :param int max_length: the length of the code returned (defaults to 4)
295
    :param bool zero_pad: pad the end of the return value with 0s to achieve
296
        a max_length string
297
    :returns: the Fuzzy Soundex value
298
    :rtype: str
299
300
    >>> fuzzy_soundex('Christopher')
301
    'K6931'
302
    >>> fuzzy_soundex('Niall')
303
    'N4000'
304
    >>> fuzzy_soundex('Smith')
305
    'S5300'
306
    >>> fuzzy_soundex('Smith')
307
    'S5300'
308
    """
309 1
    _fuzzy_soundex_translation = dict(
310
        zip(
311
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
312
            '0193017-07745501769301-7-9',
313
        )
314
    )
315
316 1
    word = unicode_normalize('NFKD', text_type(word.upper()))
317 1
    word = word.replace('ß', 'SS')
318
319
    # Clamp max_length to [4, 64]
320 1
    if max_length != -1:
321 1
        max_length = min(max(4, max_length), 64)
322
    else:
323 1
        max_length = 64
324
325 1
    if not word:
326 1
        if zero_pad:
327 1
            return '0' * max_length
328 1
        return '0'
329
330 1
    if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
331 1
        word = 'SS' + word[2:]
332 1
    elif word[:2] == 'GN':
333 1
        word = 'NN' + word[2:]
334 1
    elif word[:2] in {'HR', 'WR'}:
335 1
        word = 'RR' + word[2:]
336 1
    elif word[:2] == 'HW':
337 1
        word = 'WW' + word[2:]
338 1
    elif word[:2] in {'KN', 'NG'}:
339 1
        word = 'NN' + word[2:]
340
341 1
    if word[-2:] == 'CH':
342 1
        word = word[:-2] + 'KK'
343 1
    elif word[-2:] == 'NT':
344 1
        word = word[:-2] + 'TT'
345 1
    elif word[-2:] == 'RT':
346 1
        word = word[:-2] + 'RR'
347 1
    elif word[-3:] == 'RDT':
348 1
        word = word[:-3] + 'RR'
349
350 1
    word = word.replace('CA', 'KA')
351 1
    word = word.replace('CC', 'KK')
352 1
    word = word.replace('CK', 'KK')
353 1
    word = word.replace('CE', 'SE')
354 1
    word = word.replace('CHL', 'KL')
355 1
    word = word.replace('CL', 'KL')
356 1
    word = word.replace('CHR', 'KR')
357 1
    word = word.replace('CR', 'KR')
358 1
    word = word.replace('CI', 'SI')
359 1
    word = word.replace('CO', 'KO')
360 1
    word = word.replace('CU', 'KU')
361 1
    word = word.replace('CY', 'SY')
362 1
    word = word.replace('DG', 'GG')
363 1
    word = word.replace('GH', 'HH')
364 1
    word = word.replace('MAC', 'MK')
365 1
    word = word.replace('MC', 'MK')
366 1
    word = word.replace('NST', 'NSS')
367 1
    word = word.replace('PF', 'FF')
368 1
    word = word.replace('PH', 'FF')
369 1
    word = word.replace('SCH', 'SSS')
370 1
    word = word.replace('TIO', 'SIO')
371 1
    word = word.replace('TIA', 'SIO')
372 1
    word = word.replace('TCH', 'CHH')
373
374 1
    sdx = word.translate(_fuzzy_soundex_translation)
375 1
    sdx = sdx.replace('-', '')
376
377
    # remove repeating characters
378 1
    sdx = _delete_consecutive_repeats(sdx)
379
380 1
    if word[0] in {'H', 'W', 'Y'}:
381 1
        sdx = word[0] + sdx
382
    else:
383 1
        sdx = word[0] + sdx[1:]
384
385 1
    sdx = sdx.replace('0', '')
386
387 1
    if zero_pad:
388 1
        sdx += '0' * max_length
389
390 1
    return sdx[:max_length]
391
392
393 1
def phonex(word, max_length=4, zero_pad=True):
394
    """Return the Phonex code for a word.
395
396
    Phonex is an algorithm derived from Soundex, defined in :cite:`Lait:1996`.
397
398
    :param str word: the word to transform
399
    :param int max_length: the length of the code returned (defaults to 4)
400
    :param bool zero_pad: pad the end of the return value with 0s to achieve
401
        a max_length string
402
    :returns: the Phonex value
403
    :rtype: str
404
405
    >>> phonex('Christopher')
406
    'C623'
407
    >>> phonex('Niall')
408
    'N400'
409
    >>> phonex('Schmidt')
410
    'S253'
411
    >>> phonex('Smith')
412
    'S530'
413
    """
414 1
    name = unicode_normalize('NFKD', text_type(word.upper()))
415 1
    name = name.replace('ß', 'SS')
416
417
    # Clamp max_length to [4, 64]
418 1
    if max_length != -1:
419 1
        max_length = min(max(4, max_length), 64)
420
    else:
421 1
        max_length = 64
422
423 1
    name_code = last = ''
424
425
    # Deletions effected by replacing with next letter which
426
    # will be ignored due to duplicate handling of Soundex code.
427
    # This is faster than 'moving' all subsequent letters.
428
429
    # Remove any trailing Ss
430 1
    while name[-1:] == 'S':
431 1
        name = name[:-1]
432
433
    # Phonetic equivalents of first 2 characters
434
    # Works since duplicate letters are ignored
435 1
    if name[:2] == 'KN':
436 1
        name = 'N' + name[2:]  # KN.. == N..
437 1
    elif name[:2] == 'PH':
438 1
        name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
439 1
    elif name[:2] == 'WR':
440 1
        name = 'R' + name[2:]  # WR.. == R..
441
442 1
    if name:
443
        # Special case, ignore H first letter (subsequent Hs ignored anyway)
444
        # Works since duplicate letters are ignored
445 1
        if name[0] == 'H':
446 1
            name = name[1:]
447
448 1
    if name:
449
        # Phonetic equivalents of first character
450 1
        if name[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
451 1
            name = 'A' + name[1:]
452 1
        elif name[0] in {'B', 'P'}:
453 1
            name = 'B' + name[1:]
454 1
        elif name[0] in {'V', 'F'}:
455 1
            name = 'F' + name[1:]
456 1
        elif name[0] in {'C', 'K', 'Q'}:
457 1
            name = 'C' + name[1:]
458 1
        elif name[0] in {'G', 'J'}:
459 1
            name = 'G' + name[1:]
460 1
        elif name[0] in {'S', 'Z'}:
461 1
            name = 'S' + name[1:]
462
463 1
        name_code = last = name[0]
464
465
    # Modified Soundex code
466 1
    for i in range(1, len(name)):
467 1
        code = '0'
468 1
        if name[i] in {'B', 'F', 'P', 'V'}:
469 1
            code = '1'
470 1
        elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
471 1
            code = '2'
472 1
        elif name[i] in {'D', 'T'}:
473 1
            if name[i + 1 : i + 2] != 'C':
474 1
                code = '3'
475 1
        elif name[i] == 'L':
476 1
            if name[i + 1 : i + 2] in {
477
                'A',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
478
                'E',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
479
                'I',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
480
                'O',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
481
                'U',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
482
                'Y',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
483
            } or i + 1 == len(name):
484 1
                code = '4'
485 1
        elif name[i] in {'M', 'N'}:
486 1
            if name[i + 1 : i + 2] in {'D', 'G'}:
487 1
                name = name[: i + 1] + name[i] + name[i + 2 :]
488 1
            code = '5'
489 1
        elif name[i] == 'R':
490 1
            if name[i + 1 : i + 2] in {
491
                'A',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
492
                'E',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
493
                'I',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
494
                'O',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
495
                'U',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
496
                'Y',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
497
            } or i + 1 == len(name):
498 1
                code = '6'
499
500 1
        if code != last and code != '0' and i != 0:
501 1
            name_code += code
502
503 1
        last = name_code[-1]
504
505 1
    if zero_pad:
506 1
        name_code += '0' * max_length
507 1
    if not name_code:
508 1
        name_code = '0'
509 1
    return name_code[:max_length]
510
511
512 1
def phonix(word, max_length=4, zero_pad=True):
513
    """Return the Phonix code for a word.
514
515
    Phonix is a Soundex-like algorithm defined in :cite:`Gadd:1990`.
516
517
    This implementation is based on:
518
    - :cite:`Pfeifer:2000`
519
    - :cite:`Christen:2011`
520
    - :cite:`Kollar:2007`
521
522
    :param str word: the word to transform
523
    :param int max_length: the length of the code returned (defaults to 4)
524
    :param bool zero_pad: pad the end of the return value with 0s to achieve
525
        a max_length string
526
    :returns: the Phonix value
527
    :rtype: str
528
529
    >>> phonix('Christopher')
530
    'K683'
531
    >>> phonix('Niall')
532
    'N400'
533
    >>> phonix('Smith')
534
    'S530'
535
    >>> phonix('Schmidt')
536
    'S530'
537
    """
538
539 1
    def _start_repl(word, src, tar, post=None):
540
        r"""Replace src with tar at the start of word."""
541 1
        if post:
542 1
            for i in post:
543 1
                if word.startswith(src + i):
544 1
                    return tar + word[len(src) :]
545 1
        elif word.startswith(src):
546 1
            return tar + word[len(src) :]
547 1
        return word
548
549 1
    def _end_repl(word, src, tar, pre=None):
550
        r"""Replace src with tar at the end of word."""
551 1
        if pre:
552 1
            for i in pre:
553 1
                if word.endswith(i + src):
554 1
                    return word[: -len(src)] + tar
555 1
        elif word.endswith(src):
556 1
            return word[: -len(src)] + tar
557 1
        return word
558
559 1
    def _mid_repl(word, src, tar, pre=None, post=None):
560
        r"""Replace src with tar in the middle of word."""
561 1
        if pre or post:
562 1
            if not pre:
563 1
                return word[0] + _all_repl(word[1:], src, tar, pre, post)
564 1
            elif not post:
565 1
                return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
566 1
            return _all_repl(word, src, tar, pre, post)
567 1
        return word[0] + _all_repl(word[1:-1], src, tar, pre, post) + word[-1]
568
569 1
    def _all_repl(word, src, tar, pre=None, post=None):
570
        r"""Replace src with tar anywhere in word."""
571 1
        if pre or post:
572 1
            if post:
573 1
                post = post
574
            else:
575 1
                post = frozenset(('',))
576 1
            if pre:
577 1
                pre = pre
578
            else:
579 1
                pre = frozenset(('',))
580
581 1
            for i, j in ((i, j) for i in pre for j in post):
582 1
                word = word.replace(i + src + j, i + tar + j)
583 1
            return word
584
        else:
585 1
            return word.replace(src, tar)
586
587 1
    _vow = {'A', 'E', 'I', 'O', 'U'}
588 1
    _con = {
589
        'B',
590
        'C',
591
        'D',
592
        'F',
593
        'G',
594
        'H',
595
        'J',
596
        'K',
597
        'L',
598
        'M',
599
        'N',
600
        'P',
601
        'Q',
602
        'R',
603
        'S',
604
        'T',
605
        'V',
606
        'W',
607
        'X',
608
        'Y',
609
        'Z',
610
    }
611
612 1
    _phonix_substitutions = (
613
        (_all_repl, 'DG', 'G'),
614
        (_all_repl, 'CO', 'KO'),
615
        (_all_repl, 'CA', 'KA'),
616
        (_all_repl, 'CU', 'KU'),
617
        (_all_repl, 'CY', 'SI'),
618
        (_all_repl, 'CI', 'SI'),
619
        (_all_repl, 'CE', 'SE'),
620
        (_start_repl, 'CL', 'KL', _vow),
621
        (_all_repl, 'CK', 'K'),
622
        (_end_repl, 'GC', 'K'),
623
        (_end_repl, 'JC', 'K'),
624
        (_start_repl, 'CHR', 'KR', _vow),
625
        (_start_repl, 'CR', 'KR', _vow),
626
        (_start_repl, 'WR', 'R'),
627
        (_all_repl, 'NC', 'NK'),
628
        (_all_repl, 'CT', 'KT'),
629
        (_all_repl, 'PH', 'F'),
630
        (_all_repl, 'AA', 'AR'),
631
        (_all_repl, 'SCH', 'SH'),
632
        (_all_repl, 'BTL', 'TL'),
633
        (_all_repl, 'GHT', 'T'),
634
        (_all_repl, 'AUGH', 'ARF'),
635
        (_mid_repl, 'LJ', 'LD', _vow, _vow),
636
        (_all_repl, 'LOUGH', 'LOW'),
637
        (_start_repl, 'Q', 'KW'),
638
        (_start_repl, 'KN', 'N'),
639
        (_end_repl, 'GN', 'N'),
640
        (_all_repl, 'GHN', 'N'),
641
        (_end_repl, 'GNE', 'N'),
642
        (_all_repl, 'GHNE', 'NE'),
643
        (_end_repl, 'GNES', 'NS'),
644
        (_start_repl, 'GN', 'N'),
645
        (_mid_repl, 'GN', 'N', None, _con),
646
        (_end_repl, 'GN', 'N'),
647
        (_start_repl, 'PS', 'S'),
648
        (_start_repl, 'PT', 'T'),
649
        (_start_repl, 'CZ', 'C'),
650
        (_mid_repl, 'WZ', 'Z', _vow),
651
        (_mid_repl, 'CZ', 'CH'),
652
        (_all_repl, 'LZ', 'LSH'),
653
        (_all_repl, 'RZ', 'RSH'),
654
        (_mid_repl, 'Z', 'S', None, _vow),
655
        (_all_repl, 'ZZ', 'TS'),
656
        (_mid_repl, 'Z', 'TS', _con),
657
        (_all_repl, 'HROUG', 'REW'),
658
        (_all_repl, 'OUGH', 'OF'),
659
        (_mid_repl, 'Q', 'KW', _vow, _vow),
660
        (_mid_repl, 'J', 'Y', _vow, _vow),
661
        (_start_repl, 'YJ', 'Y', _vow),
662
        (_start_repl, 'GH', 'G'),
663
        (_end_repl, 'GH', 'E', _vow),
664
        (_start_repl, 'CY', 'S'),
665
        (_all_repl, 'NX', 'NKS'),
666
        (_start_repl, 'PF', 'F'),
667
        (_end_repl, 'DT', 'T'),
668
        (_end_repl, 'TL', 'TIL'),
669
        (_end_repl, 'DL', 'DIL'),
670
        (_all_repl, 'YTH', 'ITH'),
671
        (_start_repl, 'TJ', 'CH', _vow),
672
        (_start_repl, 'TSJ', 'CH', _vow),
673
        (_start_repl, 'TS', 'T', _vow),
674
        (_all_repl, 'TCH', 'CH'),
675
        (_mid_repl, 'WSK', 'VSKIE', _vow),
676
        (_end_repl, 'WSK', 'VSKIE', _vow),
677
        (_start_repl, 'MN', 'N', _vow),
678
        (_start_repl, 'PN', 'N', _vow),
679
        (_mid_repl, 'STL', 'SL', _vow),
680
        (_end_repl, 'STL', 'SL', _vow),
681
        (_end_repl, 'TNT', 'ENT'),
682
        (_end_repl, 'EAUX', 'OH'),
683
        (_all_repl, 'EXCI', 'ECS'),
684
        (_all_repl, 'X', 'ECS'),
685
        (_end_repl, 'NED', 'ND'),
686
        (_all_repl, 'JR', 'DR'),
687
        (_end_repl, 'EE', 'EA'),
688
        (_all_repl, 'ZS', 'S'),
689
        (_mid_repl, 'R', 'AH', _vow, _con),
690
        (_end_repl, 'R', 'AH', _vow),
691
        (_mid_repl, 'HR', 'AH', _vow, _con),
692
        (_end_repl, 'HR', 'AH', _vow),
693
        (_end_repl, 'HR', 'AH', _vow),
694
        (_end_repl, 'RE', 'AR'),
695
        (_end_repl, 'R', 'AH', _vow),
696
        (_all_repl, 'LLE', 'LE'),
697
        (_end_repl, 'LE', 'ILE', _con),
698
        (_end_repl, 'LES', 'ILES', _con),
699
        (_end_repl, 'E', ''),
700
        (_end_repl, 'ES', 'S'),
701
        (_end_repl, 'SS', 'AS', _vow),
702
        (_end_repl, 'MB', 'M', _vow),
703
        (_all_repl, 'MPTS', 'MPS'),
704
        (_all_repl, 'MPS', 'MS'),
705
        (_all_repl, 'MPT', 'MT'),
706
    )
707
708 1
    _phonix_translation = dict(
709
        zip(
710
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
711
            '01230720022455012683070808',
712
        )
713
    )
714
715 1
    sdx = ''
716
717 1
    word = unicode_normalize('NFKD', text_type(word.upper()))
718 1
    word = word.replace('ß', 'SS')
719 1
    word = ''.join(
720
        c
721
        for c in word
722
        if c
723
        in {
724
            'A',
725
            'B',
726
            'C',
727
            'D',
728
            'E',
729
            'F',
730
            'G',
731
            'H',
732
            'I',
733
            'J',
734
            'K',
735
            'L',
736
            'M',
737
            'N',
738
            'O',
739
            'P',
740
            'Q',
741
            'R',
742
            'S',
743
            'T',
744
            'U',
745
            'V',
746
            'W',
747
            'X',
748
            'Y',
749
            'Z',
750
        }
751
    )
752 1
    if word:
753 1
        for trans in _phonix_substitutions:
754 1
            word = trans[0](word, *trans[1:])
755 1
        if word[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
756 1
            sdx = 'v' + word[1:].translate(_phonix_translation)
757
        else:
758 1
            sdx = word[0] + word[1:].translate(_phonix_translation)
759 1
        sdx = _delete_consecutive_repeats(sdx)
760 1
        sdx = sdx.replace('0', '')
761
762
    # Clamp max_length to [4, 64]
763 1
    if max_length != -1:
764 1
        max_length = min(max(4, max_length), 64)
765
    else:
766 1
        max_length = 64
767
768 1
    if zero_pad:
769 1
        sdx += '0' * max_length
770 1
    if not sdx:
771 1
        sdx = '0'
772 1
    return sdx[:max_length]
773
774
775 1
def lein(word, max_length=4, zero_pad=True):
776
    """Return the Lein code for a word.
777
778
    This is Lein name coding, described in :cite:`Moore:1977`.
779
780
    :param str word: the word to transform
781
    :param int max_length: the maximum length (default 4) of the code to return
782
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
783
        max_length string
784
    :returns: the Lein code
785
    :rtype: str
786
787
    >>> lein('Christopher')
788
    'C351'
789
    >>> lein('Niall')
790
    'N300'
791
    >>> lein('Smith')
792
    'S210'
793
    >>> lein('Schmidt')
794
    'S521'
795
    """
796 1
    _lein_translation = dict(
797
        zip((ord(_) for _ in 'BCDFGJKLMNPQRSTVXZ'), '451455532245351455')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
798
    )
799
800
    # uppercase, normalize, decompose, and filter non-A-Z out
801 1
    word = unicode_normalize('NFKD', text_type(word.upper()))
802 1
    word = word.replace('ß', 'SS')
803 1
    word = ''.join(
804
        c
805
        for c in word
806
        if c
807
        in {
808
            'A',
809
            'B',
810
            'C',
811
            'D',
812
            'E',
813
            'F',
814
            'G',
815
            'H',
816
            'I',
817
            'J',
818
            'K',
819
            'L',
820
            'M',
821
            'N',
822
            'O',
823
            'P',
824
            'Q',
825
            'R',
826
            'S',
827
            'T',
828
            'U',
829
            'V',
830
            'W',
831
            'X',
832
            'Y',
833
            'Z',
834
        }
835
    )
836
837 1
    code = word[:1]  # Rule 1
838 1
    word = word[1:].translate(
839
        {
840
            32: None,
841
            65: None,
842
            69: None,
843
            72: None,
844
            73: None,
845
            79: None,
846
            85: None,
847
            87: None,
848
            89: None,
849
        }
850
    )  # Rule 2
851 1
    word = _delete_consecutive_repeats(word)  # Rule 3
852 1
    code += word.translate(_lein_translation)  # Rule 4
853
854 1
    if zero_pad:
855 1
        code += '0' * max_length  # Rule 4
856
857 1
    return code[:max_length]
858
859
860 1
def pshp_soundex_last(lname, max_length=4, german=False):
861
    """Calculate the PSHP Soundex/Viewex Coding of a last name.
862
863
    This coding is based on :cite:`Hershberg:1976`.
864
865
    Reference was also made to the German version of the same:
866
    :cite:`Hershberg:1979`.
867
868
    A separate function, pshp_soundex_first() is used for first names.
869
870
    :param str lname: the last name to encode
871
    :param int max_length: the length of the code returned (defaults to 4)
872
    :param bool german: set to True if the name is German (different rules
873
        apply)
874
    :returns: the PSHP Soundex/Viewex Coding
875
    :rtype: str
876
877
    >>> pshp_soundex_last('Smith')
878
    'S530'
879
    >>> pshp_soundex_last('Waters')
880
    'W350'
881
    >>> pshp_soundex_last('James')
882
    'J500'
883
    >>> pshp_soundex_last('Schmidt')
884
    'S530'
885
    >>> pshp_soundex_last('Ashcroft')
886
    'A225'
887
    """
888 1
    lname = unicode_normalize('NFKD', text_type(lname.upper()))
889 1
    lname = lname.replace('ß', 'SS')
890 1
    lname = ''.join(
891
        c
892
        for c in lname
893
        if c
894
        in {
895
            'A',
896
            'B',
897
            'C',
898
            'D',
899
            'E',
900
            'F',
901
            'G',
902
            'H',
903
            'I',
904
            'J',
905
            'K',
906
            'L',
907
            'M',
908
            'N',
909
            'O',
910
            'P',
911
            'Q',
912
            'R',
913
            'S',
914
            'T',
915
            'U',
916
            'V',
917
            'W',
918
            'X',
919
            'Y',
920
            'Z',
921
        }
922
    )
923
924
    # A. Prefix treatment
925 1
    if lname[:3] == 'VON' or lname[:3] == 'VAN':
926 1
        lname = lname[3:].strip()
927
928
    # The rule implemented below says "MC, MAC become 1". I believe it meant to
929
    # say they become M except in German data (where superscripted 1 indicates
930
    # "except in German data"). It doesn't make sense for them to become 1
931
    # (BPFV -> 1) or to apply outside German. Unfortunately, both articles have
932
    # this error(?).
933 1
    if not german:
934 1
        if lname[:3] == 'MAC':
935 1
            lname = 'M' + lname[3:]
936 1
        elif lname[:2] == 'MC':
937 1
            lname = 'M' + lname[2:]
938
939
    # The non-German-only rule to strip ' is unnecessary due to filtering
940
941 1
    if lname[:1] in {'E', 'I', 'O', 'U'}:
942 1
        lname = 'A' + lname[1:]
943 1
    elif lname[:2] in {'GE', 'GI', 'GY'}:
944 1
        lname = 'J' + lname[1:]
945 1
    elif lname[:2] in {'CE', 'CI', 'CY'}:
946 1
        lname = 'S' + lname[1:]
947 1
    elif lname[:3] == 'CHR':
948 1
        lname = 'K' + lname[1:]
949 1
    elif lname[:1] == 'C' and lname[:2] != 'CH':
950 1
        lname = 'K' + lname[1:]
951
952 1
    if lname[:2] == 'KN':
953 1
        lname = 'N' + lname[1:]
954 1
    elif lname[:2] == 'PH':
955 1
        lname = 'F' + lname[1:]
956 1
    elif lname[:3] in {'WIE', 'WEI'}:
957 1
        lname = 'V' + lname[1:]
958
959 1
    if german and lname[:1] in {'W', 'M', 'Y', 'Z'}:
960 1
        lname = {'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[lname[0]] + lname[1:]
961
962 1
    code = lname[:1]
963
964
    # B. Postfix treatment
965 1
    if german:  # moved from end of postfix treatment due to blocking
966 1
        if lname[-3:] == 'TES':
967 1
            lname = lname[:-3]
968 1
        elif lname[-2:] == 'TS':
969 1
            lname = lname[:-2]
970 1
        if lname[-3:] == 'TZE':
971 1
            lname = lname[:-3]
972 1
        elif lname[-2:] == 'ZE':
973 1
            lname = lname[:-2]
974 1
        if lname[-1:] == 'Z':
975 1
            lname = lname[:-1]
976 1
        elif lname[-2:] == 'TE':
977 1
            lname = lname[:-2]
978
979 1
    if lname[-1:] == 'R':
980 1
        lname = lname[:-1] + 'N'
981 1
    elif lname[-2:] in {'SE', 'CE'}:
982 1
        lname = lname[:-2]
983 1
    if lname[-2:] == 'SS':
984 1
        lname = lname[:-2]
985 1
    elif lname[-1:] == 'S':
986 1
        lname = lname[:-1]
987
988 1
    if not german:
989 1
        l5_repl = {'STOWN': 'SAWON', 'MPSON': 'MASON'}
990 1
        l4_repl = {
991
            'NSEN': 'ASEN',
992
            'MSON': 'ASON',
993
            'STEN': 'SAEN',
994
            'STON': 'SAON',
995
        }
996 1
        if lname[-5:] in l5_repl:
997 1
            lname = lname[:-5] + l5_repl[lname[-5:]]
998 1
        elif lname[-4:] in l4_repl:
999 1
            lname = lname[:-4] + l4_repl[lname[-4:]]
1000
1001 1
    if lname[-2:] in {'NG', 'ND'}:
1002 1
        lname = lname[:-1]
1003 1
    if not german and lname[-3:] in {'GAN', 'GEN'}:
1004 1
        lname = lname[:-3] + 'A' + lname[-2:]
1005
1006
    # C. Infix Treatment
1007 1
    lname = lname.replace('CK', 'C')
1008 1
    lname = lname.replace('SCH', 'S')
1009 1
    lname = lname.replace('DT', 'T')
1010 1
    lname = lname.replace('ND', 'N')
1011 1
    lname = lname.replace('NG', 'N')
1012 1
    lname = lname.replace('LM', 'M')
1013 1
    lname = lname.replace('MN', 'M')
1014 1
    lname = lname.replace('WIE', 'VIE')
1015 1
    lname = lname.replace('WEI', 'VEI')
1016
1017
    # D. Soundexing
1018
    # code for X & Y are unspecified, but presumably are 2 & 0
1019 1
    _pshp_translation = dict(
1020
        zip(
1021
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1022
            '01230120022455012523010202',
1023
        )
1024
    )
1025
1026 1
    lname = lname.translate(_pshp_translation)
1027 1
    lname = _delete_consecutive_repeats(lname)
1028
1029 1
    code += lname[1:]
1030 1
    code = code.replace('0', '')  # rule 1
1031
1032 1
    if max_length != -1:
1033 1
        if len(code) < max_length:
1034 1
            code += '0' * (max_length - len(code))
1035
        else:
1036 1
            code = code[:max_length]
1037
1038 1
    return code
1039
1040
1041 1
def pshp_soundex_first(fname, max_length=4, german=False):
1042
    """Calculate the PSHP Soundex/Viewex Coding of a first name.
1043
1044
    This coding is based on :cite:`Hershberg:1976`.
1045
1046
    Reference was also made to the German version of the same:
1047
    :cite:`Hershberg:1979`.
1048
1049
    A separate function, pshp_soundex_last() is used for last names.
1050
1051
    :param str fname: the first name to encode
1052
    :param int max_length: the length of the code returned (defaults to 4)
1053
    :param bool german: set to True if the name is German (different rules
1054
        apply)
1055
    :returns: the PSHP Soundex/Viewex Coding
1056
    :rtype: str
1057
1058
    >>> pshp_soundex_first('Smith')
1059
    'S530'
1060
    >>> pshp_soundex_first('Waters')
1061
    'W352'
1062
    >>> pshp_soundex_first('James')
1063
    'J700'
1064
    >>> pshp_soundex_first('Schmidt')
1065
    'S500'
1066
    >>> pshp_soundex_first('Ashcroft')
1067
    'A220'
1068
    >>> pshp_soundex_first('John')
1069
    'J500'
1070
    >>> pshp_soundex_first('Colin')
1071
    'K400'
1072
    >>> pshp_soundex_first('Niall')
1073
    'N400'
1074
    >>> pshp_soundex_first('Sally')
1075
    'S400'
1076
    >>> pshp_soundex_first('Jane')
1077
    'J500'
1078
    """
1079 1
    fname = unicode_normalize('NFKD', text_type(fname.upper()))
1080 1
    fname = fname.replace('ß', 'SS')
1081 1
    fname = ''.join(
1082
        c
1083
        for c in fname
1084
        if c
1085
        in {
1086
            'A',
1087
            'B',
1088
            'C',
1089
            'D',
1090
            'E',
1091
            'F',
1092
            'G',
1093
            'H',
1094
            'I',
1095
            'J',
1096
            'K',
1097
            'L',
1098
            'M',
1099
            'N',
1100
            'O',
1101
            'P',
1102
            'Q',
1103
            'R',
1104
            'S',
1105
            'T',
1106
            'U',
1107
            'V',
1108
            'W',
1109
            'X',
1110
            'Y',
1111
            'Z',
1112
        }
1113
    )
1114
1115
    # special rules
1116 1
    if fname == 'JAMES':
1117 1
        code = 'J7'
1118 1
    elif fname == 'PAT':
1119 1
        code = 'P7'
1120
1121
    else:
1122
        # A. Prefix treatment
1123 1
        if fname[:2] in {'GE', 'GI', 'GY'}:
1124 1
            fname = 'J' + fname[1:]
1125 1
        elif fname[:2] in {'CE', 'CI', 'CY'}:
1126 1
            fname = 'S' + fname[1:]
1127 1
        elif fname[:3] == 'CHR':
1128 1
            fname = 'K' + fname[1:]
1129 1
        elif fname[:1] == 'C' and fname[:2] != 'CH':
1130 1
            fname = 'K' + fname[1:]
1131
1132 1
        if fname[:2] == 'KN':
1133 1
            fname = 'N' + fname[1:]
1134 1
        elif fname[:2] == 'PH':
1135 1
            fname = 'F' + fname[1:]
1136 1
        elif fname[:3] in {'WIE', 'WEI'}:
1137 1
            fname = 'V' + fname[1:]
1138
1139 1
        if german and fname[:1] in {'W', 'M', 'Y', 'Z'}:
1140 1
            fname = {'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[fname[0]] + fname[
1141
                1:
1142
            ]
1143
1144 1
        code = fname[:1]
1145
1146
        # B. Soundex coding
1147
        # code for Y unspecified, but presumably is 0
1148 1
        _pshp_translation = dict(
1149
            zip(
1150
                (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1151
                '01230120022455012523010202',
1152
            )
1153
        )
1154
1155 1
        fname = fname.translate(_pshp_translation)
1156 1
        fname = _delete_consecutive_repeats(fname)
1157
1158 1
        code += fname[1:]
1159 1
        syl_ptr = code.find('0')
1160 1
        syl2_ptr = code[syl_ptr + 1 :].find('0')
1161 1
        if syl_ptr != -1 and syl2_ptr != -1 and syl2_ptr - syl_ptr > -1:
1162 1
            code = code[: syl_ptr + 2]
1163
1164 1
        code = code.replace('0', '')  # rule 1
1165
1166 1
    if max_length != -1:
1167 1
        if len(code) < max_length:
1168 1
            code += '0' * (max_length - len(code))
1169
        else:
1170 1
            code = code[:max_length]
1171
1172 1
    return code
1173
1174
1175
if __name__ == '__main__':
1176
    import doctest
1177
1178
    doctest.testmod()
1179