Completed
Pull Request — master (#138)
by Chris
14:20
created

abydos.phonetic._soundex.PSHPSoundexLast.encode()   F

Complexity

Conditions 36

Size

Total Lines 137
Code Lines 85

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 80
CRAP Score 36

Importance

Changes 0
Metric Value
eloc 85
dl 0
loc 137
ccs 80
cts 80
cp 1
rs 0
c 0
b 0
f 0
cc 36
nop 4
crap 36

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic._soundex.PSHPSoundexLast.encode() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (1286/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.phonetic._soundex.
20
21
The phonetic._soundex module implements phonetic algorithms that are generally
22
Soundex-like, including:
23
24
    - American Soundex
25
    - Refined Soundex
26
    - Fuzzy Soundex
27
    - Phonex
28
    - Phonix
29
    - Lein
30
    - PSHP Soundex/Viewex Coding
31
32
Being Soundex-like, for the purposes of this module means: targeted at English,
33
returning a code that starts with a letter and continues with (usually 3)
34
numerals, and mostly based on a simple translation table.
35
"""
36
37 1
from __future__ import unicode_literals
38
39 1
from unicodedata import normalize as unicode_normalize
40
41 1
from six import text_type
42 1
from six.moves import range
43
44 1
from ._phonetic import Phonetic
45
46 1
__all__ = [
47
    'FuzzySoundex',
48
    'Lein',
49
    'Phonex',
50
    'Phonix',
51
    'PSHPSoundexFirst',
52
    'PSHPSoundexLast',
53
    'RefinedSoundex',
54
    'Soundex',
55
    'fuzzy_soundex',
56
    'lein',
57
    'phonex',
58
    'phonix',
59
    'pshp_soundex_first',
60
    'pshp_soundex_last',
61
    'refined_soundex',
62
    'soundex',
63
]
64
65
66 1
class Soundex(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
67
    """Soundex.
68
69
    Three variants of Soundex are implemented:
70
71
    - 'American' follows the American Soundex algorithm, as described at
72
      :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
73
      Miracode
74
    - 'special' follows the rules from the 1880-1910 US Census
75
      retrospective re-analysis, in which h & w are not treated as blocking
76
      consonants but as vowels. Cf. :cite:`Repici:2013`.
77
    - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
78
      US Census, including coding prefixed and unprefixed versions of some
79
      names
80
    """
81
82 1
    _trans = dict(
83
        zip(
84
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
85
            '01230129022455012623019202',
86
        )
87
    )
88
89 1
    def encode(
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
90
        self, word, max_length=4, var='American', reverse=False, zero_pad=True
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
91
    ):
92
        """Return the Soundex code for a word.
93
94
        :param str word: the word to transform
95
        :param int max_length: the length of the code returned (defaults to 4)
96
        :param str var: the variant of the algorithm to employ (defaults to
97
            'American'):
98
99
            - 'American' follows the American Soundex algorithm, as described
100
              at :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
101
              Miracode
102
            - 'special' follows the rules from the 1880-1910 US Census
103
              retrospective re-analysis, in which h & w are not treated as
104
              blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
105
            - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by
106
              the US Census, including coding prefixed and unprefixed versions
107
              of some names
108
109
        :param bool reverse: reverse the word before computing the selected
110
            Soundex (defaults to False); This results in "Reverse Soundex",
111
            which is useful for blocking in cases where the initial elements
112
            may be in error.
113
        :param bool zero_pad: pad the end of the return value with 0s to
114
            achieve a max_length string
115
        :returns: the Soundex value
116
        :rtype: str
117
118
        >>> pe = Soundex()
119
        >>> pe.encode("Christopher")
120
        'C623'
121
        >>> pe.encode("Niall")
122
        'N400'
123
        >>> pe.encode('Smith')
124
        'S530'
125
        >>> pe.encode('Schmidt')
126
        'S530'
127
128
        >>> pe.encode('Christopher', max_length=-1)
129
        'C623160000000000000000000000000000000000000000000000000000000000'
130
        >>> pe.encode('Christopher', max_length=-1, zero_pad=False)
131
        'C62316'
132
133
        >>> pe.encode('Christopher', reverse=True)
134
        'R132'
135
136
        >>> pe.encode('Ashcroft')
137
        'A261'
138
        >>> pe.encode('Asicroft')
139
        'A226'
140
        >>> pe.encode('Ashcroft', var='special')
141
        'A226'
142
        >>> pe.encode('Asicroft', var='special')
143
        'A226'
144
        """
145
        # Require a max_length of at least 4 and not more than 64
146 1
        if max_length != -1:
147 1
            max_length = min(max(4, max_length), 64)
148
        else:
149 1
            max_length = 64
150
151
        # uppercase, normalize, decompose, and filter non-A-Z out
152 1
        word = unicode_normalize('NFKD', text_type(word.upper()))
153 1
        word = word.replace('ß', 'SS')
154
155 1
        if var == 'Census':
156
            # TODO: Should these prefixes be supplemented? (VANDE, DELA, VON)
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
157 1
            if word[:3] in {'VAN', 'CON'} and len(word) > 4:
158 1
                return (
159
                    soundex(word, max_length, 'American', reverse, zero_pad),
160
                    soundex(
161
                        word[3:], max_length, 'American', reverse, zero_pad
162
                    ),
163
                )
164 1
            if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
165 1
                return (
166
                    soundex(word, max_length, 'American', reverse, zero_pad),
167
                    soundex(
168
                        word[2:], max_length, 'American', reverse, zero_pad
169
                    ),
170
                )
171
            # Otherwise, proceed as usual (var='American' mode, ostensibly)
172
173 1
        word = ''.join(c for c in word if c in self._uc_set)
174
175
        # Nothing to convert, return base case
176 1
        if not word:
177 1
            if zero_pad:
178 1
                return '0' * max_length
179 1
            return '0'
180
181
        # Reverse word if computing Reverse Soundex
182 1
        if reverse:
183 1
            word = word[::-1]
184
185
        # apply the Soundex algorithm
186 1
        sdx = word.translate(self._trans)
187
188 1
        if var == 'special':
189 1
            sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
190
        else:
191 1
            sdx = sdx.replace('9', '')  # rule 1
192 1
        sdx = self._delete_consecutive_repeats(sdx)  # rule 3
193
194 1
        if word[0] in 'HW':
195 1
            sdx = word[0] + sdx
196
        else:
197 1
            sdx = word[0] + sdx[1:]
198 1
        sdx = sdx.replace('0', '')  # rule 1
199
200 1
        if zero_pad:
201 1
            sdx += '0' * max_length  # rule 4
202
203 1
        return sdx[:max_length]
204
205
206 1
def soundex(word, max_length=4, var='American', reverse=False, zero_pad=True):
207
    """Return the Soundex code for a word.
208
209
    This is a wrapper for :py:meth:`Soundex.encode`.
210
211
    :param str word: the word to transform
212
    :param int max_length: the length of the code returned (defaults to 4)
213
    :param str var: the variant of the algorithm to employ (defaults to
214
        'American'):
215
216
        - 'American' follows the American Soundex algorithm, as described at
217
          :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
218
          Miracode
219
        - 'special' follows the rules from the 1880-1910 US Census
220
          retrospective re-analysis, in which h & w are not treated as blocking
221
          consonants but as vowels. Cf. :cite:`Repici:2013`.
222
        - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
223
          US Census, including coding prefixed and unprefixed versions of some
224
          names
225
226
    :param bool reverse: reverse the word before computing the selected Soundex
227
        (defaults to False); This results in "Reverse Soundex", which is useful
228
        for blocking in cases where the initial elements may be in error.
229
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
230
        max_length string
231
    :returns: the Soundex value
232
    :rtype: str
233
234
    >>> soundex("Christopher")
235
    'C623'
236
    >>> soundex("Niall")
237
    'N400'
238
    >>> soundex('Smith')
239
    'S530'
240
    >>> soundex('Schmidt')
241
    'S530'
242
243
    >>> soundex('Christopher', max_length=-1)
244
    'C623160000000000000000000000000000000000000000000000000000000000'
245
    >>> soundex('Christopher', max_length=-1, zero_pad=False)
246
    'C62316'
247
248
    >>> soundex('Christopher', reverse=True)
249
    'R132'
250
251
    >>> soundex('Ashcroft')
252
    'A261'
253
    >>> soundex('Asicroft')
254
    'A226'
255
    >>> soundex('Ashcroft', var='special')
256
    'A226'
257
    >>> soundex('Asicroft', var='special')
258
    'A226'
259
    """
260 1
    return Soundex().encode(word, max_length, var, reverse, zero_pad)
261
262
263 1
class RefinedSoundex(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
264
    """Refined Soundex.
265
266
    This is Soundex, but with more character classes. It was defined at
267
    :cite:`Boyce:1998`.
268
    """
269
270 1
    _trans = dict(
271
        zip(
272
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
273
            '01360240043788015936020505',
274
        )
275
    )
276
277 1
    def encode(self, word, max_length=-1, zero_pad=False, retain_vowels=False):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
278
        """Return the Refined Soundex code for a word.
279
280
        :param word: the word to transform
281
        :param max_length: the length of the code returned (defaults to
282
            unlimited)
283
        :param zero_pad: pad the end of the return value with 0s to achieve a
284
            max_length string
285
        :param retain_vowels: retain vowels (as 0) in the resulting code
286
        :returns: the Refined Soundex value
287
        :rtype: str
288
289
        >>> pe = RefinedSoundex()
290
        >>> pe.encode('Christopher')
291
        'C393619'
292
        >>> pe.encode('Niall')
293
        'N87'
294
        >>> pe.encode('Smith')
295
        'S386'
296
        >>> pe.encode('Schmidt')
297
        'S386'
298
        """
299
        # uppercase, normalize, decompose, and filter non-A-Z out
300 1
        word = unicode_normalize('NFKD', text_type(word.upper()))
301 1
        word = word.replace('ß', 'SS')
302 1
        word = ''.join(c for c in word if c in self._uc_set)
303
304
        # apply the Soundex algorithm
305 1
        sdx = word[:1] + word.translate(self._trans)
306 1
        sdx = self._delete_consecutive_repeats(sdx)
307 1
        if not retain_vowels:
308 1
            sdx = sdx.replace('0', '')  # Delete vowels, H, W, Y
309
310 1
        if max_length > 0:
311 1
            if zero_pad:
312 1
                sdx += '0' * max_length
313 1
            sdx = sdx[:max_length]
314
315 1
        return sdx
316
317
318 1
def refined_soundex(word, max_length=-1, zero_pad=False, retain_vowels=False):
319
    """Return the Refined Soundex code for a word.
320
321
    This is a wrapper for :py:meth:`RefinedSoundex.encode`.
322
323
    :param word: the word to transform
324
    :param max_length: the length of the code returned (defaults to unlimited)
325
    :param zero_pad: pad the end of the return value with 0s to achieve a
326
        max_length string
327
    :param retain_vowels: retain vowels (as 0) in the resulting code
328
    :returns: the Refined Soundex value
329
    :rtype: str
330
331
    >>> refined_soundex('Christopher')
332
    'C393619'
333
    >>> refined_soundex('Niall')
334
    'N87'
335
    >>> refined_soundex('Smith')
336
    'S386'
337
    >>> refined_soundex('Schmidt')
338
    'S386'
339
    """
340 1
    return RefinedSoundex().encode(word, max_length, zero_pad, retain_vowels)
341
342
343 1
class FuzzySoundex(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
344
    """Fuzzy Soundex.
345
346
    Fuzzy Soundex is an algorithm derived from Soundex, defined in
347
    :cite:`Holmes:2002`.
348
    """
349
350 1
    _trans = dict(
351
        zip(
352
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
353
            '0193017-07745501769301-7-9',
354
        )
355
    )
356
357 1
    def encode(self, word, max_length=5, zero_pad=True):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
358
        """Return the Fuzzy Soundex code for a word.
359
360
        :param str word: the word to transform
361
        :param int max_length: the length of the code returned (defaults to 4)
362
        :param bool zero_pad: pad the end of the return value with 0s to
363
            achieve a max_length string
364
        :returns: the Fuzzy Soundex value
365
        :rtype: str
366
367
        >>> pe = FuzzySoundex()
368
        >>> pe.encode('Christopher')
369
        'K6931'
370
        >>> pe.encode('Niall')
371
        'N4000'
372
        >>> pe.encode('Smith')
373
        'S5300'
374
        >>> pe.encode('Smith')
375
        'S5300'
376
        """
377 1
        word = unicode_normalize('NFKD', text_type(word.upper()))
378 1
        word = word.replace('ß', 'SS')
379
380
        # Clamp max_length to [4, 64]
381 1
        if max_length != -1:
382 1
            max_length = min(max(4, max_length), 64)
383
        else:
384 1
            max_length = 64
385
386 1
        if not word:
387 1
            if zero_pad:
388 1
                return '0' * max_length
389 1
            return '0'
390
391 1
        if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
392 1
            word = 'SS' + word[2:]
393 1
        elif word[:2] == 'GN':
394 1
            word = 'NN' + word[2:]
395 1
        elif word[:2] in {'HR', 'WR'}:
396 1
            word = 'RR' + word[2:]
397 1
        elif word[:2] == 'HW':
398 1
            word = 'WW' + word[2:]
399 1
        elif word[:2] in {'KN', 'NG'}:
400 1
            word = 'NN' + word[2:]
401
402 1
        if word[-2:] == 'CH':
403 1
            word = word[:-2] + 'KK'
404 1
        elif word[-2:] == 'NT':
405 1
            word = word[:-2] + 'TT'
406 1
        elif word[-2:] == 'RT':
407 1
            word = word[:-2] + 'RR'
408 1
        elif word[-3:] == 'RDT':
409 1
            word = word[:-3] + 'RR'
410
411 1
        word = word.replace('CA', 'KA')
412 1
        word = word.replace('CC', 'KK')
413 1
        word = word.replace('CK', 'KK')
414 1
        word = word.replace('CE', 'SE')
415 1
        word = word.replace('CHL', 'KL')
416 1
        word = word.replace('CL', 'KL')
417 1
        word = word.replace('CHR', 'KR')
418 1
        word = word.replace('CR', 'KR')
419 1
        word = word.replace('CI', 'SI')
420 1
        word = word.replace('CO', 'KO')
421 1
        word = word.replace('CU', 'KU')
422 1
        word = word.replace('CY', 'SY')
423 1
        word = word.replace('DG', 'GG')
424 1
        word = word.replace('GH', 'HH')
425 1
        word = word.replace('MAC', 'MK')
426 1
        word = word.replace('MC', 'MK')
427 1
        word = word.replace('NST', 'NSS')
428 1
        word = word.replace('PF', 'FF')
429 1
        word = word.replace('PH', 'FF')
430 1
        word = word.replace('SCH', 'SSS')
431 1
        word = word.replace('TIO', 'SIO')
432 1
        word = word.replace('TIA', 'SIO')
433 1
        word = word.replace('TCH', 'CHH')
434
435 1
        sdx = word.translate(self._trans)
436 1
        sdx = sdx.replace('-', '')
437
438
        # remove repeating characters
439 1
        sdx = self._delete_consecutive_repeats(sdx)
440
441 1
        if word[0] in {'H', 'W', 'Y'}:
442 1
            sdx = word[0] + sdx
443
        else:
444 1
            sdx = word[0] + sdx[1:]
445
446 1
        sdx = sdx.replace('0', '')
447
448 1
        if zero_pad:
449 1
            sdx += '0' * max_length
450
451 1
        return sdx[:max_length]
452
453
454 1
def fuzzy_soundex(word, max_length=5, zero_pad=True):
455
    """Return the Fuzzy Soundex code for a word.
456
457
    This is a wrapper for :py:meth:`FuzzySoundex.encode`.
458
459
    :param str word: the word to transform
460
    :param int max_length: the length of the code returned (defaults to 4)
461
    :param bool zero_pad: pad the end of the return value with 0s to achieve
462
        a max_length string
463
    :returns: the Fuzzy Soundex value
464
    :rtype: str
465
466
    >>> fuzzy_soundex('Christopher')
467
    'K6931'
468
    >>> fuzzy_soundex('Niall')
469
    'N4000'
470
    >>> fuzzy_soundex('Smith')
471
    'S5300'
472
    >>> fuzzy_soundex('Smith')
473
    'S5300'
474
    """
475 1
    return FuzzySoundex().encode(word, max_length, zero_pad)
476
477
478 1
class Phonex(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
479
    """Phonex code.
480
481
    Phonex is an algorithm derived from Soundex, defined in :cite:`Lait:1996`.
482
    """
483
484 1
    def encode(self, word, max_length=4, zero_pad=True):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
485
        """Return the Phonex code for a word.
486
487
        :param str word: the word to transform
488
        :param int max_length: the length of the code returned (defaults to 4)
489
        :param bool zero_pad: pad the end of the return value with 0s to
490
            achieve a max_length string
491
        :returns: the Phonex value
492
        :rtype: str
493
494
        >>> pe = Phonex()
495
        >>> pe.encode('Christopher')
496
        'C623'
497
        >>> pe.encode('Niall')
498
        'N400'
499
        >>> pe.encode('Schmidt')
500
        'S253'
501
        >>> pe.encode('Smith')
502
        'S530'
503
        """
504 1
        name = unicode_normalize('NFKD', text_type(word.upper()))
505 1
        name = name.replace('ß', 'SS')
506
507
        # Clamp max_length to [4, 64]
508 1
        if max_length != -1:
509 1
            max_length = min(max(4, max_length), 64)
510
        else:
511 1
            max_length = 64
512
513 1
        name_code = last = ''
514
515
        # Deletions effected by replacing with next letter which
516
        # will be ignored due to duplicate handling of Soundex code.
517
        # This is faster than 'moving' all subsequent letters.
518
519
        # Remove any trailing Ss
520 1
        while name[-1:] == 'S':
521 1
            name = name[:-1]
522
523
        # Phonetic equivalents of first 2 characters
524
        # Works since duplicate letters are ignored
525 1
        if name[:2] == 'KN':
526 1
            name = 'N' + name[2:]  # KN.. == N..
527 1
        elif name[:2] == 'PH':
528 1
            name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
529 1
        elif name[:2] == 'WR':
530 1
            name = 'R' + name[2:]  # WR.. == R..
531
532 1
        if name:
533
            # Special case, ignore H first letter (subsequent Hs ignored
534
            # anyway)
535
            # Works since duplicate letters are ignored
536 1
            if name[0] == 'H':
537 1
                name = name[1:]
538
539 1
        if name:
540
            # Phonetic equivalents of first character
541 1
            if name[0] in self._uc_vy_set:
542 1
                name = 'A' + name[1:]
543 1
            elif name[0] in {'B', 'P'}:
544 1
                name = 'B' + name[1:]
545 1
            elif name[0] in {'V', 'F'}:
546 1
                name = 'F' + name[1:]
547 1
            elif name[0] in {'C', 'K', 'Q'}:
548 1
                name = 'C' + name[1:]
549 1
            elif name[0] in {'G', 'J'}:
550 1
                name = 'G' + name[1:]
551 1
            elif name[0] in {'S', 'Z'}:
552 1
                name = 'S' + name[1:]
553
554 1
            name_code = last = name[0]
555
556
        # Modified Soundex code
557 1
        for i in range(1, len(name)):
558 1
            code = '0'
559 1
            if name[i] in {'B', 'F', 'P', 'V'}:
560 1
                code = '1'
561 1
            elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
562 1
                code = '2'
563 1
            elif name[i] in {'D', 'T'}:
564 1
                if name[i + 1 : i + 2] != 'C':
565 1
                    code = '3'
566 1
            elif name[i] == 'L':
567 1
                if name[i + 1 : i + 2] in self._uc_vy_set or i + 1 == len(
568
                    name
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
569
                ):
570 1
                    code = '4'
571 1
            elif name[i] in {'M', 'N'}:
572 1
                if name[i + 1 : i + 2] in {'D', 'G'}:
573 1
                    name = name[: i + 1] + name[i] + name[i + 2 :]
574 1
                code = '5'
575 1
            elif name[i] == 'R':
576 1
                if name[i + 1 : i + 2] in self._uc_vy_set or i + 1 == len(
577
                    name
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
578
                ):
579 1
                    code = '6'
580
581 1
            if code != last and code != '0' and i != 0:
582 1
                name_code += code
583
584 1
            last = name_code[-1]
585
586 1
        if zero_pad:
587 1
            name_code += '0' * max_length
588 1
        if not name_code:
589 1
            name_code = '0'
590 1
        return name_code[:max_length]
591
592
593 1
def phonex(word, max_length=4, zero_pad=True):
594
    """Return the Phonex code for a word.
595
596
    This is a wrapper for :py:meth:`Phonex.encode`.
597
598
    :param str word: the word to transform
599
    :param int max_length: the length of the code returned (defaults to 4)
600
    :param bool zero_pad: pad the end of the return value with 0s to achieve
601
        a max_length string
602
    :returns: the Phonex value
603
    :rtype: str
604
605
    >>> phonex('Christopher')
606
    'C623'
607
    >>> phonex('Niall')
608
    'N400'
609
    >>> phonex('Schmidt')
610
    'S253'
611
    >>> phonex('Smith')
612
    'S530'
613
    """
614 1
    return Phonex().encode(word, max_length, zero_pad)
615
616
617 1
class Phonix(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
618
    """Phonix code.
619
620
    Phonix is a Soundex-like algorithm defined in :cite:`Gadd:1990`.
621
622
    This implementation is based on:
623
    - :cite:`Pfeifer:2000`
624
    - :cite:`Christen:2011`
625
    - :cite:`Kollar:2007`
626
    """
627
628 1
    _uc_c_set = None
629
630 1
    _substitutions = None
631
632 1
    _trans = dict(
633
        zip(
634
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
635
            '01230720022455012683070808',
636
        )
637
    )
638
639 1
    def __init__(self):
640
        """Initialize Phonix."""
641 1
        self._uc_c_set = (
642
            super(Phonix, self)._uc_set - super(Phonix, self)._uc_v_set
643
        )
644
645 1
        self._substitutions = (
646
            (3, 'DG', 'G'),
647
            (3, 'CO', 'KO'),
648
            (3, 'CA', 'KA'),
649
            (3, 'CU', 'KU'),
650
            (3, 'CY', 'SI'),
651
            (3, 'CI', 'SI'),
652
            (3, 'CE', 'SE'),
653
            (0, 'CL', 'KL', super(Phonix, self)._uc_v_set),
654
            (3, 'CK', 'K'),
655
            (1, 'GC', 'K'),
656
            (1, 'JC', 'K'),
657
            (0, 'CHR', 'KR', super(Phonix, self)._uc_v_set),
658
            (0, 'CR', 'KR', super(Phonix, self)._uc_v_set),
659
            (0, 'WR', 'R'),
660
            (3, 'NC', 'NK'),
661
            (3, 'CT', 'KT'),
662
            (3, 'PH', 'F'),
663
            (3, 'AA', 'AR'),
664
            (3, 'SCH', 'SH'),
665
            (3, 'BTL', 'TL'),
666
            (3, 'GHT', 'T'),
667
            (3, 'AUGH', 'ARF'),
668
            (
669
                2,
670
                'LJ',
671
                'LD',
672
                super(Phonix, self)._uc_v_set,
673
                super(Phonix, self)._uc_v_set,
674
            ),
675
            (3, 'LOUGH', 'LOW'),
676
            (0, 'Q', 'KW'),
677
            (0, 'KN', 'N'),
678
            (1, 'GN', 'N'),
679
            (3, 'GHN', 'N'),
680
            (1, 'GNE', 'N'),
681
            (3, 'GHNE', 'NE'),
682
            (1, 'GNES', 'NS'),
683
            (0, 'GN', 'N'),
684
            (2, 'GN', 'N', None, self._uc_c_set),
685
            (1, 'GN', 'N'),
686
            (0, 'PS', 'S'),
687
            (0, 'PT', 'T'),
688
            (0, 'CZ', 'C'),
689
            (2, 'WZ', 'Z', super(Phonix, self)._uc_v_set),
690
            (2, 'CZ', 'CH'),
691
            (3, 'LZ', 'LSH'),
692
            (3, 'RZ', 'RSH'),
693
            (2, 'Z', 'S', None, super(Phonix, self)._uc_v_set),
694
            (3, 'ZZ', 'TS'),
695
            (2, 'Z', 'TS', self._uc_c_set),
696
            (3, 'HROUG', 'REW'),
697
            (3, 'OUGH', 'OF'),
698
            (
699
                2,
700
                'Q',
701
                'KW',
702
                super(Phonix, self)._uc_v_set,
703
                super(Phonix, self)._uc_v_set,
704
            ),
705
            (
706
                2,
707
                'J',
708
                'Y',
709
                super(Phonix, self)._uc_v_set,
710
                super(Phonix, self)._uc_v_set,
711
            ),
712
            (0, 'YJ', 'Y', super(Phonix, self)._uc_v_set),
713
            (0, 'GH', 'G'),
714
            (1, 'GH', 'E', super(Phonix, self)._uc_v_set),
715
            (0, 'CY', 'S'),
716
            (3, 'NX', 'NKS'),
717
            (0, 'PF', 'F'),
718
            (1, 'DT', 'T'),
719
            (1, 'TL', 'TIL'),
720
            (1, 'DL', 'DIL'),
721
            (3, 'YTH', 'ITH'),
722
            (0, 'TJ', 'CH', super(Phonix, self)._uc_v_set),
723
            (0, 'TSJ', 'CH', super(Phonix, self)._uc_v_set),
724
            (0, 'TS', 'T', super(Phonix, self)._uc_v_set),
725
            (3, 'TCH', 'CH'),
726
            (2, 'WSK', 'VSKIE', super(Phonix, self)._uc_v_set),
727
            (1, 'WSK', 'VSKIE', super(Phonix, self)._uc_v_set),
728
            (0, 'MN', 'N', super(Phonix, self)._uc_v_set),
729
            (0, 'PN', 'N', super(Phonix, self)._uc_v_set),
730
            (2, 'STL', 'SL', super(Phonix, self)._uc_v_set),
731
            (1, 'STL', 'SL', super(Phonix, self)._uc_v_set),
732
            (1, 'TNT', 'ENT'),
733
            (1, 'EAUX', 'OH'),
734
            (3, 'EXCI', 'ECS'),
735
            (3, 'X', 'ECS'),
736
            (1, 'NED', 'ND'),
737
            (3, 'JR', 'DR'),
738
            (1, 'EE', 'EA'),
739
            (3, 'ZS', 'S'),
740
            (2, 'R', 'AH', super(Phonix, self)._uc_v_set, self._uc_c_set),
741
            (1, 'R', 'AH', super(Phonix, self)._uc_v_set),
742
            (2, 'HR', 'AH', super(Phonix, self)._uc_v_set, self._uc_c_set),
743
            (1, 'HR', 'AH', super(Phonix, self)._uc_v_set),
744
            (1, 'HR', 'AH', super(Phonix, self)._uc_v_set),
745
            (1, 'RE', 'AR'),
746
            (1, 'R', 'AH', super(Phonix, self)._uc_v_set),
747
            (3, 'LLE', 'LE'),
748
            (1, 'LE', 'ILE', self._uc_c_set),
749
            (1, 'LES', 'ILES', self._uc_c_set),
750
            (1, 'E', ''),
751
            (1, 'ES', 'S'),
752
            (1, 'SS', 'AS', super(Phonix, self)._uc_v_set),
753
            (1, 'MB', 'M', super(Phonix, self)._uc_v_set),
754
            (3, 'MPTS', 'MPS'),
755
            (3, 'MPS', 'MS'),
756
            (3, 'MPT', 'MT'),
757
        )
758
759 1
    def encode(self, word, max_length=4, zero_pad=True):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
760
        """Return the Phonix code for a word.
761
762
        :param str word: the word to transform
763
        :param int max_length: the length of the code returned (defaults to 4)
764
        :param bool zero_pad: pad the end of the return value with 0s to
765
            achieve a max_length string
766
        :returns: the Phonix value
767
        :rtype: str
768
769
        >>> pe = Phonix()
770
        >>> pe.encode('Christopher')
771
        'K683'
772
        >>> pe.encode('Niall')
773
        'N400'
774
        >>> pe.encode('Smith')
775
        'S530'
776
        >>> pe.encode('Schmidt')
777
        'S530'
778
        """
779
780 1
        def _start_repl(word, src, tar, post=None):
781
            """Replace src with tar at the start of word."""
782 1
            if post:
783 1
                for i in post:
784 1
                    if word.startswith(src + i):
785 1
                        return tar + word[len(src) :]
786 1
            elif word.startswith(src):
787 1
                return tar + word[len(src) :]
788 1
            return word
789
790 1
        def _end_repl(word, src, tar, pre=None):
791
            """Replace src with tar at the end of word."""
792 1
            if pre:
793 1
                for i in pre:
794 1
                    if word.endswith(i + src):
795 1
                        return word[: -len(src)] + tar
796 1
            elif word.endswith(src):
797 1
                return word[: -len(src)] + tar
798 1
            return word
799
800 1
        def _mid_repl(word, src, tar, pre=None, post=None):
801
            """Replace src with tar in the middle of word."""
802 1
            if pre or post:
803 1
                if not pre:
804 1
                    return word[0] + _all_repl(word[1:], src, tar, pre, post)
805 1
                elif not post:
806 1
                    return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
807 1
                return _all_repl(word, src, tar, pre, post)
808 1
            return (
809
                word[0] + _all_repl(word[1:-1], src, tar, pre, post) + word[-1]
810
            )
811
812 1
        def _all_repl(word, src, tar, pre=None, post=None):
813
            """Replace src with tar anywhere in word."""
814 1
            if pre or post:
815 1
                if post:
816 1
                    post = post
817
                else:
818 1
                    post = frozenset(('',))
819 1
                if pre:
820 1
                    pre = pre
821
                else:
822 1
                    pre = frozenset(('',))
823
824 1
                for i, j in ((i, j) for i in pre for j in post):
825 1
                    word = word.replace(i + src + j, i + tar + j)
826 1
                return word
827
            else:
828 1
                return word.replace(src, tar)
829
830 1
        repl_at = (_start_repl, _end_repl, _mid_repl, _all_repl)
831
832 1
        sdx = ''
833
834 1
        word = unicode_normalize('NFKD', text_type(word.upper()))
835 1
        word = word.replace('ß', 'SS')
836 1
        word = ''.join(c for c in word if c in self._uc_set)
837 1
        if word:
838 1
            for trans in self._substitutions:
839 1
                word = repl_at[trans[0]](word, *trans[1:])
840 1
            if word[0] in self._uc_vy_set:
841 1
                sdx = 'v' + word[1:].translate(self._trans)
842
            else:
843 1
                sdx = word[0] + word[1:].translate(self._trans)
844 1
            sdx = self._delete_consecutive_repeats(sdx)
845 1
            sdx = sdx.replace('0', '')
846
847
        # Clamp max_length to [4, 64]
848 1
        if max_length != -1:
849 1
            max_length = min(max(4, max_length), 64)
850
        else:
851 1
            max_length = 64
852
853 1
        if zero_pad:
854 1
            sdx += '0' * max_length
855 1
        if not sdx:
856 1
            sdx = '0'
857 1
        return sdx[:max_length]
858
859
860 1
def phonix(word, max_length=4, zero_pad=True):
861
    """Return the Phonix code for a word.
862
863
    This is a wrapper for :py:meth:`Phonix.encode`.
864
865
    :param str word: the word to transform
866
    :param int max_length: the length of the code returned (defaults to 4)
867
    :param bool zero_pad: pad the end of the return value with 0s to achieve
868
        a max_length string
869
    :returns: the Phonix value
870
    :rtype: str
871
872
    >>> phonix('Christopher')
873
    'K683'
874
    >>> phonix('Niall')
875
    'N400'
876
    >>> phonix('Smith')
877
    'S530'
878
    >>> phonix('Schmidt')
879
    'S530'
880
    """
881 1
    return Phonix().encode(word, max_length, zero_pad)
882
883
884 1
class Lein(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
885
    """Lein code.
886
887
    This is Lein name coding, described in :cite:`Moore:1977`.
888
    """
889
890 1
    _trans = dict(
891
        zip((ord(_) for _ in 'BCDFGJKLMNPQRSTVXZ'), '451455532245351455')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
892
    )
893
894 1
    _del_trans = {num: None for num in (32, 65, 69, 72, 73, 79, 85, 87, 89)}
895
896 1
    def encode(self, word, max_length=4, zero_pad=True):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
897
        """Return the Lein code for a word.
898
899
        :param str word: the word to transform
900
        :param int max_length: the maximum length (default 4) of the code to
901
            return
902
        :param bool zero_pad: pad the end of the return value with 0s to
903
            achieve a max_length string
904
        :returns: the Lein code
905
        :rtype: str
906
907
        >>> pe = Lein()
908
        >>> pe.encode('Christopher')
909
        'C351'
910
        >>> pe.encode('Niall')
911
        'N300'
912
        >>> pe.encode('Smith')
913
        'S210'
914
        >>> pe.encode('Schmidt')
915
        'S521'
916
        """
917
        # uppercase, normalize, decompose, and filter non-A-Z out
918 1
        word = unicode_normalize('NFKD', text_type(word.upper()))
919 1
        word = word.replace('ß', 'SS')
920 1
        word = ''.join(c for c in word if c in self._uc_set)
921
922 1
        code = word[:1]  # Rule 1
923 1
        word = word[1:].translate(self._del_trans)  # Rule 2
924 1
        word = self._delete_consecutive_repeats(word)  # Rule 3
925 1
        code += word.translate(self._trans)  # Rule 4
926
927 1
        if zero_pad:
928 1
            code += '0' * max_length  # Rule 4
929
930 1
        return code[:max_length]
931
932
933 1
def lein(word, max_length=4, zero_pad=True):
934
    """Return the Lein code for a word.
935
936
    This is a wrapper for :py:meth:`Lein.encode`.
937
938
    :param str word: the word to transform
939
    :param int max_length: the maximum length (default 4) of the code to return
940
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
941
        max_length string
942
    :returns: the Lein code
943
    :rtype: str
944
945
    >>> lein('Christopher')
946
    'C351'
947
    >>> lein('Niall')
948
    'N300'
949
    >>> lein('Smith')
950
    'S210'
951
    >>> lein('Schmidt')
952
    'S521'
953
    """
954 1
    return Lein().encode(word, max_length, zero_pad)
955
956
957 1
class PSHPSoundexLast(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
958
    """PSHP Soundex/Viewex Coding of a last name.
959
960
    This coding is based on :cite:`Hershberg:1976`.
961
962
    Reference was also made to the German version of the same:
963
    :cite:`Hershberg:1979`.
964
965
    A separate function, :py:class:`PSHPSoundexFirst` is used for first names.
966
    """
967
968 1
    _trans = dict(
969
        zip(
970
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
971
            '01230120022455012523010202',
972
        )
973
    )
974
975 1
    def encode(self, lname, max_length=4, german=False):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
976
        """Calculate the PSHP Soundex/Viewex Coding of a last name.
977
978
        :param str lname: the last name to encode
979
        :param int max_length: the length of the code returned (defaults to 4)
980
        :param bool german: set to True if the name is German (different rules
981
            apply)
982
        :returns: the PSHP Soundex/Viewex Coding
983
        :rtype: str
984
985
        >>> pe = PSHPSoundexLast()
986
        >>> pe.encode('Smith')
987
        'S530'
988
        >>> pe.encode('Waters')
989
        'W350'
990
        >>> pe.encode('James')
991
        'J500'
992
        >>> pe.encode('Schmidt')
993
        'S530'
994
        >>> pe.encode('Ashcroft')
995
        'A225'
996
        """
997 1
        lname = unicode_normalize('NFKD', text_type(lname.upper()))
998 1
        lname = lname.replace('ß', 'SS')
999 1
        lname = ''.join(c for c in lname if c in self._uc_set)
1000
1001
        # A. Prefix treatment
1002 1
        if lname[:3] == 'VON' or lname[:3] == 'VAN':
1003 1
            lname = lname[3:].strip()
1004
1005
        # The rule implemented below says "MC, MAC become 1". I believe it
1006
        # meant to say they become M except in German data (where superscripted
1007
        # 1 indicates "except in German data"). It doesn't make sense for them
1008
        # to become 1 (BPFV -> 1) or to apply outside German. Unfortunately,
1009
        # both articles have this error(?).
1010 1
        if not german:
1011 1
            if lname[:3] == 'MAC':
1012 1
                lname = 'M' + lname[3:]
1013 1
            elif lname[:2] == 'MC':
1014 1
                lname = 'M' + lname[2:]
1015
1016
        # The non-German-only rule to strip ' is unnecessary due to filtering
1017
1018 1
        if lname[:1] in {'E', 'I', 'O', 'U'}:
1019 1
            lname = 'A' + lname[1:]
1020 1
        elif lname[:2] in {'GE', 'GI', 'GY'}:
1021 1
            lname = 'J' + lname[1:]
1022 1
        elif lname[:2] in {'CE', 'CI', 'CY'}:
1023 1
            lname = 'S' + lname[1:]
1024 1
        elif lname[:3] == 'CHR':
1025 1
            lname = 'K' + lname[1:]
1026 1
        elif lname[:1] == 'C' and lname[:2] != 'CH':
1027 1
            lname = 'K' + lname[1:]
1028
1029 1
        if lname[:2] == 'KN':
1030 1
            lname = 'N' + lname[1:]
1031 1
        elif lname[:2] == 'PH':
1032 1
            lname = 'F' + lname[1:]
1033 1
        elif lname[:3] in {'WIE', 'WEI'}:
1034 1
            lname = 'V' + lname[1:]
1035
1036 1
        if german and lname[:1] in {'W', 'M', 'Y', 'Z'}:
1037 1
            lname = {'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[lname[0]] + lname[
1038
                1:
1039
            ]
1040
1041 1
        code = lname[:1]
1042
1043
        # B. Postfix treatment
1044 1
        if german:  # moved from end of postfix treatment due to blocking
1045 1
            if lname[-3:] == 'TES':
1046 1
                lname = lname[:-3]
1047 1
            elif lname[-2:] == 'TS':
1048 1
                lname = lname[:-2]
1049 1
            if lname[-3:] == 'TZE':
1050 1
                lname = lname[:-3]
1051 1
            elif lname[-2:] == 'ZE':
1052 1
                lname = lname[:-2]
1053 1
            if lname[-1:] == 'Z':
1054 1
                lname = lname[:-1]
1055 1
            elif lname[-2:] == 'TE':
1056 1
                lname = lname[:-2]
1057
1058 1
        if lname[-1:] == 'R':
1059 1
            lname = lname[:-1] + 'N'
1060 1
        elif lname[-2:] in {'SE', 'CE'}:
1061 1
            lname = lname[:-2]
1062 1
        if lname[-2:] == 'SS':
1063 1
            lname = lname[:-2]
1064 1
        elif lname[-1:] == 'S':
1065 1
            lname = lname[:-1]
1066
1067 1
        if not german:
1068 1
            l5_repl = {'STOWN': 'SAWON', 'MPSON': 'MASON'}
1069 1
            l4_repl = {
1070
                'NSEN': 'ASEN',
1071
                'MSON': 'ASON',
1072
                'STEN': 'SAEN',
1073
                'STON': 'SAON',
1074
            }
1075 1
            if lname[-5:] in l5_repl:
1076 1
                lname = lname[:-5] + l5_repl[lname[-5:]]
1077 1
            elif lname[-4:] in l4_repl:
1078 1
                lname = lname[:-4] + l4_repl[lname[-4:]]
1079
1080 1
        if lname[-2:] in {'NG', 'ND'}:
1081 1
            lname = lname[:-1]
1082 1
        if not german and lname[-3:] in {'GAN', 'GEN'}:
1083 1
            lname = lname[:-3] + 'A' + lname[-2:]
1084
1085
        # C. Infix Treatment
1086 1
        lname = lname.replace('CK', 'C')
1087 1
        lname = lname.replace('SCH', 'S')
1088 1
        lname = lname.replace('DT', 'T')
1089 1
        lname = lname.replace('ND', 'N')
1090 1
        lname = lname.replace('NG', 'N')
1091 1
        lname = lname.replace('LM', 'M')
1092 1
        lname = lname.replace('MN', 'M')
1093 1
        lname = lname.replace('WIE', 'VIE')
1094 1
        lname = lname.replace('WEI', 'VEI')
1095
1096
        # D. Soundexing
1097
        # code for X & Y are unspecified, but presumably are 2 & 0
1098
1099 1
        lname = lname.translate(self._trans)
1100 1
        lname = self._delete_consecutive_repeats(lname)
1101
1102 1
        code += lname[1:]
1103 1
        code = code.replace('0', '')  # rule 1
1104
1105 1
        if max_length != -1:
1106 1
            if len(code) < max_length:
1107 1
                code += '0' * (max_length - len(code))
1108
            else:
1109 1
                code = code[:max_length]
1110
1111 1
        return code
1112
1113
1114 1
def pshp_soundex_last(lname, max_length=4, german=False):
1115
    """Calculate the PSHP Soundex/Viewex Coding of a last name.
1116
1117
    This is a wrapper for :py:meth:`PSHPSoundexLast.encode`.
1118
1119
    :param str lname: the last name to encode
1120
    :param int max_length: the length of the code returned (defaults to 4)
1121
    :param bool german: set to True if the name is German (different rules
1122
        apply)
1123
    :returns: the PSHP Soundex/Viewex Coding
1124
    :rtype: str
1125
1126
    >>> pshp_soundex_last('Smith')
1127
    'S530'
1128
    >>> pshp_soundex_last('Waters')
1129
    'W350'
1130
    >>> pshp_soundex_last('James')
1131
    'J500'
1132
    >>> pshp_soundex_last('Schmidt')
1133
    'S530'
1134
    >>> pshp_soundex_last('Ashcroft')
1135
    'A225'
1136
    """
1137 1
    return PSHPSoundexLast().encode(lname, max_length, german)
1138
1139
1140 1
class PSHPSoundexFirst(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
1141
    """PSHP Soundex/Viewex Coding of a first name.
1142
1143
    This coding is based on :cite:`Hershberg:1976`.
1144
1145
    Reference was also made to the German version of the same:
1146
    :cite:`Hershberg:1979`.
1147
1148
    A separate class, :py:class:`PSHPSoundexLast` is used for last names.
1149
    """
1150
1151 1
    _trans = dict(
1152
        zip(
1153
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1154
            '01230120022455012523010202',
1155
        )
1156
    )
1157
1158 1
    def encode(self, fname, max_length=4, german=False):
0 ignored issues
show
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
1159
        """Calculate the PSHP Soundex/Viewex Coding of a first name.
1160
1161
        :param str fname: the first name to encode
1162
        :param int max_length: the length of the code returned (defaults to 4)
1163
        :param bool german: set to True if the name is German (different rules
1164
            apply)
1165
        :returns: the PSHP Soundex/Viewex Coding
1166
        :rtype: str
1167
1168
        >>> pe = PSHPSoundexFirst()
1169
        >>> pe.encode('Smith')
1170
        'S530'
1171
        >>> pe.encode('Waters')
1172
        'W352'
1173
        >>> pe.encode('James')
1174
        'J700'
1175
        >>> pe.encode('Schmidt')
1176
        'S500'
1177
        >>> pe.encode('Ashcroft')
1178
        'A220'
1179
        >>> pe.encode('John')
1180
        'J500'
1181
        >>> pe.encode('Colin')
1182
        'K400'
1183
        >>> pe.encode('Niall')
1184
        'N400'
1185
        >>> pe.encode('Sally')
1186
        'S400'
1187
        >>> pe.encode('Jane')
1188
        'J500'
1189
        """
1190 1
        fname = unicode_normalize('NFKD', text_type(fname.upper()))
1191 1
        fname = fname.replace('ß', 'SS')
1192 1
        fname = ''.join(c for c in fname if c in self._uc_set)
1193
1194
        # special rules
1195 1
        if fname == 'JAMES':
1196 1
            code = 'J7'
1197 1
        elif fname == 'PAT':
1198 1
            code = 'P7'
1199
1200
        else:
1201
            # A. Prefix treatment
1202 1
            if fname[:2] in {'GE', 'GI', 'GY'}:
1203 1
                fname = 'J' + fname[1:]
1204 1
            elif fname[:2] in {'CE', 'CI', 'CY'}:
1205 1
                fname = 'S' + fname[1:]
1206 1
            elif fname[:3] == 'CHR':
1207 1
                fname = 'K' + fname[1:]
1208 1
            elif fname[:1] == 'C' and fname[:2] != 'CH':
1209 1
                fname = 'K' + fname[1:]
1210
1211 1
            if fname[:2] == 'KN':
1212 1
                fname = 'N' + fname[1:]
1213 1
            elif fname[:2] == 'PH':
1214 1
                fname = 'F' + fname[1:]
1215 1
            elif fname[:3] in {'WIE', 'WEI'}:
1216 1
                fname = 'V' + fname[1:]
1217
1218 1
            if german and fname[:1] in {'W', 'M', 'Y', 'Z'}:
1219 1
                fname = {'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[
1220
                    fname[0]
1221
                ] + fname[1:]
1222
1223 1
            code = fname[:1]
1224
1225
            # B. Soundex coding
1226
            # code for Y unspecified, but presumably is 0
1227 1
            fname = fname.translate(self._trans)
1228 1
            fname = self._delete_consecutive_repeats(fname)
1229
1230 1
            code += fname[1:]
1231 1
            syl_ptr = code.find('0')
1232 1
            syl2_ptr = code[syl_ptr + 1 :].find('0')
1233 1
            if syl_ptr != -1 and syl2_ptr != -1 and syl2_ptr - syl_ptr > -1:
1234 1
                code = code[: syl_ptr + 2]
1235
1236 1
            code = code.replace('0', '')  # rule 1
1237
1238 1
        if max_length != -1:
1239 1
            if len(code) < max_length:
1240 1
                code += '0' * (max_length - len(code))
1241
            else:
1242 1
                code = code[:max_length]
1243
1244 1
        return code
1245
1246
1247 1
def pshp_soundex_first(fname, max_length=4, german=False):
1248
    """Calculate the PSHP Soundex/Viewex Coding of a first name.
1249
1250
    This is a wrapper for :py:meth:`PSHPSoundexFirst.encode`.
1251
1252
    :param str fname: the first name to encode
1253
    :param int max_length: the length of the code returned (defaults to 4)
1254
    :param bool german: set to True if the name is German (different rules
1255
        apply)
1256
    :returns: the PSHP Soundex/Viewex Coding
1257
    :rtype: str
1258
1259
    >>> pshp_soundex_first('Smith')
1260
    'S530'
1261
    >>> pshp_soundex_first('Waters')
1262
    'W352'
1263
    >>> pshp_soundex_first('James')
1264
    'J700'
1265
    >>> pshp_soundex_first('Schmidt')
1266
    'S500'
1267
    >>> pshp_soundex_first('Ashcroft')
1268
    'A220'
1269
    >>> pshp_soundex_first('John')
1270
    'J500'
1271
    >>> pshp_soundex_first('Colin')
1272
    'K400'
1273
    >>> pshp_soundex_first('Niall')
1274
    'N400'
1275
    >>> pshp_soundex_first('Sally')
1276
    'S400'
1277
    >>> pshp_soundex_first('Jane')
1278
    'J500'
1279
    """
1280 1
    return PSHPSoundexFirst().encode(fname, max_length, german)
1281
1282
1283
if __name__ == '__main__':
1284
    import doctest
1285
1286
    doctest.testmod()
1287