Completed
Push — master ( ae7271...8fcab5 )
by Chris
11:49
created

abydos.phonetic.alpha_sis()   F

Complexity

Conditions 14

Size

Total Lines 102
Code Lines 63

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 14
eloc 63
nop 2
dl 0
loc 102
rs 3.6
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic.alpha_sis() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (6524/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.phonetic.
20
21
The phonetic module implements phonetic algorithms including:
22
23
    - Robert C. Russell's Index
24
    - American Soundex
25
    - Refined Soundex
26
    - Daitch-Mokotoff Soundex
27
    - Kölner Phonetik
28
    - NYSIIS
29
    - Match Rating Algorithm
30
    - Metaphone
31
    - Double Metaphone
32
    - Caverphone
33
    - Alpha Search Inquiry System
34
    - Fuzzy Soundex
35
    - Phonex
36
    - Phonem
37
    - Phonix
38
    - SfinxBis
39
    - phonet
40
    - Standardized Phonetic Frequency Code
41
    - Statistics Canada
42
    - Lein
43
    - Roger Root
44
    - Oxford Name Compression Algorithm (ONCA)
45
    - Eudex phonetic hash
46
    - Haase Phonetik
47
    - Reth-Schek Phonetik
48
    - FONEM
49
    - Parmar-Kumbharana
50
    - Davidson's Consonant Code
51
    - SoundD
52
    - PSHP Soundex/Viewex Coding
53
    - an early version of Henry Code
54
    - Norphone
55
    - Dolby Code
56
    - Phonetic Spanish
57
    - Spanish Metaphone
58
    - MetaSoundex
59
    - SoundexBR
60
    - Beider-Morse Phonetic Matching
61
"""
62
63
from __future__ import division, unicode_literals
64
65
from collections import Counter
66
from itertools import groupby, product
67
from re import compile as re_compile
68
from re import match as re_match
69
from unicodedata import normalize
70
71
from six import text_type
72
from six.moves import range
73
74
from ._bm import _bmpm
75
76
_INFINITY = float('inf')
77
78
79
def _delete_consecutive_repeats(word):
80
    """Delete consecutive repeated characters in a word.
81
82
    :param str word: the word to transform
83
    :returns: word with consecutive repeating characters collapsed to
84
        a single instance
85
    :rtype: str
86
    """
87
    return ''.join(char for char, _ in groupby(word))
88
89
90
def russell_index(word):
91
    """Return the Russell Index (integer output) of a word.
92
93
    This follows Robert C. Russell's Index algorithm, as described in
94
    :cite:`Russell:1917`.
95
96
    :param str word: the word to transform
97
    :returns: the Russell Index value
98
    :rtype: int
99
100
    >>> russell_index('Christopher')
101
    3813428
102
    >>> russell_index('Niall')
103
    715
104
    >>> russell_index('Smith')
105
    3614
106
    >>> russell_index('Schmidt')
107
    3614
108
    """
109
    _russell_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
110
                                     'ABCDEFGIKLMNOPQRSTUVXYZ'),
111
                                    '12341231356712383412313'))
112
113
    word = normalize('NFKD', text_type(word.upper()))
114
    word = word.replace('ß', 'SS')
115
    word = word.replace('GH', '')  # discard gh (rule 3)
116
    word = word.rstrip('SZ')  # discard /[sz]$/ (rule 3)
117
118
    # translate according to Russell's mapping
119
    word = ''.join(c for c in word if c in
120
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'K', 'L', 'M', 'N',
121
                    'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'})
122
    sdx = word.translate(_russell_translation)
123
124
    # remove any 1s after the first occurrence
125
    one = sdx.find('1')+1
126
    if one:
127
        sdx = sdx[:one] + ''.join(c for c in sdx[one:] if c != '1')
128
129
    # remove repeating characters
130
    sdx = _delete_consecutive_repeats(sdx)
131
132
    # return as an int
133
    return int(sdx) if sdx else float('NaN')
134
135
136
def russell_index_num_to_alpha(num):
137
    """Convert the Russell Index integer to an alphabetic string.
138
139
    This follows Robert C. Russell's Index algorithm, as described in
140
    :cite:`Russell:1917`.
141
142
    :param int num: a Russell Index integer value
143
    :returns: the Russell Index as an alphabetic string
144
    :rtype: str
145
146
    >>> russell_index_num_to_alpha(3813428)
147
    'CRACDBR'
148
    >>> russell_index_num_to_alpha(715)
149
    'NAL'
150
    >>> russell_index_num_to_alpha(3614)
151
    'CMAD'
152
    """
153
    _russell_num_translation = dict(zip((ord(_) for _ in '12345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
154
                                        'ABCDLMNR'))
155
    num = ''.join(c for c in text_type(num) if c in {'1', '2', '3', '4', '5',
156
                                                     '6', '7', '8'})
157
    if num:
158
        return num.translate(_russell_num_translation)
159
    return ''
160
161
162
def russell_index_alpha(word):
163
    """Return the Russell Index (alphabetic output) for the word.
164
165
    This follows Robert C. Russell's Index algorithm, as described in
166
    :cite:`Russell:1917`.
167
168
    :param str word: the word to transform
169
    :returns: the Russell Index value as an alphabetic string
170
    :rtype: str
171
172
    >>> russell_index_alpha('Christopher')
173
    'CRACDBR'
174
    >>> russell_index_alpha('Niall')
175
    'NAL'
176
    >>> russell_index_alpha('Smith')
177
    'CMAD'
178
    >>> russell_index_alpha('Schmidt')
179
    'CMAD'
180
    """
181
    if word:
182
        return russell_index_num_to_alpha(russell_index(word))
183
    return ''
184
185
186
def soundex(word, maxlength=4, var='American', reverse=False, zero_pad=True):
187
    """Return the Soundex code for a word.
188
189
    :param str word: the word to transform
190
    :param int maxlength: the length of the code returned (defaults to 4)
191
    :param str var: the variant of the algorithm to employ (defaults to
192
        'American'):
193
194
        - 'American' follows the American Soundex algorithm, as described at
195
          :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
196
          Miracode
197
        - 'special' follows the rules from the 1880-1910 US Census
198
          retrospective re-analysis, in which h & w are not treated as blocking
199
          consonants but as vowels. Cf. :cite:`Repici:2013`.
200
        - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
201
          US Census, including coding prefixed and unprefixed versions of some
202
          names
203
204
    :param bool reverse: reverse the word before computing the selected Soundex
205
        (defaults to False); This results in "Reverse Soundex"
206
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
207
        maxlength string
208
    :returns: the Soundex value
209
    :rtype: str
210
211
    >>> soundex("Christopher")
212
    'C623'
213
    >>> soundex("Niall")
214
    'N400'
215
    >>> soundex('Smith')
216
    'S530'
217
    >>> soundex('Schmidt')
218
    'S530'
219
220
221
    >>> soundex('Christopher', maxlength=_INFINITY)
222
    'C623160000000000000000000000000000000000000000000000000000000000'
223
    >>> soundex('Christopher', maxlength=_INFINITY, zero_pad=False)
224
    'C62316'
225
226
    >>> soundex('Christopher', reverse=True)
227
    'R132'
228
229
    >>> soundex('Ashcroft')
230
    'A261'
231
    >>> soundex('Asicroft')
232
    'A226'
233
    >>> soundex('Ashcroft', var='special')
234
    'A226'
235
    >>> soundex('Asicroft', var='special')
236
    'A226'
237
    """
238
    _soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
239
                                     'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
240
                                    '01230129022455012623019202'))
241
242
    # Require a maxlength of at least 4 and not more than 64
243
    if maxlength is not None:
244
        maxlength = min(max(4, maxlength), 64)
245
    else:
246
        maxlength = 64
247
248
    # uppercase, normalize, decompose, and filter non-A-Z out
249
    word = normalize('NFKD', text_type(word.upper()))
250
    word = word.replace('ß', 'SS')
251
252
    if var == 'Census':
253
        # Should these prefixes be supplemented? (VANDE, DELA, VON)
254
        if word[:3] in {'VAN', 'CON'} and len(word) > 4:
255
            return (soundex(word, maxlength, 'American', reverse, zero_pad),
256
                    soundex(word[3:], maxlength, 'American', reverse,
257
                            zero_pad))
258
        if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
259
            return (soundex(word, maxlength, 'American', reverse, zero_pad),
260
                    soundex(word[2:], maxlength, 'American', reverse,
261
                            zero_pad))
262
        # Otherwise, proceed as usual (var='American' mode, ostensibly)
263
264
    word = ''.join(c for c in word if c in
265
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
266
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
267
                    'Y', 'Z'})
268
269
    # Nothing to convert, return base case
270
    if not word:
271
        if zero_pad:
272
            return '0'*maxlength
273
        return '0'
274
275
    # Reverse word if computing Reverse Soundex
276
    if reverse:
277
        word = word[::-1]
278
279
    # apply the Soundex algorithm
280
    sdx = word.translate(_soundex_translation)
281
282
    if var == 'special':
283
        sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
284
    else:
285
        sdx = sdx.replace('9', '')  # rule 1
286
    sdx = _delete_consecutive_repeats(sdx)  # rule 3
287
288
    if word[0] in 'HW':
289
        sdx = word[0] + sdx
290
    else:
291
        sdx = word[0] + sdx[1:]
292
    sdx = sdx.replace('0', '')  # rule 1
293
294
    if zero_pad:
295
        sdx += ('0'*maxlength)  # rule 4
296
297
    return sdx[:maxlength]
298
299
300
def refined_soundex(word, maxlength=_INFINITY, reverse=False, zero_pad=False,
301
                    retain_vowels=False):
302
    """Return the Refined Soundex code for a word.
303
304
    This is Soundex, but with more character classes. It was defined at
305
    :cite:`Boyce:1998`.
306
307
    :param word: the word to transform
308
    :param maxlength: the length of the code returned (defaults to unlimited)
309
    :param reverse: reverse the word before computing the selected Soundex
310
        (defaults to False); This results in "Reverse Soundex"
311
    :param zero_pad: pad the end of the return value with 0s to achieve a
312
        maxlength string
313
    :param retain_vowels: retain vowels (as 0) in the resulting code
314
    :returns: the Refined Soundex value
315
    :rtype: str
316
317
    >>> refined_soundex('Christopher')
318
    'C393619'
319
    >>> refined_soundex('Niall')
320
    'N87'
321
    >>> refined_soundex('Smith')
322
    'S386'
323
    >>> refined_soundex('Schmidt')
324
    'S386'
325
    """
326
    _ref_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
327
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
328
                                        '01360240043788015936020505'))
329
330
    # uppercase, normalize, decompose, and filter non-A-Z out
331
    word = normalize('NFKD', text_type(word.upper()))
332
    word = word.replace('ß', 'SS')
333
    word = ''.join(c for c in word if c in
334
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
335
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
336
                    'Y', 'Z'})
337
338
    # Reverse word if computing Reverse Soundex
339
    if reverse:
340
        word = word[::-1]
341
342
    # apply the Soundex algorithm
343
    sdx = word[:1] + word.translate(_ref_soundex_translation)
344
    sdx = _delete_consecutive_repeats(sdx)
345
    if not retain_vowels:
346
        sdx = sdx.replace('0', '')  # Delete vowels, H, W, Y
347
348
    if maxlength < _INFINITY:
349
        if zero_pad:
350
            sdx += ('0' * maxlength)
351
        if maxlength:
352
            sdx = sdx[:maxlength]
353
354
    return sdx
355
356
357
def dm_soundex(word, maxlength=6, reverse=False, zero_pad=True):
358
    """Return the Daitch-Mokotoff Soundex code for a word.
359
360
    Based on Daitch-Mokotoff Soundex :cite:`Mokotoff:1997`, this returns values
361
    of a word as a set. A collection is necessary since there can be multiple
362
    values for a single word.
363
364
    :param word: the word to transform
365
    :param maxlength: the length of the code returned (defaults to 6)
366
    :param reverse: reverse the word before computing the selected Soundex
367
        (defaults to False); This results in "Reverse Soundex"
368
    :param zero_pad: pad the end of the return value with 0s to achieve a
369
        maxlength string
370
    :returns: the Daitch-Mokotoff Soundex value
371
    :rtype: str
372
373
    >>> sorted(dm_soundex('Christopher'))
374
    ['494379', '594379']
375
    >>> dm_soundex('Niall')
376
    {'680000'}
377
    >>> dm_soundex('Smith')
378
    {'463000'}
379
    >>> dm_soundex('Schmidt')
380
    {'463000'}
381
382
    >>> sorted(dm_soundex('The quick brown fox', maxlength=20, zero_pad=False))
383
    ['35457976754', '3557976754']
384
    """
385
    _dms_table = {'STCH': (2, 4, 4), 'DRZ': (4, 4, 4), 'ZH': (4, 4, 4),
386
                  'ZHDZH': (2, 4, 4), 'DZH': (4, 4, 4), 'DRS': (4, 4, 4),
387
                  'DZS': (4, 4, 4), 'SCHTCH': (2, 4, 4), 'SHTSH': (2, 4, 4),
388
                  'SZCZ': (2, 4, 4), 'TZS': (4, 4, 4), 'SZCS': (2, 4, 4),
389
                  'STSH': (2, 4, 4), 'SHCH': (2, 4, 4), 'D': (3, 3, 3),
390
                  'H': (5, 5, '_'), 'TTSCH': (4, 4, 4), 'THS': (4, 4, 4),
391
                  'L': (8, 8, 8), 'P': (7, 7, 7), 'CHS': (5, 54, 54),
392
                  'T': (3, 3, 3), 'X': (5, 54, 54), 'OJ': (0, 1, '_'),
393
                  'OI': (0, 1, '_'), 'SCHTSH': (2, 4, 4), 'OY': (0, 1, '_'),
394
                  'Y': (1, '_', '_'), 'TSH': (4, 4, 4), 'ZDZ': (2, 4, 4),
395
                  'TSZ': (4, 4, 4), 'SHT': (2, 43, 43), 'SCHTSCH': (2, 4, 4),
396
                  'TTSZ': (4, 4, 4), 'TTZ': (4, 4, 4), 'SCH': (4, 4, 4),
397
                  'TTS': (4, 4, 4), 'SZD': (2, 43, 43), 'AI': (0, 1, '_'),
398
                  'PF': (7, 7, 7), 'TCH': (4, 4, 4), 'PH': (7, 7, 7),
399
                  'TTCH': (4, 4, 4), 'SZT': (2, 43, 43), 'ZDZH': (2, 4, 4),
400
                  'EI': (0, 1, '_'), 'G': (5, 5, 5), 'EJ': (0, 1, '_'),
401
                  'ZD': (2, 43, 43), 'IU': (1, '_', '_'), 'K': (5, 5, 5),
402
                  'O': (0, '_', '_'), 'SHTCH': (2, 4, 4), 'S': (4, 4, 4),
403
                  'TRZ': (4, 4, 4), 'SHD': (2, 43, 43), 'DSH': (4, 4, 4),
404
                  'CSZ': (4, 4, 4), 'EU': (1, 1, '_'), 'TRS': (4, 4, 4),
405
                  'ZS': (4, 4, 4), 'STRZ': (2, 4, 4), 'UY': (0, 1, '_'),
406
                  'STRS': (2, 4, 4), 'CZS': (4, 4, 4),
407
                  'MN': ('6_6', '6_6', '6_6'), 'UI': (0, 1, '_'),
408
                  'UJ': (0, 1, '_'), 'UE': (0, '_', '_'), 'EY': (0, 1, '_'),
409
                  'W': (7, 7, 7), 'IA': (1, '_', '_'), 'FB': (7, 7, 7),
410
                  'STSCH': (2, 4, 4), 'SCHT': (2, 43, 43),
411
                  'NM': ('6_6', '6_6', '6_6'), 'SCHD': (2, 43, 43),
412
                  'B': (7, 7, 7), 'DSZ': (4, 4, 4), 'F': (7, 7, 7),
413
                  'N': (6, 6, 6), 'CZ': (4, 4, 4), 'R': (9, 9, 9),
414
                  'U': (0, '_', '_'), 'V': (7, 7, 7), 'CS': (4, 4, 4),
415
                  'Z': (4, 4, 4), 'SZ': (4, 4, 4), 'TSCH': (4, 4, 4),
416
                  'KH': (5, 5, 5), 'ST': (2, 43, 43), 'KS': (5, 54, 54),
417
                  'SH': (4, 4, 4), 'SC': (2, 4, 4), 'SD': (2, 43, 43),
418
                  'DZ': (4, 4, 4), 'ZHD': (2, 43, 43), 'DT': (3, 3, 3),
419
                  'ZSH': (4, 4, 4), 'DS': (4, 4, 4), 'TZ': (4, 4, 4),
420
                  'TS': (4, 4, 4), 'TH': (3, 3, 3), 'TC': (4, 4, 4),
421
                  'A': (0, '_', '_'), 'E': (0, '_', '_'), 'I': (0, '_', '_'),
422
                  'AJ': (0, 1, '_'), 'M': (6, 6, 6), 'Q': (5, 5, 5),
423
                  'AU': (0, 7, '_'), 'IO': (1, '_', '_'), 'AY': (0, 1, '_'),
424
                  'IE': (1, '_', '_'), 'ZSCH': (4, 4, 4),
425
                  'CH': ((5, 4), (5, 4), (5, 4)),
426
                  'CK': ((5, 45), (5, 45), (5, 45)),
427
                  'C': ((5, 4), (5, 4), (5, 4)),
428
                  'J': ((1, 4), ('_', 4), ('_', 4)),
429
                  'RZ': ((94, 4), (94, 4), (94, 4)),
430
                  'RS': ((94, 4), (94, 4), (94, 4))}
431
432
    _dms_order = {'A': ('AI', 'AJ', 'AU', 'AY', 'A'),
433
                  'B': ('B'),
434
                  'C': ('CHS', 'CSZ', 'CZS', 'CH', 'CK', 'CS', 'CZ', 'C'),
435
                  'D': ('DRS', 'DRZ', 'DSH', 'DSZ', 'DZH', 'DZS', 'DS', 'DT',
436
                        'DZ', 'D'),
437
                  'E': ('EI', 'EJ', 'EU', 'EY', 'E'),
438
                  'F': ('FB', 'F'),
439
                  'G': ('G'),
440
                  'H': ('H'),
441
                  'I': ('IA', 'IE', 'IO', 'IU', 'I'),
442
                  'J': ('J'),
443
                  'K': ('KH', 'KS', 'K'),
444
                  'L': ('L'),
445
                  'M': ('MN', 'M'),
446
                  'N': ('NM', 'N'),
447
                  'O': ('OI', 'OJ', 'OY', 'O'),
448
                  'P': ('PF', 'PH', 'P'),
449
                  'Q': ('Q'),
450
                  'R': ('RS', 'RZ', 'R'),
451
                  'S': ('SCHTSCH', 'SCHTCH', 'SCHTSH', 'SHTCH', 'SHTSH',
452
                        'STSCH', 'SCHD', 'SCHT', 'SHCH', 'STCH', 'STRS',
453
                        'STRZ', 'STSH', 'SZCS', 'SZCZ', 'SCH', 'SHD', 'SHT',
454
                        'SZD', 'SZT', 'SC', 'SD', 'SH', 'ST', 'SZ', 'S'),
455
                  'T': ('TTSCH', 'TSCH', 'TTCH', 'TTSZ', 'TCH', 'THS', 'TRS',
456
                        'TRZ', 'TSH', 'TSZ', 'TTS', 'TTZ', 'TZS', 'TC', 'TH',
457
                        'TS', 'TZ', 'T'),
458
                  'U': ('UE', 'UI', 'UJ', 'UY', 'U'),
459
                  'V': ('V'),
460
                  'W': ('W'),
461
                  'X': ('X'),
462
                  'Y': ('Y'),
463
                  'Z': ('ZHDZH', 'ZDZH', 'ZSCH', 'ZDZ', 'ZHD', 'ZSH', 'ZD',
464
                        'ZH', 'ZS', 'Z')}
465
466
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
467
    dms = ['']  # initialize empty code list
468
469
    # Require a maxlength of at least 6 and not more than 64
470
    if maxlength is not None:
471
        maxlength = min(max(6, maxlength), 64)
472
    else:
473
        maxlength = 64
474
475
    # uppercase, normalize, decompose, and filter non-A-Z
476
    word = normalize('NFKD', text_type(word.upper()))
477
    word = word.replace('ß', 'SS')
478
    word = ''.join(c for c in word if c in
479
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
480
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
481
                    'Y', 'Z'})
482
483
    # Nothing to convert, return base case
484
    if not word:
485
        if zero_pad:
486
            return {'0'*maxlength}
487
        return {'0'}
488
489
    # Reverse word if computing Reverse Soundex
490
    if reverse:
491
        word = word[::-1]
492
493
    pos = 0
494
    while pos < len(word):
495
        # Iterate through _dms_order, which specifies the possible substrings
496
        # for which codes exist in the Daitch-Mokotoff coding
497
        for sstr in _dms_order[word[pos]]:  # pragma: no branch
498
            if word[pos:].startswith(sstr):
499
                # Having determined a valid substring start, retrieve the code
500
                dm_val = _dms_table[sstr]
501
502
                # Having retried the code (triple), determine the correct
503
                # positional variant (first, pre-vocalic, elsewhere)
504
                if pos == 0:
505
                    dm_val = dm_val[0]
506
                elif (pos+len(sstr) < len(word) and
507
                      word[pos+len(sstr)] in _vowels):
508
                    dm_val = dm_val[1]
509
                else:
510
                    dm_val = dm_val[2]
511
512
                # Build the code strings
513
                if isinstance(dm_val, tuple):
514
                    dms = [_ + text_type(dm_val[0]) for _ in dms] \
515
                            + [_ + text_type(dm_val[1]) for _ in dms]
516
                else:
517
                    dms = [_ + text_type(dm_val) for _ in dms]
518
                pos += len(sstr)
519
                break
520
521
    # Filter out double letters and _ placeholders
522
    dms = (''.join(c for c in _delete_consecutive_repeats(_) if c != '_')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
523
           for _ in dms)
524
525
    # Trim codes and return set
526
    if zero_pad:
527
        dms = ((_ + ('0'*maxlength))[:maxlength] for _ in dms)
528
    else:
529
        dms = (_[:maxlength] for _ in dms)
530
    return set(dms)
531
532
533
def koelner_phonetik(word):
534
    """Return the Kölner Phonetik (numeric output) code for a word.
535
536
    Based on the algorithm defined by :cite:`Postel:1969`.
537
538
    While the output code is numeric, it is still a str because 0s can lead
539
    the code.
540
541
    :param str word: the word to transform
542
    :returns: the Kölner Phonetik value as a numeric string
543
    :rtype: str
544
545
    >>> koelner_phonetik('Christopher')
546
    '478237'
547
    >>> koelner_phonetik('Niall')
548
    '65'
549
    >>> koelner_phonetik('Smith')
550
    '862'
551
    >>> koelner_phonetik('Schmidt')
552
    '862'
553
    >>> koelner_phonetik('Müller')
554
    '657'
555
    >>> koelner_phonetik('Zimmermann')
556
    '86766'
557
    """
558
    # pylint: disable=too-many-branches
559
    def _after(word, i, letters):
560
        """Return True if word[i] follows one of the supplied letters."""
561
        if i > 0 and word[i-1] in letters:
562
            return True
563
        return False
564
565
    def _before(word, i, letters):
566
        """Return True if word[i] precedes one of the supplied letters."""
567
        if i+1 < len(word) and word[i+1] in letters:
568
            return True
569
        return False
570
571
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
572
573
    sdx = ''
574
575
    word = normalize('NFKD', text_type(word.upper()))
576
    word = word.replace('ß', 'SS')
577
578
    word = word.replace('Ä', 'AE')
579
    word = word.replace('Ö', 'OE')
580
    word = word.replace('Ü', 'UE')
581
    word = ''.join(c for c in word if c in
582
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
583
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
584
                    'Y', 'Z'})
585
586
    # Nothing to convert, return base case
587
    if not word:
588
        return sdx
589
590
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
591 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
592
            sdx += '0'
593
        elif word[i] == 'B':
594
            sdx += '1'
595
        elif word[i] == 'P':
596
            if _before(word, i, {'H'}):
597
                sdx += '3'
598
            else:
599
                sdx += '1'
600
        elif word[i] in {'D', 'T'}:
601
            if _before(word, i, {'C', 'S', 'Z'}):
602
                sdx += '8'
603
            else:
604
                sdx += '2'
605
        elif word[i] in {'F', 'V', 'W'}:
606
            sdx += '3'
607
        elif word[i] in {'G', 'K', 'Q'}:
608
            sdx += '4'
609
        elif word[i] == 'C':
610
            if _after(word, i, {'S', 'Z'}):
611
                sdx += '8'
612
            elif i == 0:
613
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
614
                                     'X'}):
615
                    sdx += '4'
616
                else:
617
                    sdx += '8'
618
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
619
                sdx += '4'
620
            else:
621
                sdx += '8'
622
        elif word[i] == 'X':
623
            if _after(word, i, {'C', 'K', 'Q'}):
624
                sdx += '8'
625
            else:
626
                sdx += '48'
627
        elif word[i] == 'L':
628
            sdx += '5'
629
        elif word[i] in {'M', 'N'}:
630
            sdx += '6'
631
        elif word[i] == 'R':
632
            sdx += '7'
633
        elif word[i] in {'S', 'Z'}:
634
            sdx += '8'
635
636
    sdx = _delete_consecutive_repeats(sdx)
637
638
    if sdx:
639
        sdx = sdx[:1] + sdx[1:].replace('0', '')
640
641
    return sdx
642
643
644
def koelner_phonetik_num_to_alpha(num):
645
    """Convert a Kölner Phonetik code from numeric to alphabetic.
646
647
    :param str num: a numeric Kölner Phonetik representation
648
    :returns: an alphabetic representation of the same word
649
    :rtype: str
650
651
    >>> koelner_phonetik_num_to_alpha(862)
652
    'SNT'
653
    >>> koelner_phonetik_num_to_alpha(657)
654
    'NLR'
655
    >>> koelner_phonetik_num_to_alpha(86766)
656
    'SNRNN'
657
    """
658
    _koelner_num_translation = dict(zip((ord(_) for _ in '012345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
659
                                        'APTFKLNRS'))
660
    num = ''.join(c for c in text_type(num) if c in {'0', '1', '2', '3', '4',
661
                                                     '5', '6', '7', '8'})
662
    return num.translate(_koelner_num_translation)
663
664
665
def koelner_phonetik_alpha(word):
666
    """Return the Kölner Phonetik (alphabetic output) code for a word.
667
668
    :param str word: the word to transform
669
    :returns: the Kölner Phonetik value as an alphabetic string
670
    :rtype: str
671
672
    >>> koelner_phonetik_alpha('Smith')
673
    'SNT'
674
    >>> koelner_phonetik_alpha('Schmidt')
675
    'SNT'
676
    >>> koelner_phonetik_alpha('Müller')
677
    'NLR'
678
    >>> koelner_phonetik_alpha('Zimmermann')
679
    'SNRNN'
680
    """
681
    return koelner_phonetik_num_to_alpha(koelner_phonetik(word))
682
683
684
def nysiis(word, maxlength=6, modified=False):
685
    """Return the NYSIIS code for a word.
686
687
    The New York State Identification and Intelligence System algorithm is
688
    defined in :cite:`Taft:1970`.
689
690
    The modified version of this algorithm is described in Appendix B of
691
    :cite:`Lynch:1977`.
692
693
    :param str word: the word to transform
694
    :param int maxlength: the maximum length (default 6) of the code to return
695
    :param bool modified: indicates whether to use USDA modified NYSIIS
696
    :returns: the NYSIIS value
697
    :rtype: str
698
699
    >>> nysiis('Christopher')
700
    'CRASTA'
701
    >>> nysiis('Niall')
702
    'NAL'
703
    >>> nysiis('Smith')
704
    'SNAT'
705
    >>> nysiis('Schmidt')
706
    'SNAD'
707
708
    >>> nysiis('Christopher', maxlength=_INFINITY)
709
    'CRASTAFAR'
710
711
    >>> nysiis('Christopher', maxlength=8, modified=True)
712
    'CRASTAFA'
713
    >>> nysiis('Niall', maxlength=8, modified=True)
714
    'NAL'
715
    >>> nysiis('Smith', maxlength=8, modified=True)
716
    'SNAT'
717
    >>> nysiis('Schmidt', maxlength=8, modified=True)
718
    'SNAD'
719
    """
720
    # Require a maxlength of at least 6
721
    if maxlength:
722
        maxlength = max(6, maxlength)
723
724
    _vowels = {'A', 'E', 'I', 'O', 'U'}
725
726
    word = ''.join(c for c in word.upper() if c.isalpha())
727
    word = word.replace('ß', 'SS')
728
729
    # exit early if there are no alphas
730
    if not word:
731
        return ''
732
733
    if modified:
734
        original_first_char = word[0]
735
736
    if word[:3] == 'MAC':
737
        word = 'MCC'+word[3:]
738
    elif word[:2] == 'KN':
739
        word = 'NN'+word[2:]
740
    elif word[:1] == 'K':
741
        word = 'C'+word[1:]
742
    elif word[:2] in {'PH', 'PF'}:
743
        word = 'FF'+word[2:]
744
    elif word[:3] == 'SCH':
745
        word = 'SSS'+word[3:]
746
    elif modified:
747
        if word[:2] == 'WR':
748
            word = 'RR'+word[2:]
749
        elif word[:2] == 'RH':
750
            word = 'RR'+word[2:]
751
        elif word[:2] == 'DG':
752
            word = 'GG'+word[2:]
753
        elif word[:1] in _vowels:
754
            word = 'A'+word[1:]
755
756
    if modified and word[-1:] in {'S', 'Z'}:
757
        word = word[:-1]
758
759
    if word[-2:] == 'EE' or word[-2:] == 'IE' or (modified and
760
                                                  word[-2:] == 'YE'):
761
        word = word[:-2]+'Y'
762
    elif word[-2:] in {'DT', 'RT', 'RD'}:
763
        word = word[:-2]+'D'
764
    elif word[-2:] in {'NT', 'ND'}:
765
        word = word[:-2]+('N' if modified else 'D')
766
    elif modified:
767
        if word[-2:] == 'IX':
768
            word = word[:-2]+'ICK'
769
        elif word[-2:] == 'EX':
770
            word = word[:-2]+'ECK'
771
        elif word[-2:] in {'JR', 'SR'}:
772
            return 'ERROR'  # TODO: decide how best to return an error
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
773
774
    key = word[:1]
775
776
    skip = 0
777
    for i in range(1, len(word)):
778
        if i >= len(word):
779
            continue
780
        elif skip:
781
            skip -= 1
782
            continue
783
        elif word[i:i+2] == 'EV':
784
            word = word[:i] + 'AF' + word[i+2:]
785
            skip = 1
786
        elif word[i] in _vowels:
787
            word = word[:i] + 'A' + word[i+1:]
788
        elif modified and i != len(word)-1 and word[i] == 'Y':
789
            word = word[:i] + 'A' + word[i+1:]
790
        elif word[i] == 'Q':
791
            word = word[:i] + 'G' + word[i+1:]
792
        elif word[i] == 'Z':
793
            word = word[:i] + 'S' + word[i+1:]
794
        elif word[i] == 'M':
795
            word = word[:i] + 'N' + word[i+1:]
796
        elif word[i:i+2] == 'KN':
797
            word = word[:i] + 'N' + word[i+2:]
798
        elif word[i] == 'K':
799
            word = word[:i] + 'C' + word[i+1:]
800
        elif modified and i == len(word)-3 and word[i:i+3] == 'SCH':
801
            word = word[:i] + 'SSA'
802
            skip = 2
803
        elif word[i:i+3] == 'SCH':
804
            word = word[:i] + 'SSS' + word[i+3:]
805
            skip = 2
806
        elif modified and i == len(word)-2 and word[i:i+2] == 'SH':
807
            word = word[:i] + 'SA'
808
            skip = 1
809
        elif word[i:i+2] == 'SH':
810
            word = word[:i] + 'SS' + word[i+2:]
811
            skip = 1
812
        elif word[i:i+2] == 'PH':
813
            word = word[:i] + 'FF' + word[i+2:]
814
            skip = 1
815
        elif modified and word[i:i+3] == 'GHT':
816
            word = word[:i] + 'TTT' + word[i+3:]
817
            skip = 2
818
        elif modified and word[i:i+2] == 'DG':
819
            word = word[:i] + 'GG' + word[i+2:]
820
            skip = 1
821
        elif modified and word[i:i+2] == 'WR':
822
            word = word[:i] + 'RR' + word[i+2:]
823
            skip = 1
824
        elif word[i] == 'H' and (word[i-1] not in _vowels or
825
                                 word[i+1:i+2] not in _vowels):
826
            word = word[:i] + word[i-1] + word[i+1:]
827
        elif word[i] == 'W' and word[i-1] in _vowels:
828
            word = word[:i] + word[i-1] + word[i+1:]
829
830
        if word[i:i+skip+1] != key[-1:]:
831
            key += word[i:i+skip+1]
832
833
    key = _delete_consecutive_repeats(key)
834
835
    if key[-1:] == 'S':
836
        key = key[:-1]
837
    if key[-2:] == 'AY':
838
        key = key[:-2] + 'Y'
839
    if key[-1:] == 'A':
840
        key = key[:-1]
841
    if modified and key[:1] == 'A':
842
        key = original_first_char + key[1:]
0 ignored issues
show
introduced by
The variable original_first_char does not seem to be defined in case modified on line 733 is False. Are you sure this can never be the case?
Loading history...
843
844
    if maxlength and maxlength < _INFINITY:
845
        key = key[:maxlength]
846
847
    return key
848
849
850
def mra(word):
851
    """Return the MRA personal numeric identifier (PNI) for a word.
852
853
    A description of the Western Airlines Surname Match Rating Algorithm can
854
    be found on page 18 of :cite:`Moore:1977`.
855
856
    :param str word: the word to transform
857
    :returns: the MRA PNI
858
    :rtype: str
859
860
    >>> mra('Christopher')
861
    'CHRPHR'
862
    >>> mra('Niall')
863
    'NL'
864
    >>> mra('Smith')
865
    'SMTH'
866
    >>> mra('Schmidt')
867
    'SCHMDT'
868
    """
869
    if not word:
870
        return word
871
    word = word.upper()
872
    word = word.replace('ß', 'SS')
873
    word = word[0]+''.join(c for c in word[1:] if
874
                           c not in {'A', 'E', 'I', 'O', 'U'})
875
    word = _delete_consecutive_repeats(word)
876
    if len(word) > 6:
877
        word = word[:3]+word[-3:]
878
    return word
879
880
881
def metaphone(word, maxlength=_INFINITY):
882
    """Return the Metaphone code for a word.
883
884
    Based on Lawrence Philips' Pick BASIC code from 1990 :cite:`Philips:1990`,
885
    as described in :cite:`Philips:1990b`.
886
    This incorporates some corrections to the above code, particularly
887
    some of those suggested by Michael Kuhn in :cite:`Kuhn:1995`.
888
889
    :param str word: the word to transform
890
    :param int maxlength: the maximum length of the returned Metaphone code
891
        (defaults to unlimited, but in Philips' original implementation
892
        this was 4)
893
    :returns: the Metaphone value
894
    :rtype: str
895
896
897
    >>> metaphone('Christopher')
898
    'KRSTFR'
899
    >>> metaphone('Niall')
900
    'NL'
901
    >>> metaphone('Smith')
902
    'SM0'
903
    >>> metaphone('Schmidt')
904
    'SKMTT'
905
    """
906
    # pylint: disable=too-many-branches
907
    _vowels = {'A', 'E', 'I', 'O', 'U'}
908
    _frontv = {'E', 'I', 'Y'}
909
    _varson = {'C', 'G', 'P', 'S', 'T'}
910
911
    # Require a maxlength of at least 4
912
    if maxlength is not None:
913
        maxlength = max(4, maxlength)
914
    else:
915
        maxlength = 64
916
917
    # As in variable sound--those modified by adding an "h"
918
    ename = ''.join(c for c in word.upper() if c.isalnum())
919
    ename = ename.replace('ß', 'SS')
920
921
    # Delete nonalphanumeric characters and make all caps
922
    if not ename:
923
        return ''
924
    if ename[0:2] in {'PN', 'AE', 'KN', 'GN', 'WR'}:
925
        ename = ename[1:]
926
    elif ename[0] == 'X':
927
        ename = 'S' + ename[1:]
928
    elif ename[0:2] == 'WH':
929
        ename = 'W' + ename[2:]
930
931
    # Convert to metaph
932
    elen = len(ename)-1
933
    metaph = ''
934
    for i in range(len(ename)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
935
        if len(metaph) >= maxlength:
936
            break
937
        if ((ename[i] not in {'G', 'T'} and
938
             i > 0 and ename[i-1] == ename[i])):
939
            continue
940
941
        if ename[i] in _vowels and i == 0:
942
            metaph = ename[i]
943
944
        elif ename[i] == 'B':
945
            if i != elen or ename[i-1] != 'M':
946
                metaph += ename[i]
947
948
        elif ename[i] == 'C':
949
            if not (i > 0 and ename[i-1] == 'S' and ename[i+1:i+2] in _frontv):
950
                if ename[i+1:i+3] == 'IA':
951
                    metaph += 'X'
952
                elif ename[i+1:i+2] in _frontv:
953
                    metaph += 'S'
954
                elif i > 0 and ename[i-1:i+2] == 'SCH':
955
                    metaph += 'K'
956
                elif ename[i+1:i+2] == 'H':
957
                    if i == 0 and i+1 < elen and ename[i+2:i+3] not in _vowels:
958
                        metaph += 'K'
959
                    else:
960
                        metaph += 'X'
961
                else:
962
                    metaph += 'K'
963
964
        elif ename[i] == 'D':
965
            if ename[i+1:i+2] == 'G' and ename[i+2:i+3] in _frontv:
966
                metaph += 'J'
967
            else:
968
                metaph += 'T'
969
970
        elif ename[i] == 'G':
971
            if ename[i+1:i+2] == 'H' and not (i+1 == elen or
972
                                              ename[i+2:i+3] not in _vowels):
973
                continue
974
            elif i > 0 and ((i+1 == elen and ename[i+1] == 'N') or
975
                            (i+3 == elen and ename[i+1:i+4] == 'NED')):
976
                continue
977
            elif (i-1 > 0 and i+1 <= elen and ename[i-1] == 'D' and
978
                  ename[i+1] in _frontv):
979
                continue
980
            elif ename[i+1:i+2] == 'G':
981
                continue
982
            elif ename[i+1:i+2] in _frontv:
983
                if i == 0 or ename[i-1] != 'G':
984
                    metaph += 'J'
985
                else:
986
                    metaph += 'K'
987
            else:
988
                metaph += 'K'
989
990
        elif ename[i] == 'H':
991
            if ((i > 0 and ename[i-1] in _vowels and
992
                 ename[i+1:i+2] not in _vowels)):
993
                continue
994
            elif i > 0 and ename[i-1] in _varson:
995
                continue
996
            else:
997
                metaph += 'H'
998
999
        elif ename[i] in {'F', 'J', 'L', 'M', 'N', 'R'}:
1000
            metaph += ename[i]
1001
1002
        elif ename[i] == 'K':
1003
            if i > 0 and ename[i-1] == 'C':
1004
                continue
1005
            else:
1006
                metaph += 'K'
1007
1008
        elif ename[i] == 'P':
1009
            if ename[i+1:i+2] == 'H':
1010
                metaph += 'F'
1011
            else:
1012
                metaph += 'P'
1013
1014
        elif ename[i] == 'Q':
1015
            metaph += 'K'
1016
1017
        elif ename[i] == 'S':
1018
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1019
                 ename[i+2] in 'OA')):
1020
                metaph += 'X'
1021
            elif ename[i+1:i+2] == 'H':
1022
                metaph += 'X'
1023
            else:
1024
                metaph += 'S'
1025
1026
        elif ename[i] == 'T':
1027
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1028
                 ename[i+2] in {'A', 'O'})):
1029
                metaph += 'X'
1030
            elif ename[i+1:i+2] == 'H':
1031
                metaph += '0'
1032
            elif ename[i+1:i+3] != 'CH':
1033
                if ename[i-1:i] != 'T':
1034
                    metaph += 'T'
1035
1036
        elif ename[i] == 'V':
1037
            metaph += 'F'
1038
1039
        elif ename[i] in 'WY':
1040
            if ename[i+1:i+2] in _vowels:
1041
                metaph += ename[i]
1042
1043
        elif ename[i] == 'X':
1044
            metaph += 'KS'
1045
1046
        elif ename[i] == 'Z':
1047
            metaph += 'S'
1048
1049
    return metaph
1050
1051
1052
def double_metaphone(word, maxlength=_INFINITY):
1053
    """Return the Double Metaphone code for a word.
1054
1055
    Based on Lawrence Philips' (Visual) C++ code from 1999
1056
    :cite:`Philips:2000`.
1057
1058
    :param word: the word to transform
1059
    :param maxlength: the maximum length of the returned Double Metaphone codes
1060
        (defaults to unlimited, but in Philips' original implementation this
1061
        was 4)
1062
    :returns: the Double Metaphone value(s)
1063
    :rtype: tuple
1064
1065
    >>> double_metaphone('Christopher')
1066
    ('KRSTFR', '')
1067
    >>> double_metaphone('Niall')
1068
    ('NL', '')
1069
    >>> double_metaphone('Smith')
1070
    ('SM0', 'XMT')
1071
    >>> double_metaphone('Schmidt')
1072
    ('XMT', 'SMT')
1073
    """
1074
    # pylint: disable=too-many-branches
1075
    # Require a maxlength of at least 4
1076
    if maxlength is not None:
1077
        maxlength = max(4, maxlength)
1078
    else:
1079
        maxlength = 64
1080
1081
    primary = ''
1082
    secondary = ''
1083
1084
    def _slavo_germanic():
1085
        """Return True if the word appears to be Slavic or Germanic."""
1086
        if 'W' in word or 'K' in word or 'CZ' in word:
1087
            return True
1088
        return False
1089
1090
    def _metaph_add(pri, sec=''):
1091
        """Return a new metaphone tuple with the supplied elements."""
1092
        newpri = primary
1093
        newsec = secondary
1094
        if pri:
1095
            newpri += pri
1096
        if sec:
1097
            if sec != ' ':
1098
                newsec += sec
1099
        else:
1100
            newsec += pri
1101
        return (newpri, newsec)
1102
1103
    def _is_vowel(pos):
1104
        """Return True if the character at word[pos] is a vowel."""
1105
        if pos >= 0 and word[pos] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1106
            return True
1107
        return False
1108
1109
    def _get_at(pos):
1110
        """Return the character at word[pos]."""
1111
        return word[pos]
1112
1113
    def _string_at(pos, slen, substrings):
1114
        """Return True if word[pos:pos+slen] is in substrings."""
1115
        if pos < 0:
1116
            return False
1117
        return word[pos:pos+slen] in substrings
1118
1119
    current = 0
1120
    length = len(word)
1121
    if length < 1:
1122
        return ('', '')
1123
    last = length - 1
1124
1125
    word = word.upper()
1126
    word = word.replace('ß', 'SS')
1127
1128
    # Pad the original string so that we can index beyond the edge of the world
1129
    word += '     '
1130
1131
    # Skip these when at start of word
1132
    if word[0:2] in {'GN', 'KN', 'PN', 'WR', 'PS'}:
1133
        current += 1
1134
1135
    # Initial 'X' is pronounced 'Z' e.g. 'Xavier'
1136
    if _get_at(0) == 'X':
1137
        (primary, secondary) = _metaph_add('S')  # 'Z' maps to 'S'
1138
        current += 1
1139
1140
    # Main loop
1141
    while True:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1142
        if current >= length:
1143
            break
1144
1145
        if _get_at(current) in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1146
            if current == 0:
1147
                # All init vowels now map to 'A'
1148
                (primary, secondary) = _metaph_add('A')
1149
            current += 1
1150
            continue
1151
1152
        elif _get_at(current) == 'B':
1153
            # "-mb", e.g", "dumb", already skipped over...
1154
            (primary, secondary) = _metaph_add('P')
1155
            if _get_at(current + 1) == 'B':
1156
                current += 2
1157
            else:
1158
                current += 1
1159
            continue
1160
1161
        elif _get_at(current) == 'Ç':
1162
            (primary, secondary) = _metaph_add('S')
1163
            current += 1
1164
            continue
1165
1166
        elif _get_at(current) == 'C':
1167
            # Various Germanic
1168
            if (current > 1 and not _is_vowel(current - 2) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1169
                    _string_at((current - 1), 3, {'ACH'}) and
1170
                    ((_get_at(current + 2) != 'I') and
1171
                     ((_get_at(current + 2) != 'E') or
1172
                      _string_at((current - 2), 6,
1173
                                 {'BACHER', 'MACHER'})))):
1174
                (primary, secondary) = _metaph_add('K')
1175
                current += 2
1176
                continue
1177
1178
            # Special case 'caesar'
1179
            elif current == 0 and _string_at(current, 6, {'CAESAR'}):
1180
                (primary, secondary) = _metaph_add('S')
1181
                current += 2
1182
                continue
1183
1184
            # Italian 'chianti'
1185
            elif _string_at(current, 4, {'CHIA'}):
1186
                (primary, secondary) = _metaph_add('K')
1187
                current += 2
1188
                continue
1189
1190
            elif _string_at(current, 2, {'CH'}):
1191
                # Find 'Michael'
1192
                if current > 0 and _string_at(current, 4, {'CHAE'}):
1193
                    (primary, secondary) = _metaph_add('K', 'X')
1194
                    current += 2
1195
                    continue
1196
1197
                # Greek roots e.g. 'chemistry', 'chorus'
1198
                elif (current == 0 and
1199
                      (_string_at((current + 1), 5,
1200
                                  {'HARAC', 'HARIS'}) or
1201
                       _string_at((current + 1), 3,
1202
                                  {'HOR', 'HYM', 'HIA', 'HEM'})) and
1203
                      not _string_at(0, 5, {'CHORE'})):
1204
                    (primary, secondary) = _metaph_add('K')
1205
                    current += 2
1206
                    continue
1207
1208
                # Germanic, Greek, or otherwise 'ch' for 'kh' sound
1209
                elif ((_string_at(0, 4, {'VAN ', 'VON '}) or
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (7/5)
Loading history...
1210
                       _string_at(0, 3, {'SCH'})) or
1211
                      # 'architect but not 'arch', 'orchestra', 'orchid'
1212
                      _string_at((current - 2), 6,
1213
                                 {'ORCHES', 'ARCHIT', 'ORCHID'}) or
1214
                      _string_at((current + 2), 1, {'T', 'S'}) or
1215
                      ((_string_at((current - 1), 1,
1216
                                   {'A', 'O', 'U', 'E'}) or
1217
                        (current == 0)) and
1218
                       # e.g., 'wachtler', 'wechsler', but not 'tichner'
1219
                       _string_at((current + 2), 1,
1220
                                  {'L', 'R', 'N', 'M', 'B', 'H', 'F', 'V', 'W',
1221
                                   ' '}))):
1222
                    (primary, secondary) = _metaph_add('K')
1223
1224
                else:
1225
                    if current > 0:
1226
                        if _string_at(0, 2, {'MC'}):
1227
                            # e.g., "McHugh"
1228
                            (primary, secondary) = _metaph_add('K')
1229
                        else:
1230
                            (primary, secondary) = _metaph_add('X', 'K')
1231
                    else:
1232
                        (primary, secondary) = _metaph_add('X')
1233
1234
                current += 2
1235
                continue
1236
1237
            # e.g, 'czerny'
1238
            elif (_string_at(current, 2, {'CZ'}) and
1239
                  not _string_at((current - 2), 4, {'WICZ'})):
1240
                (primary, secondary) = _metaph_add('S', 'X')
1241
                current += 2
1242
                continue
1243
1244
            # e.g., 'focaccia'
1245
            elif _string_at((current + 1), 3, {'CIA'}):
1246
                (primary, secondary) = _metaph_add('X')
1247
                current += 3
1248
1249
            # double 'C', but not if e.g. 'McClellan'
1250
            elif (_string_at(current, 2, {'CC'}) and
1251
                  not ((current == 1) and (_get_at(0) == 'M'))):
1252
                # 'bellocchio' but not 'bacchus'
1253
                if ((_string_at((current + 2), 1,
1254
                                {'I', 'E', 'H'}) and
1255
                     not _string_at((current + 2), 2, ['HU']))):
1256
                    # 'accident', 'accede' 'succeed'
1257
                    if ((((current == 1) and _get_at(current - 1) == 'A') or
1258
                         _string_at((current - 1), 5,
1259
                                    {'UCCEE', 'UCCES'}))):
1260
                        (primary, secondary) = _metaph_add('KS')
1261
                    # 'bacci', 'bertucci', other italian
1262
                    else:
1263
                        (primary, secondary) = _metaph_add('X')
1264
                    current += 3
1265
                    continue
1266
                else:  # Pierce's rule
1267
                    (primary, secondary) = _metaph_add('K')
1268
                    current += 2
1269
                    continue
1270
1271
            elif _string_at(current, 2, {'CK', 'CG', 'CQ'}):
1272
                (primary, secondary) = _metaph_add('K')
1273
                current += 2
1274
                continue
1275
1276
            elif _string_at(current, 2, {'CI', 'CE', 'CY'}):
1277
                # Italian vs. English
1278
                if _string_at(current, 3, {'CIO', 'CIE', 'CIA'}):
1279
                    (primary, secondary) = _metaph_add('S', 'X')
1280
                else:
1281
                    (primary, secondary) = _metaph_add('S')
1282
                current += 2
1283
                continue
1284
1285
            # else
1286
            else:
1287
                (primary, secondary) = _metaph_add('K')
1288
1289
                # name sent in 'mac caffrey', 'mac gregor
1290
                if _string_at((current + 1), 2, {' C', ' Q', ' G'}):
1291
                    current += 3
1292
                elif (_string_at((current + 1), 1,
1293
                                 {'C', 'K', 'Q'}) and
1294
                      not _string_at((current + 1), 2, {'CE', 'CI'})):
1295
                    current += 2
1296
                else:
1297
                    current += 1
1298
                continue
1299
1300
        elif _get_at(current) == 'D':
1301
            if _string_at(current, 2, {'DG'}):
1302
                if _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1303
                    # e.g. 'edge'
1304
                    (primary, secondary) = _metaph_add('J')
1305
                    current += 3
1306
                    continue
1307
                else:
1308
                    # e.g. 'edgar'
1309
                    (primary, secondary) = _metaph_add('TK')
1310
                    current += 2
1311
                    continue
1312
1313
            elif _string_at(current, 2, {'DT', 'DD'}):
1314
                (primary, secondary) = _metaph_add('T')
1315
                current += 2
1316
                continue
1317
1318
            # else
1319
            else:
1320
                (primary, secondary) = _metaph_add('T')
1321
                current += 1
1322
                continue
1323
1324
        elif _get_at(current) == 'F':
1325
            if _get_at(current + 1) == 'F':
1326
                current += 2
1327
            else:
1328
                current += 1
1329
            (primary, secondary) = _metaph_add('F')
1330
            continue
1331
1332
        elif _get_at(current) == 'G':
1333
            if _get_at(current + 1) == 'H':
1334
                if (current > 0) and not _is_vowel(current - 1):
1335
                    (primary, secondary) = _metaph_add('K')
1336
                    current += 2
1337
                    continue
1338
1339
                # 'ghislane', ghiradelli
1340
                elif current == 0:
1341
                    if _get_at(current + 2) == 'I':
1342
                        (primary, secondary) = _metaph_add('J')
1343
                    else:
1344
                        (primary, secondary) = _metaph_add('K')
1345
                    current += 2
1346
                    continue
1347
1348
                # Parker's rule (with some further refinements) - e.g., 'hugh'
1349
                elif (((current > 1) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1350
                       _string_at((current - 2), 1, {'B', 'H', 'D'})) or
1351
                      # e.g., 'bough'
1352
                      ((current > 2) and
1353
                       _string_at((current - 3), 1, {'B', 'H', 'D'})) or
1354
                      # e.g., 'broughton'
1355
                      ((current > 3) and
1356
                       _string_at((current - 4), 1, {'B', 'H'}))):
1357
                    current += 2
1358
                    continue
1359
                else:
1360
                    # e.g. 'laugh', 'McLaughlin', 'cough',
1361
                    #      'gough', 'rough', 'tough'
1362
                    if ((current > 2) and
1363
                            (_get_at(current - 1) == 'U') and
1364
                            (_string_at((current - 3), 1,
1365
                                        {'C', 'G', 'L', 'R', 'T'}))):
1366
                        (primary, secondary) = _metaph_add('F')
1367
                    elif (current > 0) and _get_at(current - 1) != 'I':
1368
                        (primary, secondary) = _metaph_add('K')
1369
                    current += 2
1370
                    continue
1371
1372
            elif _get_at(current + 1) == 'N':
1373
                if (current == 1) and _is_vowel(0) and not _slavo_germanic():
1374
                    (primary, secondary) = _metaph_add('KN', 'N')
1375
                # not e.g. 'cagney'
1376
                elif (not _string_at((current + 2), 2, {'EY'}) and
1377
                      (_get_at(current + 1) != 'Y') and
1378
                      not _slavo_germanic()):
1379
                    (primary, secondary) = _metaph_add('N', 'KN')
1380
                else:
1381
                    (primary, secondary) = _metaph_add('KN')
1382
                current += 2
1383
                continue
1384
1385
            # 'tagliaro'
1386
            elif (_string_at((current + 1), 2, {'LI'}) and
1387
                  not _slavo_germanic()):
1388
                (primary, secondary) = _metaph_add('KL', 'L')
1389
                current += 2
1390
                continue
1391
1392
            # -ges-, -gep-, -gel-, -gie- at beginning
1393
            elif ((current == 0) and
1394
                  ((_get_at(current + 1) == 'Y') or
1395
                   _string_at((current + 1), 2, {'ES', 'EP', 'EB', 'EL', 'EY',
1396
                                                 'IB', 'IL', 'IN', 'IE', 'EI',
1397
                                                 'ER'}))):
1398
                (primary, secondary) = _metaph_add('K', 'J')
1399
                current += 2
1400
                continue
1401
1402
            #  -ger-,  -gy-
1403
            elif ((_string_at((current + 1), 2, {'ER'}) or
1404
                   (_get_at(current + 1) == 'Y')) and not
1405
                  _string_at(0, 6, {'DANGER', 'RANGER', 'MANGER'}) and not
1406
                  _string_at((current - 1), 1, {'E', 'I'}) and not
1407
                  _string_at((current - 1), 3, {'RGY', 'OGY'})):
1408
                (primary, secondary) = _metaph_add('K', 'J')
1409
                current += 2
1410
                continue
1411
1412
            #  italian e.g, 'biaggi'
1413
            elif (_string_at((current + 1), 1, {'E', 'I', 'Y'}) or
1414
                  _string_at((current - 1), 4, {'AGGI', 'OGGI'})):
1415
                # obvious germanic
1416
                if (((_string_at(0, 4, {'VAN ', 'VON '}) or
1417
                      _string_at(0, 3, {'SCH'})) or
1418
                     _string_at((current + 1), 2, {'ET'}))):
1419
                    (primary, secondary) = _metaph_add('K')
1420
                elif _string_at((current + 1), 4, {'IER '}):
1421
                    (primary, secondary) = _metaph_add('J')
1422
                else:
1423
                    (primary, secondary) = _metaph_add('J', 'K')
1424
                current += 2
1425
                continue
1426
1427
            else:
1428
                if _get_at(current + 1) == 'G':
1429
                    current += 2
1430
                else:
1431
                    current += 1
1432
                (primary, secondary) = _metaph_add('K')
1433
                continue
1434
1435
        elif _get_at(current) == 'H':
1436
            # only keep if first & before vowel or btw. 2 vowels
1437
            if ((((current == 0) or _is_vowel(current - 1)) and
1438
                 _is_vowel(current + 1))):
1439
                (primary, secondary) = _metaph_add('H')
1440
                current += 2
1441
            else:  # also takes care of 'HH'
1442
                current += 1
1443
            continue
1444
1445
        elif _get_at(current) == 'J':
1446
            # obvious spanish, 'jose', 'san jacinto'
1447
            if _string_at(current, 4, ['JOSE']) or _string_at(0, 4, {'SAN '}):
1448
                if ((((current == 0) and (_get_at(current + 4) == ' ')) or
1449
                     _string_at(0, 4, ['SAN ']))):
1450
                    (primary, secondary) = _metaph_add('H')
1451
                else:
1452
                    (primary, secondary) = _metaph_add('J', 'H')
1453
                current += 1
1454
                continue
1455
1456
            elif (current == 0) and not _string_at(current, 4, {'JOSE'}):
1457
                # Yankelovich/Jankelowicz
1458
                (primary, secondary) = _metaph_add('J', 'A')
1459
            # Spanish pron. of e.g. 'bajador'
1460
            elif (_is_vowel(current - 1) and
1461
                  not _slavo_germanic() and
1462
                  ((_get_at(current + 1) == 'A') or
1463
                   (_get_at(current + 1) == 'O'))):
1464
                (primary, secondary) = _metaph_add('J', 'H')
1465
            elif current == last:
1466
                (primary, secondary) = _metaph_add('J', ' ')
1467
            elif (not _string_at((current + 1), 1,
1468
                                 {'L', 'T', 'K', 'S', 'N', 'M', 'B', 'Z'}) and
1469
                  not _string_at((current - 1), 1, {'S', 'K', 'L'})):
1470
                (primary, secondary) = _metaph_add('J')
1471
1472
            if _get_at(current + 1) == 'J':  # it could happen!
1473
                current += 2
1474
            else:
1475
                current += 1
1476
            continue
1477
1478
        elif _get_at(current) == 'K':
1479
            if _get_at(current + 1) == 'K':
1480
                current += 2
1481
            else:
1482
                current += 1
1483
            (primary, secondary) = _metaph_add('K')
1484
            continue
1485
1486
        elif _get_at(current) == 'L':
1487
            if _get_at(current + 1) == 'L':
1488
                # Spanish e.g. 'cabrillo', 'gallegos'
1489
                if (((current == (length - 3)) and
1490
                     _string_at((current - 1), 4, {'ILLO', 'ILLA', 'ALLE'})) or
1491
                        ((_string_at((last - 1), 2, {'AS', 'OS'}) or
1492
                          _string_at(last, 1, {'A', 'O'})) and
1493
                         _string_at((current - 1), 4, {'ALLE'}))):
1494
                    (primary, secondary) = _metaph_add('L', ' ')
1495
                    current += 2
1496
                    continue
1497
                current += 2
1498
            else:
1499
                current += 1
1500
            (primary, secondary) = _metaph_add('L')
1501
            continue
1502
1503
        elif _get_at(current) == 'M':
1504
            if (((_string_at((current - 1), 3, {'UMB'}) and
1505
                  (((current + 1) == last) or
1506
                   _string_at((current + 2), 2, {'ER'}))) or
1507
                 # 'dumb', 'thumb'
1508
                 (_get_at(current + 1) == 'M'))):
1509
                current += 2
1510
            else:
1511
                current += 1
1512
            (primary, secondary) = _metaph_add('M')
1513
            continue
1514
1515
        elif _get_at(current) == 'N':
1516
            if _get_at(current + 1) == 'N':
1517
                current += 2
1518
            else:
1519
                current += 1
1520
            (primary, secondary) = _metaph_add('N')
1521
            continue
1522
1523
        elif _get_at(current) == 'Ñ':
1524
            current += 1
1525
            (primary, secondary) = _metaph_add('N')
1526
            continue
1527
1528
        elif _get_at(current) == 'P':
1529
            if _get_at(current + 1) == 'H':
1530
                (primary, secondary) = _metaph_add('F')
1531
                current += 2
1532
                continue
1533
1534
            # also account for "campbell", "raspberry"
1535
            elif _string_at((current + 1), 1, {'P', 'B'}):
1536
                current += 2
1537
            else:
1538
                current += 1
1539
            (primary, secondary) = _metaph_add('P')
1540
            continue
1541
1542
        elif _get_at(current) == 'Q':
1543
            if _get_at(current + 1) == 'Q':
1544
                current += 2
1545
            else:
1546
                current += 1
1547
            (primary, secondary) = _metaph_add('K')
1548
            continue
1549
1550
        elif _get_at(current) == 'R':
1551
            # french e.g. 'rogier', but exclude 'hochmeier'
1552
            if (((current == last) and
1553
                 not _slavo_germanic() and
1554
                 _string_at((current - 2), 2, {'IE'}) and
1555
                 not _string_at((current - 4), 2, {'ME', 'MA'}))):
1556
                (primary, secondary) = _metaph_add('', 'R')
1557
            else:
1558
                (primary, secondary) = _metaph_add('R')
1559
1560
            if _get_at(current + 1) == 'R':
1561
                current += 2
1562
            else:
1563
                current += 1
1564
            continue
1565
1566
        elif _get_at(current) == 'S':
1567
            # special cases 'island', 'isle', 'carlisle', 'carlysle'
1568
            if _string_at((current - 1), 3, {'ISL', 'YSL'}):
1569
                current += 1
1570
                continue
1571
1572
            # special case 'sugar-'
1573
            elif (current == 0) and _string_at(current, 5, {'SUGAR'}):
1574
                (primary, secondary) = _metaph_add('X', 'S')
1575
                current += 1
1576
                continue
1577
1578
            elif _string_at(current, 2, {'SH'}):
1579
                # Germanic
1580
                if _string_at((current + 1), 4,
1581
                              {'HEIM', 'HOEK', 'HOLM', 'HOLZ'}):
1582
                    (primary, secondary) = _metaph_add('S')
1583
                else:
1584
                    (primary, secondary) = _metaph_add('X')
1585
                current += 2
1586
                continue
1587
1588
            # Italian & Armenian
1589
            elif (_string_at(current, 3, {'SIO', 'SIA'}) or
1590
                  _string_at(current, 4, {'SIAN'})):
1591
                if not _slavo_germanic():
1592
                    (primary, secondary) = _metaph_add('S', 'X')
1593
                else:
1594
                    (primary, secondary) = _metaph_add('S')
1595
                current += 3
1596
                continue
1597
1598
            # German & anglicisations, e.g. 'smith' match 'schmidt',
1599
            #                               'snider' match 'schneider'
1600
            # also, -sz- in Slavic language although in Hungarian it is
1601
            #       pronounced 's'
1602
            elif (((current == 0) and
1603
                   _string_at((current + 1), 1, {'M', 'N', 'L', 'W'})) or
1604
                  _string_at((current + 1), 1, {'Z'})):
1605
                (primary, secondary) = _metaph_add('S', 'X')
1606
                if _string_at((current + 1), 1, {'Z'}):
1607
                    current += 2
1608
                else:
1609
                    current += 1
1610
                continue
1611
1612
            elif _string_at(current, 2, {'SC'}):
1613
                # Schlesinger's rule
1614
                if _get_at(current + 2) == 'H':
1615
                    # dutch origin, e.g. 'school', 'schooner'
1616
                    if _string_at((current + 3), 2,
1617
                                  {'OO', 'ER', 'EN', 'UY', 'ED', 'EM'}):
1618
                        # 'schermerhorn', 'schenker'
1619
                        if _string_at((current + 3), 2, {'ER', 'EN'}):
1620
                            (primary, secondary) = _metaph_add('X', 'SK')
1621
                        else:
1622
                            (primary, secondary) = _metaph_add('SK')
1623
                        current += 3
1624
                        continue
1625
                    else:
1626
                        if (((current == 0) and not _is_vowel(3) and
1627
                             (_get_at(3) != 'W'))):
1628
                            (primary, secondary) = _metaph_add('X', 'S')
1629
                        else:
1630
                            (primary, secondary) = _metaph_add('X')
1631
                        current += 3
1632
                        continue
1633
1634
                elif _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1635
                    (primary, secondary) = _metaph_add('S')
1636
                    current += 3
1637
                    continue
1638
1639
                # else
1640
                else:
1641
                    (primary, secondary) = _metaph_add('SK')
1642
                    current += 3
1643
                    continue
1644
1645
            else:
1646
                # french e.g. 'resnais', 'artois'
1647
                if (current == last) and _string_at((current - 2), 2,
1648
                                                    {'AI', 'OI'}):
1649
                    (primary, secondary) = _metaph_add('', 'S')
1650
                else:
1651
                    (primary, secondary) = _metaph_add('S')
1652
1653
                if _string_at((current + 1), 1, {'S', 'Z'}):
1654
                    current += 2
1655
                else:
1656
                    current += 1
1657
                continue
1658
1659
        elif _get_at(current) == 'T':
1660
            if _string_at(current, 4, {'TION'}):
1661
                (primary, secondary) = _metaph_add('X')
1662
                current += 3
1663
                continue
1664
1665
            elif _string_at(current, 3, {'TIA', 'TCH'}):
1666
                (primary, secondary) = _metaph_add('X')
1667
                current += 3
1668
                continue
1669
1670
            elif (_string_at(current, 2, {'TH'}) or
1671
                  _string_at(current, 3, {'TTH'})):
1672
                # special case 'thomas', 'thames' or germanic
1673
                if ((_string_at((current + 2), 2, {'OM', 'AM'}) or
1674
                     _string_at(0, 4, {'VAN ', 'VON '}) or
1675
                     _string_at(0, 3, {'SCH'}))):
1676
                    (primary, secondary) = _metaph_add('T')
1677
                else:
1678
                    (primary, secondary) = _metaph_add('0', 'T')
1679
                current += 2
1680
                continue
1681
1682
            elif _string_at((current + 1), 1, {'T', 'D'}):
1683
                current += 2
1684
            else:
1685
                current += 1
1686
            (primary, secondary) = _metaph_add('T')
1687
            continue
1688
1689
        elif _get_at(current) == 'V':
1690
            if _get_at(current + 1) == 'V':
1691
                current += 2
1692
            else:
1693
                current += 1
1694
            (primary, secondary) = _metaph_add('F')
1695
            continue
1696
1697
        elif _get_at(current) == 'W':
1698
            # can also be in middle of word
1699
            if _string_at(current, 2, {'WR'}):
1700
                (primary, secondary) = _metaph_add('R')
1701
                current += 2
1702
                continue
1703
            elif ((current == 0) and
1704
                  (_is_vowel(current + 1) or _string_at(current, 2, {'WH'}))):
1705
                # Wasserman should match Vasserman
1706
                if _is_vowel(current + 1):
1707
                    (primary, secondary) = _metaph_add('A', 'F')
1708
                else:
1709
                    # need Uomo to match Womo
1710
                    (primary, secondary) = _metaph_add('A')
1711
1712
            # Arnow should match Arnoff
1713
            if ((((current == last) and _is_vowel(current - 1)) or
1714
                 _string_at((current - 1), 5,
1715
                            {'EWSKI', 'EWSKY', 'OWSKI', 'OWSKY'}) or
1716
                 _string_at(0, 3, ['SCH']))):
1717
                (primary, secondary) = _metaph_add('', 'F')
1718
                current += 1
1719
                continue
1720
            # Polish e.g. 'filipowicz'
1721
            elif _string_at(current, 4, {'WICZ', 'WITZ'}):
1722
                (primary, secondary) = _metaph_add('TS', 'FX')
1723
                current += 4
1724
                continue
1725
            # else skip it
1726
            else:
1727
                current += 1
1728
                continue
1729
1730
        elif _get_at(current) == 'X':
1731
            # French e.g. breaux
1732
            if (not ((current == last) and
1733
                     (_string_at((current - 3), 3, {'IAU', 'EAU'}) or
1734
                      _string_at((current - 2), 2, {'AU', 'OU'})))):
1735
                (primary, secondary) = _metaph_add('KS')
1736
1737
            if _string_at((current + 1), 1, {'C', 'X'}):
1738
                current += 2
1739
            else:
1740
                current += 1
1741
            continue
1742
1743
        elif _get_at(current) == 'Z':
1744
            # Chinese Pinyin e.g. 'zhao'
1745
            if _get_at(current + 1) == 'H':
1746
                (primary, secondary) = _metaph_add('J')
1747
                current += 2
1748
                continue
1749
            elif (_string_at((current + 1), 2, {'ZO', 'ZI', 'ZA'}) or
1750
                  (_slavo_germanic() and ((current > 0) and
1751
                                          _get_at(current - 1) != 'T'))):
1752
                (primary, secondary) = _metaph_add('S', 'TS')
1753
            else:
1754
                (primary, secondary) = _metaph_add('S')
1755
1756
            if _get_at(current + 1) == 'Z':
1757
                current += 2
1758
            else:
1759
                current += 1
1760
            continue
1761
1762
        else:
1763
            current += 1
1764
1765
    if maxlength and maxlength < _INFINITY:
1766
        primary = primary[:maxlength]
1767
        secondary = secondary[:maxlength]
1768
    if primary == secondary:
1769
        secondary = ''
1770
1771
    return (primary, secondary)
1772
1773
1774
def caverphone(word, version=2):
1775
    """Return the Caverphone code for a word.
1776
1777
    A description of version 1 of the algorithm can be found in
1778
    :cite:`Hood:2002`.
1779
1780
    A description of version 2 of the algorithm can be found in
1781
    :cite:`Hood:2004`.
1782
1783
    :param str word: the word to transform
1784
    :param int version: the version of Caverphone to employ for encoding
1785
        (defaults to 2)
1786
    :returns: the Caverphone value
1787
    :rtype: str
1788
1789
    >>> caverphone('Christopher')
1790
    'KRSTFA1111'
1791
    >>> caverphone('Niall')
1792
    'NA11111111'
1793
    >>> caverphone('Smith')
1794
    'SMT1111111'
1795
    >>> caverphone('Schmidt')
1796
    'SKMT111111'
1797
1798
    >>> caverphone('Christopher', 1)
1799
    'KRSTF1'
1800
    >>> caverphone('Niall', 1)
1801
    'N11111'
1802
    >>> caverphone('Smith', 1)
1803
    'SMT111'
1804
    >>> caverphone('Schmidt', 1)
1805
    'SKMT11'
1806
    """
1807
    _vowels = {'a', 'e', 'i', 'o', 'u'}
1808
1809
    word = word.lower()
1810
    word = ''.join(c for c in word if c in
1811
                   {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
1812
                    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
1813
                    'y', 'z'})
1814
1815
    def _squeeze_replace(word, char, new_char):
1816
        """Convert strings of char in word to one instance of new_char."""
1817
        while char * 2 in word:
1818
            word = word.replace(char * 2, char)
1819
        return word.replace(char, new_char)
1820
1821
    # the main replacemet algorithm
1822
    if version != 1 and word[-1:] == 'e':
1823
        word = word[:-1]
1824
    if word:
1825
        if word[:5] == 'cough':
1826
            word = 'cou2f'+word[5:]
1827
        if word[:5] == 'rough':
1828
            word = 'rou2f'+word[5:]
1829
        if word[:5] == 'tough':
1830
            word = 'tou2f'+word[5:]
1831
        if word[:6] == 'enough':
1832
            word = 'enou2f'+word[6:]
1833
        if version != 1 and word[:6] == 'trough':
1834
            word = 'trou2f'+word[6:]
1835
        if word[:2] == 'gn':
1836
            word = '2n'+word[2:]
1837
        if word[-2:] == 'mb':
1838
            word = word[:-1]+'2'
1839
        word = word.replace('cq', '2q')
1840
        word = word.replace('ci', 'si')
1841
        word = word.replace('ce', 'se')
1842
        word = word.replace('cy', 'sy')
1843
        word = word.replace('tch', '2ch')
1844
        word = word.replace('c', 'k')
1845
        word = word.replace('q', 'k')
1846
        word = word.replace('x', 'k')
1847
        word = word.replace('v', 'f')
1848
        word = word.replace('dg', '2g')
1849
        word = word.replace('tio', 'sio')
1850
        word = word.replace('tia', 'sia')
1851
        word = word.replace('d', 't')
1852
        word = word.replace('ph', 'fh')
1853
        word = word.replace('b', 'p')
1854
        word = word.replace('sh', 's2')
1855
        word = word.replace('z', 's')
1856
        if word[0] in _vowels:
1857
            word = 'A'+word[1:]
1858
        word = word.replace('a', '3')
1859
        word = word.replace('e', '3')
1860
        word = word.replace('i', '3')
1861
        word = word.replace('o', '3')
1862
        word = word.replace('u', '3')
1863
        if version != 1:
1864
            word = word.replace('j', 'y')
1865
            if word[:2] == 'y3':
1866
                word = 'Y3'+word[2:]
1867
            if word[:1] == 'y':
1868
                word = 'A'+word[1:]
1869
            word = word.replace('y', '3')
1870
        word = word.replace('3gh3', '3kh3')
1871
        word = word.replace('gh', '22')
1872
        word = word.replace('g', 'k')
1873
1874
        word = _squeeze_replace(word, 's', 'S')
1875
        word = _squeeze_replace(word, 't', 'T')
1876
        word = _squeeze_replace(word, 'p', 'P')
1877
        word = _squeeze_replace(word, 'k', 'K')
1878
        word = _squeeze_replace(word, 'f', 'F')
1879
        word = _squeeze_replace(word, 'm', 'M')
1880
        word = _squeeze_replace(word, 'n', 'N')
1881
1882
        word = word.replace('w3', 'W3')
1883
        if version == 1:
1884
            word = word.replace('wy', 'Wy')
1885
        word = word.replace('wh3', 'Wh3')
1886
        if version == 1:
1887
            word = word.replace('why', 'Why')
1888
        if version != 1 and word[-1:] == 'w':
1889
            word = word[:-1]+'3'
1890
        word = word.replace('w', '2')
1891
        if word[:1] == 'h':
1892
            word = 'A'+word[1:]
1893
        word = word.replace('h', '2')
1894
        word = word.replace('r3', 'R3')
1895
        if version == 1:
1896
            word = word.replace('ry', 'Ry')
1897
        if version != 1 and word[-1:] == 'r':
1898
            word = word[:-1]+'3'
1899
        word = word.replace('r', '2')
1900
        word = word.replace('l3', 'L3')
1901
        if version == 1:
1902
            word = word.replace('ly', 'Ly')
1903
        if version != 1 and word[-1:] == 'l':
1904
            word = word[:-1]+'3'
1905
        word = word.replace('l', '2')
1906
        if version == 1:
1907
            word = word.replace('j', 'y')
1908
            word = word.replace('y3', 'Y3')
1909
            word = word.replace('y', '2')
1910
        word = word.replace('2', '')
1911
        if version != 1 and word[-1:] == '3':
1912
            word = word[:-1]+'A'
1913
        word = word.replace('3', '')
1914
1915
    # pad with 1s, then extract the necessary length of code
1916
    word = word+'1'*10
1917
    if version != 1:
1918
        word = word[:10]
1919
    else:
1920
        word = word[:6]
1921
1922
    return word
1923
1924
1925
def alpha_sis(word, maxlength=14):
1926
    """Return the IBM Alpha Search Inquiry System code for a word.
1927
1928
    The Alpha Search Inquiry System code is defined in :cite:`IBM:1973`.
1929
    This implementation is based on the description in :cite:`Moore:1977`.
1930
1931
    A collection is necessary since there can be multiple values for a
1932
    single word. But the collection must be ordered since the first value
1933
    is the primary coding.
1934
1935
    :param str word: the word to transform
1936
    :param int maxlength: the length of the code returned (defaults to 14)
1937
    :returns: the Alpha SIS value
1938
    :rtype: tuple
1939
1940
    >>> alpha_sis('Christopher')
1941
    ('06401840000000', '07040184000000', '04018400000000')
1942
    >>> alpha_sis('Niall')
1943
    ('02500000000000',)
1944
    >>> alpha_sis('Smith')
1945
    ('03100000000000',)
1946
    >>> alpha_sis('Schmidt')
1947
    ('06310000000000',)
1948
    """
1949
    _alpha_sis_initials = {'GF': '08', 'GM': '03', 'GN': '02', 'KN': '02',
1950
                           'PF': '08', 'PN': '02', 'PS': '00', 'WR': '04',
1951
                           'A': '1', 'E': '1', 'H': '2', 'I': '1', 'J': '3',
1952
                           'O': '1', 'U': '1', 'W': '4', 'Y': '5'}
1953
    _alpha_sis_initials_order = ('GF', 'GM', 'GN', 'KN', 'PF', 'PN', 'PS',
1954
                                 'WR', 'A', 'E', 'H', 'I', 'J', 'O', 'U', 'W',
1955
                                 'Y')
1956
    _alpha_sis_basic = {'SCH': '6', 'CZ': ('70', '6', '0'),
1957
                        'CH': ('6', '70', '0'), 'CK': ('7', '6'),
1958
                        'DS': ('0', '10'), 'DZ': ('0', '10'),
1959
                        'TS': ('0', '10'), 'TZ': ('0', '10'), 'CI': '0',
1960
                        'CY': '0', 'CE': '0', 'SH': '6', 'DG': '7', 'PH': '8',
1961
                        'C': ('7', '6'), 'K': ('7', '6'), 'Z': '0', 'S': '0',
1962
                        'D': '1', 'T': '1', 'N': '2', 'M': '3', 'R': '4',
1963
                        'L': '5', 'J': '6', 'G': '7', 'Q': '7', 'X': '7',
1964
                        'F': '8', 'V': '8', 'B': '9', 'P': '9'}
1965
    _alpha_sis_basic_order = ('SCH', 'CZ', 'CH', 'CK', 'DS', 'DZ', 'TS', 'TZ',
1966
                              'CI', 'CY', 'CE', 'SH', 'DG', 'PH', 'C', 'K',
1967
                              'Z', 'S', 'D', 'T', 'N', 'M', 'R', 'L', 'J', 'C',
1968
                              'G', 'K', 'Q', 'X', 'F', 'V', 'B', 'P')
1969
1970
    alpha = ['']
1971
    pos = 0
1972
    word = normalize('NFKD', text_type(word.upper()))
1973
    word = word.replace('ß', 'SS')
1974
    word = ''.join(c for c in word if c in
1975
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
1976
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
1977
                    'Y', 'Z'})
1978
1979
    # Clamp maxlength to [4, 64]
1980
    if maxlength is not None:
1981
        maxlength = min(max(4, maxlength), 64)
1982
    else:
1983
        maxlength = 64
1984
1985
    # Do special processing for initial substrings
1986
    for k in _alpha_sis_initials_order:
1987
        if word.startswith(k):
1988
            alpha[0] += _alpha_sis_initials[k]
1989
            pos += len(k)
1990
            break
1991
1992
    # Add a '0' if alpha is still empty
1993
    if not alpha[0]:
1994
        alpha[0] += '0'
1995
1996
    # Whether or not any special initial codes were encoded, iterate
1997
    # through the length of the word in the main encoding loop
1998
    while pos < len(word):
1999
        origpos = pos
2000
        for k in _alpha_sis_basic_order:
2001
            if word[pos:].startswith(k):
2002
                if isinstance(_alpha_sis_basic[k], tuple):
2003
                    newalpha = []
2004
                    for i in range(len(_alpha_sis_basic[k])):
2005
                        newalpha += [_ + _alpha_sis_basic[k][i] for _ in alpha]
2006
                    alpha = newalpha
2007
                else:
2008
                    alpha = [_ + _alpha_sis_basic[k] for _ in alpha]
2009
                pos += len(k)
2010
                break
2011
        if pos == origpos:
2012
            alpha = [_ + '_' for _ in alpha]
2013
            pos += 1
2014
2015
    # Trim doublets and placeholders
2016
    for i in range(len(alpha)):
2017
        pos = 1
2018
        while pos < len(alpha[i]):
2019
            if alpha[i][pos] == alpha[i][pos-1]:
2020
                alpha[i] = alpha[i][:pos]+alpha[i][pos+1:]
2021
            pos += 1
2022
    alpha = (_.replace('_', '') for _ in alpha)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2023
2024
    # Trim codes and return tuple
2025
    alpha = ((_ + ('0'*maxlength))[:maxlength] for _ in alpha)
2026
    return tuple(alpha)
2027
2028
2029
def fuzzy_soundex(word, maxlength=5, zero_pad=True):
2030
    """Return the Fuzzy Soundex code for a word.
2031
2032
    Fuzzy Soundex is an algorithm derived from Soundex, defined in
2033
    :cite:`Holmes:2002`.
2034
2035
    :param str word: the word to transform
2036
    :param int maxlength: the length of the code returned (defaults to 4)
2037
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2038
        a maxlength string
2039
    :returns: the Fuzzy Soundex value
2040
    :rtype: str
2041
2042
    >>> fuzzy_soundex('Christopher')
2043
    'K6931'
2044
    >>> fuzzy_soundex('Niall')
2045
    'N4000'
2046
    >>> fuzzy_soundex('Smith')
2047
    'S5300'
2048
    >>> fuzzy_soundex('Smith')
2049
    'S5300'
2050
    """
2051
    _fuzzy_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2052
                                           'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2053
                                          '0193017-07745501769301-7-9'))
2054
2055
    word = normalize('NFKD', text_type(word.upper()))
2056
    word = word.replace('ß', 'SS')
2057
2058
    # Clamp maxlength to [4, 64]
2059
    if maxlength is not None:
2060
        maxlength = min(max(4, maxlength), 64)
2061
    else:
2062
        maxlength = 64
2063
2064
    if not word:
2065
        if zero_pad:
2066
            return '0' * maxlength
2067
        return '0'
2068
2069
    if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
2070
        word = 'SS' + word[2:]
2071
    elif word[:2] == 'GN':
2072
        word = 'NN' + word[2:]
2073
    elif word[:2] in {'HR', 'WR'}:
2074
        word = 'RR' + word[2:]
2075
    elif word[:2] == 'HW':
2076
        word = 'WW' + word[2:]
2077
    elif word[:2] in {'KN', 'NG'}:
2078
        word = 'NN' + word[2:]
2079
2080
    if word[-2:] == 'CH':
2081
        word = word[:-2] + 'KK'
2082
    elif word[-2:] == 'NT':
2083
        word = word[:-2] + 'TT'
2084
    elif word[-2:] == 'RT':
2085
        word = word[:-2] + 'RR'
2086
    elif word[-3:] == 'RDT':
2087
        word = word[:-3] + 'RR'
2088
2089
    word = word.replace('CA', 'KA')
2090
    word = word.replace('CC', 'KK')
2091
    word = word.replace('CK', 'KK')
2092
    word = word.replace('CE', 'SE')
2093
    word = word.replace('CHL', 'KL')
2094
    word = word.replace('CL', 'KL')
2095
    word = word.replace('CHR', 'KR')
2096
    word = word.replace('CR', 'KR')
2097
    word = word.replace('CI', 'SI')
2098
    word = word.replace('CO', 'KO')
2099
    word = word.replace('CU', 'KU')
2100
    word = word.replace('CY', 'SY')
2101
    word = word.replace('DG', 'GG')
2102
    word = word.replace('GH', 'HH')
2103
    word = word.replace('MAC', 'MK')
2104
    word = word.replace('MC', 'MK')
2105
    word = word.replace('NST', 'NSS')
2106
    word = word.replace('PF', 'FF')
2107
    word = word.replace('PH', 'FF')
2108
    word = word.replace('SCH', 'SSS')
2109
    word = word.replace('TIO', 'SIO')
2110
    word = word.replace('TIA', 'SIO')
2111
    word = word.replace('TCH', 'CHH')
2112
2113
    sdx = word.translate(_fuzzy_soundex_translation)
2114
    sdx = sdx.replace('-', '')
2115
2116
    # remove repeating characters
2117
    sdx = _delete_consecutive_repeats(sdx)
2118
2119
    if word[0] in {'H', 'W', 'Y'}:
2120
        sdx = word[0] + sdx
2121
    else:
2122
        sdx = word[0] + sdx[1:]
2123
2124
    sdx = sdx.replace('0', '')
2125
2126
    if zero_pad:
2127
        sdx += ('0'*maxlength)
2128
2129
    return sdx[:maxlength]
2130
2131
2132
def phonex(word, maxlength=4, zero_pad=True):
2133
    """Return the Phonex code for a word.
2134
2135
    Phonex is an algorithm derived from Soundex, defined in :cite:`Lait:1996`.
2136
2137
    :param str word: the word to transform
2138
    :param int maxlength: the length of the code returned (defaults to 4)
2139
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2140
        a maxlength string
2141
    :returns: the Phonex value
2142
    :rtype: str
2143
2144
    >>> phonex('Christopher')
2145
    'C623'
2146
    >>> phonex('Niall')
2147
    'N400'
2148
    >>> phonex('Schmidt')
2149
    'S253'
2150
    >>> phonex('Smith')
2151
    'S530'
2152
    """
2153
    name = normalize('NFKD', text_type(word.upper()))
2154
    name = name.replace('ß', 'SS')
2155
2156
    # Clamp maxlength to [4, 64]
2157
    if maxlength is not None:
2158
        maxlength = min(max(4, maxlength), 64)
2159
    else:
2160
        maxlength = 64
2161
2162
    name_code = last = ''
2163
2164
    # Deletions effected by replacing with next letter which
2165
    # will be ignored due to duplicate handling of Soundex code.
2166
    # This is faster than 'moving' all subsequent letters.
2167
2168
    # Remove any trailing Ss
2169
    while name[-1:] == 'S':
2170
        name = name[:-1]
2171
2172
    # Phonetic equivalents of first 2 characters
2173
    # Works since duplicate letters are ignored
2174
    if name[:2] == 'KN':
2175
        name = 'N' + name[2:]  # KN.. == N..
2176
    elif name[:2] == 'PH':
2177
        name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
2178
    elif name[:2] == 'WR':
2179
        name = 'R' + name[2:]  # WR.. == R..
2180
2181
    if name:
2182
        # Special case, ignore H first letter (subsequent Hs ignored anyway)
2183
        # Works since duplicate letters are ignored
2184
        if name[0] == 'H':
2185
            name = name[1:]
2186
2187
    if name:
2188
        # Phonetic equivalents of first character
2189
        if name[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2190
            name = 'A' + name[1:]
2191
        elif name[0] in {'B', 'P'}:
2192
            name = 'B' + name[1:]
2193
        elif name[0] in {'V', 'F'}:
2194
            name = 'F' + name[1:]
2195
        elif name[0] in {'C', 'K', 'Q'}:
2196
            name = 'C' + name[1:]
2197
        elif name[0] in {'G', 'J'}:
2198
            name = 'G' + name[1:]
2199
        elif name[0] in {'S', 'Z'}:
2200
            name = 'S' + name[1:]
2201
2202
        name_code = last = name[0]
2203
2204
    # MODIFIED SOUNDEX CODE
2205
    for i in range(1, len(name)):
2206
        code = '0'
2207
        if name[i] in {'B', 'F', 'P', 'V'}:
2208
            code = '1'
2209
        elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
2210
            code = '2'
2211
        elif name[i] in {'D', 'T'}:
2212
            if name[i+1:i+2] != 'C':
2213
                code = '3'
2214
        elif name[i] == 'L':
2215
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2216
                    i+1 == len(name)):
2217
                code = '4'
2218
        elif name[i] in {'M', 'N'}:
2219
            if name[i+1:i+2] in {'D', 'G'}:
2220
                name = name[:i+1] + name[i] + name[i+2:]
2221
            code = '5'
2222
        elif name[i] == 'R':
2223
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2224
                    i+1 == len(name)):
2225
                code = '6'
2226
2227
        if code != last and code != '0' and i != 0:
2228
            name_code += code
2229
2230
        last = name_code[-1]
2231
2232
    if zero_pad:
2233
        name_code += '0' * maxlength
2234
    if not name_code:
2235
        name_code = '0'
2236
    return name_code[:maxlength]
2237
2238
2239
def phonem(word):
2240
    """Return the Phonem code for a word.
2241
2242
    Phonem is defined in :cite:`Wilde:1988`.
2243
2244
    This version is based on the Perl implementation documented at
2245
    :cite:`Wilz:2005`.
2246
    It includes some enhancements presented in the Java port at
2247
    :cite:`dcm4che:2011`.
2248
2249
    Phonem is intended chiefly for German names/words.
2250
2251
    :param str word: the word to transform
2252
    :returns: the Phonem value
2253
    :rtype: str
2254
2255
    >>> phonem('Christopher')
2256
    'CRYSDOVR'
2257
    >>> phonem('Niall')
2258
    'NYAL'
2259
    >>> phonem('Smith')
2260
    'SMYD'
2261
    >>> phonem('Schmidt')
2262
    'CMYD'
2263
    """
2264
    _phonem_substitutions = (('SC', 'C'), ('SZ', 'C'), ('CZ', 'C'),
2265
                             ('TZ', 'C'), ('TS', 'C'), ('KS', 'X'),
2266
                             ('PF', 'V'), ('QU', 'KW'), ('PH', 'V'),
2267
                             ('UE', 'Y'), ('AE', 'E'), ('OE', 'Ö'),
2268
                             ('EI', 'AY'), ('EY', 'AY'), ('EU', 'OY'),
2269
                             ('AU', 'A§'), ('OU', '§'))
2270
    _phonem_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2271
                                    'ZKGQÇÑßFWPTÁÀÂÃÅÄÆÉÈÊËIJÌÍÎÏÜݧÚÙÛÔÒÓÕØ'),
2272
                                   'CCCCCNSVVBDAAAAAEEEEEEYYYYYYYYUUUUOOOOÖ'))
2273
2274
    word = normalize('NFC', text_type(word.upper()))
2275
    for i, j in _phonem_substitutions:
2276
        word = word.replace(i, j)
2277
    word = word.translate(_phonem_translation)
2278
2279
    return ''.join(c for c in _delete_consecutive_repeats(word)
2280
                   if c in {'A', 'B', 'C', 'D', 'L', 'M', 'N', 'O', 'R', 'S',
2281
                            'U', 'V', 'W', 'X', 'Y', 'Ö'})
2282
2283
2284
def phonix(word, maxlength=4, zero_pad=True):
2285
    """Return the Phonix code for a word.
2286
2287
    Phonix is a Soundex-like algorithm defined in :cite:`Gadd:1990`.
2288
2289
    This implementation is based on:
2290
    - :cite:`Pfeifer:2000`
2291
    - :cite:`Christen:2011`
2292
    - :cite:`Kollar:2007`
2293
2294
    :param str word: the word to transform
2295
    :param int maxlength: the length of the code returned (defaults to 4)
2296
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2297
        a maxlength string
2298
    :returns: the Phonix value
2299
    :rtype: str
2300
2301
    >>> phonix('Christopher')
2302
    'K683'
2303
    >>> phonix('Niall')
2304
    'N400'
2305
    >>> phonix('Smith')
2306
    'S530'
2307
    >>> phonix('Schmidt')
2308
    'S530'
2309
    """
2310
    # pylint: disable=too-many-branches
2311
    def _start_repl(word, src, tar, post=None):
2312
        r"""Replace src with tar at the start of word."""
2313
        if post:
2314
            for i in post:
2315
                if word.startswith(src+i):
2316
                    return tar + word[len(src):]
2317
        elif word.startswith(src):
2318
            return tar + word[len(src):]
2319
        return word
2320
2321
    def _end_repl(word, src, tar, pre=None):
2322
        r"""Replace src with tar at the end of word."""
2323
        if pre:
2324
            for i in pre:
2325
                if word.endswith(i+src):
2326
                    return word[:-len(src)] + tar
2327
        elif word.endswith(src):
2328
            return word[:-len(src)] + tar
2329
        return word
2330
2331
    def _mid_repl(word, src, tar, pre=None, post=None):
2332
        r"""Replace src with tar in the middle of word."""
2333
        if pre or post:
2334
            if not pre:
2335
                return word[0] + _all_repl(word[1:], src, tar, pre, post)
2336
            elif not post:
2337
                return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
2338
            return _all_repl(word, src, tar, pre, post)
2339
        return (word[0] + _all_repl(word[1:-1], src, tar, pre, post) +
2340
                word[-1])
2341
2342
    def _all_repl(word, src, tar, pre=None, post=None):
2343
        r"""Replace src with tar anywhere in word."""
2344
        if pre or post:
2345
            if post:
2346
                post = post
2347
            else:
2348
                post = frozenset(('',))
2349
            if pre:
2350
                pre = pre
2351
            else:
2352
                pre = frozenset(('',))
2353
2354
            for i, j in ((i, j) for i in pre for j in post):
2355
                word = word.replace(i+src+j, i+tar+j)
2356
            return word
2357
        else:
2358
            return word.replace(src, tar)
2359
2360
    _vow = {'A', 'E', 'I', 'O', 'U'}
2361
    _con = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q',
2362
            'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z'}
2363
2364
    _phonix_substitutions = ((_all_repl, 'DG', 'G'),
2365
                             (_all_repl, 'CO', 'KO'),
2366
                             (_all_repl, 'CA', 'KA'),
2367
                             (_all_repl, 'CU', 'KU'),
2368
                             (_all_repl, 'CY', 'SI'),
2369
                             (_all_repl, 'CI', 'SI'),
2370
                             (_all_repl, 'CE', 'SE'),
2371
                             (_start_repl, 'CL', 'KL', _vow),
2372
                             (_all_repl, 'CK', 'K'),
2373
                             (_end_repl, 'GC', 'K'),
2374
                             (_end_repl, 'JC', 'K'),
2375
                             (_start_repl, 'CHR', 'KR', _vow),
2376
                             (_start_repl, 'CR', 'KR', _vow),
2377
                             (_start_repl, 'WR', 'R'),
2378
                             (_all_repl, 'NC', 'NK'),
2379
                             (_all_repl, 'CT', 'KT'),
2380
                             (_all_repl, 'PH', 'F'),
2381
                             (_all_repl, 'AA', 'AR'),
2382
                             (_all_repl, 'SCH', 'SH'),
2383
                             (_all_repl, 'BTL', 'TL'),
2384
                             (_all_repl, 'GHT', 'T'),
2385
                             (_all_repl, 'AUGH', 'ARF'),
2386
                             (_mid_repl, 'LJ', 'LD', _vow, _vow),
2387
                             (_all_repl, 'LOUGH', 'LOW'),
2388
                             (_start_repl, 'Q', 'KW'),
2389
                             (_start_repl, 'KN', 'N'),
2390
                             (_end_repl, 'GN', 'N'),
2391
                             (_all_repl, 'GHN', 'N'),
2392
                             (_end_repl, 'GNE', 'N'),
2393
                             (_all_repl, 'GHNE', 'NE'),
2394
                             (_end_repl, 'GNES', 'NS'),
2395
                             (_start_repl, 'GN', 'N'),
2396
                             (_mid_repl, 'GN', 'N', None, _con),
2397
                             (_end_repl, 'GN', 'N'),
2398
                             (_start_repl, 'PS', 'S'),
2399
                             (_start_repl, 'PT', 'T'),
2400
                             (_start_repl, 'CZ', 'C'),
2401
                             (_mid_repl, 'WZ', 'Z', _vow),
2402
                             (_mid_repl, 'CZ', 'CH'),
2403
                             (_all_repl, 'LZ', 'LSH'),
2404
                             (_all_repl, 'RZ', 'RSH'),
2405
                             (_mid_repl, 'Z', 'S', None, _vow),
2406
                             (_all_repl, 'ZZ', 'TS'),
2407
                             (_mid_repl, 'Z', 'TS', _con),
2408
                             (_all_repl, 'HROUG', 'REW'),
2409
                             (_all_repl, 'OUGH', 'OF'),
2410
                             (_mid_repl, 'Q', 'KW', _vow, _vow),
2411
                             (_mid_repl, 'J', 'Y', _vow, _vow),
2412
                             (_start_repl, 'YJ', 'Y', _vow),
2413
                             (_start_repl, 'GH', 'G'),
2414
                             (_end_repl, 'GH', 'E', _vow),
2415
                             (_start_repl, 'CY', 'S'),
2416
                             (_all_repl, 'NX', 'NKS'),
2417
                             (_start_repl, 'PF', 'F'),
2418
                             (_end_repl, 'DT', 'T'),
2419
                             (_end_repl, 'TL', 'TIL'),
2420
                             (_end_repl, 'DL', 'DIL'),
2421
                             (_all_repl, 'YTH', 'ITH'),
2422
                             (_start_repl, 'TJ', 'CH', _vow),
2423
                             (_start_repl, 'TSJ', 'CH', _vow),
2424
                             (_start_repl, 'TS', 'T', _vow),
2425
                             (_all_repl, 'TCH', 'CH'),
2426
                             (_mid_repl, 'WSK', 'VSKIE', _vow),
2427
                             (_end_repl, 'WSK', 'VSKIE', _vow),
2428
                             (_start_repl, 'MN', 'N', _vow),
2429
                             (_start_repl, 'PN', 'N', _vow),
2430
                             (_mid_repl, 'STL', 'SL', _vow),
2431
                             (_end_repl, 'STL', 'SL', _vow),
2432
                             (_end_repl, 'TNT', 'ENT'),
2433
                             (_end_repl, 'EAUX', 'OH'),
2434
                             (_all_repl, 'EXCI', 'ECS'),
2435
                             (_all_repl, 'X', 'ECS'),
2436
                             (_end_repl, 'NED', 'ND'),
2437
                             (_all_repl, 'JR', 'DR'),
2438
                             (_end_repl, 'EE', 'EA'),
2439
                             (_all_repl, 'ZS', 'S'),
2440
                             (_mid_repl, 'R', 'AH', _vow, _con),
2441
                             (_end_repl, 'R', 'AH', _vow),
2442
                             (_mid_repl, 'HR', 'AH', _vow, _con),
2443
                             (_end_repl, 'HR', 'AH', _vow),
2444
                             (_end_repl, 'HR', 'AH', _vow),
2445
                             (_end_repl, 'RE', 'AR'),
2446
                             (_end_repl, 'R', 'AH', _vow),
2447
                             (_all_repl, 'LLE', 'LE'),
2448
                             (_end_repl, 'LE', 'ILE', _con),
2449
                             (_end_repl, 'LES', 'ILES', _con),
2450
                             (_end_repl, 'E', ''),
2451
                             (_end_repl, 'ES', 'S'),
2452
                             (_end_repl, 'SS', 'AS', _vow),
2453
                             (_end_repl, 'MB', 'M', _vow),
2454
                             (_all_repl, 'MPTS', 'MPS'),
2455
                             (_all_repl, 'MPS', 'MS'),
2456
                             (_all_repl, 'MPT', 'MT'))
2457
2458
    _phonix_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2459
                                    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2460
                                   '01230720022455012683070808'))
2461
2462
    sdx = ''
2463
2464
    word = normalize('NFKD', text_type(word.upper()))
2465
    word = word.replace('ß', 'SS')
2466
    word = ''.join(c for c in word if c in
2467
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2468
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2469
                    'Y', 'Z'})
2470
    if word:
2471
        for trans in _phonix_substitutions:
2472
            word = trans[0](word, *trans[1:])
2473
        if word[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2474
            sdx = 'v' + word[1:].translate(_phonix_translation)
2475
        else:
2476
            sdx = word[0] + word[1:].translate(_phonix_translation)
2477
        sdx = _delete_consecutive_repeats(sdx)
2478
        sdx = sdx.replace('0', '')
2479
2480
    # Clamp maxlength to [4, 64]
2481
    if maxlength is not None:
2482
        maxlength = min(max(4, maxlength), 64)
2483
    else:
2484
        maxlength = 64
2485
2486
    if zero_pad:
2487
        sdx += '0' * maxlength
2488
    if not sdx:
2489
        sdx = '0'
2490
    return sdx[:maxlength]
2491
2492
2493
def sfinxbis(word, maxlength=None):
2494
    """Return the SfinxBis code for a word.
2495
2496
    SfinxBis is a Soundex-like algorithm defined in :cite:`Axelsson:2009`.
2497
2498
    This implementation follows the reference implementation:
2499
    :cite:`Sjoo:2009`.
2500
2501
    SfinxBis is intended chiefly for Swedish names.
2502
2503
    :param str word: the word to transform
2504
    :param int maxlength: the length of the code returned (defaults to
2505
        unlimited)
2506
    :returns: the SfinxBis value
2507
    :rtype: tuple
2508
2509
    >>> sfinxbis('Christopher')
2510
    ('K68376',)
2511
    >>> sfinxbis('Niall')
2512
    ('N4',)
2513
    >>> sfinxbis('Smith')
2514
    ('S53',)
2515
    >>> sfinxbis('Schmidt')
2516
    ('S53',)
2517
2518
    >>> sfinxbis('Johansson')
2519
    ('J585',)
2520
    >>> sfinxbis('Sjöberg')
2521
    ('#162',)
2522
    """
2523
    adelstitler = (' DE LA ', ' DE LAS ', ' DE LOS ', ' VAN DE ', ' VAN DEN ',
2524
                   ' VAN DER ', ' VON DEM ', ' VON DER ',
2525
                   ' AF ', ' AV ', ' DA ', ' DE ', ' DEL ', ' DEN ', ' DES ',
2526
                   ' DI ', ' DO ', ' DON ', ' DOS ', ' DU ', ' E ', ' IN ',
2527
                   ' LA ', ' LE ', ' MAC ', ' MC ', ' VAN ', ' VON ', ' Y ',
2528
                   ' S:T ')
2529
2530
    _harde_vokaler = {'A', 'O', 'U', 'Å'}
2531
    _mjuka_vokaler = {'E', 'I', 'Y', 'Ä', 'Ö'}
2532
    _konsonanter = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P',
2533
                    'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Z'}
2534
    _alfabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2535
                'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2536
                'Y', 'Z', 'Ä', 'Å', 'Ö'}
2537
2538
    _sfinxbis_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2539
                                      'BCDFGHJKLMNPQRSTVZAOUÅEIYÄÖ'),
2540
                                     '123729224551268378999999999'))
2541
2542
    _sfinxbis_substitutions = dict(zip((ord(_) for _ in
2543
                                        'WZÀÁÂÃÆÇÈÉÊËÌÍÎÏÑÒÓÔÕØÙÚÛÜÝ'),
2544
                                       'VSAAAAÄCEEEEIIIINOOOOÖUUUYY'))
2545
2546
    def _foersvensker(ordet):
2547
        """Return the Swedish-ized form of the word."""
2548
        ordet = ordet.replace('STIERN', 'STJÄRN')
2549
        ordet = ordet.replace('HIE', 'HJ')
2550
        ordet = ordet.replace('SIÖ', 'SJÖ')
2551
        ordet = ordet.replace('SCH', 'SH')
2552
        ordet = ordet.replace('QU', 'KV')
2553
        ordet = ordet.replace('IO', 'JO')
2554
        ordet = ordet.replace('PH', 'F')
2555
2556
        for i in _harde_vokaler:
2557
            ordet = ordet.replace(i+'Ü', i+'J')
2558
            ordet = ordet.replace(i+'Y', i+'J')
2559
            ordet = ordet.replace(i+'I', i+'J')
2560
        for i in _mjuka_vokaler:
2561
            ordet = ordet.replace(i+'Ü', i+'J')
2562
            ordet = ordet.replace(i+'Y', i+'J')
2563
            ordet = ordet.replace(i+'I', i+'J')
2564
2565
        if 'H' in ordet:
2566
            for i in _konsonanter:
2567
                ordet = ordet.replace('H'+i, i)
2568
2569
        ordet = ordet.translate(_sfinxbis_substitutions)
2570
2571
        ordet = ordet.replace('Ð', 'ETH')
2572
        ordet = ordet.replace('Þ', 'TH')
2573
        ordet = ordet.replace('ß', 'SS')
2574
2575
        return ordet
2576
2577
    def _koda_foersta_ljudet(ordet):
2578
        """Return the word with the first sound coded."""
2579
        if ordet[0:1] in _mjuka_vokaler or ordet[0:1] in _harde_vokaler:
2580
            ordet = '$' + ordet[1:]
2581
        elif ordet[0:2] in ('DJ', 'GJ', 'HJ', 'LJ'):
2582
            ordet = 'J' + ordet[2:]
2583
        elif ordet[0:1] == 'G' and ordet[1:2] in _mjuka_vokaler:
2584
            ordet = 'J' + ordet[1:]
2585
        elif ordet[0:1] == 'Q':
2586
            ordet = 'K' + ordet[1:]
2587
        elif (ordet[0:2] == 'CH' and
2588
              ordet[2:3] in frozenset(_mjuka_vokaler | _harde_vokaler)):
2589
            ordet = '#' + ordet[2:]
2590
        elif ordet[0:1] == 'C' and ordet[1:2] in _harde_vokaler:
2591
            ordet = 'K' + ordet[1:]
2592
        elif ordet[0:1] == 'C' and ordet[1:2] in _konsonanter:
2593
            ordet = 'K' + ordet[1:]
2594
        elif ordet[0:1] == 'X':
2595
            ordet = 'S' + ordet[1:]
2596
        elif ordet[0:1] == 'C' and ordet[1:2] in _mjuka_vokaler:
2597
            ordet = 'S' + ordet[1:]
2598
        elif ordet[0:3] in ('SKJ', 'STJ', 'SCH'):
2599
            ordet = '#' + ordet[3:]
2600
        elif ordet[0:2] in ('SH', 'KJ', 'TJ', 'SJ'):
2601
            ordet = '#' + ordet[2:]
2602
        elif ordet[0:2] == 'SK' and ordet[2:3] in _mjuka_vokaler:
2603
            ordet = '#' + ordet[2:]
2604
        elif ordet[0:1] == 'K' and ordet[1:2] in _mjuka_vokaler:
2605
            ordet = '#' + ordet[1:]
2606
        return ordet
2607
2608
    # Steg 1, Versaler
2609
    word = normalize('NFC', text_type(word.upper()))
2610
    word = word.replace('ß', 'SS')
2611
    word = word.replace('-', ' ')
2612
2613
    # Steg 2, Ta bort adelsprefix
2614
    for adelstitel in adelstitler:
2615
        while adelstitel in word:
2616
            word = word.replace(adelstitel, ' ')
2617
        if word.startswith(adelstitel[1:]):
2618
            word = word[len(adelstitel)-1:]
2619
2620
    # Split word into tokens
2621
    ordlista = word.split()
2622
2623
    # Steg 3, Ta bort dubbelteckning i början på namnet
2624
    ordlista = [_delete_consecutive_repeats(ordet) for ordet in ordlista]
2625
    if not ordlista:
2626
        return ('',)
2627
2628
    # Steg 4, Försvenskning
2629
    ordlista = [_foersvensker(ordet) for ordet in ordlista]
2630
2631
    # Steg 5, Ta bort alla tecken som inte är A-Ö (65-90,196,197,214)
2632
    ordlista = [''.join(c for c in ordet if c in _alfabet)
2633
                for ordet in ordlista]
2634
2635
    # Steg 6, Koda första ljudet
2636
    ordlista = [_koda_foersta_ljudet(ordet) for ordet in ordlista]
2637
2638
    # Steg 7, Dela upp namnet i två delar
2639
    rest = [ordet[1:] for ordet in ordlista]
2640
2641
    # Steg 8, Utför fonetisk transformation i resten
2642
    rest = [ordet.replace('DT', 'T') for ordet in rest]
2643
    rest = [ordet.replace('X', 'KS') for ordet in rest]
2644
2645
    # Steg 9, Koda resten till en sifferkod
2646
    for vokal in _mjuka_vokaler:
2647
        rest = [ordet.replace('C'+vokal, '8'+vokal) for ordet in rest]
2648
    rest = [ordet.translate(_sfinxbis_translation) for ordet in rest]
2649
2650
    # Steg 10, Ta bort intilliggande dubbletter
2651
    rest = [_delete_consecutive_repeats(ordet) for ordet in rest]
2652
2653
    # Steg 11, Ta bort alla "9"
2654
    rest = [ordet.replace('9', '') for ordet in rest]
2655
2656
    # Steg 12, Sätt ihop delarna igen
2657
    ordlista = [''.join(ordet) for ordet in
2658
                zip((_[0:1] for _ in ordlista), rest)]
2659
2660
    # truncate, if maxlength is set
2661
    if maxlength and maxlength < _INFINITY:
2662
        ordlista = [ordet[:maxlength] for ordet in ordlista]
2663
2664
    return tuple(ordlista)
2665
2666
2667
def phonet(word, mode=1, lang='de', trace=False):
2668
    """Return the phonet code for a word.
2669
2670
    phonet ("Hannoveraner Phonetik") was developed by Jörg Michael and
2671
    documented in :cite:`Michael:1999`.
2672
2673
    This is a port of Jesper Zedlitz's code, which is licensed LGPL
2674
    :cite:`Zedlitz:2015`.
2675
2676
    That is, in turn, based on Michael's C code, which is also licensed LGPL
2677
    :cite:`Michael:2007`.
2678
2679
    :param str word: the word to transform
2680
    :param int mode: the ponet variant to employ (1 or 2)
2681
    :param str lang: 'de' (default) for German
2682
            'none' for no language
2683
    :param bool trace: prints debugging info if True
2684
    :returns: the phonet value
2685
    :rtype: str
2686
2687
    >>> phonet('Christopher')
2688
    'KRISTOFA'
2689
    >>> phonet('Niall')
2690
    'NIAL'
2691
    >>> phonet('Smith')
2692
    'SMIT'
2693
    >>> phonet('Schmidt')
2694
    'SHMIT'
2695
2696
    >>> phonet('Christopher', mode=2)
2697
    'KRIZTUFA'
2698
    >>> phonet('Niall', mode=2)
2699
    'NIAL'
2700
    >>> phonet('Smith', mode=2)
2701
    'ZNIT'
2702
    >>> phonet('Schmidt', mode=2)
2703
    'ZNIT'
2704
2705
    >>> phonet('Christopher', lang='none')
2706
    'CHRISTOPHER'
2707
    >>> phonet('Niall', lang='none')
2708
    'NIAL'
2709
    >>> phonet('Smith', lang='none')
2710
    'SMITH'
2711
    >>> phonet('Schmidt', lang='none')
2712
    'SCHMIDT'
2713
    """
2714
    # pylint: disable=too-many-branches
2715
2716
    _phonet_rules_no_lang = (  # separator chars
2717
        '´', ' ', ' ',
2718
        '"', ' ', ' ',
2719
        '`$', '', '',
2720
        '\'', ' ', ' ',
2721
        ',', ',', ',',
2722
        ';', ',', ',',
2723
        '-', ' ', ' ',
2724
        ' ', ' ', ' ',
2725
        '.', '.', '.',
2726
        ':', '.', '.',
2727
        # German umlauts
2728
        'Ä', 'AE', 'AE',
2729
        'Ö', 'OE', 'OE',
2730
        'Ü', 'UE', 'UE',
2731
        'ß', 'S', 'S',
2732
        # international umlauts
2733
        'À', 'A', 'A',
2734
        'Á', 'A', 'A',
2735
        'Â', 'A', 'A',
2736
        'Ã', 'A', 'A',
2737
        'Å', 'A', 'A',
2738
        'Æ', 'AE', 'AE',
2739
        'Ç', 'C', 'C',
2740
        'Ð', 'DJ', 'DJ',
2741
        'È', 'E', 'E',
2742
        'É', 'E', 'E',
2743
        'Ê', 'E', 'E',
2744
        'Ë', 'E', 'E',
2745
        'Ì', 'I', 'I',
2746
        'Í', 'I', 'I',
2747
        'Î', 'I', 'I',
2748
        'Ï', 'I', 'I',
2749
        'Ñ', 'NH', 'NH',
2750
        'Ò', 'O', 'O',
2751
        'Ó', 'O', 'O',
2752
        'Ô', 'O', 'O',
2753
        'Õ', 'O', 'O',
2754
        'Œ', 'OE', 'OE',
2755
        'Ø', 'OE', 'OE',
2756
        'Š', 'SH', 'SH',
2757
        'Þ', 'TH', 'TH',
2758
        'Ù', 'U', 'U',
2759
        'Ú', 'U', 'U',
2760
        'Û', 'U', 'U',
2761
        'Ý', 'Y', 'Y',
2762
        'Ÿ', 'Y', 'Y',
2763
        # 'normal' letters (A-Z)
2764
        'MC^', 'MAC', 'MAC',
2765
        'MC^', 'MAC', 'MAC',
2766
        'M´^', 'MAC', 'MAC',
2767
        'M\'^', 'MAC', 'MAC',
2768
        'O´^', 'O', 'O',
2769
        'O\'^', 'O', 'O',
2770
        'VAN DEN ^', 'VANDEN', 'VANDEN',
2771
        None, None, None)
2772
2773
    _phonet_rules_german = (  # separator chars
2774
        '´', ' ', ' ',
2775
        '"', ' ', ' ',
2776
        '`$', '', '',
2777
        '\'', ' ', ' ',
2778
        ',', ' ', ' ',
2779
        ';', ' ', ' ',
2780
        '-', ' ', ' ',
2781
        ' ', ' ', ' ',
2782
        '.', '.', '.',
2783
        ':', '.', '.',
2784
        # German umlauts
2785
        'ÄE', 'E', 'E',
2786
        'ÄU<', 'EU', 'EU',
2787
        'ÄV(AEOU)-<', 'EW', None,
2788
        'Ä$', 'Ä', None,
2789
        'Ä<', None, 'E',
2790
        'Ä', 'E', None,
2791
        'ÖE', 'Ö', 'Ö',
2792
        'ÖU', 'Ö', 'Ö',
2793
        'ÖVER--<', 'ÖW', None,
2794
        'ÖV(AOU)-', 'ÖW', None,
2795
        'ÜBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
2796
        'ÜBER^^', 'ÜBA', 'IBA',
2797
        'ÜE', 'Ü', 'I',
2798
        'ÜVER--<', 'ÜW', None,
2799
        'ÜV(AOU)-', 'ÜW', None,
2800
        'Ü', None, 'I',
2801
        'ßCH<', None, 'Z',
2802
        'ß<', 'S', 'Z',
2803
        # international umlauts
2804
        'À<', 'A', 'A',
2805
        'Á<', 'A', 'A',
2806
        'Â<', 'A', 'A',
2807
        'Ã<', 'A', 'A',
2808
        'Å<', 'A', 'A',
2809
        'ÆER-', 'E', 'E',
2810
        'ÆU<', 'EU', 'EU',
2811
        'ÆV(AEOU)-<', 'EW', None,
2812
        'Æ$', 'Ä', None,
2813
        'Æ<', None, 'E',
2814
        'Æ', 'E', None,
2815
        'Ç', 'Z', 'Z',
2816
        'ÐÐ-', '', '',
2817
        'Ð', 'DI', 'TI',
2818
        'È<', 'E', 'E',
2819
        'É<', 'E', 'E',
2820
        'Ê<', 'E', 'E',
2821
        'Ë', 'E', 'E',
2822
        'Ì<', 'I', 'I',
2823
        'Í<', 'I', 'I',
2824
        'Î<', 'I', 'I',
2825
        'Ï', 'I', 'I',
2826
        'ÑÑ-', '', '',
2827
        'Ñ', 'NI', 'NI',
2828
        'Ò<', 'O', 'U',
2829
        'Ó<', 'O', 'U',
2830
        'Ô<', 'O', 'U',
2831
        'Õ<', 'O', 'U',
2832
        'Œ<', 'Ö', 'Ö',
2833
        'Ø(IJY)-<', 'E', 'E',
2834
        'Ø<', 'Ö', 'Ö',
2835
        'Š', 'SH', 'Z',
2836
        'Þ', 'T', 'T',
2837
        'Ù<', 'U', 'U',
2838
        'Ú<', 'U', 'U',
2839
        'Û<', 'U', 'U',
2840
        'Ý<', 'I', 'I',
2841
        'Ÿ<', 'I', 'I',
2842
        # 'normal' letters (A-Z)
2843
        'ABELLE$', 'ABL', 'ABL',
2844
        'ABELL$', 'ABL', 'ABL',
2845
        'ABIENNE$', 'ABIN', 'ABIN',
2846
        'ACHME---^', 'ACH', 'AK',
2847
        'ACEY$', 'AZI', 'AZI',
2848
        'ADV', 'ATW', None,
2849
        'AEGL-', 'EK', None,
2850
        'AEU<', 'EU', 'EU',
2851
        'AE2', 'E', 'E',
2852
        'AFTRAUBEN------', 'AFT ', 'AFT ',
2853
        'AGL-1', 'AK', None,
2854
        'AGNI-^', 'AKN', 'AKN',
2855
        'AGNIE-', 'ANI', 'ANI',
2856
        'AGN(AEOU)-$', 'ANI', 'ANI',
2857
        'AH(AIOÖUÜY)-', 'AH', None,
2858
        'AIA2', 'AIA', 'AIA',
2859
        'AIE$', 'E', 'E',
2860
        'AILL(EOU)-', 'ALI', 'ALI',
2861
        'AINE$', 'EN', 'EN',
2862
        'AIRE$', 'ER', 'ER',
2863
        'AIR-', 'E', 'E',
2864
        'AISE$', 'ES', 'EZ',
2865
        'AISSANCE$', 'ESANS', 'EZANZ',
2866
        'AISSE$', 'ES', 'EZ',
2867
        'AIX$', 'EX', 'EX',
2868
        'AJ(AÄEÈÉÊIOÖUÜ)--', 'A', 'A',
2869
        'AKTIE', 'AXIE', 'AXIE',
2870
        'AKTUEL', 'AKTUEL', None,
2871
        'ALOI^', 'ALOI', 'ALUI',  # Don't merge these rules
2872
        'ALOY^', 'ALOI', 'ALUI',  # needed by 'check_rules'
2873
        'AMATEU(RS)-', 'AMATÖ', 'ANATÖ',
2874
        'ANCH(OEI)-', 'ANSH', 'ANZ',
2875
        'ANDERGEGANG----', 'ANDA GE', 'ANTA KE',
2876
        'ANDERGEHE----', 'ANDA ', 'ANTA ',
2877
        'ANDERGESETZ----', 'ANDA GE', 'ANTA KE',
2878
        'ANDERGING----', 'ANDA ', 'ANTA ',
2879
        'ANDERSETZ(ET)-----', 'ANDA ', 'ANTA ',
2880
        'ANDERZUGEHE----', 'ANDA ZU ', 'ANTA ZU ',
2881
        'ANDERZUSETZE-----', 'ANDA ZU ', 'ANTA ZU ',
2882
        'ANER(BKO)---^^', 'AN', None,
2883
        'ANHAND---^$', 'AN H', 'AN ',
2884
        'ANH(AÄEIOÖUÜY)--^^', 'AN', None,
2885
        'ANIELLE$', 'ANIEL', 'ANIL',
2886
        'ANIEL', 'ANIEL', None,
2887
        'ANSTELLE----^$', 'AN ST', 'AN ZT',
2888
        'ANTI^^', 'ANTI', 'ANTI',
2889
        'ANVER^^', 'ANFA', 'ANFA',
2890
        'ATIA$', 'ATIA', 'ATIA',
2891
        'ATIA(NS)--', 'ATI', 'ATI',
2892
        'ATI(AÄOÖUÜ)-', 'AZI', 'AZI',
2893
        'AUAU--', '', '',
2894
        'AUERE$', 'AUERE', None,
2895
        'AUERE(NS)-$', 'AUERE', None,
2896
        'AUERE(AIOUY)--', 'AUER', None,
2897
        'AUER(AÄIOÖUÜY)-', 'AUER', None,
2898
        'AUER<', 'AUA', 'AUA',
2899
        'AUF^^', 'AUF', 'AUF',
2900
        'AULT$', 'O', 'U',
2901
        'AUR(BCDFGKLMNQSTVWZ)-', 'AUA', 'AUA',
2902
        'AUR$', 'AUA', 'AUA',
2903
        'AUSSE$', 'OS', 'UZ',
2904
        'AUS(ST)-^', 'AUS', 'AUS',
2905
        'AUS^^', 'AUS', 'AUS',
2906
        'AUTOFAHR----', 'AUTO ', 'AUTU ',
2907
        'AUTO^^', 'AUTO', 'AUTU',
2908
        'AUX(IY)-', 'AUX', 'AUX',
2909
        'AUX', 'O', 'U',
2910
        'AU', 'AU', 'AU',
2911
        'AVER--<', 'AW', None,
2912
        'AVIER$', 'AWIE', 'AFIE',
2913
        'AV(EÈÉÊI)-^', 'AW', None,
2914
        'AV(AOU)-', 'AW', None,
2915
        'AYRE$', 'EIRE', 'EIRE',
2916
        'AYRE(NS)-$', 'EIRE', 'EIRE',
2917
        'AYRE(AIOUY)--', 'EIR', 'EIR',
2918
        'AYR(AÄIOÖUÜY)-', 'EIR', 'EIR',
2919
        'AYR<', 'EIA', 'EIA',
2920
        'AYER--<', 'EI', 'EI',
2921
        'AY(AÄEIOÖUÜY)--', 'A', 'A',
2922
        'AË', 'E', 'E',
2923
        'A(IJY)<', 'EI', 'EI',
2924
        'BABY^$', 'BEBI', 'BEBI',
2925
        'BAB(IY)^', 'BEBI', 'BEBI',
2926
        'BEAU^$', 'BO', None,
2927
        'BEA(BCMNRU)-^', 'BEA', 'BEA',
2928
        'BEAT(AEIMORU)-^', 'BEAT', 'BEAT',
2929
        'BEE$', 'BI', 'BI',
2930
        'BEIGE^$', 'BESH', 'BEZ',
2931
        'BENOIT--', 'BENO', 'BENU',
2932
        'BER(DT)-', 'BER', None,
2933
        'BERN(DT)-', 'BERN', None,
2934
        'BE(LMNRST)-^', 'BE', 'BE',
2935
        'BETTE$', 'BET', 'BET',
2936
        'BEVOR^$', 'BEFOR', None,
2937
        'BIC$', 'BIZ', 'BIZ',
2938
        'BOWL(EI)-', 'BOL', 'BUL',
2939
        'BP(AÄEÈÉÊIÌÍÎOÖRUÜY)-', 'B', 'B',
2940
        'BRINGEND-----^', 'BRI', 'BRI',
2941
        'BRINGEND-----', ' BRI', ' BRI',
2942
        'BROW(NS)-', 'BRAU', 'BRAU',
2943
        'BUDGET7', 'BÜGE', 'BIKE',
2944
        'BUFFET7', 'BÜFE', 'BIFE',
2945
        'BYLLE$', 'BILE', 'BILE',
2946
        'BYLL$', 'BIL', 'BIL',
2947
        'BYPA--^', 'BEI', 'BEI',
2948
        'BYTE<', 'BEIT', 'BEIT',
2949
        'BY9^', 'BÜ', None,
2950
        'B(SßZ)$', 'BS', None,
2951
        'CACH(EI)-^', 'KESH', 'KEZ',
2952
        'CAE--', 'Z', 'Z',
2953
        'CA(IY)$', 'ZEI', 'ZEI',
2954
        'CE(EIJUY)--', 'Z', 'Z',
2955
        'CENT<', 'ZENT', 'ZENT',
2956
        'CERST(EI)----^', 'KE', 'KE',
2957
        'CER$', 'ZA', 'ZA',
2958
        'CE3', 'ZE', 'ZE',
2959
        'CH\'S$', 'X', 'X',
2960
        'CH´S$', 'X', 'X',
2961
        'CHAO(ST)-', 'KAO', 'KAU',
2962
        'CHAMPIO-^', 'SHEMPI', 'ZENBI',
2963
        'CHAR(AI)-^', 'KAR', 'KAR',
2964
        'CHAU(CDFSVWXZ)-', 'SHO', 'ZU',
2965
        'CHÄ(CF)-', 'SHE', 'ZE',
2966
        'CHE(CF)-', 'SHE', 'ZE',
2967
        'CHEM-^', 'KE', 'KE',  # or: 'CHE', 'KE'
2968
        'CHEQUE<', 'SHEK', 'ZEK',
2969
        'CHI(CFGPVW)-', 'SHI', 'ZI',
2970
        'CH(AEUY)-<^', 'SH', 'Z',
2971
        'CHK-', '', '',
2972
        'CHO(CKPS)-^', 'SHO', 'ZU',
2973
        'CHRIS-', 'KRI', None,
2974
        'CHRO-', 'KR', None,
2975
        'CH(LOR)-<^', 'K', 'K',
2976
        'CHST-', 'X', 'X',
2977
        'CH(SßXZ)3', 'X', 'X',
2978
        'CHTNI-3', 'CHN', 'KN',
2979
        'CH^', 'K', 'K',  # or: 'CH', 'K'
2980
        'CH', 'CH', 'K',
2981
        'CIC$', 'ZIZ', 'ZIZ',
2982
        'CIENCEFICT----', 'EIENS ', 'EIENZ ',
2983
        'CIENCE$', 'EIENS', 'EIENZ',
2984
        'CIER$', 'ZIE', 'ZIE',
2985
        'CYB-^', 'ZEI', 'ZEI',
2986
        'CY9^', 'ZÜ', 'ZI',
2987
        'C(IJY)-<3', 'Z', 'Z',
2988
        'CLOWN-', 'KLAU', 'KLAU',
2989
        'CCH', 'Z', 'Z',
2990
        'CCE-', 'X', 'X',
2991
        'C(CK)-', '', '',
2992
        'CLAUDET---', 'KLO', 'KLU',
2993
        'CLAUDINE^$', 'KLODIN', 'KLUTIN',
2994
        'COACH', 'KOSH', 'KUZ',
2995
        'COLE$', 'KOL', 'KUL',
2996
        'COUCH', 'KAUSH', 'KAUZ',
2997
        'COW', 'KAU', 'KAU',
2998
        'CQUES$', 'K', 'K',
2999
        'CQUE', 'K', 'K',
3000
        'CRASH--9', 'KRE', 'KRE',
3001
        'CREAT-^', 'KREA', 'KREA',
3002
        'CST', 'XT', 'XT',
3003
        'CS<^', 'Z', 'Z',
3004
        'C(SßX)', 'X', 'X',
3005
        'CT\'S$', 'X', 'X',
3006
        'CT(SßXZ)', 'X', 'X',
3007
        'CZ<', 'Z', 'Z',
3008
        'C(ÈÉÊÌÍÎÝ)3', 'Z', 'Z',
3009
        'C.^', 'C.', 'C.',
3010
        'CÄ-', 'Z', 'Z',
3011
        'CÜ$', 'ZÜ', 'ZI',
3012
        'C\'S$', 'X', 'X',
3013
        'C<', 'K', 'K',
3014
        'DAHER^$', 'DAHER', None,
3015
        'DARAUFFOLGE-----', 'DARAUF ', 'TARAUF ',
3016
        'DAVO(NR)-^$', 'DAFO', 'TAFU',
3017
        'DD(SZ)--<', '', '',
3018
        'DD9', 'D', None,
3019
        'DEPOT7', 'DEPO', 'TEBU',
3020
        'DESIGN', 'DISEIN', 'TIZEIN',
3021
        'DE(LMNRST)-3^', 'DE', 'TE',
3022
        'DETTE$', 'DET', 'TET',
3023
        'DH$', 'T', None,
3024
        'DIC$', 'DIZ', 'TIZ',
3025
        'DIDR-^', 'DIT', None,
3026
        'DIEDR-^', 'DIT', None,
3027
        'DJ(AEIOU)-^', 'I', 'I',
3028
        'DMITR-^', 'DIMIT', 'TINIT',
3029
        'DRY9^', 'DRÜ', None,
3030
        'DT-', '', '',
3031
        'DUIS-^', 'DÜ', 'TI',
3032
        'DURCH^^', 'DURCH', 'TURK',
3033
        'DVA$', 'TWA', None,
3034
        'DY9^', 'DÜ', None,
3035
        'DYS$', 'DIS', None,
3036
        'DS(CH)--<', 'T', 'T',
3037
        'DST', 'ZT', 'ZT',
3038
        'DZS(CH)--', 'T', 'T',
3039
        'D(SßZ)', 'Z', 'Z',
3040
        'D(AÄEIOÖRUÜY)-', 'D', None,
3041
        'D(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'D', None,
3042
        'D\'H^', 'D', 'T',
3043
        'D´H^', 'D', 'T',
3044
        'D`H^', 'D', 'T',
3045
        'D\'S3$', 'Z', 'Z',
3046
        'D´S3$', 'Z', 'Z',
3047
        'D^', 'D', None,
3048
        'D', 'T', 'T',
3049
        'EAULT$', 'O', 'U',
3050
        'EAUX$', 'O', 'U',
3051
        'EAU', 'O', 'U',
3052
        'EAV', 'IW', 'IF',
3053
        'EAS3$', 'EAS', None,
3054
        'EA(AÄEIOÖÜY)-3', 'EA', 'EA',
3055
        'EA3$', 'EA', 'EA',
3056
        'EA3', 'I', 'I',
3057
        'EBENSO^$', 'EBNSO', 'EBNZU',
3058
        'EBENSO^^', 'EBNSO ', 'EBNZU ',
3059
        'EBEN^^', 'EBN', 'EBN',
3060
        'EE9', 'E', 'E',
3061
        'EGL-1', 'EK', None,
3062
        'EHE(IUY)--1', 'EH', None,
3063
        'EHUNG---1', 'E', None,
3064
        'EH(AÄIOÖUÜY)-1', 'EH', None,
3065
        'EIEI--', '', '',
3066
        'EIERE^$', 'EIERE', None,
3067
        'EIERE$', 'EIERE', None,
3068
        'EIERE(NS)-$', 'EIERE', None,
3069
        'EIERE(AIOUY)--', 'EIER', None,
3070
        'EIER(AÄIOÖUÜY)-', 'EIER', None,
3071
        'EIER<', 'EIA', None,
3072
        'EIGL-1', 'EIK', None,
3073
        'EIGH$', 'EI', 'EI',
3074
        'EIH--', 'E', 'E',
3075
        'EILLE$', 'EI', 'EI',
3076
        'EIR(BCDFGKLMNQSTVWZ)-', 'EIA', 'EIA',
3077
        'EIR$', 'EIA', 'EIA',
3078
        'EITRAUBEN------', 'EIT ', 'EIT ',
3079
        'EI', 'EI', 'EI',
3080
        'EJ$', 'EI', 'EI',
3081
        'ELIZ^', 'ELIS', None,
3082
        'ELZ^', 'ELS', None,
3083
        'EL-^', 'E', 'E',
3084
        'ELANG----1', 'E', 'E',
3085
        'EL(DKL)--1', 'E', 'E',
3086
        'EL(MNT)--1$', 'E', 'E',
3087
        'ELYNE$', 'ELINE', 'ELINE',
3088
        'ELYN$', 'ELIN', 'ELIN',
3089
        'EL(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'EL', 'EL',
3090
        'EL-1', 'L', 'L',
3091
        'EM-^', None, 'E',
3092
        'EM(DFKMPQT)--1', None, 'E',
3093
        'EM(AÄEÈÉÊIÌÍÎOÖUÜY)--1', None, 'E',
3094
        'EM-1', None, 'N',
3095
        'ENGAG-^', 'ANGA', 'ANKA',
3096
        'EN-^', 'E', 'E',
3097
        'ENTUEL', 'ENTUEL', None,
3098
        'EN(CDGKQSTZ)--1', 'E', 'E',
3099
        'EN(AÄEÈÉÊIÌÍÎNOÖUÜY)-1', 'EN', 'EN',
3100
        'EN-1', '', '',
3101
        'ERH(AÄEIOÖUÜ)-^', 'ERH', 'ER',
3102
        'ER-^', 'E', 'E',
3103
        'ERREGEND-----', ' ER', ' ER',
3104
        'ERT1$', 'AT', None,
3105
        'ER(DGLKMNRQTZß)-1', 'ER', None,
3106
        'ER(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'ER', 'A',
3107
        'ER1$', 'A', 'A',
3108
        'ER<1', 'A', 'A',
3109
        'ETAT7', 'ETA', 'ETA',
3110
        'ETI(AÄOÖÜU)-', 'EZI', 'EZI',
3111
        'EUERE$', 'EUERE', None,
3112
        'EUERE(NS)-$', 'EUERE', None,
3113
        'EUERE(AIOUY)--', 'EUER', None,
3114
        'EUER(AÄIOÖUÜY)-', 'EUER', None,
3115
        'EUER<', 'EUA', None,
3116
        'EUEU--', '', '',
3117
        'EUILLE$', 'Ö', 'Ö',
3118
        'EUR$', 'ÖR', 'ÖR',
3119
        'EUX', 'Ö', 'Ö',
3120
        'EUSZ$', 'EUS', None,
3121
        'EUTZ$', 'EUS', None,
3122
        'EUYS$', 'EUS', 'EUZ',
3123
        'EUZ$', 'EUS', None,
3124
        'EU', 'EU', 'EU',
3125
        'EVER--<1', 'EW', None,
3126
        'EV(ÄOÖUÜ)-1', 'EW', None,
3127
        'EYER<', 'EIA', 'EIA',
3128
        'EY<', 'EI', 'EI',
3129
        'FACETTE', 'FASET', 'FAZET',
3130
        'FANS--^$', 'FE', 'FE',
3131
        'FAN-^$', 'FE', 'FE',
3132
        'FAULT-', 'FOL', 'FUL',
3133
        'FEE(DL)-', 'FI', 'FI',
3134
        'FEHLER', 'FELA', 'FELA',
3135
        'FE(LMNRST)-3^', 'FE', 'FE',
3136
        'FOERDERN---^', 'FÖRD', 'FÖRT',
3137
        'FOERDERN---', ' FÖRD', ' FÖRT',
3138
        'FOND7', 'FON', 'FUN',
3139
        'FRAIN$', 'FRA', 'FRA',
3140
        'FRISEU(RS)-', 'FRISÖ', 'FRIZÖ',
3141
        'FY9^', 'FÜ', None,
3142
        'FÖRDERN---^', 'FÖRD', 'FÖRT',
3143
        'FÖRDERN---', ' FÖRD', ' FÖRT',
3144
        'GAGS^$', 'GEX', 'KEX',
3145
        'GAG^$', 'GEK', 'KEK',
3146
        'GD', 'KT', 'KT',
3147
        'GEGEN^^', 'GEGN', 'KEKN',
3148
        'GEGENGEKOM-----', 'GEGN ', 'KEKN ',
3149
        'GEGENGESET-----', 'GEGN ', 'KEKN ',
3150
        'GEGENKOMME-----', 'GEGN ', 'KEKN ',
3151
        'GEGENZUKOM---', 'GEGN ZU ', 'KEKN ZU ',
3152
        'GENDETWAS-----$', 'GENT ', 'KENT ',
3153
        'GENRE', 'IORE', 'IURE',
3154
        'GE(LMNRST)-3^', 'GE', 'KE',
3155
        'GER(DKT)-', 'GER', None,
3156
        'GETTE$', 'GET', 'KET',
3157
        'GGF.', 'GF.', None,
3158
        'GG-', '', '',
3159
        'GH', 'G', None,
3160
        'GI(AOU)-^', 'I', 'I',
3161
        'GION-3', 'KIO', 'KIU',
3162
        'G(CK)-', '', '',
3163
        'GJ(AEIOU)-^', 'I', 'I',
3164
        'GMBH^$', 'GMBH', 'GMBH',
3165
        'GNAC$', 'NIAK', 'NIAK',
3166
        'GNON$', 'NION', 'NIUN',
3167
        'GN$', 'N', 'N',
3168
        'GONCAL-^', 'GONZA', 'KUNZA',
3169
        'GRY9^', 'GRÜ', None,
3170
        'G(SßXZ)-<', 'K', 'K',
3171
        'GUCK-', 'KU', 'KU',
3172
        'GUISEP-^', 'IUSE', 'IUZE',
3173
        'GUI-^', 'G', 'K',
3174
        'GUTAUSSEH------^', 'GUT ', 'KUT ',
3175
        'GUTGEHEND------^', 'GUT ', 'KUT ',
3176
        'GY9^', 'GÜ', None,
3177
        'G(AÄEILOÖRUÜY)-', 'G', None,
3178
        'G(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'G', None,
3179
        'G\'S$', 'X', 'X',
3180
        'G´S$', 'X', 'X',
3181
        'G^', 'G', None,
3182
        'G', 'K', 'K',
3183
        'HA(HIUY)--1', 'H', None,
3184
        'HANDVOL---^', 'HANT ', 'ANT ',
3185
        'HANNOVE-^', 'HANOF', None,
3186
        'HAVEN7$', 'HAFN', None,
3187
        'HEAD-', 'HE', 'E',
3188
        'HELIEGEN------', 'E ', 'E ',
3189
        'HESTEHEN------', 'E ', 'E ',
3190
        'HE(LMNRST)-3^', 'HE', 'E',
3191
        'HE(LMN)-1', 'E', 'E',
3192
        'HEUR1$', 'ÖR', 'ÖR',
3193
        'HE(HIUY)--1', 'H', None,
3194
        'HIH(AÄEIOÖUÜY)-1', 'IH', None,
3195
        'HLH(AÄEIOÖUÜY)-1', 'LH', None,
3196
        'HMH(AÄEIOÖUÜY)-1', 'MH', None,
3197
        'HNH(AÄEIOÖUÜY)-1', 'NH', None,
3198
        'HOBBY9^', 'HOBI', None,
3199
        'HOCHBEGAB-----^', 'HOCH ', 'UK ',
3200
        'HOCHTALEN-----^', 'HOCH ', 'UK ',
3201
        'HOCHZUFRI-----^', 'HOCH ', 'UK ',
3202
        'HO(HIY)--1', 'H', None,
3203
        'HRH(AÄEIOÖUÜY)-1', 'RH', None,
3204
        'HUH(AÄEIOÖUÜY)-1', 'UH', None,
3205
        'HUIS^^', 'HÜS', 'IZ',
3206
        'HUIS$', 'ÜS', 'IZ',
3207
        'HUI--1', 'H', None,
3208
        'HYGIEN^', 'HÜKIEN', None,
3209
        'HY9^', 'HÜ', None,
3210
        'HY(BDGMNPST)-', 'Ü', None,
3211
        'H.^', None, 'H.',
3212
        'HÄU--1', 'H', None,
3213
        'H^', 'H', '',
3214
        'H', '', '',
3215
        'ICHELL---', 'ISH', 'IZ',
3216
        'ICHI$', 'ISHI', 'IZI',
3217
        'IEC$', 'IZ', 'IZ',
3218
        'IEDENSTELLE------', 'IDN ', 'ITN ',
3219
        'IEI-3', '', '',
3220
        'IELL3', 'IEL', 'IEL',
3221
        'IENNE$', 'IN', 'IN',
3222
        'IERRE$', 'IER', 'IER',
3223
        'IERZULAN---', 'IR ZU ', 'IR ZU ',
3224
        'IETTE$', 'IT', 'IT',
3225
        'IEU', 'IÖ', 'IÖ',
3226
        'IE<4', 'I', 'I',
3227
        'IGL-1', 'IK', None,
3228
        'IGHT3$', 'EIT', 'EIT',
3229
        'IGNI(EO)-', 'INI', 'INI',
3230
        'IGN(AEOU)-$', 'INI', 'INI',
3231
        'IHER(DGLKRT)--1', 'IHE', None,
3232
        'IHE(IUY)--', 'IH', None,
3233
        'IH(AIOÖUÜY)-', 'IH', None,
3234
        'IJ(AOU)-', 'I', 'I',
3235
        'IJ$', 'I', 'I',
3236
        'IJ<', 'EI', 'EI',
3237
        'IKOLE$', 'IKOL', 'IKUL',
3238
        'ILLAN(STZ)--4', 'ILIA', 'ILIA',
3239
        'ILLAR(DT)--4', 'ILIA', 'ILIA',
3240
        'IMSTAN----^', 'IM ', 'IN ',
3241
        'INDELERREGE------', 'INDL ', 'INTL ',
3242
        'INFRAGE-----^$', 'IN ', 'IN ',
3243
        'INTERN(AOU)-^', 'INTAN', 'INTAN',
3244
        'INVER-', 'INWE', 'INFE',
3245
        'ITI(AÄIOÖUÜ)-', 'IZI', 'IZI',
3246
        'IUSZ$', 'IUS', None,
3247
        'IUTZ$', 'IUS', None,
3248
        'IUZ$', 'IUS', None,
3249
        'IVER--<', 'IW', None,
3250
        'IVIER$', 'IWIE', 'IFIE',
3251
        'IV(ÄOÖUÜ)-', 'IW', None,
3252
        'IV<3', 'IW', None,
3253
        'IY2', 'I', None,
3254
        'I(ÈÉÊ)<4', 'I', 'I',
3255
        'JAVIE---<^', 'ZA', 'ZA',
3256
        'JEANS^$', 'JINS', 'INZ',
3257
        'JEANNE^$', 'IAN', 'IAN',
3258
        'JEAN-^', 'IA', 'IA',
3259
        'JER-^', 'IE', 'IE',
3260
        'JE(LMNST)-', 'IE', 'IE',
3261
        'JI^', 'JI', None,
3262
        'JOR(GK)^$', 'IÖRK', 'IÖRK',
3263
        'J', 'I', 'I',
3264
        'KC(ÄEIJ)-', 'X', 'X',
3265
        'KD', 'KT', None,
3266
        'KE(LMNRST)-3^', 'KE', 'KE',
3267
        'KG(AÄEILOÖRUÜY)-', 'K', None,
3268
        'KH<^', 'K', 'K',
3269
        'KIC$', 'KIZ', 'KIZ',
3270
        'KLE(LMNRST)-3^', 'KLE', 'KLE',
3271
        'KOTELE-^', 'KOTL', 'KUTL',
3272
        'KREAT-^', 'KREA', 'KREA',
3273
        'KRÜS(TZ)--^', 'KRI', None,
3274
        'KRYS(TZ)--^', 'KRI', None,
3275
        'KRY9^', 'KRÜ', None,
3276
        'KSCH---', 'K', 'K',
3277
        'KSH--', 'K', 'K',
3278
        'K(SßXZ)7', 'X', 'X',  # implies 'KST' -> 'XT'
3279
        'KT\'S$', 'X', 'X',
3280
        'KTI(AIOU)-3', 'XI', 'XI',
3281
        'KT(SßXZ)', 'X', 'X',
3282
        'KY9^', 'KÜ', None,
3283
        'K\'S$', 'X', 'X',
3284
        'K´S$', 'X', 'X',
3285
        'LANGES$', ' LANGES', ' LANKEZ',
3286
        'LANGE$', ' LANGE', ' LANKE',
3287
        'LANG$', ' LANK', ' LANK',
3288
        'LARVE-', 'LARF', 'LARF',
3289
        'LD(SßZ)$', 'LS', 'LZ',
3290
        'LD\'S$', 'LS', 'LZ',
3291
        'LD´S$', 'LS', 'LZ',
3292
        'LEAND-^', 'LEAN', 'LEAN',
3293
        'LEERSTEHE-----^', 'LER ', 'LER ',
3294
        'LEICHBLEIB-----', 'LEICH ', 'LEIK ',
3295
        'LEICHLAUTE-----', 'LEICH ', 'LEIK ',
3296
        'LEIDERREGE------', 'LEIT ', 'LEIT ',
3297
        'LEIDGEPR----^', 'LEIT ', 'LEIT ',
3298
        'LEINSTEHE-----', 'LEIN ', 'LEIN ',
3299
        'LEL-', 'LE', 'LE',
3300
        'LE(MNRST)-3^', 'LE', 'LE',
3301
        'LETTE$', 'LET', 'LET',
3302
        'LFGNAG-', 'LFGAN', 'LFKAN',
3303
        'LICHERWEIS----', 'LICHA ', 'LIKA ',
3304
        'LIC$', 'LIZ', 'LIZ',
3305
        'LIVE^$', 'LEIF', 'LEIF',
3306
        'LT(SßZ)$', 'LS', 'LZ',
3307
        'LT\'S$', 'LS', 'LZ',
3308
        'LT´S$', 'LS', 'LZ',
3309
        'LUI(GS)--', 'LU', 'LU',
3310
        'LV(AIO)-', 'LW', None,
3311
        'LY9^', 'LÜ', None,
3312
        'LSTS$', 'LS', 'LZ',
3313
        'LZ(BDFGKLMNPQRSTVWX)-', 'LS', None,
3314
        'L(SßZ)$', 'LS', None,
3315
        'MAIR-<', 'MEI', 'NEI',
3316
        'MANAG-', 'MENE', 'NENE',
3317
        'MANUEL', 'MANUEL', None,
3318
        'MASSEU(RS)-', 'MASÖ', 'NAZÖ',
3319
        'MATCH', 'MESH', 'NEZ',
3320
        'MAURICE', 'MORIS', 'NURIZ',
3321
        'MBH^$', 'MBH', 'MBH',
3322
        'MB(ßZ)$', 'MS', None,
3323
        'MB(SßTZ)-', 'M', 'N',
3324
        'MCG9^', 'MAK', 'NAK',
3325
        'MC9^', 'MAK', 'NAK',
3326
        'MEMOIR-^', 'MEMOA', 'NENUA',
3327
        'MERHAVEN$', 'MAHAFN', None,
3328
        'ME(LMNRST)-3^', 'ME', 'NE',
3329
        'MEN(STZ)--3', 'ME', None,
3330
        'MEN$', 'MEN', None,
3331
        'MIGUEL-', 'MIGE', 'NIKE',
3332
        'MIKE^$', 'MEIK', 'NEIK',
3333
        'MITHILFE----^$', 'MIT H', 'NIT ',
3334
        'MN$', 'M', None,
3335
        'MN', 'N', 'N',
3336
        'MPJUTE-', 'MPUT', 'NBUT',
3337
        'MP(ßZ)$', 'MS', None,
3338
        'MP(SßTZ)-', 'M', 'N',
3339
        'MP(BDJLMNPQVW)-', 'MB', 'NB',
3340
        'MY9^', 'MÜ', None,
3341
        'M(ßZ)$', 'MS', None,
3342
        'M´G7^', 'MAK', 'NAK',
3343
        'M\'G7^', 'MAK', 'NAK',
3344
        'M´^', 'MAK', 'NAK',
3345
        'M\'^', 'MAK', 'NAK',
3346
        'M', None, 'N',
3347
        'NACH^^', 'NACH', 'NAK',
3348
        'NADINE', 'NADIN', 'NATIN',
3349
        'NAIV--', 'NA', 'NA',
3350
        'NAISE$', 'NESE', 'NEZE',
3351
        'NAUGENOMM------', 'NAU ', 'NAU ',
3352
        'NAUSOGUT$', 'NAUSO GUT', 'NAUZU KUT',
3353
        'NCH$', 'NSH', 'NZ',
3354
        'NCOISE$', 'SOA', 'ZUA',
3355
        'NCOIS$', 'SOA', 'ZUA',
3356
        'NDAR$', 'NDA', 'NTA',
3357
        'NDERINGEN------', 'NDE ', 'NTE ',
3358
        'NDRO(CDKTZ)-', 'NTRO', None,
3359
        'ND(BFGJLMNPQVW)-', 'NT', None,
3360
        'ND(SßZ)$', 'NS', 'NZ',
3361
        'ND\'S$', 'NS', 'NZ',
3362
        'ND´S$', 'NS', 'NZ',
3363
        'NEBEN^^', 'NEBN', 'NEBN',
3364
        'NENGELERN------', 'NEN ', 'NEN ',
3365
        'NENLERN(ET)---', 'NEN LE', 'NEN LE',
3366
        'NENZULERNE---', 'NEN ZU LE', 'NEN ZU LE',
3367
        'NE(LMNRST)-3^', 'NE', 'NE',
3368
        'NEN-3', 'NE', 'NE',
3369
        'NETTE$', 'NET', 'NET',
3370
        'NGU^^', 'NU', 'NU',
3371
        'NG(BDFJLMNPQRTVW)-', 'NK', 'NK',
3372
        'NH(AUO)-$', 'NI', 'NI',
3373
        'NICHTSAHNEN-----', 'NIX ', 'NIX ',
3374
        'NICHTSSAGE----', 'NIX ', 'NIX ',
3375
        'NICHTS^^', 'NIX', 'NIX',
3376
        'NICHT^^', 'NICHT', 'NIKT',
3377
        'NINE$', 'NIN', 'NIN',
3378
        'NON^^', 'NON', 'NUN',
3379
        'NOTLEIDE-----^', 'NOT ', 'NUT ',
3380
        'NOT^^', 'NOT', 'NUT',
3381
        'NTI(AIOU)-3', 'NZI', 'NZI',
3382
        'NTIEL--3', 'NZI', 'NZI',
3383
        'NT(SßZ)$', 'NS', 'NZ',
3384
        'NT\'S$', 'NS', 'NZ',
3385
        'NT´S$', 'NS', 'NZ',
3386
        'NYLON', 'NEILON', 'NEILUN',
3387
        'NY9^', 'NÜ', None,
3388
        'NSTZUNEH---', 'NST ZU ', 'NZT ZU ',
3389
        'NSZ-', 'NS', None,
3390
        'NSTS$', 'NS', 'NZ',
3391
        'NZ(BDFGKLMNPQRSTVWX)-', 'NS', None,
3392
        'N(SßZ)$', 'NS', None,
3393
        'OBERE-', 'OBER', None,
3394
        'OBER^^', 'OBA', 'UBA',
3395
        'OEU2', 'Ö', 'Ö',
3396
        'OE<2', 'Ö', 'Ö',
3397
        'OGL-', 'OK', None,
3398
        'OGNIE-', 'ONI', 'UNI',
3399
        'OGN(AEOU)-$', 'ONI', 'UNI',
3400
        'OH(AIOÖUÜY)-', 'OH', None,
3401
        'OIE$', 'Ö', 'Ö',
3402
        'OIRE$', 'OA', 'UA',
3403
        'OIR$', 'OA', 'UA',
3404
        'OIX', 'OA', 'UA',
3405
        'OI<3', 'EU', 'EU',
3406
        'OKAY^$', 'OKE', 'UKE',
3407
        'OLYN$', 'OLIN', 'ULIN',
3408
        'OO(DLMZ)-', 'U', None,
3409
        'OO$', 'U', None,
3410
        'OO-', '', '',
3411
        'ORGINAL-----', 'ORI', 'URI',
3412
        'OTI(AÄOÖUÜ)-', 'OZI', 'UZI',
3413
        'OUI^', 'WI', 'FI',
3414
        'OUILLE$', 'ULIE', 'ULIE',
3415
        'OU(DT)-^', 'AU', 'AU',
3416
        'OUSE$', 'AUS', 'AUZ',
3417
        'OUT-', 'AU', 'AU',
3418
        'OU', 'U', 'U',
3419
        'O(FV)$', 'AU', 'AU',  # due to 'OW$' -> 'AU'
3420
        'OVER--<', 'OW', None,
3421
        'OV(AOU)-', 'OW', None,
3422
        'OW$', 'AU', 'AU',
3423
        'OWS$', 'OS', 'UZ',
3424
        'OJ(AÄEIOÖUÜ)--', 'O', 'U',
3425
        'OYER', 'OIA', None,
3426
        'OY(AÄEIOÖUÜ)--', 'O', 'U',
3427
        'O(JY)<', 'EU', 'EU',
3428
        'OZ$', 'OS', None,
3429
        'O´^', 'O', 'U',
3430
        'O\'^', 'O', 'U',
3431
        'O', None, 'U',
3432
        'PATIEN--^', 'PAZI', 'PAZI',
3433
        'PENSIO-^', 'PANSI', 'PANZI',
3434
        'PE(LMNRST)-3^', 'PE', 'PE',
3435
        'PFER-^', 'FE', 'FE',
3436
        'P(FH)<', 'F', 'F',
3437
        'PIC^$', 'PIK', 'PIK',
3438
        'PIC$', 'PIZ', 'PIZ',
3439
        'PIPELINE', 'PEIBLEIN', 'PEIBLEIN',
3440
        'POLYP-', 'POLÜ', None,
3441
        'POLY^^', 'POLI', 'PULI',
3442
        'PORTRAIT7', 'PORTRE', 'PURTRE',
3443
        'POWER7', 'PAUA', 'PAUA',
3444
        'PP(FH)--<', 'B', 'B',
3445
        'PP-', '', '',
3446
        'PRODUZ-^', 'PRODU', 'BRUTU',
3447
        'PRODUZI--', ' PRODU', ' BRUTU',
3448
        'PRIX^$', 'PRI', 'PRI',
3449
        'PS-^^', 'P', None,
3450
        'P(SßZ)^', None, 'Z',
3451
        'P(SßZ)$', 'BS', None,
3452
        'PT-^', '', '',
3453
        'PTI(AÄOÖUÜ)-3', 'BZI', 'BZI',
3454
        'PY9^', 'PÜ', None,
3455
        'P(AÄEIOÖRUÜY)-', 'P', 'P',
3456
        'P(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'P', None,
3457
        'P.^', None, 'P.',
3458
        'P^', 'P', None,
3459
        'P', 'B', 'B',
3460
        'QI-', 'Z', 'Z',
3461
        'QUARANT--', 'KARA', 'KARA',
3462
        'QUE(LMNRST)-3', 'KWE', 'KFE',
3463
        'QUE$', 'K', 'K',
3464
        'QUI(NS)$', 'KI', 'KI',
3465
        'QUIZ7', 'KWIS', None,
3466
        'Q(UV)7', 'KW', 'KF',
3467
        'Q<', 'K', 'K',
3468
        'RADFAHR----', 'RAT ', 'RAT ',
3469
        'RAEFTEZEHRE-----', 'REFTE ', 'REFTE ',
3470
        'RCH', 'RCH', 'RK',
3471
        'REA(DU)---3^', 'R', None,
3472
        'REBSERZEUG------', 'REBS ', 'REBZ ',
3473
        'RECHERCH^', 'RESHASH', 'REZAZ',
3474
        'RECYCL--', 'RIZEI', 'RIZEI',
3475
        'RE(ALST)-3^', 'RE', None,
3476
        'REE$', 'RI', 'RI',
3477
        'RER$', 'RA', 'RA',
3478
        'RE(MNR)-4', 'RE', 'RE',
3479
        'RETTE$', 'RET', 'RET',
3480
        'REUZ$', 'REUZ', None,
3481
        'REW$', 'RU', 'RU',
3482
        'RH<^', 'R', 'R',
3483
        'RJA(MN)--', 'RI', 'RI',
3484
        'ROWD-^', 'RAU', 'RAU',
3485
        'RTEMONNAIE-', 'RTMON', 'RTNUN',
3486
        'RTI(AÄOÖUÜ)-3', 'RZI', 'RZI',
3487
        'RTIEL--3', 'RZI', 'RZI',
3488
        'RV(AEOU)-3', 'RW', None,
3489
        'RY(KN)-$', 'RI', 'RI',
3490
        'RY9^', 'RÜ', None,
3491
        'RÄFTEZEHRE-----', 'REFTE ', 'REFTE ',
3492
        'SAISO-^', 'SES', 'ZEZ',
3493
        'SAFE^$', 'SEIF', 'ZEIF',
3494
        'SAUCE-^', 'SOS', 'ZUZ',
3495
        'SCHLAGGEBEN-----<', 'SHLAK ', 'ZLAK ',
3496
        'SCHSCH---7', '', '',
3497
        'SCHTSCH', 'SH', 'Z',
3498
        'SC(HZ)<', 'SH', 'Z',
3499
        'SC', 'SK', 'ZK',
3500
        'SELBSTST--7^^', 'SELB', 'ZELB',
3501
        'SELBST7^^', 'SELBST', 'ZELBZT',
3502
        'SERVICE7^', 'SÖRWIS', 'ZÖRFIZ',
3503
        'SERVI-^', 'SERW', None,
3504
        'SE(LMNRST)-3^', 'SE', 'ZE',
3505
        'SETTE$', 'SET', 'ZET',
3506
        'SHP-^', 'S', 'Z',
3507
        'SHST', 'SHT', 'ZT',
3508
        'SHTSH', 'SH', 'Z',
3509
        'SHT', 'ST', 'Z',
3510
        'SHY9^', 'SHÜ', None,
3511
        'SH^^', 'SH', None,
3512
        'SH3', 'SH', 'Z',
3513
        'SICHERGEGAN-----^', 'SICHA ', 'ZIKA ',
3514
        'SICHERGEHE----^', 'SICHA ', 'ZIKA ',
3515
        'SICHERGESTEL------^', 'SICHA ', 'ZIKA ',
3516
        'SICHERSTELL-----^', 'SICHA ', 'ZIKA ',
3517
        'SICHERZU(GS)--^', 'SICHA ZU ', 'ZIKA ZU ',
3518
        'SIEGLI-^', 'SIKL', 'ZIKL',
3519
        'SIGLI-^', 'SIKL', 'ZIKL',
3520
        'SIGHT', 'SEIT', 'ZEIT',
3521
        'SIGN', 'SEIN', 'ZEIN',
3522
        'SKI(NPZ)-', 'SKI', 'ZKI',
3523
        'SKI<^', 'SHI', 'ZI',
3524
        'SODASS^$', 'SO DAS', 'ZU TAZ',
3525
        'SODAß^$', 'SO DAS', 'ZU TAZ',
3526
        'SOGENAN--^', 'SO GEN', 'ZU KEN',
3527
        'SOUND-', 'SAUN', 'ZAUN',
3528
        'STAATS^^', 'STAZ', 'ZTAZ',
3529
        'STADT^^', 'STAT', 'ZTAT',
3530
        'STANDE$', ' STANDE', ' ZTANTE',
3531
        'START^^', 'START', 'ZTART',
3532
        'STAURANT7', 'STORAN', 'ZTURAN',
3533
        'STEAK-', 'STE', 'ZTE',
3534
        'STEPHEN-^$', 'STEW', None,
3535
        'STERN', 'STERN', None,
3536
        'STRAF^^', 'STRAF', 'ZTRAF',
3537
        'ST\'S$', 'Z', 'Z',
3538
        'ST´S$', 'Z', 'Z',
3539
        'STST--', '', '',
3540
        'STS(ACEÈÉÊHIÌÍÎOUÄÜÖ)--', 'ST', 'ZT',
3541
        'ST(SZ)', 'Z', 'Z',
3542
        'SPAREN---^', 'SPA', 'ZPA',
3543
        'SPAREND----', ' SPA', ' ZPA',
3544
        'S(PTW)-^^', 'S', None,
3545
        'SP', 'SP', None,
3546
        'STYN(AE)-$', 'STIN', 'ZTIN',
3547
        'ST', 'ST', 'ZT',
3548
        'SUITE<', 'SIUT', 'ZIUT',
3549
        'SUKE--$', 'S', 'Z',
3550
        'SURF(EI)-', 'SÖRF', 'ZÖRF',
3551
        'SV(AEÈÉÊIÌÍÎOU)-<^', 'SW', None,
3552
        'SYB(IY)--^', 'SIB', None,
3553
        'SYL(KVW)--^', 'SI', None,
3554
        'SY9^', 'SÜ', None,
3555
        'SZE(NPT)-^', 'ZE', 'ZE',
3556
        'SZI(ELN)-^', 'ZI', 'ZI',
3557
        'SZCZ<', 'SH', 'Z',
3558
        'SZT<', 'ST', 'ZT',
3559
        'SZ<3', 'SH', 'Z',
3560
        'SÜL(KVW)--^', 'SI', None,
3561
        'S', None, 'Z',
3562
        'TCH', 'SH', 'Z',
3563
        'TD(AÄEIOÖRUÜY)-', 'T', None,
3564
        'TD(ÀÁÂÃÅÈÉÊËÌÍÎÏÒÓÔÕØÙÚÛÝŸ)-', 'T', None,
3565
        'TEAT-^', 'TEA', 'TEA',
3566
        'TERRAI7^', 'TERA', 'TERA',
3567
        'TE(LMNRST)-3^', 'TE', 'TE',
3568
        'TH<', 'T', 'T',
3569
        'TICHT-', 'TIK', 'TIK',
3570
        'TICH$', 'TIK', 'TIK',
3571
        'TIC$', 'TIZ', 'TIZ',
3572
        'TIGGESTELL-------', 'TIK ', 'TIK ',
3573
        'TIGSTELL-----', 'TIK ', 'TIK ',
3574
        'TOAS-^', 'TO', 'TU',
3575
        'TOILET-', 'TOLE', 'TULE',
3576
        'TOIN-', 'TOA', 'TUA',
3577
        'TRAECHTI-^', 'TRECHT', 'TREKT',
3578
        'TRAECHTIG--', ' TRECHT', ' TREKT',
3579
        'TRAINI-', 'TREN', 'TREN',
3580
        'TRÄCHTI-^', 'TRECHT', 'TREKT',
3581
        'TRÄCHTIG--', ' TRECHT', ' TREKT',
3582
        'TSCH', 'SH', 'Z',
3583
        'TSH', 'SH', 'Z',
3584
        'TST', 'ZT', 'ZT',
3585
        'T(Sß)', 'Z', 'Z',
3586
        'TT(SZ)--<', '', '',
3587
        'TT9', 'T', 'T',
3588
        'TV^$', 'TV', 'TV',
3589
        'TX(AEIOU)-3', 'SH', 'Z',
3590
        'TY9^', 'TÜ', None,
3591
        'TZ-', '', '',
3592
        'T\'S3$', 'Z', 'Z',
3593
        'T´S3$', 'Z', 'Z',
3594
        'UEBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
3595
        'UEBER^^', 'ÜBA', 'IBA',
3596
        'UE2', 'Ü', 'I',
3597
        'UGL-', 'UK', None,
3598
        'UH(AOÖUÜY)-', 'UH', None,
3599
        'UIE$', 'Ü', 'I',
3600
        'UM^^', 'UM', 'UN',
3601
        'UNTERE--3', 'UNTE', 'UNTE',
3602
        'UNTER^^', 'UNTA', 'UNTA',
3603
        'UNVER^^', 'UNFA', 'UNFA',
3604
        'UN^^', 'UN', 'UN',
3605
        'UTI(AÄOÖUÜ)-', 'UZI', 'UZI',
3606
        'UVE-4', 'UW', None,
3607
        'UY2', 'UI', None,
3608
        'UZZ', 'AS', 'AZ',
3609
        'VACL-^', 'WAZ', 'FAZ',
3610
        'VAC$', 'WAZ', 'FAZ',
3611
        'VAN DEN ^', 'FANDN', 'FANTN',
3612
        'VANES-^', 'WANE', None,
3613
        'VATRO-', 'WATR', None,
3614
        'VA(DHJNT)--^', 'F', None,
3615
        'VEDD-^', 'FE', 'FE',
3616
        'VE(BEHIU)--^', 'F', None,
3617
        'VEL(BDLMNT)-^', 'FEL', None,
3618
        'VENTZ-^', 'FEN', None,
3619
        'VEN(NRSZ)-^', 'FEN', None,
3620
        'VER(AB)-^$', 'WER', None,
3621
        'VERBAL^$', 'WERBAL', None,
3622
        'VERBAL(EINS)-^', 'WERBAL', None,
3623
        'VERTEBR--', 'WERTE', None,
3624
        'VEREIN-----', 'F', None,
3625
        'VEREN(AEIOU)-^', 'WEREN', None,
3626
        'VERIFI', 'WERIFI', None,
3627
        'VERON(AEIOU)-^', 'WERON', None,
3628
        'VERSEN^', 'FERSN', 'FAZN',
3629
        'VERSIERT--^', 'WERSI', None,
3630
        'VERSIO--^', 'WERS', None,
3631
        'VERSUS', 'WERSUS', None,
3632
        'VERTI(GK)-', 'WERTI', None,
3633
        'VER^^', 'FER', 'FA',
3634
        'VERSPRECHE-------', ' FER', ' FA',
3635
        'VER$', 'WA', None,
3636
        'VER', 'FA', 'FA',
3637
        'VET(HT)-^', 'FET', 'FET',
3638
        'VETTE$', 'WET', 'FET',
3639
        'VE^', 'WE', None,
3640
        'VIC$', 'WIZ', 'FIZ',
3641
        'VIELSAGE----', 'FIL ', 'FIL ',
3642
        'VIEL', 'FIL', 'FIL',
3643
        'VIEW', 'WIU', 'FIU',
3644
        'VILL(AE)-', 'WIL', None,
3645
        'VIS(ACEIKUVWZ)-<^', 'WIS', None,
3646
        'VI(ELS)--^', 'F', None,
3647
        'VILLON--', 'WILI', 'FILI',
3648
        'VIZE^^', 'FIZE', 'FIZE',
3649
        'VLIE--^', 'FL', None,
3650
        'VL(AEIOU)--', 'W', None,
3651
        'VOKA-^', 'WOK', None,
3652
        'VOL(ATUVW)--^', 'WO', None,
3653
        'VOR^^', 'FOR', 'FUR',
3654
        'VR(AEIOU)--', 'W', None,
3655
        'VV9', 'W', None,
3656
        'VY9^', 'WÜ', 'FI',
3657
        'V(ÜY)-', 'W', None,
3658
        'V(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'W', None,
3659
        'V(AEIJLRU)-<', 'W', None,
3660
        'V.^', 'V.', None,
3661
        'V<', 'F', 'F',
3662
        'WEITERENTWI-----^', 'WEITA ', 'FEITA ',
3663
        'WEITREICH-----^', 'WEIT ', 'FEIT ',
3664
        'WEITVER^', 'WEIT FER', 'FEIT FA',
3665
        'WE(LMNRST)-3^', 'WE', 'FE',
3666
        'WER(DST)-', 'WER', None,
3667
        'WIC$', 'WIZ', 'FIZ',
3668
        'WIEDERU--', 'WIDE', 'FITE',
3669
        'WIEDER^$', 'WIDA', 'FITA',
3670
        'WIEDER^^', 'WIDA ', 'FITA ',
3671
        'WIEVIEL', 'WI FIL', 'FI FIL',
3672
        'WISUEL', 'WISUEL', None,
3673
        'WR-^', 'W', None,
3674
        'WY9^', 'WÜ', 'FI',
3675
        'W(BDFGJKLMNPQRSTZ)-', 'F', None,
3676
        'W$', 'F', None,
3677
        'W', None, 'F',
3678
        'X<^', 'Z', 'Z',
3679
        'XHAVEN$', 'XAFN', None,
3680
        'X(CSZ)', 'X', 'X',
3681
        'XTS(CH)--', 'XT', 'XT',
3682
        'XT(SZ)', 'Z', 'Z',
3683
        'YE(LMNRST)-3^', 'IE', 'IE',
3684
        'YE-3', 'I', 'I',
3685
        'YOR(GK)^$', 'IÖRK', 'IÖRK',
3686
        'Y(AOU)-<7', 'I', 'I',
3687
        'Y(BKLMNPRSTX)-1', 'Ü', None,
3688
        'YVES^$', 'IF', 'IF',
3689
        'YVONNE^$', 'IWON', 'IFUN',
3690
        'Y.^', 'Y.', None,
3691
        'Y', 'I', 'I',
3692
        'ZC(AOU)-', 'SK', 'ZK',
3693
        'ZE(LMNRST)-3^', 'ZE', 'ZE',
3694
        'ZIEJ$', 'ZI', 'ZI',
3695
        'ZIGERJA(HR)-3', 'ZIGA IA', 'ZIKA IA',
3696
        'ZL(AEIOU)-', 'SL', None,
3697
        'ZS(CHT)--', '', '',
3698
        'ZS', 'SH', 'Z',
3699
        'ZUERST', 'ZUERST', 'ZUERST',
3700
        'ZUGRUNDE^$', 'ZU GRUNDE', 'ZU KRUNTE',
3701
        'ZUGRUNDE', 'ZU GRUNDE ', 'ZU KRUNTE ',
3702
        'ZUGUNSTEN', 'ZU GUNSTN', 'ZU KUNZTN',
3703
        'ZUHAUSE-', 'ZU HAUS', 'ZU AUZ',
3704
        'ZULASTEN^$', 'ZU LASTN', 'ZU LAZTN',
3705
        'ZURUECK^^', 'ZURÜK', 'ZURIK',
3706
        'ZURZEIT', 'ZUR ZEIT', 'ZUR ZEIT',
3707
        'ZURÜCK^^', 'ZURÜK', 'ZURIK',
3708
        'ZUSTANDE', 'ZU STANDE', 'ZU ZTANTE',
3709
        'ZUTAGE', 'ZU TAGE', 'ZU TAKE',
3710
        'ZUVER^^', 'ZUFA', 'ZUFA',
3711
        'ZUVIEL', 'ZU FIL', 'ZU FIL',
3712
        'ZUWENIG', 'ZU WENIK', 'ZU FENIK',
3713
        'ZY9^', 'ZÜ', None,
3714
        'ZYK3$', 'ZIK', None,
3715
        'Z(VW)7^', 'SW', None,
3716
        None, None, None)
3717
3718
    phonet_hash = Counter()
3719
    alpha_pos = Counter()
3720
3721
    phonet_hash_1 = Counter()
3722
    phonet_hash_2 = Counter()
3723
3724
    _phonet_upper_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
3725
                                          'abcdefghijklmnopqrstuvwxyzàáâãåäæ' +
3726
                                          'çðèéêëìíîïñòóôõöøœšßþùúûüýÿ'),
3727
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÅÄÆ' +
3728
                                         'ÇÐÈÉÊËÌÍÎÏÑÒÓÔÕÖØŒŠßÞÙÚÛÜÝŸ'))
3729
3730
    def _trinfo(text, rule, err_text, lang):
3731
        """Output debug information."""
3732
        if lang == 'none':
3733
            _phonet_rules = _phonet_rules_no_lang
3734
        else:
3735
            _phonet_rules = _phonet_rules_german
3736
3737
        from_rule = ('(NULL)' if _phonet_rules[rule] is None else
3738
                     _phonet_rules[rule])
3739
        to_rule1 = ('(NULL)' if (_phonet_rules[rule + 1] is None) else
3740
                    _phonet_rules[rule + 1])
3741
        to_rule2 = ('(NULL)' if (_phonet_rules[rule + 2] is None) else
3742
                    _phonet_rules[rule + 2])
3743
        print('"{} {}:  "{}"{}"{}" {}'.format(text, ((rule / 3) + 1),
3744
                                              from_rule, to_rule1, to_rule2,
3745
                                              err_text))
3746
3747
    def _initialize_phonet(lang):
3748
        """Initialize phonet variables."""
3749
        if lang == 'none':
3750
            _phonet_rules = _phonet_rules_no_lang
3751
        else:
3752
            _phonet_rules = _phonet_rules_german
3753
3754
        phonet_hash[''] = -1
3755
3756
        # German and international umlauts
3757
        for j in {'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë',
3758
                  'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø',
3759
                  'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'Œ', 'Š', 'Ÿ'}:
3760
            alpha_pos[j] = 1
3761
            phonet_hash[j] = -1
3762
3763
        # "normal" letters ('A'-'Z')
3764
        for i, j in enumerate('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
3765
            alpha_pos[j] = i + 2
3766
            phonet_hash[j] = -1
3767
3768
        for i in range(26):
3769
            for j in range(28):
3770
                phonet_hash_1[i, j] = -1
3771
                phonet_hash_2[i, j] = -1
3772
3773
        # for each phonetc rule
3774
        for i in range(len(_phonet_rules)):
3775
            rule = _phonet_rules[i]
3776
3777
            if rule and i % 3 == 0:
3778
                # calculate first hash value
3779
                k = _phonet_rules[i][0]
3780
3781
                if phonet_hash[k] < 0 and (_phonet_rules[i+1] or
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash does not seem to be defined.
Loading history...
3782
                                           _phonet_rules[i+2]):
3783
                    phonet_hash[k] = i
3784
3785
                # calculate second hash values
3786
                if k and alpha_pos[k] >= 2:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable alpha_pos does not seem to be defined.
Loading history...
3787
                    k = alpha_pos[k]
3788
3789
                    j = k-2
3790
                    rule = rule[1:]
3791
3792
                    if not rule:
3793
                        rule = ' '
3794
                    elif rule[0] == '(':
3795
                        rule = rule[1:]
3796
                    else:
3797
                        rule = rule[0]
3798
3799
                    while rule and (rule[0] != ')'):
3800
                        k = alpha_pos[rule[0]]
3801
3802
                        if k > 0:
3803
                            # add hash value for this letter
3804
                            if phonet_hash_1[j, k] < 0:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_1 does not seem to be defined.
Loading history...
3805
                                phonet_hash_1[j, k] = i
3806
                                phonet_hash_2[j, k] = i
3807
3808
                            if phonet_hash_2[j, k] >= (i-30):
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_2 does not seem to be defined.
Loading history...
3809
                                phonet_hash_2[j, k] = i
3810
                            else:
3811
                                k = -1
3812
3813
                        if k <= 0:
3814
                            # add hash value for all letters
3815
                            if phonet_hash_1[j, 0] < 0:
3816
                                phonet_hash_1[j, 0] = i
3817
3818
                            phonet_hash_2[j, 0] = i
3819
3820
                        rule = rule[1:]
3821
3822
    def _phonet(term, mode, lang, trace):
3823
        """Return the phonet coded form of a term."""
3824
        if lang == 'none':
3825
            _phonet_rules = _phonet_rules_no_lang
3826
        else:
3827
            _phonet_rules = _phonet_rules_german
3828
3829
        char0 = ''
3830
        dest = term
3831
3832
        if not term:
3833
            return ''
3834
3835
        term_length = len(term)
3836
3837
        # convert input string to upper-case
3838
        src = term.translate(_phonet_upper_translation)
3839
3840
        # check "src"
3841
        i = 0
3842
        j = 0
3843
        zeta = 0
3844
3845
        while i < len(src):
3846
            char = src[i]
3847
3848
            if trace:
3849
                print('\ncheck position {}:  src = "{}",  dest = "{}"'.format
3850
                      (j, src[i:], dest[:j]))
3851
3852
            pos = alpha_pos[char]
3853
3854
            if pos >= 2:
3855
                xpos = pos-2
3856
3857
                if i+1 == len(src):
3858
                    pos = alpha_pos['']
3859
                else:
3860
                    pos = alpha_pos[src[i+1]]
3861
3862
                start1 = phonet_hash_1[xpos, pos]
3863
                start2 = phonet_hash_1[xpos, 0]
3864
                end1 = phonet_hash_2[xpos, pos]
3865
                end2 = phonet_hash_2[xpos, 0]
3866
3867
                # preserve rule priorities
3868
                if (start2 >= 0) and ((start1 < 0) or (start2 < start1)):
3869
                    pos = start1
3870
                    start1 = start2
3871
                    start2 = pos
3872
                    pos = end1
3873
                    end1 = end2
3874
                    end2 = pos
3875
3876
                if (end1 >= start2) and (start2 >= 0):
3877
                    if end2 > end1:
3878
                        end1 = end2
3879
3880
                    start2 = -1
3881
                    end2 = -1
3882
            else:
3883
                pos = phonet_hash[char]
3884
                start1 = pos
3885
                end1 = 10000
3886
                start2 = -1
3887
                end2 = -1
3888
3889
            pos = start1
3890
            zeta0 = 0
3891
3892
            if pos >= 0:
3893
                # check rules for this char
3894
                while ((_phonet_rules[pos] is None) or
3895
                       (_phonet_rules[pos][0] == char)):
3896
                    if pos > end1:
3897
                        if start2 > 0:
3898
                            pos = start2
3899
                            start1 = start2
3900
                            start2 = -1
3901
                            end1 = end2
3902
                            end2 = -1
3903
                            continue
3904
3905
                        break
3906
3907
                    if (((_phonet_rules[pos] is None) or
3908
                         (_phonet_rules[pos + mode] is None))):
3909
                        # no conversion rule available
3910
                        pos += 3
3911
                        continue
3912
3913
                    if trace:
3914
                        _trinfo('> rule no.', pos, 'is being checked', lang)
3915
3916
                    # check whole string
3917
                    matches = 1  # number of matching letters
3918
                    priority = 5  # default priority
3919
                    rule = _phonet_rules[pos]
3920
                    rule = rule[1:]
3921
3922
                    while (rule and
3923
                           (len(src) > (i + matches)) and
3924
                           (src[i + matches] == rule[0]) and
3925
                           not rule[0].isdigit() and
3926
                           (rule not in '(-<^$')):
3927
                        matches += 1
3928
                        rule = rule[1:]
3929
3930
                    if rule and (rule[0] == '('):
3931
                        # check an array of letters
3932
                        if (((len(src) > (i + matches)) and
3933
                             src[i + matches].isalpha() and
3934
                             (src[i + matches] in rule[1:]))):
3935
                            matches += 1
3936
3937
                            while rule and rule[0] != ')':
3938
                                rule = rule[1:]
3939
3940
                            # if rule[0] == ')':
3941
                            rule = rule[1:]
3942
3943
                    if rule:
3944
                        priority0 = ord(rule[0])
3945
                    else:
3946
                        priority0 = 0
3947
3948
                    matches0 = matches
3949
3950
                    while rule and rule[0] == '-' and matches > 1:
3951
                        matches -= 1
3952
                        rule = rule[1:]
3953
3954
                    if rule and rule[0] == '<':
3955
                        rule = rule[1:]
3956
3957
                    if rule and rule[0].isdigit():
3958
                        # read priority
3959
                        priority = int(rule[0])
3960
                        rule = rule[1:]
3961
3962
                    if rule and rule[0:2] == '^^':
3963
                        rule = rule[1:]
3964
3965
                    if (not rule or
3966
                            ((rule[0] == '^') and
3967
                             ((i == 0) or not src[i-1].isalpha()) and
3968
                             ((rule[1:2] != '$') or
3969
                              (not (src[i+matches0:i+matches0+1].isalpha()) and
3970
                               (src[i+matches0:i+matches0+1] != '.')))) or
3971
                            ((rule[0] == '$') and (i > 0) and
3972
                             src[i-1].isalpha() and
3973
                             ((not src[i+matches0:i+matches0+1].isalpha()) and
3974
                              (src[i+matches0:i+matches0+1] != '.')))):
3975
                        # look for continuation, if:
3976
                        # matches > 1 und NO '-' in first string */
3977
                        pos0 = -1
3978
3979
                        start3 = 0
3980
                        start4 = 0
3981
                        end3 = 0
3982
                        end4 = 0
3983
3984
                        if (((matches > 1) and
3985
                             src[i+matches:i+matches+1] and
3986
                             (priority0 != ord('-')))):
3987
                            char0 = src[i+matches-1]
3988
                            pos0 = alpha_pos[char0]
3989
3990
                            if pos0 >= 2 and src[i+matches]:
3991
                                xpos = pos0 - 2
3992
                                pos0 = alpha_pos[src[i+matches]]
3993
                                start3 = phonet_hash_1[xpos, pos0]
3994
                                start4 = phonet_hash_1[xpos, 0]
3995
                                end3 = phonet_hash_2[xpos, pos0]
3996
                                end4 = phonet_hash_2[xpos, 0]
3997
3998
                                # preserve rule priorities
3999
                                if (((start4 >= 0) and
4000
                                     ((start3 < 0) or (start4 < start3)))):
4001
                                    pos0 = start3
4002
                                    start3 = start4
4003
                                    start4 = pos0
4004
                                    pos0 = end3
4005
                                    end3 = end4
4006
                                    end4 = pos0
4007
4008
                                if (end3 >= start4) and (start4 >= 0):
4009
                                    if end4 > end3:
4010
                                        end3 = end4
4011
4012
                                    start4 = -1
4013
                                    end4 = -1
4014
                            else:
4015
                                pos0 = phonet_hash[char0]
4016
                                start3 = pos0
4017
                                end3 = 10000
4018
                                start4 = -1
4019
                                end4 = -1
4020
4021
                            pos0 = start3
4022
4023
                        # check continuation rules for src[i+matches]
4024
                        if pos0 >= 0:
4025
                            while ((_phonet_rules[pos0] is None) or
4026
                                   (_phonet_rules[pos0][0] == char0)):
4027
                                if pos0 > end3:
4028
                                    if start4 > 0:
4029
                                        pos0 = start4
4030
                                        start3 = start4
4031
                                        start4 = -1
4032
                                        end3 = end4
4033
                                        end4 = -1
4034
                                        continue
4035
4036
                                    priority0 = -1
4037
4038
                                    # important
4039
                                    break
4040
4041
                                if (((_phonet_rules[pos0] is None) or
4042
                                     (_phonet_rules[pos0 + mode] is None))):
4043
                                    # no conversion rule available
4044
                                    pos0 += 3
4045
                                    continue
4046
4047
                                if trace:
4048
                                    _trinfo('> > continuation rule no.', pos0,
4049
                                            'is being checked', lang)
4050
4051
                                # check whole string
4052
                                matches0 = matches
4053
                                priority0 = 5
4054
                                rule = _phonet_rules[pos0]
4055
                                rule = rule[1:]
4056
4057
                                while (rule and
4058
                                       (src[i+matches0:i+matches0+1] ==
4059
                                        rule[0]) and
4060
                                       (not rule[0].isdigit() or
4061
                                        (rule in '(-<^$'))):
4062
                                    matches0 += 1
4063
                                    rule = rule[1:]
4064
4065
                                if rule and rule[0] == '(':
4066
                                    # check an array of letters
4067
                                    if ((src[i+matches0:i+matches0+1]
4068
                                         .isalpha() and
4069
                                         (src[i+matches0] in rule[1:]))):
4070
                                        matches0 += 1
4071
4072
                                        while rule and rule[0] != ')':
4073
                                            rule = rule[1:]
4074
4075
                                        # if rule[0] == ')':
4076
                                        rule = rule[1:]
4077
4078
                                while rule and rule[0] == '-':
4079
                                    # "matches0" is NOT decremented
4080
                                    # because of  "if (matches0 == matches)"
4081
                                    rule = rule[1:]
4082
4083
                                if rule and rule[0] == '<':
4084
                                    rule = rule[1:]
4085
4086
                                if rule and rule[0].isdigit():
4087
                                    priority0 = int(rule[0])
4088
                                    rule = rule[1:]
4089
4090
                                if (not rule or
4091
                                        # rule == '^' is not possible here
4092
                                        ((rule[0] == '$') and not
4093
                                         src[i+matches0:i+matches0+1]
4094
                                         .isalpha() and
4095
                                         (src[i+matches0:i+matches0+1]
4096
                                          != '.'))):
4097
                                    if matches0 == matches:
4098
                                        # this is only a partial string
4099
                                        if trace:
4100
                                            _trinfo('> > continuation ' +
4101
                                                    'rule no.',
4102
                                                    pos0,
4103
                                                    'not used (too short)',
4104
                                                    lang)
4105
4106
                                        pos0 += 3
4107
                                        continue
4108
4109
                                    if priority0 < priority:
4110
                                        # priority is too low
4111
                                        if trace:
4112
                                            _trinfo('> > continuation ' +
4113
                                                    'rule no.',
4114
                                                    pos0,
4115
                                                    'not used (priority)',
4116
                                                    lang)
4117
4118
                                        pos0 += 3
4119
                                        continue
4120
4121
                                    # continuation rule found
4122
                                    break
4123
4124
                                if trace:
4125
                                    _trinfo('> > continuation rule no.', pos0,
4126
                                            'not used', lang)
4127
4128
                                pos0 += 3
4129
4130
                            # end of "while"
4131
                            if ((priority0 >= priority) and
4132
                                    ((_phonet_rules[pos0] is not None) and
4133
                                     (_phonet_rules[pos0][0] == char0))):
4134
4135
                                if trace:
4136
                                    _trinfo('> rule no.', pos, '', lang)
4137
                                    _trinfo('> not used because of ' +
4138
                                            'continuation', pos0, '', lang)
4139
4140
                                pos += 3
4141
                                continue
4142
4143
                        # replace string
4144
                        if trace:
4145
                            _trinfo('Rule no.', pos, 'is applied', lang)
4146
4147
                        if ((_phonet_rules[pos] and
4148
                             ('<' in _phonet_rules[pos][1:]))):
4149
                            priority0 = 1
4150
                        else:
4151
                            priority0 = 0
4152
4153
                        rule = _phonet_rules[pos + mode]
4154
4155
                        if (priority0 == 1) and (zeta == 0):
4156
                            # rule with '<' is applied
4157
                            if ((j > 0) and rule and
4158
                                    ((dest[j-1] == char) or
4159
                                     (dest[j-1] == rule[0]))):
4160
                                j -= 1
4161
4162
                            zeta0 = 1
4163
                            zeta += 1
4164
                            matches0 = 0
4165
4166
                            while rule and src[i+matches0]:
4167
                                src = (src[0:i+matches0] + rule[0] +
4168
                                       src[i+matches0+1:])
4169
                                matches0 += 1
4170
                                rule = rule[1:]
4171
4172
                            if matches0 < matches:
4173
                                src = (src[0:i+matches0] +
4174
                                       src[i+matches:])
4175
4176
                            char = src[i]
4177
                        else:
4178
                            i = i + matches - 1
4179
                            zeta = 0
4180
4181
                            while len(rule) > 1:
4182
                                if (j == 0) or (dest[j - 1] != rule[0]):
4183
                                    dest = (dest[0:j] + rule[0] +
4184
                                            dest[min(len(dest), j+1):])
4185
                                    j += 1
4186
4187
                                rule = rule[1:]
4188
4189
                            # new "current char"
4190
                            if not rule:
4191
                                rule = ''
4192
                                char = ''
4193
                            else:
4194
                                char = rule[0]
4195
4196
                            if ((_phonet_rules[pos] and
4197
                                 '^^' in _phonet_rules[pos][1:])):
4198
                                if char:  # pragma: no branch
4199
                                    dest = (dest[0:j] + char +
4200
                                            dest[min(len(dest), j + 1):])
4201
                                    j += 1
4202
4203
                                src = src[i + 1:]
4204
                                i = 0
4205
                                zeta0 = 1
4206
4207
                        break
4208
4209
                    pos += 3
4210
4211
                    if pos > end1 and start2 > 0:
4212
                        pos = start2
4213
                        start1 = start2
4214
                        end1 = end2
4215
                        start2 = -1
4216
                        end2 = -1
4217
4218
            if zeta0 == 0:
4219
                if char and ((j == 0) or (dest[j-1] != char)):
4220
                    # delete multiple letters only
4221
                    dest = dest[0:j] + char + dest[min(j+1, term_length):]
4222
                    j += 1
4223
4224
                i += 1
4225
                zeta = 0
4226
4227
        dest = dest[0:j]
4228
4229
        return dest
4230
4231
    _initialize_phonet(lang)
4232
4233
    word = normalize('NFKC', text_type(word))
4234
    return _phonet(word, mode, lang, trace)
4235
4236
4237
def spfc(word):
4238
    """Return the Standardized Phonetic Frequency Code (SPFC) of a word.
4239
4240
    Standardized Phonetic Frequency Code is roughly Soundex-like.
4241
    This implementation is based on page 19-21 of :cite:`Moore:1977`.
4242
4243
    :param str word: the word to transform
4244
    :returns: the SPFC value
4245
    :rtype: str
4246
4247
    >>> spfc('Christopher Smith')
4248
    '01160'
4249
    >>> spfc('Christopher Schmidt')
4250
    '01160'
4251
    >>> spfc('Niall Smith')
4252
    '01660'
4253
    >>> spfc('Niall Schmidt')
4254
    '01660'
4255
4256
    >>> spfc('L.Smith')
4257
    '01960'
4258
    >>> spfc('R.Miller')
4259
    '65490'
4260
4261
    >>> spfc(('L', 'Smith'))
4262
    '01960'
4263
    >>> spfc(('R', 'Miller'))
4264
    '65490'
4265
    """
4266
    _pf1 = dict(zip((ord(_) for _ in 'SZCKQVFPUWABLORDHIEMNXGJT'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4267
                    '0011112222334445556666777'))
4268
    _pf2 = dict(zip((ord(_) for _ in
4269
                     'SZCKQFPXABORDHIMNGJTUVWEL'),
4270
                    '0011122233445556677788899'))
4271
    _pf3 = dict(zip((ord(_) for _ in
4272
                     'BCKQVDTFLPGJXMNRSZAEHIOUWY'),
4273
                    '00000112223334456677777777'))
4274
4275
    _substitutions = (('DK', 'K'), ('DT', 'T'), ('SC', 'S'), ('KN', 'N'),
4276
                      ('MN', 'N'))
4277
4278
    def _raise_word_ex():
4279
        """Raise an AttributeError."""
4280
        raise AttributeError('word attribute must be a string with a space ' +
4281
                             'or period dividing the first and last names ' +
4282
                             'or a tuple/list consisting of the first and ' +
4283
                             'last names')
4284
4285
    if not word:
4286
        return ''
4287
4288
    if isinstance(word, (str, text_type)):
4289
        names = word.split('.', 1)
4290
        if len(names) != 2:
4291
            names = word.split(' ', 1)
4292
            if len(names) != 2:
4293
                _raise_word_ex()
4294
    elif hasattr(word, '__iter__'):
4295
        if len(word) != 2:
4296
            _raise_word_ex()
4297
        names = word
4298
    else:
4299
        _raise_word_ex()
4300
4301
    names = [normalize('NFKD', text_type(_.strip()
4302
                                         .replace('ß', 'SS')
4303
                                         .upper()))
4304
             for _ in names]
0 ignored issues
show
introduced by
The variable names does not seem to be defined for all execution paths.
Loading history...
4305
    code = ''
4306
4307
    def steps_one_to_three(name):
4308
        """Perform the first three steps of SPFC."""
4309
        # filter out non A-Z
4310
        name = ''.join(_ for _ in name if _ in
4311
                       {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
4312
                        'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
4313
                        'W', 'X', 'Y', 'Z'})
4314
4315
        # 1. In the field, convert DK to K, DT to T, SC to S, KN to N,
4316
        # and MN to N
4317
        for subst in _substitutions:
4318
            name = name.replace(subst[0], subst[1])
4319
4320
        # 2. In the name field, replace multiple letters with a single letter
4321
        name = _delete_consecutive_repeats(name)
4322
4323
        # 3. Remove vowels, W, H, and Y, but keep the first letter in the name
4324
        # field.
4325
        if name:
4326
            name = name[0] + ''.join(_ for _ in name[1:] if _ not in
4327
                                     {'A', 'E', 'H', 'I', 'O', 'U', 'W', 'Y'})
4328
        return name
4329
4330
    names = [steps_one_to_three(_) for _ in names]
4331
4332
    # 4. The first digit of the code is obtained using PF1 and the first letter
4333
    # of the name field. Remove this letter after coding.
4334
    if names[1]:
4335
        code += names[1][0].translate(_pf1)
4336
        names[1] = names[1][1:]
4337
4338
    # 5. Using the last letters of the name, use Table PF3 to obtain the
4339
    # second digit of the code. Use as many letters as possible and remove
4340
    # after coding.
4341
    if names[1]:
4342
        if names[1][-3:] == 'STN' or names[1][-3:] == 'PRS':
4343
            code += '8'
4344
            names[1] = names[1][:-3]
4345
        elif names[1][-2:] == 'SN':
4346
            code += '8'
4347
            names[1] = names[1][:-2]
4348
        elif names[1][-3:] == 'STR':
4349
            code += '9'
4350
            names[1] = names[1][:-3]
4351
        elif names[1][-2:] in {'SR', 'TN', 'TD'}:
4352
            code += '9'
4353
            names[1] = names[1][:-2]
4354
        elif names[1][-3:] == 'DRS':
4355
            code += '7'
4356
            names[1] = names[1][:-3]
4357
        elif names[1][-2:] in {'TR', 'MN'}:
4358
            code += '7'
4359
            names[1] = names[1][:-2]
4360
        else:
4361
            code += names[1][-1].translate(_pf3)
4362
            names[1] = names[1][:-1]
4363
4364
    # 6. The third digit is found using Table PF2 and the first character of
4365
    # the first name. Remove after coding.
4366
    if names[0]:
4367
        code += names[0][0].translate(_pf2)
4368
        names[0] = names[0][1:]
4369
4370
    # 7. The fourth digit is found using Table PF2 and the first character of
4371
    # the name field. If no letters remain use zero. After coding remove the
4372
    # letter.
4373
    # 8. The fifth digit is found in the same manner as the fourth using the
4374
    # remaining characters of the name field if any.
4375
    for _ in range(2):
4376
        if names[1]:
4377
            code += names[1][0].translate(_pf2)
4378
            names[1] = names[1][1:]
4379
        else:
4380
            code += '0'
4381
4382
    return code
4383
4384
4385
def statistics_canada(word, maxlength=4):
4386
    """Return the Statistics Canada code for a word.
4387
4388
    The original description of this algorithm could not be located, and
4389
    may only have been specified in an unpublished TR. The coding does not
4390
    appear to be in use by Statistics Canada any longer. In its place, this is
4391
    an implementation of the "Census modified Statistics Canada name coding
4392
    procedure".
4393
4394
    The modified version of this algorithm is described in Appendix B of
4395
     :cite:`Moore:1977`.
4396
4397
    :param str word: the word to transform
4398
    :param int maxlength: the maximum length (default 6) of the code to return
4399
    :param bool modified: indicates whether to use USDA modified algorithm
4400
    :returns: the Statistics Canada name code value
4401
    :rtype: str
4402
4403
    >>> statistics_canada('Christopher')
4404
    'CHRS'
4405
    >>> statistics_canada('Niall')
4406
    'NL'
4407
    >>> statistics_canada('Smith')
4408
    'SMTH'
4409
    >>> statistics_canada('Schmidt')
4410
    'SCHM'
4411
    """
4412
    # uppercase, normalize, decompose, and filter non-A-Z out
4413
    word = normalize('NFKD', text_type(word.upper()))
4414
    word = word.replace('ß', 'SS')
4415
    word = ''.join(c for c in word if c in
4416
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4417
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4418
                    'Y', 'Z'})
4419
    if not word:
4420
        return ''
4421
4422
    code = word[1:]
4423
    for vowel in {'A', 'E', 'I', 'O', 'U', 'Y'}:
4424
        code = code.replace(vowel, '')
4425
    code = word[0]+code
4426
    code = _delete_consecutive_repeats(code)
4427
    code = code.replace(' ', '')
4428
4429
    return code[:maxlength]
4430
4431
4432
def lein(word, maxlength=4, zero_pad=True):
4433
    """Return the Lein code for a word.
4434
4435
    This is Lein name coding, described in :cite:`Moore:1977`.
4436
4437
    :param str word: the word to transform
4438
    :param int maxlength: the maximum length (default 4) of the code to return
4439
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4440
        maxlength string
4441
    :returns: the Lein code
4442
    :rtype: str
4443
4444
    >>> lein('Christopher')
4445
    'C351'
4446
    >>> lein('Niall')
4447
    'N300'
4448
    >>> lein('Smith')
4449
    'S210'
4450
    >>> lein('Schmidt')
4451
    'S521'
4452
    """
4453
    _lein_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4454
                                  'BCDFGJKLMNPQRSTVXZ'),
4455
                                 '451455532245351455'))
4456
4457
    # uppercase, normalize, decompose, and filter non-A-Z out
4458
    word = normalize('NFKD', text_type(word.upper()))
4459
    word = word.replace('ß', 'SS')
4460
    word = ''.join(c for c in word if c in
4461
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4462
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4463
                    'Y', 'Z'})
4464
4465
    if not word:
4466
        return ''
4467
4468
    code = word[0]  # Rule 1
4469
    word = word[1:].translate({32: None, 65: None, 69: None, 72: None,
4470
                               73: None, 79: None, 85: None, 87: None,
4471
                               89: None})  # Rule 2
4472
    word = _delete_consecutive_repeats(word)  # Rule 3
4473
    code += word.translate(_lein_translation)  # Rule 4
4474
4475
    if zero_pad:
4476
        code += ('0'*maxlength)  # Rule 4
4477
4478
    return code[:maxlength]
4479
4480
4481
def roger_root(word, maxlength=5, zero_pad=True):
4482
    """Return the Roger Root code for a word.
4483
4484
    This is Roger Root name coding, described in :cite:`Moore:1977`.
4485
4486
    :param str word: the word to transform
4487
    :param int maxlength: the maximum length (default 5) of the code to return
4488
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4489
        maxlength string
4490
    :returns: the Roger Root code
4491
    :rtype: str
4492
4493
    >>> roger_root('Christopher')
4494
    '06401'
4495
    >>> roger_root('Niall')
4496
    '02500'
4497
    >>> roger_root('Smith')
4498
    '00310'
4499
    >>> roger_root('Schmidt')
4500
    '06310'
4501
    """
4502
    # uppercase, normalize, decompose, and filter non-A-Z out
4503
    word = normalize('NFKD', text_type(word.upper()))
4504
    word = word.replace('ß', 'SS')
4505
    word = ''.join(c for c in word if c in
4506
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4507
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4508
                    'Y', 'Z'})
4509
4510
    if not word:
4511
        return ''
4512
4513
    # '*' is used to prevent combining by _delete_consecutive_repeats()
4514
    _init_patterns = {4: {'TSCH': '06'},
4515
                      3: {'TSH': '06', 'SCH': '06'},
4516
                      2: {'CE': '0*0', 'CH': '06', 'CI': '0*0', 'CY': '0*0',
4517
                          'DG': '07', 'GF': '08', 'GM': '03', 'GN': '02',
4518
                          'KN': '02', 'PF': '08', 'PH': '08', 'PN': '02',
4519
                          'SH': '06', 'TS': '0*0', 'WR': '04'},
4520
                      1: {'A': '1', 'B': '09', 'C': '07', 'D': '01', 'E': '1',
4521
                          'F': '08', 'G': '07', 'H': '2', 'I': '1', 'J': '3',
4522
                          'K': '07', 'L': '05', 'M': '03', 'N': '02', 'O': '1',
4523
                          'P': '09', 'Q': '07', 'R': '04', 'S': '0*0',
4524
                          'T': '01', 'U': '1', 'V': '08', 'W': '4', 'X': '07',
4525
                          'Y': '5', 'Z': '0*0'}}
4526
4527
    _med_patterns = {4: {'TSCH': '6'},
4528
                     3: {'TSH': '6', 'SCH': '6'},
4529
                     2: {'CE': '0', 'CH': '6', 'CI': '0', 'CY': '0', 'DG': '7',
4530
                         'PH': '8', 'SH': '6', 'TS': '0'},
4531
                     1: {'B': '9', 'C': '7', 'D': '1', 'F': '8', 'G': '7',
4532
                         'J': '6', 'K': '7', 'L': '5', 'M': '3', 'N': '2',
4533
                         'P': '9', 'Q': '7', 'R': '4', 'S': '0', 'T': '1',
4534
                         'V': '8', 'X': '7', 'Z': '0',
4535
                         'A': '*', 'E': '*', 'H': '*', 'I': '*', 'O': '*',
4536
                         'U': '*', 'W': '*', 'Y': '*'}}
4537
4538
    code = ''
4539
    pos = 0
4540
4541
    # Do first digit(s) first
4542
    for num in range(4, 0, -1):
4543
        if word[:num] in _init_patterns[num]:
4544
            code = _init_patterns[num][word[:num]]
4545
            pos += num
4546
            break
4547
    else:
4548
        pos += 1  # Advance if nothing is recognized
4549
4550
    # Then code subsequent digits
4551
    while pos < len(word):
4552
        for num in range(4, 0, -1):
4553
            if word[pos:pos+num] in _med_patterns[num]:
4554
                code += _med_patterns[num][word[pos:pos+num]]
4555
                pos += num
4556
                break
4557
        else:
4558
            pos += 1  # Advance if nothing is recognized
4559
4560
    code = _delete_consecutive_repeats(code)
4561
    code = code.replace('*', '')
4562
4563
    if zero_pad:
4564
        code += '0'*maxlength
4565
4566
    return code[:maxlength]
4567
4568
4569
def onca(word, maxlength=4, zero_pad=True):
4570
    """Return the Oxford Name Compression Algorithm (ONCA) code for a word.
4571
4572
    This is the Oxford Name Compression Algorithm, based on :cite:`Gill:1997`.
4573
4574
    I can find no complete description of the "anglicised version of the NYSIIS
4575
    method" identified as the first step in this algorithm, so this is likely
4576
    not a precisely correct implementation, in that it employs the standard
4577
    NYSIIS algorithm.
4578
4579
    :param str word: the word to transform
4580
    :param int maxlength: the maximum length (default 5) of the code to return
4581
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4582
        maxlength string
4583
    :returns: the ONCA code
4584
    :rtype: str
4585
4586
    >>> onca('Christopher')
4587
    'C623'
4588
    >>> onca('Niall')
4589
    'N400'
4590
    >>> onca('Smith')
4591
    'S530'
4592
    >>> onca('Schmidt')
4593
    'S530'
4594
    """
4595
    # In the most extreme case, 3 characters of NYSIIS input can be compressed
4596
    # to one character of output, so give it triple the maxlength.
4597
    return soundex(nysiis(word, maxlength=maxlength*3), maxlength,
4598
                   zero_pad=zero_pad)
4599
4600
4601
def eudex(word, maxlength=8):
4602
    """Return the eudex phonetic hash of a word.
4603
4604
    This implementation of eudex phonetic hashing is based on the specification
4605
    (not the reference implementation) at :cite:`Ticki:2016`.
4606
4607
    Further details can be found at :cite:`Ticki:2016b`.
4608
4609
    :param str word: the word to transform
4610
    :param int maxlength: the length of the code returned (defaults to 8)
4611
    :returns: the eudex hash
4612
    :rtype: str
4613
    """
4614
    _trailing_phones = {
4615
        'a': 0,  # a
4616
        'b': 0b01001000,  # b
4617
        'c': 0b00001100,  # c
4618
        'd': 0b00011000,  # d
4619
        'e': 0,  # e
4620
        'f': 0b01000100,  # f
4621
        'g': 0b00001000,  # g
4622
        'h': 0b00000100,  # h
4623
        'i': 1,  # i
4624
        'j': 0b00000101,  # j
4625
        'k': 0b00001001,  # k
4626
        'l': 0b10100000,  # l
4627
        'm': 0b00000010,  # m
4628
        'n': 0b00010010,  # n
4629
        'o': 0,  # o
4630
        'p': 0b01001001,  # p
4631
        'q': 0b10101000,  # q
4632
        'r': 0b10100001,  # r
4633
        's': 0b00010100,  # s
4634
        't': 0b00011101,  # t
4635
        'u': 1,  # u
4636
        'v': 0b01000101,  # v
4637
        'w': 0b00000000,  # w
4638
        'x': 0b10000100,  # x
4639
        'y': 1,  # y
4640
        'z': 0b10010100,  # z
4641
4642
        'ß': 0b00010101,  # ß
4643
        'à': 0,  # à
4644
        'á': 0,  # á
4645
        'â': 0,  # â
4646
        'ã': 0,  # ã
4647
        'ä': 0,  # ä[æ]
4648
        'å': 1,  # å[oː]
4649
        'æ': 0,  # æ[æ]
4650
        'ç': 0b10010101,  # ç[t͡ʃ]
4651
        'è': 1,  # è
4652
        'é': 1,  # é
4653
        'ê': 1,  # ê
4654
        'ë': 1,  # ë
4655
        'ì': 1,  # ì
4656
        'í': 1,  # í
4657
        'î': 1,  # î
4658
        'ï': 1,  # ï
4659
        'ð': 0b00010101,  # ð[ð̠](represented as a non-plosive T)
4660
        'ñ': 0b00010111,  # ñ[nj](represented as a combination of n and j)
4661
        'ò': 0,  # ò
4662
        'ó': 0,  # ó
4663
        'ô': 0,  # ô
4664
        'õ': 0,  # õ
4665
        'ö': 1,  # ö[ø]
4666
        '÷': 0b11111111,  # ÷
4667
        'ø': 1,  # ø[ø]
4668
        'ù': 1,  # ù
4669
        'ú': 1,  # ú
4670
        'û': 1,  # û
4671
        'ü': 1,  # ü
4672
        'ý': 1,  # ý
4673
        'þ': 0b00010101,  # þ[ð̠](represented as a non-plosive T)
4674
        'ÿ': 1,  # ÿ
4675
    }
4676
4677
    _initial_phones = {
4678
        'a': 0b10000100,  # a*
4679
        'b': 0b00100100,  # b
4680
        'c': 0b00000110,  # c
4681
        'd': 0b00001100,  # d
4682
        'e': 0b11011000,  # e*
4683
        'f': 0b00100010,  # f
4684
        'g': 0b00000100,  # g
4685
        'h': 0b00000010,  # h
4686
        'i': 0b11111000,  # i*
4687
        'j': 0b00000011,  # j
4688
        'k': 0b00000101,  # k
4689
        'l': 0b01010000,  # l
4690
        'm': 0b00000001,  # m
4691
        'n': 0b00001001,  # n
4692
        'o': 0b10010100,  # o*
4693
        'p': 0b00100101,  # p
4694
        'q': 0b01010100,  # q
4695
        'r': 0b01010001,  # r
4696
        's': 0b00001010,  # s
4697
        't': 0b00001110,  # t
4698
        'u': 0b11100000,  # u*
4699
        'v': 0b00100011,  # v
4700
        'w': 0b00000000,  # w
4701
        'x': 0b01000010,  # x
4702
        'y': 0b11100100,  # y*
4703
        'z': 0b01001010,  # z
4704
4705
        'ß': 0b00001011,  # ß
4706
        'à': 0b10000101,  # à
4707
        'á': 0b10000101,  # á
4708
        'â': 0b10000000,  # â
4709
        'ã': 0b10000110,  # ã
4710
        'ä': 0b10100110,  # ä [æ]
4711
        'å': 0b11000010,  # å [oː]
4712
        'æ': 0b10100111,  # æ [æ]
4713
        'ç': 0b01010100,  # ç [t͡ʃ]
4714
        'è': 0b11011001,  # è
4715
        'é': 0b11011001,  # é
4716
        'ê': 0b11011001,  # ê
4717
        'ë': 0b11000110,  # ë [ə] or [œ]
4718
        'ì': 0b11111001,  # ì
4719
        'í': 0b11111001,  # í
4720
        'î': 0b11111001,  # î
4721
        'ï': 0b11111001,  # ï
4722
        'ð': 0b00001011,  # ð [ð̠] (represented as a non-plosive T)
4723
        'ñ': 0b00001011,  # ñ [nj] (represented as a combination of n and j)
4724
        'ò': 0b10010101,  # ò
4725
        'ó': 0b10010101,  # ó
4726
        'ô': 0b10010101,  # ô
4727
        'õ': 0b10010101,  # õ
4728
        'ö': 0b11011100,  # ö [œ] or [ø]
4729
        '÷': 0b11111111,  # ÷
4730
        'ø': 0b11011101,  # ø [œ] or [ø]
4731
        'ù': 0b11100001,  # ù
4732
        'ú': 0b11100001,  # ú
4733
        'û': 0b11100001,  # û
4734
        'ü': 0b11100101,  # ü
4735
        'ý': 0b11100101,  # ý
4736
        'þ': 0b00001011,  # þ [ð̠] (represented as a non-plosive T)
4737
        'ÿ': 0b11100101,  # ÿ
4738
    }
4739
    # Lowercase input & filter unknown characters
4740
    word = ''.join(char for char in word.lower() if char in _initial_phones)
4741
4742
    if not word:
4743
        word = '÷'
4744
4745
    # Perform initial eudex coding of each character
4746
    values = [_initial_phones[word[0]]]
4747
    values += [_trailing_phones[char] for char in word[1:]]
4748
4749
    # Right-shift by one to determine if second instance should be skipped
4750
    shifted_values = [_ >> 1 for _ in values]
4751
    condensed_values = [values[0]]
4752
    for n in range(1, len(shifted_values)):
4753
        if shifted_values[n] != shifted_values[n-1]:
4754
            condensed_values.append(values[n])
4755
4756
    # Add padding after first character & trim beyond maxlength
4757
    values = ([condensed_values[0]] +
4758
              [0]*max(0, maxlength - len(condensed_values)) +
4759
              condensed_values[1:maxlength])
4760
4761
    # Combine individual character values into eudex hash
4762
    hash_value = 0
4763
    for val in values:
4764
        hash_value = (hash_value << 8) | val
4765
4766
    return hash_value
4767
4768
4769
def haase_phonetik(word, primary_only=False):
4770
    """Return the Haase Phonetik (numeric output) code for a word.
4771
4772
    Based on the algorithm described at :cite:`Prante:2015`.
4773
4774
    Based on the original :cite:`Haase:2000`.
4775
4776
    While the output code is numeric, it is nevertheless a str.
4777
4778
    :param str word: the word to transform
4779
    :returns: the Haase Phonetik value as a numeric string
4780
    :rtype: str
4781
    """
4782
    def _after(word, i, letters):
4783
        """Return True if word[i] follows one of the supplied letters."""
4784
        if i > 0 and word[i-1] in letters:
4785
            return True
4786
        return False
4787
4788
    def _before(word, i, letters):
4789
        """Return True if word[i] precedes one of the supplied letters."""
4790
        if i+1 < len(word) and word[i+1] in letters:
4791
            return True
4792
        return False
4793
4794
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
4795
4796
    word = normalize('NFKD', text_type(word.upper()))
4797
    word = word.replace('ß', 'SS')
4798
4799
    word = word.replace('Ä', 'AE')
4800
    word = word.replace('Ö', 'OE')
4801
    word = word.replace('Ü', 'UE')
4802
    word = ''.join(c for c in word if c in
4803
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4804
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4805
                    'Y', 'Z'})
4806
4807
    # Nothing to convert, return base case
4808
    if not word:
4809
        return ''
4810
4811
    variants = []
4812
    if primary_only:
4813
        variants = [word]
4814
    else:
4815
        pos = 0
4816
        if word[:2] == 'CH':
4817
            variants.append(('CH', 'SCH'))
4818
            pos += 2
4819
        len_3_vars = {'OWN': 'AUN', 'WSK': 'RSK', 'SCH': 'CH', 'GLI': 'LI',
4820
                      'AUX': 'O', 'EUX': 'O'}
4821
        while pos < len(word):
4822
            if word[pos:pos+4] == 'ILLE':
4823
                variants.append(('ILLE', 'I'))
4824
                pos += 4
4825
            elif word[pos:pos+3] in len_3_vars:
4826
                variants.append((word[pos:pos+3], len_3_vars[word[pos:pos+3]]))
4827
                pos += 3
4828
            elif word[pos:pos+2] == 'RB':
4829
                variants.append(('RB', 'RW'))
4830
                pos += 2
4831
            elif len(word[pos:]) == 3 and word[pos:] == 'EAU':
4832
                variants.append(('EAU', 'O'))
4833
                pos += 3
4834
            elif len(word[pos:]) == 1 and word[pos:] in {'A', 'O'}:
4835
                if word[pos:] == 'O':
4836
                    variants.append(('O', 'OW'))
4837
                else:
4838
                    variants.append(('A', 'AR'))
4839
                pos += 1
4840
            else:
4841
                variants.append((word[pos],))
4842
                pos += 1
4843
4844
        variants = [''.join(letters) for letters in product(*variants)]
4845
4846
    def _haase_code(word):
4847
        sdx = ''
4848
        for i in range(len(word)):
4849 View Code Duplication
            if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
4850
                sdx += '9'
4851
            elif word[i] == 'B':
4852
                sdx += '1'
4853
            elif word[i] == 'P':
4854
                if _before(word, i, {'H'}):
4855
                    sdx += '3'
4856
                else:
4857
                    sdx += '1'
4858
            elif word[i] in {'D', 'T'}:
4859
                if _before(word, i, {'C', 'S', 'Z'}):
4860
                    sdx += '8'
4861
                else:
4862
                    sdx += '2'
4863
            elif word[i] in {'F', 'V', 'W'}:
4864
                sdx += '3'
4865
            elif word[i] in {'G', 'K', 'Q'}:
4866
                sdx += '4'
4867
            elif word[i] == 'C':
4868
                if _after(word, i, {'S', 'Z'}):
4869
                    sdx += '8'
4870
                elif i == 0:
4871
                    if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R',
4872
                                         'U', 'X'}):
4873
                        sdx += '4'
4874
                    else:
4875
                        sdx += '8'
4876
                elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
4877
                    sdx += '4'
4878
                else:
4879
                    sdx += '8'
4880
            elif word[i] == 'X':
4881
                if _after(word, i, {'C', 'K', 'Q'}):
4882
                    sdx += '8'
4883
                else:
4884
                    sdx += '48'
4885
            elif word[i] == 'L':
4886
                sdx += '5'
4887
            elif word[i] in {'M', 'N'}:
4888
                sdx += '6'
4889
            elif word[i] == 'R':
4890
                sdx += '7'
4891
            elif word[i] in {'S', 'Z'}:
4892
                sdx += '8'
4893
4894
        sdx = _delete_consecutive_repeats(sdx)
4895
4896
        # if sdx:
4897
        #     sdx = sdx[0] + sdx[1:].replace('9', '')
4898
4899
        return sdx
4900
4901
    return tuple(_haase_code(word) for word in variants)
4902
4903
4904
def reth_schek_phonetik(word):
4905
    """Return Reth-Schek Phonetik code for a word.
4906
4907
    This algorithm is proposed in :cite:`Reth:1977`.
4908
4909
    Since I couldn't secure a copy of that document (maybe I'll look for it
4910
    next time I'm in Germany), this implementation is based on what I could
4911
    glean from the implementations published by German Record Linkage
4912
    Center (www.record-linkage.de):
4913
4914
    - Privacy-preserving Record Linkage (PPRL) (in R) :cite:`Rukasz:2018`
4915
    - Merge ToolBox (in Java) :cite:`Schnell:2004`
4916
4917
    Rules that are unclear:
4918
4919
    - Should 'C' become 'G' or 'Z'? (PPRL has both, 'Z' rule blocked)
4920
    - Should 'CC' become 'G'? (PPRL has blocked 'CK' that may be typo)
4921
    - Should 'TUI' -> 'ZUI' rule exist? (PPRL has rule, but I can't
4922
      think of a German word with '-tui-' in it.)
4923
    - Should we really change 'SCH' -> 'CH' and then 'CH' -> 'SCH'?
4924
4925
    :param word:
4926
    :return:
4927
    """
4928
    replacements = {3: {'AEH': 'E', 'IEH': 'I', 'OEH': 'OE', 'UEH': 'UE',
4929
                        'SCH': 'CH', 'ZIO': 'TIO', 'TIU': 'TIO', 'ZIU': 'TIO',
4930
                        'CHS': 'X', 'CKS': 'X', 'AEU': 'OI'},
4931
                    2: {'LL': 'L', 'AA': 'A', 'AH': 'A', 'BB': 'B', 'PP': 'B',
4932
                        'BP': 'B', 'PB': 'B', 'DD': 'D', 'DT': 'D', 'TT': 'D',
4933
                        'TH': 'D', 'EE': 'E', 'EH': 'E', 'AE': 'E', 'FF': 'F',
4934
                        'PH': 'F', 'KK': 'K', 'GG': 'G', 'GK': 'G', 'KG': 'G',
4935
                        'CK': 'G', 'CC': 'C', 'IE': 'I', 'IH': 'I', 'MM': 'M',
4936
                        'NN': 'N', 'OO': 'O', 'OH': 'O', 'SZ': 'S', 'UH': 'U',
4937
                        'GS': 'X', 'KS': 'X', 'TZ': 'Z', 'AY': 'AI',
4938
                        'EI': 'AI', 'EY': 'AI', 'EU': 'OI', 'RR': 'R',
4939
                        'SS': 'S', 'KW': 'QU'},
4940
                    1: {'P': 'B', 'T': 'D', 'V': 'F', 'W': 'F', 'C': 'G',
4941
                        'K': 'G', 'Y': 'I'}}
4942
4943
    # Uppercase
4944
    word = word.upper()
4945
4946
    # Replace umlauts/eszett
4947
    word = word.replace('Ä', 'AE')
4948
    word = word.replace('Ö', 'OE')
4949
    word = word.replace('Ü', 'UE')
4950
    word = word.replace('ß', 'SS')
4951
4952
    # Main loop, using above replacements table
4953
    pos = 0
4954
    while pos < len(word):
4955
        for num in range(3, 0, -1):
4956
            if word[pos:pos+num] in replacements[num]:
4957
                word = (word[:pos] + replacements[num][word[pos:pos+num]]
4958
                        + word[pos+num:])
4959
                pos += 1
4960
                break
4961
        else:
4962
            pos += 1  # Advance if nothing is recognized
4963
4964
    # Change 'CH' back(?) to 'SCH'
4965
    word = word.replace('CH', 'SCH')
4966
4967
    # Replace final sequences
4968
    if word[-2:] == 'ER':
4969
        word = word[:-2]+'R'
4970
    elif word[-2:] == 'EL':
4971
        word = word[:-2]+'L'
4972
    elif word[-1] == 'H':
4973
        word = word[:-1]
4974
4975
    return word
4976
4977
4978
def fonem(word):
4979
    """Return the FONEM code of a word.
4980
4981
    FONEM is a phonetic algorithm designed for French (particularly surnames in
4982
    Saguenay, Canada), defined in :cite:`Bouchard:1981`.
4983
4984
    Guillaume Plique's Javascript implementation :cite:`Plique:2018` at
4985
    https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/french/fonem.js
4986
    was also consulted for this implementation.
4987
4988
    :param str word: the word to transform
4989
    :returns: the FONEM code
4990
    :rtype: str
4991
    """
4992
    # I don't see a sane way of doing this without regexps :(
4993
    rule_table = {
4994
        # Vowels & groups of vowels
4995
        'V-1':     (re_compile('E?AU'), 'O'),
4996
        'V-2,5':   (re_compile('(E?AU|O)L[TX]$'), 'O'),
4997
        'V-3,4':   (re_compile('E?AU[TX]$'), 'O'),
4998
        'V-6':     (re_compile('E?AUL?D$'), 'O'),
4999
        'V-7':     (re_compile(r'(?<!G)AY$'), 'E'),
5000
        'V-8':     (re_compile('EUX$'), 'EU'),
5001
        'V-9':     (re_compile('EY(?=$|[BCDFGHJKLMNPQRSTVWXZ])'), 'E'),
5002
        'V-10':    ('Y', 'I'),
5003
        'V-11':    (re_compile('(?<=[AEIOUY])I(?=[AEIOUY])'), 'Y'),
5004
        'V-12':    (re_compile('(?<=[AEIOUY])ILL'), 'Y'),
5005
        'V-13':    (re_compile('OU(?=[AEOU]|I(?!LL))'), 'W'),
5006
        'V-14':    (re_compile(r'([AEIOUY])(?=\1)'), ''),
5007
        # Nasal vowels
5008
        'V-15':    (re_compile('[AE]M(?=[BCDFGHJKLMPQRSTVWXZ])(?!$)'), 'EN'),
5009
        'V-16':    (re_compile('OM(?=[BCDFGHJKLMPQRSTVWXZ])'), 'ON'),
5010
        'V-17':    (re_compile('AN(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'EN'),
5011
        'V-18':    (re_compile('(AI[MN]|EIN)(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'),
5012
                    'IN'),
5013
        'V-19':    (re_compile('B(O|U|OU)RNE?$'), 'BURN'),
5014
        'V-20':    (re_compile('(^IM|(?<=[BCDFGHJKLMNPQRSTVWXZ])' +
5015
                               'IM(?=[BCDFGHJKLMPQRSTVWXZ]))'), 'IN'),
5016
        # Consonants and groups of consonants
5017
        'C-1':     ('BV', 'V'),
5018
        'C-2':     (re_compile('(?<=[AEIOUY])C(?=[EIY])'), 'SS'),
5019
        'C-3':     (re_compile('(?<=[BDFGHJKLMNPQRSTVWZ])C(?=[EIY])'), 'S'),
5020
        'C-4':     (re_compile('^C(?=[EIY])'), 'S'),
5021
        'C-5':     (re_compile('^C(?=[OUA])'), 'K'),
5022
        'C-6':     (re_compile('(?<=[AEIOUY])C$'), 'K'),
5023
        'C-7':     (re_compile('C(?=[BDFGJKLMNPQRSTVWXZ])'), 'K'),
5024
        'C-8':     (re_compile('CC(?=[AOU])'), 'K'),
5025
        'C-9':     (re_compile('CC(?=[EIY])'), 'X'),
5026
        'C-10':    (re_compile('G(?=[EIY])'), 'J'),
5027
        'C-11':    (re_compile('GA(?=I?[MN])'), 'G#'),
5028
        'C-12':    (re_compile('GE(O|AU)'), 'JO'),
5029
        'C-13':    (re_compile('GNI(?=[AEIOUY])'), 'GN'),
5030
        'C-14':    (re_compile('(?<![PCS])H'), ''),
5031
        'C-15':    ('JEA', 'JA'),
5032
        'C-16':    (re_compile('^MAC(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'MA#'),
5033
        'C-17':    (re_compile('^MC'), 'MA#'),
5034
        'C-18':    ('PH', 'F'),
5035
        'C-19':    ('QU', 'K'),
5036
        'C-20':    (re_compile('^SC(?=[EIY])'), 'S'),
5037
        'C-21':    (re_compile('(?<=.)SC(?=[EIY])'), 'SS'),
5038
        'C-22':    (re_compile('(?<=.)SC(?=[AOU])'), 'SK'),
5039
        'C-23':    ('SH', 'CH'),
5040
        'C-24':    (re_compile('TIA$'), 'SSIA'),
5041
        'C-25':    (re_compile('(?<=[AIOUY])W'), ''),
5042
        'C-26':    (re_compile('X[CSZ]'), 'X'),
5043
        'C-27':    (re_compile('(?<=[AEIOUY])Z|(?<=[BCDFGHJKLMNPQRSTVWXZ])' +
5044
                               'Z(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'S'),
5045
        'C-28':    (re_compile(r'([BDFGHJKMNPQRTVWXZ])\1'), r'\1'),
5046
        'C-28a':   (re_compile('CC(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'C'),
5047
        'C-28b':   (re_compile('((?<=[BCDFGHJKLMNPQRSTVWXZ])|^)SS'), 'S'),
5048
        'C-28bb':  (re_compile('SS(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'S'),
5049
        'C-28c':   (re_compile('((?<=[^I])|^)LL'), 'L'),
5050
        'C-28d':   (re_compile('ILE$'), 'ILLE'),
5051
        'C-29':    (re_compile('(ILS|[CS]H|[MN]P|R[CFKLNSX])$|([BCDFGHJKL' +
5052
                               'MNPQRSTVWXZ])[BCDFGHJKLMNPQRSTVWXZ]$'),
5053
                    lambda m: (m.group(1) or '') + (m.group(2) or '')),
5054
        'C-30,32': (re_compile('^(SA?INT?|SEI[NM]|CINQ?|ST)(?!E)-?'), 'ST-'),
5055
        'C-31,33': (re_compile('^(SAINTE|STE)-?'), 'STE-'),
5056
        # Rules to undo rule bleeding prevention in C-11, C-16, C-17
5057
        'C-34':    ('G#', 'GA'),
5058
        'C-35':    ('MA#', 'MAC')
5059
    }
5060
    rule_order = [
5061
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5062
        'C-12',
5063
        'C-8', 'C-9', 'C-10',
5064
        'C-16', 'C-17', 'C-2', 'C-3', 'C-7',
5065
        'V-2,5', 'V-3,4', 'V-6',
5066
        'V-1', 'C-14',
5067
        'C-31,33', 'C-30,32',
5068
        'C-11', 'V-15', 'V-17', 'V-18',
5069
        'V-7', 'V-8', 'V-9', 'V-10', 'V-11', 'V-12', 'V-13', 'V-16',
5070
        'V-19', 'V-20',
5071
        'C-1', 'C-4', 'C-5', 'C-6', 'C-13', 'C-15',
5072
        'C-18', 'C-19', 'C-20', 'C-21', 'C-22', 'C-23', 'C-24',
5073
        'C-25', 'C-26', 'C-27',
5074
        'C-29',
5075
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5076
        'C-34', 'C-35'
5077
    ]
5078
5079
    # normalize, upper-case, and filter non-French letters
5080
    word = normalize('NFKD', text_type(word.upper()))
5081
    word = word.translate({198: 'AE', 338: 'OE'})
5082
    word = ''.join(c for c in word if c in
5083
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5084
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5085
                    'Y', 'Z', '-'})
5086
5087
    for rule in rule_order:
5088
        regex, repl = rule_table[rule]
5089
        if isinstance(regex, text_type):
5090
            word = word.replace(regex, repl)
5091
        else:
5092
            word = regex.sub(repl, word)
5093
        # print(rule, word)
5094
5095
    return word
5096
5097
5098
def parmar_kumbharana(word):
5099
    """Return the Parmar-Kumbharana encoding of a word.
5100
5101
    This is based on the phonetic algorithm proposed in :cite:`Parmar:2014`.
5102
5103
    :param word:
5104
    :return:
5105
    """
5106
    rule_table = {4: {'OUGH': 'F'},
5107
                  3: {'DGE': 'J',
5108
                      'OUL': 'U',
5109
                      'GHT': 'T'},
5110
                  2: {'CE': 'S', 'CI': 'S', 'CY': 'S',
5111
                      'GE': 'J', 'GI': 'J', 'GY': 'J',
5112
                      'WR': 'R',
5113
                      'GN': 'N', 'KN': 'N', 'PN': 'N',
5114
                      'CK': 'K',
5115
                      'SH': 'S'}}
5116
    vowel_trans = {65: '', 69: '', 73: '', 79: '', 85: '', 89: ''}
5117
5118
    word = word.upper()  # Rule 3
5119
    word = _delete_consecutive_repeats(word)  # Rule 4
5120
5121
    # Rule 5
5122
    i = 0
5123
    while i < len(word):
5124
        for match_len in range(4, 1, -1):
5125
            if word[i:i+match_len] in rule_table[match_len]:
5126
                repl = rule_table[match_len][word[i:i+match_len]]
5127
                word = (word[:i] + repl + word[i+match_len:])
5128
                i += len(repl)
5129
                break
5130
        else:
5131
            i += 1
5132
5133
    word = word[0]+word[1:].translate(vowel_trans)  # Rule 6
5134
    return word
5135
5136
5137
def davidson(lname, fname='.', omit_fname=False):
5138
    """Return Davidson's Consonant Code.
5139
5140
    This is based on the name compression system described in
5141
    :cite:`Davidson:1962`.
5142
5143
    :cite:`Dolby:1970` identifies this as having been the name compression
5144
    algorithm used by SABRE.
5145
5146
    :param str lname: Last name (or word) to be encoded
5147
    :param str fname: First name (optional), of which the first character is
5148
        included in the code.
5149
    :param str omit_fname: Set to True to completely omit the first character
5150
        of the first name
5151
    :return: Davidson's Consonant Code
5152
    """
5153
    trans = {65: '', 69: '', 73: '', 79: '', 85: '', 72: '', 87: '', 89: ''}
5154
5155
    lname = text_type(lname.upper())
5156
    code = _delete_consecutive_repeats(lname[:1] + lname[1:].translate(trans))
5157
    code = code[:4] + (4-len(code))*' '
5158
5159
    if not omit_fname:
5160
        code += fname[:1].upper()
5161
5162
    return code
5163
5164
5165
def sound_d(word, maxlength=4):
5166
    """Return the SoundD code.
5167
5168
    SoundD is defined in :cite:`Varol:2012`.
5169
5170
    :param str word: the word to transform
5171
    :param int maxlength: the length of the code returned (defaults to 4)
5172
    :return:
5173
    """
5174
    _ref_soundd_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5175
                                        'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5176
                                       '01230120022455012623010202'))
5177
5178
    word = normalize('NFKD', text_type(word.upper()))
5179
    word = word.replace('ß', 'SS')
5180
    word = ''.join(c for c in word if c in
5181
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5182
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5183
                    'Y', 'Z'})
5184
5185
    if word[:2] in {'KN', 'GN', 'PN', 'AC', 'WR'}:
5186
        word = word[1:]
5187
    elif word[:1] == 'X':
5188
        word = 'S'+word[1:]
5189
    elif word[:2] == 'WH':
5190
        word = 'W'+word[2:]
5191
5192
    word = word.replace('DGE', '20').replace('DGI', '20').replace('GH', '0')
5193
5194
    word = word.translate(_ref_soundd_translation)
5195
    word = _delete_consecutive_repeats(word)
5196
    word = word.replace('0', '')
5197
5198
    if maxlength is not None:
5199
        if len(word) < maxlength:
5200
            word += '0' * (maxlength-len(word))
5201
        else:
5202
            word = word[:maxlength]
5203
5204
    return word
5205
5206
5207
def pshp_soundex_last(lname, maxlength=4, german=False):
5208
    """Calculate the PSHP Soundex/Viewex Coding of a last name.
5209
5210
    This coding is based on :cite:`Hershberg:1976`.
5211
5212
    Reference was also made to the German version of the same:
5213
    :cite:`Hershberg:1979`.
5214
5215
    A separate function, pshp_soundex_first() is used for first names.
5216
5217
    :param lname: the last name to encode
5218
    :param german: set to True if the name is German (different rules apply)
5219
    :return:
5220
    """
5221
    lname = normalize('NFKD', text_type(lname.upper()))
5222
    lname = lname.replace('ß', 'SS')
5223
    lname = ''.join(c for c in lname if c in
5224
                    {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
5225
                     'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
5226
                     'W', 'X', 'Y', 'Z'})
5227
5228
    # A. Prefix treatment
5229
    if lname[:3] == 'VON' or lname[:3] == 'VAN':
5230
        lname = lname[3:].strip()
5231
5232
    # The rule implemented below says "MC, MAC become 1". I believe it meant to
5233
    # say they become M except in German data (where superscripted 1 indicates
5234
    # "except in German data"). It doesn't make sense for them to become 1
5235
    # (BPFV -> 1) or to apply outside German. Unfortunately, both articles have
5236
    # this error(?).
5237
    if not german:
5238
        if lname[:3] == 'MAC':
5239
            lname = 'M'+lname[3:]
5240
        elif lname[:2] == 'MC':
5241
            lname = 'M'+lname[2:]
5242
5243
    # The non-German-only rule to strip ' is unnecessary due to filtering
5244
5245
    if lname[:1] in {'E', 'I', 'O', 'U'}:
5246
        lname = 'A' + lname[1:]
5247
    elif lname[:2] in {'GE', 'GI', 'GY'}:
5248
        lname = 'J' + lname[1:]
5249
    elif lname[:2] in {'CE', 'CI', 'CY'}:
5250
        lname = 'S' + lname[1:]
5251
    elif lname[:3] == 'CHR':
5252
        lname = 'K' + lname[1:]
5253
    elif lname[:1] == 'C' and lname[:2] != 'CH':
5254
        lname = 'K' + lname[1:]
5255
5256
    if lname[:2] == 'KN':
5257
        lname = 'N' + lname[1:]
5258
    elif lname[:2] == 'PH':
5259
        lname = 'F' + lname[1:]
5260
    elif lname[:3] in {'WIE', 'WEI'}:
5261
        lname = 'V' + lname[1:]
5262
5263
    if german and lname[:1] in {'W', 'M', 'Y', 'Z'}:
5264
        lname = {'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[lname[0]]+lname[1:]
5265
5266
    code = lname[:1]
5267
5268
    # B. Postfix treatment
5269
    if lname[-1:] == 'R':
5270
        lname = lname[:-1] + 'N'
5271
    elif lname[-2:] in {'SE', 'CE'}:
5272
        lname = lname[:-2]
5273
    if lname[-2:] == 'SS':
5274
        lname = lname[:-2]
5275
    elif lname[-1:] == 'S':
5276
        lname = lname[:-1]
5277
5278
    if not german:
5279
        l5_repl = {'STOWN': 'SAWON', 'MPSON': 'MASON'}
5280
        l4_repl = {'NSEN': 'ASEN', 'MSON': 'ASON', 'STEN': 'SAEN',
5281
                   'STON': 'SAON'}
5282
        if lname[-5:] in l5_repl:
5283
            lname = lname[:-5] + l5_repl[lname[-5:]]
5284
        elif lname[-4:] in l4_repl:
5285
            lname = lname[:-4] + l4_repl[lname[-4:]]
5286
5287
    if lname[-2:] in {'NG', 'ND'}:
5288
        lname = lname[:-1]
5289
    if not german and lname[-3:] in {'GAN', 'GEN'}:
5290
        lname = lname[:-3]+'A'+lname[-2:]
5291
5292
    if german:
5293
        if lname[-3:] == 'TES':
5294
            lname = lname[:-3]
5295
        elif lname[-2:] == 'TS':
5296
            lname = lname[:-2]
5297
        if lname[-3:] == 'TZE':
5298
            lname = lname[:-3]
5299
        elif lname[-2:] == 'ZE':
5300
            lname = lname[:-2]
5301
        if lname[-1:] == 'Z':
5302
            lname = lname[:-1]
5303
        elif lname[-2:] == 'TE':
5304
            lname = lname[:-2]
5305
5306
    # C. Infix Treatment
5307
    lname = lname.replace('CK', 'C')
5308
    lname = lname.replace('SCH', 'S')
5309
    lname = lname.replace('DT', 'T')
5310
    lname = lname.replace('ND', 'N')
5311
    lname = lname.replace('NG', 'N')
5312
    lname = lname.replace('LM', 'M')
5313
    lname = lname.replace('MN', 'M')
5314
    lname = lname.replace('WIE', 'VIE')
5315
    lname = lname.replace('WEI', 'VEI')
5316
5317
    # D. Soundexing
5318
    # code for X & Y are unspecified, but presumably are 2 & 0
5319
    _pshp_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5320
                                  'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5321
                                 '01230120022455012523010202'))
5322
5323
    lname = lname.translate(_pshp_translation)
5324
    lname = _delete_consecutive_repeats(lname)
5325
5326
    code += lname[1:]
5327
    code = code.replace('0', '')  # rule 1
5328
5329
    if maxlength is not None:
5330
        if len(code) < maxlength:
5331
            code += '0' * (maxlength-len(code))
5332
        else:
5333
            code = code[:maxlength]
5334
5335
    return code
5336
5337
5338
def pshp_soundex_first(fname, maxlength=4, german=False):
5339
    """Calculate the PSHP Soundex/Viewex Coding of a first name.
5340
5341
    This coding is based on :cite:`Hershberg:1976`.
5342
5343
    Reference was also made to the German version of the same:
5344
    :cite:`Hershberg:1979`.
5345
5346
    A separate function, pshp_soundex_last() is used for last names.
5347
5348
    :param fname: the first name to encode
5349
    :param german: set to True if the name is German (different rules apply)
5350
    :return:
5351
    """
5352
    fname = normalize('NFKD', text_type(fname.upper()))
5353
    fname = fname.replace('ß', 'SS')
5354
    fname = ''.join(c for c in fname if c in
5355
                    {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
5356
                     'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
5357
                     'W', 'X', 'Y', 'Z'})
5358
5359
    # special rules
5360
    if fname == 'JAMES':
5361
        code = 'J7'
5362
    elif fname == 'PAT':
5363
        code = 'P7'
5364
5365
    else:
5366
        # A. Prefix treatment
5367
        if fname[:2] in {'GE', 'GI', 'GY'}:
5368
            fname = 'J' + fname[1:]
5369
        elif fname[:2] in {'CE', 'CI', 'CY'}:
5370
            fname = 'S' + fname[1:]
5371
        elif fname[:3] == 'CHR':
5372
            fname = 'K' + fname[1:]
5373
        elif fname[:1] == 'C' and fname[:2] != 'CH':
5374
            fname = 'K' + fname[1:]
5375
5376
        if fname[:2] == 'KN':
5377
            fname = 'N' + fname[1:]
5378
        elif fname[:2] == 'PH':
5379
            fname = 'F' + fname[1:]
5380
        elif fname[:3] in {'WIE', 'WEI'}:
5381
            fname = 'V' + fname[1:]
5382
5383
        if german and fname[:1] in {'W', 'M', 'Y', 'Z'}:
5384
            fname = ({'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[fname[0]] +
5385
                     fname[1:])
5386
5387
        code = fname[:1]
5388
5389
        # B. Soundex coding
5390
        # code for Y unspecified, but presumably is 0
5391
        _pshp_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5392
                                      'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5393
                                     '01230120022455012523010202'))
5394
5395
        fname = fname.translate(_pshp_translation)
5396
        fname = _delete_consecutive_repeats(fname)
5397
5398
        code += fname[1:]
5399
        syl_ptr = code.find('0')
5400
        syl2_ptr = code[syl_ptr + 1:].find('0')
5401
        if syl_ptr != -1 and syl2_ptr != -1 and syl2_ptr - syl_ptr > -1:
5402
            code = code[:syl_ptr + 2]
5403
5404
        code = code.replace('0', '')  # rule 1
5405
5406
    if maxlength is not None:
5407
        if len(code) < maxlength:
5408
            code += '0' * (maxlength-len(code))
5409
        else:
5410
            code = code[:maxlength]
5411
5412
    return code
5413
5414
5415
def henry_early(word, maxlength=3):
5416
    """Calculate the early version of the Henry code for a word.
5417
5418
    The early version of Henry coding is given in :cite:`Legare:1972`. This is
5419
    different from the later version defined in :cite:`Henry:1976`.
5420
5421
    :param word:
5422
    :param int maxlength: the length of the code returned (defaults to 3)
5423
    :return:
5424
    """
5425
    _cons = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q',
5426
             'R', 'S', 'T', 'V', 'W', 'X', 'Z'}
5427
    _vows = {'A', 'E', 'I', 'O', 'U', 'Y'}
5428
    _diph = {'AI': 'E', 'AY': 'E', 'EI': 'E', 'AU': 'O', 'OI': 'O', 'OU': 'O',
5429
             'EU': 'U'}
5430
    _unaltered = {'B', 'D', 'F', 'J', 'K', 'L', 'M', 'N', 'R', 'T', 'V'}
5431
    _simple = {'W': 'V', 'X': 'S', 'V': 'S'}
5432
5433
    word = normalize('NFKD', text_type(word.upper()))
5434
    word = ''.join(c for c in word if c in
5435
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5436
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5437
                    'Y', 'Z'})
5438
5439
    if not word:
5440
        return ''
5441
5442
    # Rule Ia seems to be covered entirely in II
5443
5444
    # Rule Ib
5445
    if word[0] in _vows:
5446
        # Ib1
5447
        if (((word[1:2] in _cons-{'M', 'N'} and word[2:3] in _cons) or
5448
             (word[1:2] in _cons and word[2:3] not in _cons))):
5449
            if word[0] == 'Y':
5450
                word = 'I'+word[1:]
5451
        # Ib2
5452
        elif word[1:2] in {'M', 'N'} and word[2:3] in _cons:
5453
            if word[0] == 'E':
5454
                word = 'A'+word[1:]
5455
            elif word[0] in {'I', 'U', 'Y'}:
5456
                word = 'E'+word[1:]
5457
        # Ib3
5458
        elif word[:2] in _diph:
5459
            word = _diph[word[:2]]+word[2:]
5460
        # Ib4
5461
        elif word[1:2] in _vows and word[0] == 'Y':
5462
            word = 'I' + word[1:]
5463
5464
    code = ''
5465
    skip = 0
5466
5467
    # Rule II
5468
    for pos, char in enumerate(word):
5469
        nxch = char[pos+1:pos+2]
5470
        prev = char[pos-1:pos]
5471
5472
        if skip:
5473
            skip -= 1
5474
        elif char in _vows:
5475
            code += char
5476
        # IIc
5477
        elif char == nxch:
5478
            skip = 1
5479
            code += char
5480
        elif word[pos:pos+2] in {'CQ', 'DT', 'SC'}:
5481
            skip = 1
5482
            code += word[pos+1]
5483
        # IId
5484
        elif char == 'H' and prev in _cons:
5485
            continue
5486
        elif char == 'S' and nxch in _cons:
5487
            continue
5488
        elif char in _cons-{'L', 'R'} and nxch in _cons-{'L', 'R'}:
5489
            continue
5490
        elif char == 'L' and nxch in {'M', 'N'}:
5491
            continue
5492
        elif char in {'M', 'N'} and prev in _vows and nxch in _cons:
5493
            continue
5494
        # IIa
5495
        elif char in _unaltered:
5496
            code += char
5497
        # IIb
5498
        elif char in _simple:
5499
            code += _simple[char]
5500
        elif char in {'C', 'G', 'P', 'Q', 'S'}:
5501
            if char == 'C':
5502
                if nxch in {'A', 'O', 'U', 'L', 'R'}:
5503
                    code += 'K'
5504
                elif nxch in {'E', 'I', 'Y'}:
5505
                    code += 'J'
5506
                elif nxch == 'H':
5507
                    if word[pos+2:pos+3] in _vows:
5508
                        code += 'C'
5509
                    elif word[pos+2:pos+3] in {'R', 'L'}:
5510
                        code += 'K'
5511
            elif char == 'G':
5512
                if nxch in {'A', 'O', 'U', 'L', 'R'}:
5513
                    code += 'G'
5514
                elif nxch in {'E', 'I', 'Y'}:
5515
                    code += 'J'
5516
                elif nxch == 'N':
5517
                    code += 'N'
5518
            elif char == 'P':
5519
                if nxch != 'H':
5520
                    code += 'P'
5521
                else:
5522
                    code += 'F'
5523
            elif char == 'Q':
5524
                if word[pos+1:pos+2] in {'UE', 'UI', 'UY'}:
5525
                    char += 'G'
5526
                elif word[pos + 1:pos + 2] in {'UA', 'UO'}:
5527
                    char += 'K'
5528
            elif char == 'S':
5529
                if word[pos:pos+6] == 'SAINTE':
5530
                    code += 'X'
5531
                    skip = 5
5532
                elif word[pos:pos+5] == 'SAINT':
5533
                    code += 'X'
5534
                    skip = 4
5535
                elif word[pos:pos+3] == 'STE':
5536
                    code += 'X'
5537
                    skip = 2
5538
                elif word[pos:pos+2] == 'ST':
5539
                    code += 'X'
5540
                    skip = 1
5541
                else:
5542
                    code += 'S'
5543
        else:  # this should not be possible
5544
            continue
5545
5546
    # IIe1
5547
    if code[-4:] in {'AULT', 'EULT', 'OULT'}:
5548
        code = code[:-2]
5549
    elif code[-4:-3] in _vows and code[-3:] == 'MPS':
5550
        code = code[:-3]
5551
    elif code[-3:-2] in _vows and code[-2:] in {'MB', 'MP', 'ND', 'NS', 'NT'}:
5552
        code = code[:-2]
5553
    elif code[-2:-1] == 'R' and code[-1:] in _cons:
5554
        code = code[:-1]
5555
    # IIe2
5556
    elif code[-2:-1] in _vows and code[-1:] in {'D', 'M', 'N', 'S', 'T'}:
5557
        code = code[:-1]
5558
    elif code[-2:] == 'ER':
5559
        code = code[:-1]
5560
5561
    # Drop non-initial vowels
5562
    code = code[:1]+code[1:].translate({65: '', 69: '', 73: '', 79: '', 85: '',
5563
                                        89: ''})
5564
5565
    if maxlength is not None:
5566
            code = code[:maxlength]
0 ignored issues
show
Coding Style introduced by
The indentation here looks off. 8 spaces were expected, but 12 were found.
Loading history...
5567
5568
    return code
5569
5570
5571
def norphone(word):
5572
    """Return the Norphone code.
5573
5574
    The reference implementation by Lars Marius Garshol is available in
5575
    :cite:`Garshol:2015`.
5576
5577
    Norphone was designed for Norwegian, but this implementation has been
5578
    extended to support Swedish vowels as well. This function incorporates
5579
    the "not implemented" rules from the above file's rule set.
5580
5581
    :param word:
5582
    :return:
5583
    """
5584
    _vowels = {'A', 'E', 'I', 'O', 'U', 'Y', 'Å', 'Æ', 'Ø', 'Ä', 'Ö'}
5585
5586
    replacements = {4: {'SKEI': 'X'},
5587
                    3: {'SKJ': 'X', 'KEI': 'X'},
5588
                    2: {'CH': 'K', 'CK': 'K', 'GJ': 'J', 'GH': 'K', 'HG': 'K',
5589
                        'HJ': 'J', 'HL': 'L', 'HR': 'R', 'KJ': 'X', 'KI': 'X',
5590
                        'LD': 'L', 'ND': 'N', 'PH': 'F', 'TH': 'T', 'SJ': 'X'},
5591
                    1: {'W': 'V', 'X': 'KS', 'Z': 'S', 'D': 'T', 'G': 'K'}}
5592
5593
    word = word.upper()
5594
5595
    code = ''
5596
    skip = 0
5597
5598
    if word[0:2] == 'AA':
5599
        code = 'Å'
5600
        skip = 2
5601
    elif word[0:2] == 'GI':
5602
        code = 'J'
5603
        skip = 2
5604
    elif word[0:3] == 'SKY':
5605
        code = 'X'
5606
        skip = 3
5607
    elif word[0:2] == 'EI':
5608
        code = 'Æ'
5609
        skip = 2
5610
    elif word[0:2] == 'KY':
5611
        code = 'X'
5612
        skip = 2
5613
    elif word[:1] == 'C':
5614
        code = 'K'
5615
        skip = 1
5616
    elif word[:1] == 'Ä':
5617
        code = 'Æ'
5618
        skip = 1
5619
    elif word[:1] == 'Ö':
5620
        code = 'Ø'
5621
        skip = 1
5622
5623
    if word[-2:] == 'DT':
5624
        word = word[:-2]+'T'
5625
    # Though the rules indicate this rule applies in all positions, the
5626
    # reference implementation indicates it applies only in final position.
5627
    elif word[-2:-1] in _vowels and word[-1:] == 'D':
5628
        word = word[:-2]
5629
5630
    for pos, char in enumerate(word):
5631
        if skip:
5632
            skip -= 1
5633
        else:
5634
            for length in sorted(replacements, reverse=True):
5635
                if word[pos:pos+length] in replacements[length]:
5636
                    code += replacements[length][word[pos:pos+length]]
5637
                    skip = length-1
5638
                    break
5639
            else:
5640
                if not pos or char not in _vowels:
5641
                    code += char
5642
5643
    code = _delete_consecutive_repeats(code)
5644
5645
    return code
5646
5647
5648
def dolby(word, maxlength=None, keep_vowels=False, vowel_char='*'):
5649
    r"""Return the Dolby Code of a name.
5650
5651
    This follows "A Spelling Equivalent Abbreviation Algorithm For Personal
5652
    Names" from :cite:`Dolby:1970` and :cite:`Cunningham:1969`.
5653
5654
    :param word: the word to encode
5655
    :param maxlength: maximum length of the returned Dolby code -- this also
5656
        activates the fixed-length code mode
5657
    :param keep_vowels: if True, retains all vowel markers
5658
    :param vowel_char: the vowel marker character (default to \*)
5659
    :return:
5660
    """
5661
    _vowels = {'A', 'E', 'I', 'O', 'U', 'Y'}
5662
5663
    # uppercase, normalize, decompose, and filter non-A-Z out
5664
    word = normalize('NFKD', text_type(word.upper()))
5665
    word = word.replace('ß', 'SS')
5666
    word = ''.join(c for c in word if c in
5667
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5668
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5669
                    'Y', 'Z'})
5670
5671
    # Rule 1 (FL2)
5672
    if word[:3] in {'MCG', 'MAG', 'MAC'}:
5673
        word = 'MK'+word[3:]
5674
    elif word[:2] == 'MC':
5675
        word = 'MK'+word[2:]
5676
5677
    # Rule 2 (FL3)
5678
    pos = len(word)-2
5679
    while pos > -1:
5680
        if word[pos:pos+2] in {'DT', 'LD', 'ND', 'NT', 'RC', 'RD', 'RT', 'SC',
5681
                               'SK', 'ST'}:
5682
            word = word[:pos+1]+word[pos+2:]
5683
            pos += 1
5684
        pos -= 1
5685
5686
    # Rule 3 (FL4)
5687
    # Although the rule indicates "after the first letter", the test cases make
5688
    # it clear that these apply to the first letter also.
5689
    word = word.replace('X', 'KS')
5690
    word = word.replace('CE', 'SE')
5691
    word = word.replace('CI', 'SI')
5692
    word = word.replace('CY', 'SI')
5693
5694
    # not in the rule set, but they seem to have intended it
5695
    word = word.replace('TCH', 'CH')
5696
5697
    pos = word.find('CH', 1)
5698
    while pos != -1:
5699
        if word[pos-1:pos] not in _vowels:
5700
            word = word[:pos]+'S'+word[pos+1:]
5701
        pos = word.find('CH', pos+1)
5702
5703
    word = word.replace('C', 'K')
5704
    word = word.replace('Z', 'S')
5705
5706
    word = word.replace('WR', 'R')
5707
    word = word.replace('DG', 'G')
5708
    word = word.replace('QU', 'K')
5709
    word = word.replace('T', 'D')
5710
    word = word.replace('PH', 'F')
5711
5712
    # Rule 4 (FL5)
5713
    # Although the rule indicates "after the first letter", the test cases make
5714
    # it clear that these apply to the first letter also.
5715
    pos = word.find('K', 0)
5716
    while pos != -1:
5717
        if pos > 1 and word[pos-1:pos] not in _vowels | {'L', 'N', 'R'}:
5718
            word = word[:pos-1]+word[pos:]
5719
            pos -= 1
5720
        pos = word.find('K', pos+1)
5721
5722
    # Rule FL6
5723
    if maxlength and word[-1:] == 'E':
5724
        word = word[:-1]
5725
5726
    # Rule 5 (FL7)
5727
    word = _delete_consecutive_repeats(word)
5728
5729
    # Rule 6 (FL8)
5730
    if word[:2] == 'PF':
5731
        word = word[1:]
5732
    if word[-2:] == 'PF':
5733
        word = word[:-1]
5734
    elif word[-2:] == 'GH':
5735
        if word[-3:-2] in _vowels:
5736
            word = word[:-2]+'F'
5737
        else:
5738
            word = word[:-2]+'G'
5739
    word = word.replace('GH', '')
5740
5741
    # Rule FL9
5742
    if maxlength:
5743
        word = word.replace('V', 'F')
5744
5745
    # Rules 7-9 (FL10-FL12)
5746
    first = 1 + (1 if maxlength else 0)
5747
    code = ''
5748
    for pos, char in enumerate(word):
5749
        if char in _vowels:
5750
            if first or keep_vowels:
5751
                code += vowel_char
5752
                first -= 1
5753
            else:
5754
                continue
5755
        elif pos > 0 and char in {'W', 'H'}:
5756
            continue
5757
        else:
5758
            code += char
5759
5760
    if maxlength:
5761
        # Rule FL13
5762
        if len(code) > maxlength and code[-1:] == 'S':
5763
            code = code[:-1]
5764
        if keep_vowels:
5765
            code = code[:maxlength]
5766
        else:
5767
            # Rule FL14
5768
            code = code[:maxlength + 2]
5769
            # Rule FL15
5770
            while len(code) > maxlength:
5771
                vowels = len(code) - maxlength
5772
                excess = vowels - 1
5773
                word = code
5774
                code = ''
5775
                for char in word:
5776
                    if char == vowel_char:
5777
                        if vowels:
5778
                            code += char
5779
                            vowels -= 1
5780
                    else:
5781
                        code += char
5782
                code = code[:maxlength + excess]
5783
5784
        # Rule FL16
5785
        code += ' ' * (maxlength - len(code))
5786
5787
    return code
5788
5789
5790
def phonetic_spanish(word, maxlength=None):
5791
    """Return the PhoneticSpanish coding of word.
5792
5793
    This follows the coding described in :cite:`Amon:2012` and
5794
    :cite:`delPilarAngeles:2015`.
5795
5796
    :param word:
5797
    :return:
5798
    """
5799
    _es_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5800
                                        'BCDFGHJKLMNPQRSTVXYZ'),
5801
                                       '14328287566079431454'))
5802
5803
    # uppercase, normalize, and decompose, filter to A-Z minus vowels & W
5804
    word = normalize('NFKD', text_type(word.upper()))
5805
    word = ''.join(c for c in word if c in
5806
                   {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N',
5807
                    'P', 'Q', 'R', 'S', 'T', 'V', 'X', 'Y', 'Z'})
5808
5809
    # merge repeated Ls & Rs
5810
    word = word.replace('LL', 'L')
5811
    word = word.replace('R', 'R')
5812
5813
    # apply the Soundex algorithm
5814
    sdx = word.translate(_es_soundex_translation)
5815
5816
    if maxlength:
5817
        sdx = sdx[:maxlength]
5818
5819
    return sdx
5820
5821
5822
def spanish_metaphone(word, maxlength=6, modified=False):
5823
    """Return the Spanish Metaphone of a word.
5824
5825
    This is a quick rewrite of the Spanish Metaphone Algorithm, as presented at
5826
    https://github.com/amsqr/Spanish-Metaphone and discussed in
5827
    :cite:`Mosquera:2012`.
5828
5829
    Modified version based on :cite:`delPilarAngeles:2016`.
5830
5831
    :param word:
5832
    :param maxlength:
5833
    :param modified: Set to True to use del Pilar Angeles & Bailón-Miguel's
5834
        modified version of the algorithm
5835
    :return:
5836
    """
5837
    def _is_vowel(pos):
5838
        """Return True if the character at word[pos] is a vowel."""
5839
        if pos < len(word) and word[pos] in {'A', 'E', 'I', 'O', 'U'}:
5840
            return True
5841
        return False
5842
5843
    word = normalize('NFC', text_type(word.upper()))
5844
5845
    meta_key = ''
5846
    pos = 0
5847
5848
    # do some replacements for the modified version
5849
    if modified:
5850
        word = word.replace('MB', 'NB')
5851
        word = word.replace('MP', 'NP')
5852
        word = word.replace('BS', 'S')
5853
        if word[:2] == 'PS':
5854
            word = word[1:]
5855
5856
    # simple replacements
5857
    word = word.replace('Á', 'A')
5858
    word = word.replace('CH', 'X')
5859
    word = word.replace('Ç', 'S')
5860
    word = word.replace('É', 'E')
5861
    word = word.replace('Í', 'I')
5862
    word = word.replace('Ó', 'O')
5863
    word = word.replace('Ú', 'U')
5864
    word = word.replace('Ñ', 'NY')
5865
    word = word.replace('GÜ', 'W')
5866
    word = word.replace('Ü', 'U')
5867
    word = word.replace('B', 'V')
5868
    word = word.replace('LL', 'Y')
5869
5870
    while len(meta_key) < maxlength:
5871
        if pos >= len(word):
5872
            break
5873
5874
        # get the next character
5875
        current_char = word[pos]
5876
5877
        # if a vowel in pos 0, add to key
5878
        if _is_vowel(pos) and pos == 0:
5879
            meta_key += current_char
5880
            pos += 1
5881
        # otherwise, do consonant rules
5882
        else:
5883
            # simple consonants (unmutated)
5884
            if current_char in {'D', 'F', 'J', 'K', 'M', 'N', 'P', 'T', 'V',
5885
                                'L', 'Y'}:
5886
                meta_key += current_char
5887
                # skip doubled consonants
5888
                if word[pos+1:pos+2] == current_char:
5889
                    pos += 2
5890
                else:
5891
                    pos += 1
5892
            else:
5893
                if current_char == 'C':
5894
                    # special case 'acción', 'reacción',etc.
5895
                    if word[pos+1:pos+2] == 'C':
5896
                        meta_key += 'X'
5897
                        pos += 2
5898
                    # special case 'cesar', 'cien', 'cid', 'conciencia'
5899
                    elif word[pos+1:pos+2] in {'E', 'I'}:
5900
                        meta_key += 'Z'
5901
                        pos += 2
5902
                    # base case
5903
                    else:
5904
                        meta_key += 'K'
5905
                        pos += 1
5906
                elif current_char == 'G':
5907
                    # special case 'gente', 'ecologia',etc
5908
                    if word[pos + 1:pos + 2] in {'E', 'I'}:
5909
                        meta_key += 'J'
5910
                        pos += 2
5911
                    # base case
5912
                    else:
5913
                        meta_key += 'G'
5914
                        pos += 1
5915
                elif current_char == 'H':
5916
                    # since the letter 'H' is silent in Spanish,
5917
                    # set the meta key to the vowel after the letter 'H'
5918
                    if _is_vowel(pos+1):
5919
                        meta_key += word[pos+1]
5920
                        pos += 2
5921
                    else:
5922
                        meta_key += 'H'
5923
                        pos += 1
5924
                elif current_char == 'Q':
5925
                    if word[pos+1:pos+2] == 'U':
5926
                        pos += 2
5927
                    else:
5928
                        pos += 1
5929
                    meta_key += 'K'
5930
                elif current_char == 'W':
5931
                    meta_key += 'U'
5932
                    pos += 1
5933
                elif current_char == 'R':
5934
                    meta_key += 'R'
5935
                    pos += 1
5936
                elif current_char == 'S':
5937
                    if not _is_vowel(pos+1) and pos == 0:
5938
                        meta_key += 'ES'
5939
                        pos += 1
5940
                    else:
5941
                        meta_key += 'S'
5942
                        pos += 1
5943
                elif current_char == 'Z':
5944
                    meta_key += 'Z'
5945
                    pos += 1
5946
                elif current_char == 'X':
5947
                    if len(word) > 1 and pos == 0 and not _is_vowel(pos+1):
5948
                        meta_key += 'EX'
5949
                        pos += 1
5950
                    else:
5951
                        meta_key += 'X'
5952
                        pos += 1
5953
                else:
5954
                    pos += 1
5955
5956
    # Final change from S to Z in modified version
5957
    if modified:
5958
        meta_key = meta_key.replace('S', 'Z')
5959
5960
    return meta_key
5961
5962
5963
def metasoundex(word, language='en'):
5964
    """Return the MetaSoundex code for a word.
5965
5966
    This is based on :cite:`Koneru:2017`.
5967
5968
    :param word:
5969
    :param language: either 'en' for English or 'es' for Spanish
5970
    :return:
5971
    """
5972
    _metasoundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5973
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5974
                                        '07430755015866075943077514'))
5975
5976
    if language == 'es':
5977
        return phonetic_spanish(spanish_metaphone(word))
5978
5979
    word = soundex(metaphone(word))
5980
    word = word[0].translate(_metasoundex_translation)+word[1:]
5981
5982
    return word
5983
5984
5985
def soundex_br(word, maxlength=4, zero_pad=True):
5986
    """Return the SoundexBR encoding of a word.
5987
5988
    This is based on :cite:`Marcelino:2015`.
5989
5990
    :param word:
5991
    :return:
5992
    """
5993
    _soundex_br_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5994
                                        'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5995
                                       '01230120022455012623010202'))
5996
5997
    word = normalize('NFKD', text_type(word.upper()))
5998
    word = ''.join(c for c in word if c in
5999
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
6000
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
6001
                    'Y', 'Z'})
6002
6003
    if word[:2] == 'WA':
6004
        first = 'V'
6005
    elif word[:1] == 'K' and word[1:2] in {'A', 'O', 'U'}:
6006
        first = 'C'
6007
    elif word[:1] == 'C' and word[1:2] in {'I', 'E'}:
6008
        first = 'S'
6009
    elif word[:1] == 'G' and word[1:2] in {'E', 'I'}:
6010
        first = 'J'
6011
    elif word[:1] == 'Y':
6012
        first = 'I'
6013
    elif word[:1] == 'H':
6014
        first = word[1:2]
6015
        word = word[1:]
6016
    else:
6017
        first = word[:1]
6018
6019
    sdx = first + word[1:].translate(_soundex_br_translation)
6020
    sdx = _delete_consecutive_repeats(sdx)
6021
    sdx = sdx.replace('0', '')
6022
6023
    if zero_pad:
6024
        sdx += ('0'*maxlength)
6025
6026
    return sdx[:maxlength]
6027
6028
6029
def nrl(word):
6030
    """Return the Naval Research Laboratory phonetic encoding of a word.
6031
6032
    This is defined by :cite:`Elovitz:1976`.
6033
6034
    :param word:
6035
    :return:
6036
    """
6037
6038
    def to_regex(pattern, left=True):
6039
        new_pattern = ''
6040
        replacements = {'#': '[AEIOU]+',
6041
                        ':': '[BCDFGHJKLMNPQRSTVWXYZ]*',
6042
                        '^': '[BCDFGHJKLMNPQRSTVWXYZ]',
6043
                        '.': '[BDVGJLMNTWZ]',
6044
                        '%': '(ER|E|ES|ED|ING|ELY)',
6045
                        '+': '[EIY]',
6046
                        ' ': '^'}
6047
        for char in pattern:
6048
            new_pattern += (replacements[char] if char in replacements
6049
                            else char)
6050
6051
        if left:
6052
            new_pattern += '$'
6053
            if '^' not in pattern:
6054
                new_pattern = '^.*' + new_pattern
6055
        else:
6056
            new_pattern = '^' + new_pattern.replace('^', '$')
6057
            if '$' not in new_pattern:
6058
                new_pattern += '.*$'
6059
6060
        return new_pattern
6061
6062
    rules = {' ': (('', ' ', '', ' '),
6063
                   ('', '-', '', ''),
6064
                   ('.', '\'S', '', 'z'),
6065
                   ('#:.E', '\'S', '', 'z'),
6066
                   ('#', '\'S', '', 'z'),
6067
                   ('', '\'', '', ''),
6068
                   ('', ',', '', ' '),
6069
                   ('', '.', '', ' '),
6070
                   ('', '?', '', ' '),
6071
                   ('', '!', '', ' ')),
6072
             'A': (('', 'A', ' ', 'AX'),
6073
                   (' ', 'ARE', ' ', 'AAr'),
6074
                   (' ', 'AR', 'O', 'AXr'),
6075
                   ('', 'AR', '#', 'EHr'),
6076
                   ('^', 'AS', '#', 'EYs'),
6077
                   ('', 'A', 'WA', 'AX'),
6078
                   ('', 'AW', '', 'AO'),
6079
                   (' :', 'ANY', '', 'EHnIY'),
6080
                   ('', 'A', '^+#', 'EY'),
6081
                   ('#:', 'ALLY', '', 'AXlIY'),
6082
                   (' ', 'AL', '#', 'AXl'),
6083
                   ('', 'AGAIN', '', 'AXgEHn'),
6084
                   ('#:', 'AG', 'E', 'IHj'),
6085
                   ('', 'A', '^+:#', 'AE'),
6086
                   (' :', 'A', '^+ ', 'EY'),
6087
                   ('', 'A', '^%', 'EY'),
6088
                   (' ', 'ARR', '', 'AXr'),
6089
                   ('', 'ARR', '', 'AEr'),
6090
                   (' :', 'AR', ' ', 'AAr'),
6091
                   ('', 'AR', ' ', 'ER'),
6092
                   ('', 'AR', '', 'AAr'),
6093
                   ('', 'AIR', '', 'EHr'),
6094
                   ('', 'AI', '', 'EY'),
6095
                   ('', 'AY', '', 'EY'),
6096
                   ('', 'AU', '', 'AO'),
6097
                   ('#:', 'AL', ' ', 'AXl'),
6098
                   ('#:', 'ALS', ' ', 'AXlz'),
6099
                   ('', 'ALK', '', 'AOk'),
6100
                   ('', 'AL', '^', 'AOl'),
6101
                   (' :', 'ABLE', '', 'EYbAXl'),
6102
                   ('', 'ABLE', '', 'AXbAXl'),
6103
                   ('', 'ANG', '+', 'EYnj'),
6104
                   ('', 'A', '', 'AE')),
6105
             'B': ((' ', 'BE', '^#', 'bIH'),
6106
                   ('', 'BEING', '', 'bIYIHNG'),
6107
                   (' ', 'BOTH', ' ', 'bOWTH'),
6108
                   (' ', 'BUS', '#', 'bIHz'),
6109
                   ('', 'BUIL', '', 'bIHl'),
6110
                   ('', 'B', '', 'b')),
6111
             'C': ((' ', 'CH', '^', 'k'),
6112
                   ('^E', 'CH', '', 'k'),
6113
                   ('', 'CH', '', 'CH'),
6114
                   (' S', 'CI', '#', 'sAY'),
6115
                   ('', 'CI', 'A', 'SH'),
6116
                   ('', 'CI', 'O', 'SH'),
6117
                   ('', 'CI', 'EN', 'SH'),
6118
                   ('', 'C', '+', 's'),
6119
                   ('', 'CK', '', 'k'),
6120
                   ('', 'COM', '%', 'kAHm'),
6121
                   ('', 'C', '', 'k')),
6122
             'D': (('#:', 'DED', ' ', 'dIHd'),
6123
                   ('.E', 'D', ' ', 'd'),
6124
                   ('#:^E', 'D', ' ', 't'),
6125
                   (' ', 'DE', '^#', 'dIH'),
6126
                   (' ', 'DO', ' ', 'dUW'),
6127
                   (' ', 'DOES', '', 'dAHz'),
6128
                   (' ', 'DOING', '', 'dUWIHNG'),
6129
                   (' ', 'DOW', '', 'dAW'),
6130
                   ('', 'DU', 'A', 'jUW'),
6131
                   ('', 'D', '', 'd')),
6132
             'E': (('#:', 'E', ' ', ''),
6133
                   ('\':^', 'E', ' ', ''),
6134
                   (' :', 'E', ' ', 'IY'),
6135
                   ('#', 'ED', ' ', 'd'),
6136
                   ('#:', 'E', 'D ', ''),
6137
                   ('', 'EV', 'ER', 'EHv'),
6138
                   ('', 'E', '^%', 'IY'),
6139
                   ('', 'ERI', '#', 'IYrIY'),
6140
                   ('', 'ERI', '', 'EHrIH'),
6141
                   ('#:', 'ER', '#', 'ER'),
6142
                   ('', 'ER', '#', 'EHr'),
6143
                   ('', 'ER', '', 'ER'),
6144
                   (' ', 'EVEN', '', 'IYvEHn'),
6145
                   ('#:', 'E', 'W', ''),
6146
                   ('T', 'EW', '', 'UW'),
6147
                   ('S', 'EW', '', 'UW'),
6148
                   ('R', 'EW', '', 'UW'),
6149
                   ('D', 'EW', '', 'UW'),
6150
                   ('L', 'EW', '', 'UW'),
6151
                   ('Z', 'EW', '', 'UW'),
6152
                   ('N', 'EW', '', 'UW'),
6153
                   ('J', 'EW', '', 'UW'),
6154
                   ('TH', 'EW', '', 'UW'),
6155
                   ('CH', 'EW', '', 'UW'),
6156
                   ('SH', 'EW', '', 'UW'),
6157
                   ('', 'EW', '', 'yUW'),
6158
                   ('', 'E', 'O', 'IY'),
6159
                   ('#:S', 'ES', ' ', 'IHz'),
6160
                   ('#:C', 'ES', ' ', 'IHz'),
6161
                   ('#:G', 'ES', ' ', 'IHz'),
6162
                   ('#:Z', 'ES', ' ', 'IHz'),
6163
                   ('#:X', 'ES', ' ', 'IHz'),
6164
                   ('#:J', 'ES', ' ', 'IHz'),
6165
                   ('#:CH', 'ES', ' ', 'IHz'),
6166
                   ('#:SH', 'ES', ' ', 'IHz'),
6167
                   ('#:', 'E', 'S ', ''),
6168
                   ('#:', 'ELY', ' ', 'lIY'),
6169
                   ('#:', 'EMENT', '', 'mEHnt'),
6170
                   ('', 'EFUL', '', 'fUHl'),
6171
                   ('', 'EE', '', 'IY'),
6172
                   ('', 'EARN', '', 'ERn'),
6173
                   (' ', 'EAR', '^', 'ER'),
6174
                   ('', 'EAD', '', 'EHd'),
6175
                   ('#:', 'EA', ' ', 'IYAX'),
6176
                   ('', 'EA', 'SU', 'EH'),
6177
                   ('', 'EA', '', 'IY'),
6178
                   ('', 'EIGH', '', 'EY'),
6179
                   ('', 'EI', '', 'IY'),
6180
                   (' ', 'EYE', '', 'AY'),
6181
                   ('', 'EY', '', 'IY'),
6182
                   ('', 'EU', '', 'yUW'),
6183
                   ('', 'E', '', 'EH')),
6184
             'F': (('', 'FUL', '', 'fUHl'),
6185
                   ('', 'F', '', 'f')),
6186
             'G': (('', 'GIV', '', 'gIHv'),
6187
                   (' ', 'G', 'I^', 'g'),
6188
                   ('', 'GE', 'T', 'gEH'),
6189
                   ('SU', 'GGES', '', 'gjEHs'),
6190
                   ('', 'GG', '', 'g'),
6191
                   (' B#', 'G', '', 'g'),
6192
                   ('', 'G', '+', 'j'),
6193
                   ('', 'GREAT', '', 'grEYt'),
6194
                   ('#', 'GH', '', ''),
6195
                   ('', 'G', '', 'g')),
6196
             'H': ((' ', 'HAV', '', 'hAEv'),
6197
                   (' ', 'HERE', '', 'hIYr'),
6198
                   (' ', 'HOUR', '', 'AWER'),
6199
                   ('', 'HOW', '', 'hAW'),
6200
                   ('', 'H', '#', 'h'),
6201
                   ('', 'H', '', '')),
6202
             'I': ((' ', 'IN', '', 'IHn'),
6203
                   (' ', 'I', ' ', 'AY'),
6204
                   ('', 'IN', 'D', 'AYn'),
6205
                   ('', 'IER', '', 'IYER'),
6206
                   ('#:R', 'IED', '', 'IYd'),
6207
                   ('', 'IED', ' ', 'AYd'),
6208
                   ('', 'IEN', '', 'IYEHn'),
6209
                   ('', 'IE', 'T', 'AYEH'),
6210
                   (' :', 'I', '%', 'AY'),
6211
                   ('', 'I', '%', 'IY'),
6212
                   ('', 'IE', '', 'IY'),
6213
                   ('', 'I', '^+:#', 'IH'),
6214
                   ('', 'IR', '#', 'AYr'),
6215
                   ('', 'IZ', '%', 'AYz'),
6216
                   ('', 'IS', '%', 'AYz'),
6217
                   ('', 'I', 'D%', 'AY'),
6218
                   ('+^', 'I', '^+', 'IH'),
6219
                   ('', 'I', 'T%', 'AY'),
6220
                   ('#:^', 'I', '^+', 'IH'),
6221
                   ('', 'I', '^+', 'AY'),
6222
                   ('', 'IR', '', 'ER'),
6223
                   ('', 'IGH', '', 'AY'),
6224
                   ('', 'ILD', '', 'AYld'),
6225
                   ('', 'IGN', ' ', 'AYn'),
6226
                   ('', 'IGN', '^', 'AYn'),
6227
                   ('', 'IGN', '%', 'AYn'),
6228
                   ('', 'IQUE', '', 'IYk'),
6229
                   ('', 'I', '', 'IH')),
6230
             'J': (('', 'J', '', 'j'),),
6231
             'K': ((' ', 'K', 'N', ''),
6232
                   ('', 'K', '', 'k')),
6233
             'L': (('', 'LO', 'C#', 'lOW'),
6234
                   ('L', 'L', '', ''),
6235
                   ('#:^', 'L', '%', 'AXl'),
6236
                   ('', 'LEAD', '', 'lIYd'),
6237
                   ('', 'L', '', 'l')),
6238
             'M': (('', 'MOV', '', 'mUWv'),
6239
                   ('', 'M', '', 'm')),
6240
             'N': (('E', 'NG', '+', 'nj'),
6241
                   ('', 'NG', 'R', 'NGg'),
6242
                   ('', 'NG', '#', 'NGg'),
6243
                   ('', 'NGL', '%', 'NGgAXl'),
6244
                   ('', 'NG', '', 'NG'),
6245
                   ('', 'NK', '', 'NGk'),
6246
                   (' ', 'NOW', ' ', 'nAW'),
6247
                   ('', 'N', '', 'n')),
6248
             'O': (('', 'OF', ' ', 'AXv'),
6249
                   ('', 'OROUGH', '', 'EROW'),
6250
                   ('#:', 'OR', ' ', 'ER'),
6251
                   ('#:', 'ORS', ' ', 'ERz'),
6252
                   ('', 'OR', '', 'AOr'),
6253
                   (' ', 'ONE', '', 'wAHn'),
6254
                   ('', 'OW', '', 'OW'),
6255
                   (' ', 'OVER', '', 'OWvER'),
6256
                   ('', 'OV', '', 'AHv'),
6257
                   ('', 'O', '^%', 'OW'),
6258
                   ('', 'O', '^EN', 'OW'),
6259
                   ('', 'O', '^I#', 'OW'),
6260
                   ('', 'OL', 'D', 'OWl'),
6261
                   ('', 'OUGHT', '', 'AOt'),
6262
                   ('', 'OUGH', '', 'AHf'),
6263
                   (' ', 'OU', '', 'AW'),
6264
                   ('H', 'OU', 'S#', 'AW'),
6265
                   ('', 'OUS', '', 'AXs'),
6266
                   ('', 'OUR', '', 'AOr'),
6267
                   ('', 'OULD', '', 'UHd'),
6268
                   ('^', 'OU', '^L', 'AH'),
6269
                   ('', 'OUP', '', 'UWp'),
6270
                   ('', 'OU', '', 'AW'),
6271
                   ('', 'OY', '', 'OY'),
6272
                   ('', 'OING', '', 'OWIHNG'),
6273
                   ('', 'OI', '', 'OY'),
6274
                   ('', 'OOR', '', 'AOr'),
6275
                   ('', 'OOK', '', 'UHk'),
6276
                   ('', 'OOD', '', 'UHd'),
6277
                   ('', 'OO', '', 'UW'),
6278
                   ('', 'O', 'E', 'OW'),
6279
                   ('', 'O', ' ', 'OW'),
6280
                   ('', 'OA', '', 'OW'),
6281
                   (' ', 'ONLY', '', 'OWnlIY'),
6282
                   (' ', 'ONCE', '', 'wAHns'),
6283
                   ('', 'ON\'T', '', 'OWnt'),
6284
                   ('C', 'O', 'N', 'AA'),
6285
                   ('', 'O', 'NG', 'AO'),
6286
                   (' :^', 'O', 'N', 'AH'),
6287
                   ('I', 'ON', '', 'AXn'),
6288
                   ('#:', 'ON', ' ', 'AXn'),
6289
                   ('#^', 'ON', '', 'AXn'),
6290
                   ('', 'O', 'ST ', 'OW'),
6291
                   ('', 'OF', '^', 'AOf'),
6292
                   ('', 'OTHER', '', 'AHDHER'),
6293
                   ('', 'OSS', ' ', 'AOs'),
6294
                   ('#:^', 'OM', '', 'AHm'),
6295
                   ('', 'O', '', 'AA')),
6296
             'P': (('', 'PH', '', 'f'),
6297
                   ('', 'PEOP', '', 'pIYp'),
6298
                   ('', 'POW', '', 'pAW'),
6299
                   ('', 'PUT', ' ', 'pUHt'),
6300
                   ('', 'P', '', 'p')),
6301
             'Q': (('', 'QUAR', '', 'kwAOr'),
6302
                   ('', 'QU', '', 'kw'),
6303
                   ('', 'Q', '', 'k')),
6304
             'R': ((' ', 'RE', '^#', 'rIY'),
6305
                   ('', 'R', '', 'r')),
6306
             'S': (('', 'SH', '', 'SH'),
6307
                   ('#', 'SION', '', 'ZHAXn'),
6308
                   ('', 'SOME', '', 'sAHm'),
6309
                   ('#', 'SUR', '#', 'ZHER'),
6310
                   ('', 'SUR', '#', 'SHER'),
6311
                   ('#', 'SU', '#', 'ZHUW'),
6312
                   ('#', 'SSU', '#', 'SHUW'),
6313
                   ('#', 'SED', ' ', 'zd'),
6314
                   ('#', 'S', '#', 'z'),
6315
                   ('', 'SAID', '', 'sEHd'),
6316
                   ('^', 'SION', '', 'SHAXn'),
6317
                   ('', 'S', 'S', ''),
6318
                   ('.', 'S', ' ', 'z'),
6319
                   ('#:.E', 'S', ' ', 'z'),
6320
                   ('#:^##', 'S', ' ', 'z'),
6321
                   ('#:^#', 'S', ' ', 's'),
6322
                   ('U', 'S', ' ', 's'),
6323
                   (' :#', 'S', ' ', 'z'),
6324
                   (' ', 'SCH', '', 'sk'),
6325
                   ('', 'S', 'C+', ''),
6326
                   ('#', 'SM', '', 'zm'),
6327
                   ('#', 'SN', '\'', 'zAXn'),
6328
                   ('', 'S', '', 's')),
6329
             'T': ((' ', 'THE', ' ', 'DHAX'),
6330
                   ('', 'TO', ' ', 'tUW'),
6331
                   ('', 'THAT', ' ', 'DHAEt'),
6332
                   (' ', 'THIS', ' ', 'DHIHs'),
6333
                   (' ', 'THEY', '', 'DHEY'),
6334
                   (' ', 'THERE', '', 'DHEHr'),
6335
                   ('', 'THER', '', 'DHER'),
6336
                   ('', 'THEIR', '', 'DHEHr'),
6337
                   (' ', 'THAN', ' ', 'DHAEn'),
6338
                   (' ', 'THEM', ' ', 'DHEHm'),
6339
                   ('', 'THESE', ' ', 'DHIYz'),
6340
                   (' ', 'THEN', '', 'DHEHn'),
6341
                   ('', 'THROUGH', '', 'THrUW'),
6342
                   ('', 'THOSE', '', 'DHOWz'),
6343
                   ('', 'THOUGH', ' ', 'DHOW'),
6344
                   (' ', 'THUS', '', 'DHAHs'),
6345
                   ('', 'TH', '', 'TH'),
6346
                   ('#:', 'TED', ' ', 'tIHd'),
6347
                   ('S', 'TI', '#N', 'CH'),
6348
                   ('', 'TI', 'O', 'SH'),
6349
                   ('', 'TI', 'A', 'SH'),
6350
                   ('', 'TIEN', '', 'SHAXn'),
6351
                   ('', 'TUR', '#', 'CHER'),
6352
                   ('', 'TU', 'A', 'CHUW'),
6353
                   (' ', 'TWO', '', 'tUW'),
6354
                   ('', 'T', '', 't')),
6355
             'U': ((' ', 'UN', 'I', 'yUWn'),
6356
                   (' ', 'UN', '', 'AHn'),
6357
                   (' ', 'UPON', '', 'AXpAOn'),
6358
                   ('T', 'UR', '#', 'UHr'),
6359
                   ('S', 'UR', '#', 'UHr'),
6360
                   ('R', 'UR', '#', 'UHr'),
6361
                   ('D', 'UR', '#', 'UHr'),
6362
                   ('L', 'UR', '#', 'UHr'),
6363
                   ('Z', 'UR', '#', 'UHr'),
6364
                   ('N', 'UR', '#', 'UHr'),
6365
                   ('J', 'UR', '#', 'UHr'),
6366
                   ('TH', 'UR', '#', 'UHr'),
6367
                   ('CH', 'UR', '#', 'UHr'),
6368
                   ('SH', 'UR', '#', 'UHr'),
6369
                   ('', 'UR', '#', 'yUHr'),
6370
                   ('', 'UR', '', 'ER'),
6371
                   ('', 'U', '^ ', 'AH'),
6372
                   ('', 'U', '^^', 'AH'),
6373
                   ('', 'UY', '', 'AY'),
6374
                   (' G', 'U', '#', ''),
6375
                   ('G', 'U', '%', ''),
6376
                   ('G', 'U', '#', 'w'),
6377
                   ('#N', 'U', '', 'yUW'),
6378
                   ('T', 'U', '', 'UW'),
6379
                   ('S', 'U', '', 'UW'),
6380
                   ('R', 'U', '', 'UW'),
6381
                   ('D', 'U', '', 'UW'),
6382
                   ('L', 'U', '', 'UW'),
6383
                   ('Z', 'U', '', 'UW'),
6384
                   ('N', 'U', '', 'UW'),
6385
                   ('J', 'U', '', 'UW'),
6386
                   ('TH', 'U', '', 'UW'),
6387
                   ('CH', 'U', '', 'UW'),
6388
                   ('SH', 'U', '', 'UW'),
6389
                   ('', 'U', '', 'yUW')),
6390
             'V': (('', 'VIEW', '', 'vyUW'),
6391
                   ('', 'V', '', 'v')),
6392
             'W': ((' ', 'WERE', '', 'wER'),
6393
                   ('', 'WA', 'S', 'wAA'),
6394
                   ('', 'WA', 'T', 'wAA'),
6395
                   ('', 'WHERE', '', 'WHEHr'),
6396
                   ('', 'WHAT', '', 'WHAAt'),
6397
                   ('', 'WHOL', '', 'hOWl'),
6398
                   ('', 'WHO', '', 'hUW'),
6399
                   ('', 'WH', '', 'WH'),
6400
                   ('', 'WAR', '', 'wAOr'),
6401
                   ('', 'WOR', '^', 'wER'),
6402
                   ('', 'WR', '', 'r'),
6403
                   ('', 'W', '', 'w')),
6404
             'X': (('', 'X', '', 'ks'),),
6405
             'Y': (('', 'YOUNG', '', 'yAHNG'),
6406
                   (' ', 'YOU', '', 'yUW'),
6407
                   (' ', 'YES', '', 'yEHs'),
6408
                   (' ', 'Y', '', 'y'),
6409
                   ('#:^', 'Y', ' ', 'IY'),
6410
                   ('#:^', 'Y', 'I', 'IY'),
6411
                   (' :', 'Y', ' ', 'AY'),
6412
                   (' :', 'Y', '#', 'AY'),
6413
                   (' :', 'Y', '^+:#', 'IH'),
6414
                   (' :', 'Y', '^#', 'AY'),
6415
                   ('', 'Y', '', 'IH')),
6416
             'Z': (('', 'Z', '', 'z'),)}
6417
6418
    word = word.upper()
6419
6420
    pron = ''
6421
    pos = 0
6422
    while pos < len(word):
6423
        left_orig = word[:pos]
6424
        right_orig = word[pos:]
6425
        first = word[pos] if word[pos] in rules else ' '
6426
        for rule in rules[first]:
6427
            left, match, right, out = rule
6428
            if right_orig.startswith(match):
6429
                if left:
6430
                    l_pattern = to_regex(left, left=True)
6431
                if right:
6432
                    r_pattern = to_regex(right, left=False)
6433
                if ((not left or re_match(l_pattern, left_orig)) and
0 ignored issues
show
introduced by
The variable l_pattern does not seem to be defined for all execution paths.
Loading history...
6434
                        (not right or
6435
                         re_match(r_pattern, right_orig[len(match):]))):
0 ignored issues
show
introduced by
The variable r_pattern does not seem to be defined for all execution paths.
Loading history...
6436
                    pron += out
6437
                    pos += len(match)
6438
                    break
6439
        else:
6440
            pron += word[pos]
6441
            pos += 1
6442
6443
    return pron
6444
6445
6446
def bmpm(word, language_arg=0, name_mode='gen', match_mode='approx',
6447
         concat=False, filter_langs=False):
6448
    """Return the Beider-Morse Phonetic Matching algorithm code for a word.
6449
6450
    The Beider-Morse Phonetic Matching algorithm is described in
6451
    :cite:`Beider:2008`.
6452
    The reference implementation is licensed under GPLv3.
6453
6454
    :param str word: the word to transform
6455
    :param str language_arg: the language of the term; supported values
6456
        include:
6457
6458
            - 'any'
6459
            - 'arabic'
6460
            - 'cyrillic'
6461
            - 'czech'
6462
            - 'dutch'
6463
            - 'english'
6464
            - 'french'
6465
            - 'german'
6466
            - 'greek'
6467
            - 'greeklatin'
6468
            - 'hebrew'
6469
            - 'hungarian'
6470
            - 'italian'
6471
            - 'polish'
6472
            - 'portuguese'
6473
            - 'romanian'
6474
            - 'russian'
6475
            - 'spanish'
6476
            - 'turkish'
6477
            - 'germandjsg'
6478
            - 'polishdjskp'
6479
            - 'russiandjsre'
6480
6481
    :param str name_mode: the name mode of the algorithm:
6482
6483
            - 'gen' -- general (default)
6484
            - 'ash' -- Ashkenazi
6485
            - 'sep' -- Sephardic
6486
6487
    :param str match_mode: matching mode: 'approx' or 'exact'
6488
    :param bool concat: concatenation mode
6489
    :param bool filter_langs: filter out incompatible languages
6490
    :returns: the BMPM value(s)
6491
    :rtype: tuple
6492
6493
    >>> bmpm('Christopher')
6494
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
6495
    xristYfir xristopi xritopir xritopi xristofi xritofir xritofi tzristopir
6496
    tzristofir zristopir zristopi zritopir zritopi zristofir zristofi zritofir
6497
    zritofi'
6498
    >>> bmpm('Niall')
6499
    'nial niol'
6500
    >>> bmpm('Smith')
6501
    'zmit'
6502
    >>> bmpm('Schmidt')
6503
    'zmit stzmit'
6504
6505
    >>> bmpm('Christopher', language_arg='German')
6506
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
6507
    xristYfir'
6508
    >>> bmpm('Christopher', language_arg='English')
6509
    'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir xristafir
6510
    xrQstafir'
6511
    >>> bmpm('Christopher', language_arg='German', name_mode='ash')
6512
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
6513
    xristYfir'
6514
6515
    >>> bmpm('Christopher', language_arg='German', match_mode='exact')
6516
    'xriStopher xriStofer xristopher xristofer'
6517
    """
6518
    return _bmpm(word, language_arg, name_mode, match_mode,
6519
                 concat, filter_langs)
6520
6521
6522
if __name__ == '__main__':
6523
    import doctest
6524
    doctest.testmod()
6525