Completed
Push — master ( c05634...e2d099 )
by Chris
10:30
created

abydos.phonetic.phonex()   F

Complexity

Conditions 33

Size

Total Lines 107
Code Lines 61

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 33
eloc 61
nop 3
dl 0
loc 107
rs 0
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic.phonex() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (5272/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.phonetic.
20
21
The phonetic module implements phonetic algorithms including:
22
23
    - Robert C. Russell's Index
24
    - American Soundex
25
    - Refined Soundex
26
    - Daitch-Mokotoff Soundex
27
    - Kölner Phonetik
28
    - NYSIIS
29
    - Match Rating Algorithm
30
    - Metaphone
31
    - Double Metaphone
32
    - Caverphone
33
    - Alpha Search Inquiry System
34
    - Fuzzy Soundex
35
    - Phonex
36
    - Phonem
37
    - Phonix
38
    - SfinxBis
39
    - phonet
40
    - Standardized Phonetic Frequency Code
41
    - Statistics Canada
42
    - Lein
43
    - Roger Root
44
    - Oxford Name Compression Algorithm (ONCA)
45
    - Eudex phonetic hash
46
    - Haase Phonetik
47
    - Reth-Schek Phonetik
48
    - FONEM
49
    - Parmar-Kumbharana
50
    - Beider-Morse Phonetic Matching
51
"""
52
53
from __future__ import division, unicode_literals
54
55
import re
56
import unicodedata
57
from collections import Counter
58
from itertools import groupby, product
59
60
from six import text_type
61
from six.moves import range
62
63
from ._bm import _bmpm
64
65
_INFINITY = float('inf')
66
67
68
def _delete_consecutive_repeats(word):
69
    """Delete consecutive repeated characters in a word.
70
71
    :param str word: the word to transform
72
    :returns: word with consecutive repeating characters collapsed to
73
        a single instance
74
    :rtype: str
75
    """
76
    return ''.join(char for char, _ in groupby(word))
77
78
79
def russell_index(word):
80
    """Return the Russell Index (integer output) of a word.
81
82
    This follows Robert C. Russell's Index algorithm, as described in
83
    US Patent 1,261,167 (1917)
84
85
    :param str word: the word to transform
86
    :returns: the Russell Index value
87
    :rtype: int
88
89
    >>> russell_index('Christopher')
90
    3813428
91
    >>> russell_index('Niall')
92
    715
93
    >>> russell_index('Smith')
94
    3614
95
    >>> russell_index('Schmidt')
96
    3614
97
    """
98
    _russell_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
99
                                     'ABCDEFGIKLMNOPQRSTUVXYZ'),
100
                                    '12341231356712383412313'))
101
102
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
103
    word = word.replace('ß', 'SS')
104
    word = word.replace('GH', '')  # discard gh (rule 3)
105
    word = word.rstrip('SZ')  # discard /[sz]$/ (rule 3)
106
107
    # translate according to Russell's mapping
108
    word = ''.join(c for c in word if c in
109
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'K', 'L', 'M', 'N',
110
                    'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'})
111
    sdx = word.translate(_russell_translation)
112
113
    # remove any 1s after the first occurrence
114
    one = sdx.find('1')+1
115
    if one:
116
        sdx = sdx[:one] + ''.join(c for c in sdx[one:] if c != '1')
117
118
    # remove repeating characters
119
    sdx = _delete_consecutive_repeats(sdx)
120
121
    # return as an int
122
    return int(sdx) if sdx else float('NaN')
123
124
125
def russell_index_num_to_alpha(num):
126
    """Convert the Russell Index integer to an alphabetic string.
127
128
    This follows Robert C. Russell's Index algorithm, as described in
129
    US Patent 1,261,167 (1917)
130
131
    :param int num: a Russell Index integer value
132
    :returns: the Russell Index as an alphabetic string
133
    :rtype: str
134
135
    >>> russell_index_num_to_alpha(3813428)
136
    'CRACDBR'
137
    >>> russell_index_num_to_alpha(715)
138
    'NAL'
139
    >>> russell_index_num_to_alpha(3614)
140
    'CMAD'
141
    """
142
    _russell_num_translation = dict(zip((ord(_) for _ in '12345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
143
                                        'ABCDLMNR'))
144
    num = ''.join(c for c in text_type(num) if c in {'1', '2', '3', '4', '5',
145
                                                     '6', '7', '8'})
146
    if num:
147
        return num.translate(_russell_num_translation)
148
    return ''
149
150
151
def russell_index_alpha(word):
152
    """Return the Russell Index (alphabetic output) for the word.
153
154
    This follows Robert C. Russell's Index algorithm, as described in
155
    US Patent 1,261,167 (1917)
156
157
    :param str word: the word to transform
158
    :returns: the Russell Index value as an alphabetic string
159
    :rtype: str
160
161
    >>> russell_index_alpha('Christopher')
162
    'CRACDBR'
163
    >>> russell_index_alpha('Niall')
164
    'NAL'
165
    >>> russell_index_alpha('Smith')
166
    'CMAD'
167
    >>> russell_index_alpha('Schmidt')
168
    'CMAD'
169
    """
170
    if word:
171
        return russell_index_num_to_alpha(russell_index(word))
172
    return ''
173
174
175
def soundex(word, maxlength=4, var='American', reverse=False, zero_pad=True):
176
    """Return the Soundex code for a word.
177
178
    :param str word: the word to transform
179
    :param int maxlength: the length of the code returned (defaults to 4)
180
    :param str var: the variant of the algorithm to employ (defaults to
181
        'American'):
182
183
        - 'American' follows the American Soundex algorithm, as described at
184
          http://www.archives.gov/publications/general-info-leaflets/55-census.html
185
          and in Knuth(1998:394); this is also called Miracode
186
        - 'special' follows the rules from the 1880-1910 US Census
187
          retrospective re-analysis, in which h & w are not treated as blocking
188
          consonants but as vowels.
189
          Cf. http://creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
190
        - 'Census' follows the rules laid out in GIL 55 by the US Census,
191
          including coding prefixed and unprefixed versions of some names
192
193
    :param bool reverse: reverse the word before computing the selected Soundex
194
        (defaults to False); This results in "Reverse Soundex"
195
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
196
        maxlength string
197
    :returns: the Soundex value
198
    :rtype: str
199
200
    >>> soundex("Christopher")
201
    'C623'
202
    >>> soundex("Niall")
203
    'N400'
204
    >>> soundex('Smith')
205
    'S530'
206
    >>> soundex('Schmidt')
207
    'S530'
208
209
210
    >>> soundex('Christopher', maxlength=_INFINITY)
211
    'C623160000000000000000000000000000000000000000000000000000000000'
212
    >>> soundex('Christopher', maxlength=_INFINITY, zero_pad=False)
213
    'C62316'
214
215
    >>> soundex('Christopher', reverse=True)
216
    'R132'
217
218
    >>> soundex('Ashcroft')
219
    'A261'
220
    >>> soundex('Asicroft')
221
    'A226'
222
    >>> soundex('Ashcroft', var='special')
223
    'A226'
224
    >>> soundex('Asicroft', var='special')
225
    'A226'
226
    """
227
    _soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
228
                                     'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
229
                                    '01230129022455012623019202'))
230
231
    # Require a maxlength of at least 4 and not more than 64
232
    if maxlength is not None:
233
        maxlength = min(max(4, maxlength), 64)
234
    else:
235
        maxlength = 64
236
237
    # uppercase, normalize, decompose, and filter non-A-Z out
238
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
239
    word = word.replace('ß', 'SS')
240
241
    if var == 'Census':
242
        # Should these prefixes be supplemented? (VANDE, DELA, VON)
243
        if word[:3] in {'VAN', 'CON'} and len(word) > 4:
244
            return (soundex(word, maxlength, 'American', reverse, zero_pad),
245
                    soundex(word[3:], maxlength, 'American', reverse,
246
                            zero_pad))
247
        if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
248
            return (soundex(word, maxlength, 'American', reverse, zero_pad),
249
                    soundex(word[2:], maxlength, 'American', reverse,
250
                            zero_pad))
251
        # Otherwise, proceed as usual (var='American' mode, ostensibly)
252
253
    word = ''.join(c for c in word if c in
254
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
255
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
256
                    'Y', 'Z'})
257
258
    # Nothing to convert, return base case
259
    if not word:
260
        if zero_pad:
261
            return '0'*maxlength
262
        return '0'
263
264
    # Reverse word if computing Reverse Soundex
265
    if reverse:
266
        word = word[::-1]
267
268
    # apply the Soundex algorithm
269
    sdx = word.translate(_soundex_translation)
270
271
    if var == 'special':
272
        sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
273
    else:
274
        sdx = sdx.replace('9', '')  # rule 1
275
    sdx = _delete_consecutive_repeats(sdx)  # rule 3
276
277
    if word[0] in 'HW':
278
        sdx = word[0] + sdx
279
    else:
280
        sdx = word[0] + sdx[1:]
281
    sdx = sdx.replace('0', '')  # rule 1
282
283
    if zero_pad:
284
        sdx += ('0'*maxlength)  # rule 4
285
286
    return sdx[:maxlength]
287
288
289
def refined_soundex(word, maxlength=_INFINITY, reverse=False, zero_pad=False,
290
                    retain_vowels=False):
291
    """Return the Refined Soundex code for a word.
292
293
    This is Soundex, but with more character classes. It was defined by
294
    Carolyn B. Boyce:
295
    https://web.archive.org/web/20010513121003/http://www.bluepoof.com:80/Soundex/info2.html
296
297
    :param word: the word to transform
298
    :param maxlength: the length of the code returned (defaults to unlimited)
299
    :param reverse: reverse the word before computing the selected Soundex
300
        (defaults to False); This results in "Reverse Soundex"
301
    :param zero_pad: pad the end of the return value with 0s to achieve a
302
        maxlength string
303
    :param retain_vowels: retain vowels (as 0) in the resulting code
304
    :returns: the Refined Soundex value
305
    :rtype: str
306
307
    >>> refined_soundex('Christopher')
308
    'C3090360109'
309
    >>> refined_soundex('Niall')
310
    'N807'
311
    >>> refined_soundex('Smith')
312
    'S38060'
313
    >>> refined_soundex('Schmidt')
314
    'S30806'
315
    """
316
    _ref_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
317
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
318
                                        '01360240043788015936020505'))
319
320
    # uppercase, normalize, decompose, and filter non-A-Z out
321
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
322
    word = word.replace('ß', 'SS')
323
    word = ''.join(c for c in word if c in
324
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
325
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
326
                    'Y', 'Z'})
327
328
    # Reverse word if computing Reverse Soundex
329
    if reverse:
330
        word = word[::-1]
331
332
    # apply the Soundex algorithm
333
    sdx = word[0] + word.translate(_ref_soundex_translation)
334
    sdx = _delete_consecutive_repeats(sdx)
335
    if not retain_vowels:
336
        sdx = sdx.replace('0', '')  # Delete vowels, H, W, Y
337
338
    if maxlength < _INFINITY:
339
        if zero_pad:
340
            sdx += ('0' * maxlength)
341
        if maxlength:
342
            sdx = sdx[:maxlength]
343
344
    return sdx
345
346
347
def dm_soundex(word, maxlength=6, reverse=False, zero_pad=True):
348
    """Return the Daitch-Mokotoff Soundex code for a word.
349
350
    Returns values of a word as a set. A collection is necessary since there
351
    can be multiple values for a single word.
352
353
    :param word: the word to transform
354
    :param maxlength: the length of the code returned (defaults to 6)
355
    :param reverse: reverse the word before computing the selected Soundex
356
        (defaults to False); This results in "Reverse Soundex"
357
    :param zero_pad: pad the end of the return value with 0s to achieve a
358
        maxlength string
359
    :returns: the Daitch-Mokotoff Soundex value
360
    :rtype: str
361
362
    >>> dm_soundex('Christopher')
363
    {'494379', '594379'}
364
    >>> dm_soundex('Niall')
365
    {'680000'}
366
    >>> dm_soundex('Smith')
367
    {'463000'}
368
    >>> dm_soundex('Schmidt')
369
    {'463000'}
370
371
    >>> dm_soundex('The quick brown fox', maxlength=20, zero_pad=False)
372
    {'35457976754', '3557976754'}
373
    """
374
    _dms_table = {'STCH': (2, 4, 4), 'DRZ': (4, 4, 4), 'ZH': (4, 4, 4),
375
                  'ZHDZH': (2, 4, 4), 'DZH': (4, 4, 4), 'DRS': (4, 4, 4),
376
                  'DZS': (4, 4, 4), 'SCHTCH': (2, 4, 4), 'SHTSH': (2, 4, 4),
377
                  'SZCZ': (2, 4, 4), 'TZS': (4, 4, 4), 'SZCS': (2, 4, 4),
378
                  'STSH': (2, 4, 4), 'SHCH': (2, 4, 4), 'D': (3, 3, 3),
379
                  'H': (5, 5, '_'), 'TTSCH': (4, 4, 4), 'THS': (4, 4, 4),
380
                  'L': (8, 8, 8), 'P': (7, 7, 7), 'CHS': (5, 54, 54),
381
                  'T': (3, 3, 3), 'X': (5, 54, 54), 'OJ': (0, 1, '_'),
382
                  'OI': (0, 1, '_'), 'SCHTSH': (2, 4, 4), 'OY': (0, 1, '_'),
383
                  'Y': (1, '_', '_'), 'TSH': (4, 4, 4), 'ZDZ': (2, 4, 4),
384
                  'TSZ': (4, 4, 4), 'SHT': (2, 43, 43), 'SCHTSCH': (2, 4, 4),
385
                  'TTSZ': (4, 4, 4), 'TTZ': (4, 4, 4), 'SCH': (4, 4, 4),
386
                  'TTS': (4, 4, 4), 'SZD': (2, 43, 43), 'AI': (0, 1, '_'),
387
                  'PF': (7, 7, 7), 'TCH': (4, 4, 4), 'PH': (7, 7, 7),
388
                  'TTCH': (4, 4, 4), 'SZT': (2, 43, 43), 'ZDZH': (2, 4, 4),
389
                  'EI': (0, 1, '_'), 'G': (5, 5, 5), 'EJ': (0, 1, '_'),
390
                  'ZD': (2, 43, 43), 'IU': (1, '_', '_'), 'K': (5, 5, 5),
391
                  'O': (0, '_', '_'), 'SHTCH': (2, 4, 4), 'S': (4, 4, 4),
392
                  'TRZ': (4, 4, 4), 'SHD': (2, 43, 43), 'DSH': (4, 4, 4),
393
                  'CSZ': (4, 4, 4), 'EU': (1, 1, '_'), 'TRS': (4, 4, 4),
394
                  'ZS': (4, 4, 4), 'STRZ': (2, 4, 4), 'UY': (0, 1, '_'),
395
                  'STRS': (2, 4, 4), 'CZS': (4, 4, 4),
396
                  'MN': ('6_6', '6_6', '6_6'), 'UI': (0, 1, '_'),
397
                  'UJ': (0, 1, '_'), 'UE': (0, '_', '_'), 'EY': (0, 1, '_'),
398
                  'W': (7, 7, 7), 'IA': (1, '_', '_'), 'FB': (7, 7, 7),
399
                  'STSCH': (2, 4, 4), 'SCHT': (2, 43, 43),
400
                  'NM': ('6_6', '6_6', '6_6'), 'SCHD': (2, 43, 43),
401
                  'B': (7, 7, 7), 'DSZ': (4, 4, 4), 'F': (7, 7, 7),
402
                  'N': (6, 6, 6), 'CZ': (4, 4, 4), 'R': (9, 9, 9),
403
                  'U': (0, '_', '_'), 'V': (7, 7, 7), 'CS': (4, 4, 4),
404
                  'Z': (4, 4, 4), 'SZ': (4, 4, 4), 'TSCH': (4, 4, 4),
405
                  'KH': (5, 5, 5), 'ST': (2, 43, 43), 'KS': (5, 54, 54),
406
                  'SH': (4, 4, 4), 'SC': (2, 4, 4), 'SD': (2, 43, 43),
407
                  'DZ': (4, 4, 4), 'ZHD': (2, 43, 43), 'DT': (3, 3, 3),
408
                  'ZSH': (4, 4, 4), 'DS': (4, 4, 4), 'TZ': (4, 4, 4),
409
                  'TS': (4, 4, 4), 'TH': (3, 3, 3), 'TC': (4, 4, 4),
410
                  'A': (0, '_', '_'), 'E': (0, '_', '_'), 'I': (0, '_', '_'),
411
                  'AJ': (0, 1, '_'), 'M': (6, 6, 6), 'Q': (5, 5, 5),
412
                  'AU': (0, 7, '_'), 'IO': (1, '_', '_'), 'AY': (0, 1, '_'),
413
                  'IE': (1, '_', '_'), 'ZSCH': (4, 4, 4),
414
                  'CH': ((5, 4), (5, 4), (5, 4)),
415
                  'CK': ((5, 45), (5, 45), (5, 45)),
416
                  'C': ((5, 4), (5, 4), (5, 4)),
417
                  'J': ((1, 4), ('_', 4), ('_', 4)),
418
                  'RZ': ((94, 4), (94, 4), (94, 4)),
419
                  'RS': ((94, 4), (94, 4), (94, 4))}
420
421
    _dms_order = {'A': ('AI', 'AJ', 'AU', 'AY', 'A'),
422
                  'B': ('B'),
423
                  'C': ('CHS', 'CSZ', 'CZS', 'CH', 'CK', 'CS', 'CZ', 'C'),
424
                  'D': ('DRS', 'DRZ', 'DSH', 'DSZ', 'DZH', 'DZS', 'DS', 'DT',
425
                        'DZ', 'D'),
426
                  'E': ('EI', 'EJ', 'EU', 'EY', 'E'),
427
                  'F': ('FB', 'F'),
428
                  'G': ('G'),
429
                  'H': ('H'),
430
                  'I': ('IA', 'IE', 'IO', 'IU', 'I'),
431
                  'J': ('J'),
432
                  'K': ('KH', 'KS', 'K'),
433
                  'L': ('L'),
434
                  'M': ('MN', 'M'),
435
                  'N': ('NM', 'N'),
436
                  'O': ('OI', 'OJ', 'OY', 'O'),
437
                  'P': ('PF', 'PH', 'P'),
438
                  'Q': ('Q'),
439
                  'R': ('RS', 'RZ', 'R'),
440
                  'S': ('SCHTSCH', 'SCHTCH', 'SCHTSH', 'SHTCH', 'SHTSH',
441
                        'STSCH', 'SCHD', 'SCHT', 'SHCH', 'STCH', 'STRS',
442
                        'STRZ', 'STSH', 'SZCS', 'SZCZ', 'SCH', 'SHD', 'SHT',
443
                        'SZD', 'SZT', 'SC', 'SD', 'SH', 'ST', 'SZ', 'S'),
444
                  'T': ('TTSCH', 'TSCH', 'TTCH', 'TTSZ', 'TCH', 'THS', 'TRS',
445
                        'TRZ', 'TSH', 'TSZ', 'TTS', 'TTZ', 'TZS', 'TC', 'TH',
446
                        'TS', 'TZ', 'T'),
447
                  'U': ('UE', 'UI', 'UJ', 'UY', 'U'),
448
                  'V': ('V'),
449
                  'W': ('W'),
450
                  'X': ('X'),
451
                  'Y': ('Y'),
452
                  'Z': ('ZHDZH', 'ZDZH', 'ZSCH', 'ZDZ', 'ZHD', 'ZSH', 'ZD',
453
                        'ZH', 'ZS', 'Z')}
454
455
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
456
    dms = ['']  # initialize empty code list
457
458
    # Require a maxlength of at least 6 and not more than 64
459
    if maxlength is not None:
460
        maxlength = min(max(6, maxlength), 64)
461
    else:
462
        maxlength = 64
463
464
    # uppercase, normalize, decompose, and filter non-A-Z
465
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
466
    word = word.replace('ß', 'SS')
467
    word = ''.join(c for c in word if c in
468
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
469
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
470
                    'Y', 'Z'})
471
472
    # Nothing to convert, return base case
473
    if not word:
474
        if zero_pad:
475
            return {'0'*maxlength}
476
        return {'0'}
477
478
    # Reverse word if computing Reverse Soundex
479
    if reverse:
480
        word = word[::-1]
481
482
    pos = 0
483
    while pos < len(word):
484
        # Iterate through _dms_order, which specifies the possible substrings
485
        # for which codes exist in the Daitch-Mokotoff coding
486
        for sstr in _dms_order[word[pos]]:  # pragma: no branch
487
            if word[pos:].startswith(sstr):
488
                # Having determined a valid substring start, retrieve the code
489
                dm_val = _dms_table[sstr]
490
491
                # Having retried the code (triple), determine the correct
492
                # positional variant (first, pre-vocalic, elsewhere)
493
                if pos == 0:
494
                    dm_val = dm_val[0]
495
                elif (pos+len(sstr) < len(word) and
496
                      word[pos+len(sstr)] in _vowels):
497
                    dm_val = dm_val[1]
498
                else:
499
                    dm_val = dm_val[2]
500
501
                # Build the code strings
502
                if isinstance(dm_val, tuple):
503
                    dms = [_ + text_type(dm_val[0]) for _ in dms] \
504
                            + [_ + text_type(dm_val[1]) for _ in dms]
505
                else:
506
                    dms = [_ + text_type(dm_val) for _ in dms]
507
                pos += len(sstr)
508
                break
509
510
    # Filter out double letters and _ placeholders
511
    dms = (''.join(c for c in _delete_consecutive_repeats(_) if c != '_')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
512
           for _ in dms)
513
514
    # Trim codes and return set
515
    if zero_pad:
516
        dms = ((_ + ('0'*maxlength))[:maxlength] for _ in dms)
517
    else:
518
        dms = (_[:maxlength] for _ in dms)
519
    return set(dms)
520
521
522
def koelner_phonetik(word):
523
    """Return the Kölner Phonetik (numeric output) code for a word.
524
525
    Based on the algorithm described at
526
    https://de.wikipedia.org/wiki/Kölner_Phonetik
527
528
    While the output code is numeric, it is still a str because 0s can lead
529
    the code.
530
531
    :param str word: the word to transform
532
    :returns: the Kölner Phonetik value as a numeric string
533
    :rtype: str
534
535
    >>> koelner_phonetik('Christopher')
536
    '478237'
537
    >>> koelner_phonetik('Niall')
538
    '65'
539
    >>> koelner_phonetik('Smith')
540
    '862'
541
    >>> koelner_phonetik('Schmidt')
542
    '862'
543
    >>> koelner_phonetik('Müller')
544
    '657'
545
    >>> koelner_phonetik('Zimmermann')
546
    '86766'
547
    """
548
    # pylint: disable=too-many-branches
549
    def _after(word, i, letters):
550
        """Return True if word[i] follows one of the supplied letters."""
551
        if i > 0 and word[i-1] in letters:
552
            return True
553
        return False
554
555
    def _before(word, i, letters):
556
        """Return True if word[i] precedes one of the supplied letters."""
557
        if i+1 < len(word) and word[i+1] in letters:
558
            return True
559
        return False
560
561
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
562
563
    sdx = ''
564
565
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
566
    word = word.replace('ß', 'SS')
567
568
    word = word.replace('Ä', 'AE')
569
    word = word.replace('Ö', 'OE')
570
    word = word.replace('Ü', 'UE')
571
    word = ''.join(c for c in word if c in
572
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
573
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
574
                    'Y', 'Z'})
575
576
    # Nothing to convert, return base case
577
    if not word:
578
        return sdx
579
580
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
581 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
582
            sdx += '0'
583
        elif word[i] == 'B':
584
            sdx += '1'
585
        elif word[i] == 'P':
586
            if _before(word, i, {'H'}):
587
                sdx += '3'
588
            else:
589
                sdx += '1'
590
        elif word[i] in {'D', 'T'}:
591
            if _before(word, i, {'C', 'S', 'Z'}):
592
                sdx += '8'
593
            else:
594
                sdx += '2'
595
        elif word[i] in {'F', 'V', 'W'}:
596
            sdx += '3'
597
        elif word[i] in {'G', 'K', 'Q'}:
598
            sdx += '4'
599
        elif word[i] == 'C':
600
            if _after(word, i, {'S', 'Z'}):
601
                sdx += '8'
602
            elif i == 0:
603
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
604
                                     'X'}):
605
                    sdx += '4'
606
                else:
607
                    sdx += '8'
608
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
609
                sdx += '4'
610
            else:
611
                sdx += '8'
612
        elif word[i] == 'X':
613
            if _after(word, i, {'C', 'K', 'Q'}):
614
                sdx += '8'
615
            else:
616
                sdx += '48'
617
        elif word[i] == 'L':
618
            sdx += '5'
619
        elif word[i] in {'M', 'N'}:
620
            sdx += '6'
621
        elif word[i] == 'R':
622
            sdx += '7'
623
        elif word[i] in {'S', 'Z'}:
624
            sdx += '8'
625
626
    sdx = _delete_consecutive_repeats(sdx)
627
628
    if sdx:
629
        sdx = sdx[0] + sdx[1:].replace('0', '')
630
631
    return sdx
632
633
634
def koelner_phonetik_num_to_alpha(num):
635
    """Convert a Kölner Phonetik code from numeric to alphabetic.
636
637
    :param str num: a numeric Kölner Phonetik representation
638
    :returns: an alphabetic representation of the same word
639
    :rtype: str
640
641
    >>> koelner_phonetik_num_to_alpha(862)
642
    'SNT'
643
    >>> koelner_phonetik_num_to_alpha(657)
644
    'NLR'
645
    >>> koelner_phonetik_num_to_alpha(86766)
646
    'SNRNN'
647
    """
648
    _koelner_num_translation = dict(zip((ord(_) for _ in '012345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
649
                                        'APTFKLNRS'))
650
    num = ''.join(c for c in text_type(num) if c in {'0', '1', '2', '3', '4',
651
                                                     '5', '6', '7', '8'})
652
    return num.translate(_koelner_num_translation)
653
654
655
def koelner_phonetik_alpha(word):
656
    """Return the Kölner Phonetik (alphabetic output) code for a word.
657
658
    :param str word: the word to transform
659
    :returns: the Kölner Phonetik value as an alphabetic string
660
    :rtype: str
661
662
    >>> koelner_phonetik_alpha('Smith')
663
    'SNT'
664
    >>> koelner_phonetik_alpha('Schmidt')
665
    'SNT'
666
    >>> koelner_phonetik_alpha('Müller')
667
    'NLR'
668
    >>> koelner_phonetik_alpha('Zimmermann')
669
    'SNRNN'
670
    """
671
    return koelner_phonetik_num_to_alpha(koelner_phonetik(word))
672
673
674
def nysiis(word, maxlength=6, modified=False):
675
    """Return the NYSIIS code for a word.
676
677
    A description of the New York State Identification and Intelligence System
678
    algorithm can be found at
679
    https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
680
681
    The modified version of this algorithm is described in Appendix B of
682
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
683
    Procedure for the SRS Record Linkage System.` Statistical Reporting
684
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
685
    https://naldc.nal.usda.gov/download/27833/PDF
686
687
    :param str word: the word to transform
688
    :param int maxlength: the maximum length (default 6) of the code to return
689
    :param bool modified: indicates whether to use USDA modified NYSIIS
690
    :returns: the NYSIIS value
691
    :rtype: str
692
693
    >>> nysiis('Christopher')
694
    'CRASTA'
695
    >>> nysiis('Niall')
696
    'NAL'
697
    >>> nysiis('Smith')
698
    'SNAT'
699
    >>> nysiis('Schmidt')
700
    'SNAD'
701
702
    >>> nysiis('Christopher', maxlength=_INFINITY)
703
    'CRASTAFAR'
704
705
    >>> nysiis('Christopher', maxlength=8, modified=True)
706
    'CRASTAFA'
707
    >>> nysiis('Niall', maxlength=8, modified=True)
708
    'NAL'
709
    >>> nysiis('Smith', maxlength=8, modified=True)
710
    'SNAT'
711
    >>> nysiis('Schmidt', maxlength=8, modified=True)
712
    'SNAD'
713
    """
714
    # Require a maxlength of at least 6
715
    if maxlength:
716
        maxlength = max(6, maxlength)
717
718
    _vowels = {'A', 'E', 'I', 'O', 'U'}
719
720
    word = ''.join(c for c in word.upper() if c.isalpha())
721
    word = word.replace('ß', 'SS')
722
723
    # exit early if there are no alphas
724
    if not word:
725
        return ''
726
727
    if modified:
728
        original_first_char = word[0]
729
730
    if word[:3] == 'MAC':
731
        word = 'MCC'+word[3:]
732
    elif word[:2] == 'KN':
733
        word = 'NN'+word[2:]
734
    elif word[:1] == 'K':
735
        word = 'C'+word[1:]
736
    elif word[:2] in {'PH', 'PF'}:
737
        word = 'FF'+word[2:]
738
    elif word[:3] == 'SCH':
739
        word = 'SSS'+word[3:]
740
    elif modified:
741
        if word[:2] == 'WR':
742
            word = 'RR'+word[2:]
743
        elif word[:2] == 'RH':
744
            word = 'RR'+word[2:]
745
        elif word[:2] == 'DG':
746
            word = 'GG'+word[2:]
747
        elif word[:1] in _vowels:
748
            word = 'A'+word[1:]
749
750
    if modified and word[-1] in {'S', 'Z'}:
751
        word = word[:-1]
752
753
    if word[-2:] == 'EE' or word[-2:] == 'IE' or (modified and
754
                                                  word[-2:] == 'YE'):
755
        word = word[:-2]+'Y'
756
    elif word[-2:] in {'DT', 'RT', 'RD'}:
757
        word = word[:-2]+'D'
758
    elif word[-2:] in {'NT', 'ND'}:
759
        word = word[:-2]+('N' if modified else 'D')
760
    elif modified:
761
        if word[-2:] == 'IX':
762
            word = word[:-2]+'ICK'
763
        elif word[-2:] == 'EX':
764
            word = word[:-2]+'ECK'
765
        elif word[-2:] in {'JR', 'SR'}:
766
            return 'ERROR'  # TODO: decide how best to return an error
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
767
768
    key = word[0]
769
770
    skip = 0
771
    for i in range(1, len(word)):
772
        if i >= len(word):
773
            continue
774
        elif skip:
775
            skip -= 1
776
            continue
777
        elif word[i:i+2] == 'EV':
778
            word = word[:i] + 'AF' + word[i+2:]
779
            skip = 1
780
        elif word[i] in _vowels:
781
            word = word[:i] + 'A' + word[i+1:]
782
        elif modified and i != len(word)-1 and word[i] == 'Y':
783
            word = word[:i] + 'A' + word[i+1:]
784
        elif word[i] == 'Q':
785
            word = word[:i] + 'G' + word[i+1:]
786
        elif word[i] == 'Z':
787
            word = word[:i] + 'S' + word[i+1:]
788
        elif word[i] == 'M':
789
            word = word[:i] + 'N' + word[i+1:]
790
        elif word[i:i+2] == 'KN':
791
            word = word[:i] + 'N' + word[i+2:]
792
        elif word[i] == 'K':
793
            word = word[:i] + 'C' + word[i+1:]
794
        elif modified and i == len(word)-3 and word[i:i+3] == 'SCH':
795
            word = word[:i] + 'SSA'
796
            skip = 2
797
        elif word[i:i+3] == 'SCH':
798
            word = word[:i] + 'SSS' + word[i+3:]
799
            skip = 2
800
        elif modified and i == len(word)-2 and word[i:i+2] == 'SH':
801
            word = word[:i] + 'SA'
802
            skip = 1
803
        elif word[i:i+2] == 'SH':
804
            word = word[:i] + 'SS' + word[i+2:]
805
            skip = 1
806
        elif word[i:i+2] == 'PH':
807
            word = word[:i] + 'FF' + word[i+2:]
808
            skip = 1
809
        elif modified and word[i:i+3] == 'GHT':
810
            word = word[:i] + 'TTT' + word[i+3:]
811
            skip = 2
812
        elif modified and word[i:i+2] == 'DG':
813
            word = word[:i] + 'GG' + word[i+2:]
814
            skip = 1
815
        elif modified and word[i:i+2] == 'WR':
816
            word = word[:i] + 'RR' + word[i+2:]
817
            skip = 1
818
        elif word[i] == 'H' and (word[i-1] not in _vowels or
819
                                 word[i+1:i+2] not in _vowels):
820
            word = word[:i] + word[i-1] + word[i+1:]
821
        elif word[i] == 'W' and word[i-1] in _vowels:
822
            word = word[:i] + word[i-1] + word[i+1:]
823
824
        if word[i:i+skip+1] != key[-1:]:
825
            key += word[i:i+skip+1]
826
827
    key = _delete_consecutive_repeats(key)
828
829
    if key[-1] == 'S':
830
        key = key[:-1]
831
    if key[-2:] == 'AY':
832
        key = key[:-2] + 'Y'
833
    if key[-1:] == 'A':
834
        key = key[:-1]
835
    if modified and key[0] == 'A':
836
        key = original_first_char + key[1:]
0 ignored issues
show
introduced by
The variable original_first_char does not seem to be defined in case modified on line 727 is False. Are you sure this can never be the case?
Loading history...
837
838
    if maxlength and maxlength < _INFINITY:
839
        key = key[:maxlength]
840
841
    return key
842
843
844
def mra(word):
845
    """Return the MRA personal numeric identifier (PNI) for a word.
846
847
    A description of the Western Airlines Surname Match Rating Algorithm can
848
    be found on page 18 of
849
    https://archive.org/details/accessingindivid00moor
850
851
    :param str word: the word to transform
852
    :returns: the MRA PNI
853
    :rtype: str
854
855
    >>> mra('Christopher')
856
    'CHRPHR'
857
    >>> mra('Niall')
858
    'NL'
859
    >>> mra('Smith')
860
    'SMTH'
861
    >>> mra('Schmidt')
862
    'SCHMDT'
863
    """
864
    if not word:
865
        return word
866
    word = word.upper()
867
    word = word.replace('ß', 'SS')
868
    word = word[0]+''.join(c for c in word[1:] if
869
                           c not in {'A', 'E', 'I', 'O', 'U'})
870
    word = _delete_consecutive_repeats(word)
871
    if len(word) > 6:
872
        word = word[:3]+word[-3:]
873
    return word
874
875
876
def metaphone(word, maxlength=_INFINITY):
877
    """Return the Metaphone code for a word.
878
879
    Based on Lawrence Philips' Pick BASIC code from 1990:
880
    http://aspell.net/metaphone/metaphone.basic
881
    This incorporates some corrections to the above code, particularly
882
    some of those suggested by Michael Kuhn in:
883
    http://aspell.net/metaphone/metaphone-kuhn.txt
884
885
    :param str word: the word to transform
886
    :param int maxlength: the maximum length of the returned Metaphone code
887
        (defaults to unlimited, but in Philips' original implementation
888
        this was 4)
889
    :returns: the Metaphone value
890
    :rtype: str
891
892
893
    >>> metaphone('Christopher')
894
    'KRSTFR'
895
    >>> metaphone('Niall')
896
    'NL'
897
    >>> metaphone('Smith')
898
    'SM0'
899
    >>> metaphone('Schmidt')
900
    'SKMTT'
901
    """
902
    # pylint: disable=too-many-branches
903
    _vowels = {'A', 'E', 'I', 'O', 'U'}
904
    _frontv = {'E', 'I', 'Y'}
905
    _varson = {'C', 'G', 'P', 'S', 'T'}
906
907
    # Require a maxlength of at least 4
908
    if maxlength is not None:
909
        maxlength = max(4, maxlength)
910
    else:
911
        maxlength = 64
912
913
    # As in variable sound--those modified by adding an "h"
914
    ename = ''.join(c for c in word.upper() if c.isalnum())
915
    ename = ename.replace('ß', 'SS')
916
917
    # Delete nonalphanumeric characters and make all caps
918
    if not ename:
919
        return ''
920
    if ename[0:2] in {'PN', 'AE', 'KN', 'GN', 'WR'}:
921
        ename = ename[1:]
922
    elif ename[0] == 'X':
923
        ename = 'S' + ename[1:]
924
    elif ename[0:2] == 'WH':
925
        ename = 'W' + ename[2:]
926
927
    # Convert to metaph
928
    elen = len(ename)-1
929
    metaph = ''
930
    for i in range(len(ename)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
931
        if len(metaph) >= maxlength:
932
            break
933
        if ((ename[i] not in {'G', 'T'} and
934
             i > 0 and ename[i-1] == ename[i])):
935
            continue
936
937
        if ename[i] in _vowels and i == 0:
938
            metaph = ename[i]
939
940
        elif ename[i] == 'B':
941
            if i != elen or ename[i-1] != 'M':
942
                metaph += ename[i]
943
944
        elif ename[i] == 'C':
945
            if not (i > 0 and ename[i-1] == 'S' and ename[i+1:i+2] in _frontv):
946
                if ename[i+1:i+3] == 'IA':
947
                    metaph += 'X'
948
                elif ename[i+1:i+2] in _frontv:
949
                    metaph += 'S'
950
                elif i > 0 and ename[i-1:i+2] == 'SCH':
951
                    metaph += 'K'
952
                elif ename[i+1:i+2] == 'H':
953
                    if i == 0 and i+1 < elen and ename[i+2:i+3] not in _vowels:
954
                        metaph += 'K'
955
                    else:
956
                        metaph += 'X'
957
                else:
958
                    metaph += 'K'
959
960
        elif ename[i] == 'D':
961
            if ename[i+1:i+2] == 'G' and ename[i+2:i+3] in _frontv:
962
                metaph += 'J'
963
            else:
964
                metaph += 'T'
965
966
        elif ename[i] == 'G':
967
            if ename[i+1:i+2] == 'H' and not (i+1 == elen or
968
                                              ename[i+2:i+3] not in _vowels):
969
                continue
970
            elif i > 0 and ((i+1 == elen and ename[i+1] == 'N') or
971
                            (i+3 == elen and ename[i+1:i+4] == 'NED')):
972
                continue
973
            elif (i-1 > 0 and i+1 <= elen and ename[i-1] == 'D' and
974
                  ename[i+1] in _frontv):
975
                continue
976
            elif ename[i+1:i+2] == 'G':
977
                continue
978
            elif ename[i+1:i+2] in _frontv:
979
                if i == 0 or ename[i-1] != 'G':
980
                    metaph += 'J'
981
                else:
982
                    metaph += 'K'
983
            else:
984
                metaph += 'K'
985
986
        elif ename[i] == 'H':
987
            if ((i > 0 and ename[i-1] in _vowels and
988
                 ename[i+1:i+2] not in _vowels)):
989
                continue
990
            elif i > 0 and ename[i-1] in _varson:
991
                continue
992
            else:
993
                metaph += 'H'
994
995
        elif ename[i] in {'F', 'J', 'L', 'M', 'N', 'R'}:
996
            metaph += ename[i]
997
998
        elif ename[i] == 'K':
999
            if i > 0 and ename[i-1] == 'C':
1000
                continue
1001
            else:
1002
                metaph += 'K'
1003
1004
        elif ename[i] == 'P':
1005
            if ename[i+1:i+2] == 'H':
1006
                metaph += 'F'
1007
            else:
1008
                metaph += 'P'
1009
1010
        elif ename[i] == 'Q':
1011
            metaph += 'K'
1012
1013
        elif ename[i] == 'S':
1014
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1015
                 ename[i+2] in 'OA')):
1016
                metaph += 'X'
1017
            elif ename[i+1:i+2] == 'H':
1018
                metaph += 'X'
1019
            else:
1020
                metaph += 'S'
1021
1022
        elif ename[i] == 'T':
1023
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1024
                 ename[i+2] in {'A', 'O'})):
1025
                metaph += 'X'
1026
            elif ename[i+1:i+2] == 'H':
1027
                metaph += '0'
1028
            elif ename[i+1:i+3] != 'CH':
1029
                if ename[i-1:i] != 'T':
1030
                    metaph += 'T'
1031
1032
        elif ename[i] == 'V':
1033
            metaph += 'F'
1034
1035
        elif ename[i] in 'WY':
1036
            if ename[i+1:i+2] in _vowels:
1037
                metaph += ename[i]
1038
1039
        elif ename[i] == 'X':
1040
            metaph += 'KS'
1041
1042
        elif ename[i] == 'Z':
1043
            metaph += 'S'
1044
1045
    return metaph
1046
1047
1048
def double_metaphone(word, maxlength=_INFINITY):
1049
    """Return the Double Metaphone code for a word.
1050
1051
    Based on Lawrence Philips' (Visual) C++ code from 1999:
1052
    http://aspell.net/metaphone/dmetaph.cpp
1053
1054
    :param word: the word to transform
1055
    :param maxlength: the maximum length of the returned Double Metaphone codes
1056
        (defaults to unlimited, but in Philips' original implementation this
1057
        was 4)
1058
    :returns: the Double Metaphone value(s)
1059
    :rtype: tuple
1060
1061
    >>> double_metaphone('Christopher')
1062
    ('KRSTFR', '')
1063
    >>> double_metaphone('Niall')
1064
    ('NL', '')
1065
    >>> double_metaphone('Smith')
1066
    ('SM0', 'XMT')
1067
    >>> double_metaphone('Schmidt')
1068
    ('XMT', 'SMT')
1069
    """
1070
    # pylint: disable=too-many-branches
1071
    # Require a maxlength of at least 4
1072
    if maxlength is not None:
1073
        maxlength = max(4, maxlength)
1074
    else:
1075
        maxlength = 64
1076
1077
    primary = ''
1078
    secondary = ''
1079
1080
    def _slavo_germanic():
1081
        """Return True if the word appears to be Slavic or Germanic."""
1082
        if 'W' in word or 'K' in word or 'CZ' in word:
1083
            return True
1084
        return False
1085
1086
    def _metaph_add(pri, sec=''):
1087
        """Return a new metaphone tuple with the supplied elements."""
1088
        newpri = primary
1089
        newsec = secondary
1090
        if pri:
1091
            newpri += pri
1092
        if sec:
1093
            if sec != ' ':
1094
                newsec += sec
1095
        else:
1096
            newsec += pri
1097
        return (newpri, newsec)
1098
1099
    def _is_vowel(pos):
1100
        """Return True if the character at word[pos] is a vowel."""
1101
        if pos >= 0 and word[pos] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1102
            return True
1103
        return False
1104
1105
    def _get_at(pos):
1106
        """Return the character at word[pos]."""
1107
        return word[pos]
1108
1109
    def _string_at(pos, slen, substrings):
1110
        """Return True if word[pos:pos+slen] is in substrings."""
1111
        if pos < 0:
1112
            return False
1113
        return word[pos:pos+slen] in substrings
1114
1115
    current = 0
1116
    length = len(word)
1117
    if length < 1:
1118
        return ('', '')
1119
    last = length - 1
1120
1121
    word = word.upper()
1122
    word = word.replace('ß', 'SS')
1123
1124
    # Pad the original string so that we can index beyond the edge of the world
1125
    word += '     '
1126
1127
    # Skip these when at start of word
1128
    if word[0:2] in {'GN', 'KN', 'PN', 'WR', 'PS'}:
1129
        current += 1
1130
1131
    # Initial 'X' is pronounced 'Z' e.g. 'Xavier'
1132
    if _get_at(0) == 'X':
1133
        (primary, secondary) = _metaph_add('S')  # 'Z' maps to 'S'
1134
        current += 1
1135
1136
    # Main loop
1137
    while True:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1138
        if current >= length:
1139
            break
1140
1141
        if _get_at(current) in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1142
            if current == 0:
1143
                # All init vowels now map to 'A'
1144
                (primary, secondary) = _metaph_add('A')
1145
            current += 1
1146
            continue
1147
1148
        elif _get_at(current) == 'B':
1149
            # "-mb", e.g", "dumb", already skipped over...
1150
            (primary, secondary) = _metaph_add('P')
1151
            if _get_at(current + 1) == 'B':
1152
                current += 2
1153
            else:
1154
                current += 1
1155
            continue
1156
1157
        elif _get_at(current) == 'Ç':
1158
            (primary, secondary) = _metaph_add('S')
1159
            current += 1
1160
            continue
1161
1162
        elif _get_at(current) == 'C':
1163
            # Various Germanic
1164
            if (current > 1 and not _is_vowel(current - 2) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1165
                    _string_at((current - 1), 3, {'ACH'}) and
1166
                    ((_get_at(current + 2) != 'I') and
1167
                     ((_get_at(current + 2) != 'E') or
1168
                      _string_at((current - 2), 6,
1169
                                 {'BACHER', 'MACHER'})))):
1170
                (primary, secondary) = _metaph_add('K')
1171
                current += 2
1172
                continue
1173
1174
            # Special case 'caesar'
1175
            elif current == 0 and _string_at(current, 6, {'CAESAR'}):
1176
                (primary, secondary) = _metaph_add('S')
1177
                current += 2
1178
                continue
1179
1180
            # Italian 'chianti'
1181
            elif _string_at(current, 4, {'CHIA'}):
1182
                (primary, secondary) = _metaph_add('K')
1183
                current += 2
1184
                continue
1185
1186
            elif _string_at(current, 2, {'CH'}):
1187
                # Find 'Michael'
1188
                if current > 0 and _string_at(current, 4, {'CHAE'}):
1189
                    (primary, secondary) = _metaph_add('K', 'X')
1190
                    current += 2
1191
                    continue
1192
1193
                # Greek roots e.g. 'chemistry', 'chorus'
1194
                elif (current == 0 and
1195
                      (_string_at((current + 1), 5,
1196
                                  {'HARAC', 'HARIS'}) or
1197
                       _string_at((current + 1), 3,
1198
                                  {'HOR', 'HYM', 'HIA', 'HEM'})) and
1199
                      not _string_at(0, 5, {'CHORE'})):
1200
                    (primary, secondary) = _metaph_add('K')
1201
                    current += 2
1202
                    continue
1203
1204
                # Germanic, Greek, or otherwise 'ch' for 'kh' sound
1205
                elif ((_string_at(0, 4, {'VAN ', 'VON '}) or
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (7/5)
Loading history...
1206
                       _string_at(0, 3, {'SCH'})) or
1207
                      # 'architect but not 'arch', 'orchestra', 'orchid'
1208
                      _string_at((current - 2), 6,
1209
                                 {'ORCHES', 'ARCHIT', 'ORCHID'}) or
1210
                      _string_at((current + 2), 1, {'T', 'S'}) or
1211
                      ((_string_at((current - 1), 1,
1212
                                   {'A', 'O', 'U', 'E'}) or
1213
                        (current == 0)) and
1214
                       # e.g., 'wachtler', 'wechsler', but not 'tichner'
1215
                       _string_at((current + 2), 1,
1216
                                  {'L', 'R', 'N', 'M', 'B', 'H', 'F', 'V', 'W',
1217
                                   ' '}))):
1218
                    (primary, secondary) = _metaph_add('K')
1219
1220
                else:
1221
                    if current > 0:
1222
                        if _string_at(0, 2, {'MC'}):
1223
                            # e.g., "McHugh"
1224
                            (primary, secondary) = _metaph_add('K')
1225
                        else:
1226
                            (primary, secondary) = _metaph_add('X', 'K')
1227
                    else:
1228
                        (primary, secondary) = _metaph_add('X')
1229
1230
                current += 2
1231
                continue
1232
1233
            # e.g, 'czerny'
1234
            elif (_string_at(current, 2, {'CZ'}) and
1235
                  not _string_at((current - 2), 4, {'WICZ'})):
1236
                (primary, secondary) = _metaph_add('S', 'X')
1237
                current += 2
1238
                continue
1239
1240
            # e.g., 'focaccia'
1241
            elif _string_at((current + 1), 3, {'CIA'}):
1242
                (primary, secondary) = _metaph_add('X')
1243
                current += 3
1244
1245
            # double 'C', but not if e.g. 'McClellan'
1246
            elif (_string_at(current, 2, {'CC'}) and
1247
                  not ((current == 1) and (_get_at(0) == 'M'))):
1248
                # 'bellocchio' but not 'bacchus'
1249
                if ((_string_at((current + 2), 1,
1250
                                {'I', 'E', 'H'}) and
1251
                     not _string_at((current + 2), 2, ['HU']))):
1252
                    # 'accident', 'accede' 'succeed'
1253
                    if ((((current == 1) and _get_at(current - 1) == 'A') or
1254
                         _string_at((current - 1), 5,
1255
                                    {'UCCEE', 'UCCES'}))):
1256
                        (primary, secondary) = _metaph_add('KS')
1257
                    # 'bacci', 'bertucci', other italian
1258
                    else:
1259
                        (primary, secondary) = _metaph_add('X')
1260
                    current += 3
1261
                    continue
1262
                else:  # Pierce's rule
1263
                    (primary, secondary) = _metaph_add('K')
1264
                    current += 2
1265
                    continue
1266
1267
            elif _string_at(current, 2, {'CK', 'CG', 'CQ'}):
1268
                (primary, secondary) = _metaph_add('K')
1269
                current += 2
1270
                continue
1271
1272
            elif _string_at(current, 2, {'CI', 'CE', 'CY'}):
1273
                # Italian vs. English
1274
                if _string_at(current, 3, {'CIO', 'CIE', 'CIA'}):
1275
                    (primary, secondary) = _metaph_add('S', 'X')
1276
                else:
1277
                    (primary, secondary) = _metaph_add('S')
1278
                current += 2
1279
                continue
1280
1281
            # else
1282
            else:
1283
                (primary, secondary) = _metaph_add('K')
1284
1285
                # name sent in 'mac caffrey', 'mac gregor
1286
                if _string_at((current + 1), 2, {' C', ' Q', ' G'}):
1287
                    current += 3
1288
                elif (_string_at((current + 1), 1,
1289
                                 {'C', 'K', 'Q'}) and
1290
                      not _string_at((current + 1), 2, {'CE', 'CI'})):
1291
                    current += 2
1292
                else:
1293
                    current += 1
1294
                continue
1295
1296
        elif _get_at(current) == 'D':
1297
            if _string_at(current, 2, {'DG'}):
1298
                if _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1299
                    # e.g. 'edge'
1300
                    (primary, secondary) = _metaph_add('J')
1301
                    current += 3
1302
                    continue
1303
                else:
1304
                    # e.g. 'edgar'
1305
                    (primary, secondary) = _metaph_add('TK')
1306
                    current += 2
1307
                    continue
1308
1309
            elif _string_at(current, 2, {'DT', 'DD'}):
1310
                (primary, secondary) = _metaph_add('T')
1311
                current += 2
1312
                continue
1313
1314
            # else
1315
            else:
1316
                (primary, secondary) = _metaph_add('T')
1317
                current += 1
1318
                continue
1319
1320
        elif _get_at(current) == 'F':
1321
            if _get_at(current + 1) == 'F':
1322
                current += 2
1323
            else:
1324
                current += 1
1325
            (primary, secondary) = _metaph_add('F')
1326
            continue
1327
1328
        elif _get_at(current) == 'G':
1329
            if _get_at(current + 1) == 'H':
1330
                if (current > 0) and not _is_vowel(current - 1):
1331
                    (primary, secondary) = _metaph_add('K')
1332
                    current += 2
1333
                    continue
1334
1335
                # 'ghislane', ghiradelli
1336
                elif current == 0:
1337
                    if _get_at(current + 2) == 'I':
1338
                        (primary, secondary) = _metaph_add('J')
1339
                    else:
1340
                        (primary, secondary) = _metaph_add('K')
1341
                    current += 2
1342
                    continue
1343
1344
                # Parker's rule (with some further refinements) - e.g., 'hugh'
1345
                elif (((current > 1) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1346
                       _string_at((current - 2), 1, {'B', 'H', 'D'})) or
1347
                      # e.g., 'bough'
1348
                      ((current > 2) and
1349
                       _string_at((current - 3), 1, {'B', 'H', 'D'})) or
1350
                      # e.g., 'broughton'
1351
                      ((current > 3) and
1352
                       _string_at((current - 4), 1, {'B', 'H'}))):
1353
                    current += 2
1354
                    continue
1355
                else:
1356
                    # e.g. 'laugh', 'McLaughlin', 'cough',
1357
                    #      'gough', 'rough', 'tough'
1358
                    if ((current > 2) and
1359
                            (_get_at(current - 1) == 'U') and
1360
                            (_string_at((current - 3), 1,
1361
                                        {'C', 'G', 'L', 'R', 'T'}))):
1362
                        (primary, secondary) = _metaph_add('F')
1363
                    elif (current > 0) and _get_at(current - 1) != 'I':
1364
                        (primary, secondary) = _metaph_add('K')
1365
                    current += 2
1366
                    continue
1367
1368
            elif _get_at(current + 1) == 'N':
1369
                if (current == 1) and _is_vowel(0) and not _slavo_germanic():
1370
                    (primary, secondary) = _metaph_add('KN', 'N')
1371
                # not e.g. 'cagney'
1372
                elif (not _string_at((current + 2), 2, {'EY'}) and
1373
                      (_get_at(current + 1) != 'Y') and
1374
                      not _slavo_germanic()):
1375
                    (primary, secondary) = _metaph_add('N', 'KN')
1376
                else:
1377
                    (primary, secondary) = _metaph_add('KN')
1378
                current += 2
1379
                continue
1380
1381
            # 'tagliaro'
1382
            elif (_string_at((current + 1), 2, {'LI'}) and
1383
                  not _slavo_germanic()):
1384
                (primary, secondary) = _metaph_add('KL', 'L')
1385
                current += 2
1386
                continue
1387
1388
            # -ges-, -gep-, -gel-, -gie- at beginning
1389
            elif ((current == 0) and
1390
                  ((_get_at(current + 1) == 'Y') or
1391
                   _string_at((current + 1), 2, {'ES', 'EP', 'EB', 'EL', 'EY',
1392
                                                 'IB', 'IL', 'IN', 'IE', 'EI',
1393
                                                 'ER'}))):
1394
                (primary, secondary) = _metaph_add('K', 'J')
1395
                current += 2
1396
                continue
1397
1398
            #  -ger-,  -gy-
1399
            elif ((_string_at((current + 1), 2, {'ER'}) or
1400
                   (_get_at(current + 1) == 'Y')) and not
1401
                  _string_at(0, 6, {'DANGER', 'RANGER', 'MANGER'}) and not
1402
                  _string_at((current - 1), 1, {'E', 'I'}) and not
1403
                  _string_at((current - 1), 3, {'RGY', 'OGY'})):
1404
                (primary, secondary) = _metaph_add('K', 'J')
1405
                current += 2
1406
                continue
1407
1408
            #  italian e.g, 'biaggi'
1409
            elif (_string_at((current + 1), 1, {'E', 'I', 'Y'}) or
1410
                  _string_at((current - 1), 4, {'AGGI', 'OGGI'})):
1411
                # obvious germanic
1412
                if (((_string_at(0, 4, {'VAN ', 'VON '}) or
1413
                      _string_at(0, 3, {'SCH'})) or
1414
                     _string_at((current + 1), 2, {'ET'}))):
1415
                    (primary, secondary) = _metaph_add('K')
1416
                elif _string_at((current + 1), 4, {'IER '}):
1417
                    (primary, secondary) = _metaph_add('J')
1418
                else:
1419
                    (primary, secondary) = _metaph_add('J', 'K')
1420
                current += 2
1421
                continue
1422
1423
            else:
1424
                if _get_at(current + 1) == 'G':
1425
                    current += 2
1426
                else:
1427
                    current += 1
1428
                (primary, secondary) = _metaph_add('K')
1429
                continue
1430
1431
        elif _get_at(current) == 'H':
1432
            # only keep if first & before vowel or btw. 2 vowels
1433
            if ((((current == 0) or _is_vowel(current - 1)) and
1434
                 _is_vowel(current + 1))):
1435
                (primary, secondary) = _metaph_add('H')
1436
                current += 2
1437
            else:  # also takes care of 'HH'
1438
                current += 1
1439
            continue
1440
1441
        elif _get_at(current) == 'J':
1442
            # obvious spanish, 'jose', 'san jacinto'
1443
            if _string_at(current, 4, ['JOSE']) or _string_at(0, 4, {'SAN '}):
1444
                if ((((current == 0) and (_get_at(current + 4) == ' ')) or
1445
                     _string_at(0, 4, ['SAN ']))):
1446
                    (primary, secondary) = _metaph_add('H')
1447
                else:
1448
                    (primary, secondary) = _metaph_add('J', 'H')
1449
                current += 1
1450
                continue
1451
1452
            elif (current == 0) and not _string_at(current, 4, {'JOSE'}):
1453
                # Yankelovich/Jankelowicz
1454
                (primary, secondary) = _metaph_add('J', 'A')
1455
            # Spanish pron. of e.g. 'bajador'
1456
            elif (_is_vowel(current - 1) and
1457
                  not _slavo_germanic() and
1458
                  ((_get_at(current + 1) == 'A') or
1459
                   (_get_at(current + 1) == 'O'))):
1460
                (primary, secondary) = _metaph_add('J', 'H')
1461
            elif current == last:
1462
                (primary, secondary) = _metaph_add('J', ' ')
1463
            elif (not _string_at((current + 1), 1,
1464
                                 {'L', 'T', 'K', 'S', 'N', 'M', 'B', 'Z'}) and
1465
                  not _string_at((current - 1), 1, {'S', 'K', 'L'})):
1466
                (primary, secondary) = _metaph_add('J')
1467
1468
            if _get_at(current + 1) == 'J':  # it could happen!
1469
                current += 2
1470
            else:
1471
                current += 1
1472
            continue
1473
1474
        elif _get_at(current) == 'K':
1475
            if _get_at(current + 1) == 'K':
1476
                current += 2
1477
            else:
1478
                current += 1
1479
            (primary, secondary) = _metaph_add('K')
1480
            continue
1481
1482
        elif _get_at(current) == 'L':
1483
            if _get_at(current + 1) == 'L':
1484
                # Spanish e.g. 'cabrillo', 'gallegos'
1485
                if (((current == (length - 3)) and
1486
                     _string_at((current - 1), 4, {'ILLO', 'ILLA', 'ALLE'})) or
1487
                        ((_string_at((last - 1), 2, {'AS', 'OS'}) or
1488
                          _string_at(last, 1, {'A', 'O'})) and
1489
                         _string_at((current - 1), 4, {'ALLE'}))):
1490
                    (primary, secondary) = _metaph_add('L', ' ')
1491
                    current += 2
1492
                    continue
1493
                current += 2
1494
            else:
1495
                current += 1
1496
            (primary, secondary) = _metaph_add('L')
1497
            continue
1498
1499
        elif _get_at(current) == 'M':
1500
            if (((_string_at((current - 1), 3, {'UMB'}) and
1501
                  (((current + 1) == last) or
1502
                   _string_at((current + 2), 2, {'ER'}))) or
1503
                 # 'dumb', 'thumb'
1504
                 (_get_at(current + 1) == 'M'))):
1505
                current += 2
1506
            else:
1507
                current += 1
1508
            (primary, secondary) = _metaph_add('M')
1509
            continue
1510
1511
        elif _get_at(current) == 'N':
1512
            if _get_at(current + 1) == 'N':
1513
                current += 2
1514
            else:
1515
                current += 1
1516
            (primary, secondary) = _metaph_add('N')
1517
            continue
1518
1519
        elif _get_at(current) == 'Ñ':
1520
            current += 1
1521
            (primary, secondary) = _metaph_add('N')
1522
            continue
1523
1524
        elif _get_at(current) == 'P':
1525
            if _get_at(current + 1) == 'H':
1526
                (primary, secondary) = _metaph_add('F')
1527
                current += 2
1528
                continue
1529
1530
            # also account for "campbell", "raspberry"
1531
            elif _string_at((current + 1), 1, {'P', 'B'}):
1532
                current += 2
1533
            else:
1534
                current += 1
1535
            (primary, secondary) = _metaph_add('P')
1536
            continue
1537
1538
        elif _get_at(current) == 'Q':
1539
            if _get_at(current + 1) == 'Q':
1540
                current += 2
1541
            else:
1542
                current += 1
1543
            (primary, secondary) = _metaph_add('K')
1544
            continue
1545
1546
        elif _get_at(current) == 'R':
1547
            # french e.g. 'rogier', but exclude 'hochmeier'
1548
            if (((current == last) and
1549
                 not _slavo_germanic() and
1550
                 _string_at((current - 2), 2, {'IE'}) and
1551
                 not _string_at((current - 4), 2, {'ME', 'MA'}))):
1552
                (primary, secondary) = _metaph_add('', 'R')
1553
            else:
1554
                (primary, secondary) = _metaph_add('R')
1555
1556
            if _get_at(current + 1) == 'R':
1557
                current += 2
1558
            else:
1559
                current += 1
1560
            continue
1561
1562
        elif _get_at(current) == 'S':
1563
            # special cases 'island', 'isle', 'carlisle', 'carlysle'
1564
            if _string_at((current - 1), 3, {'ISL', 'YSL'}):
1565
                current += 1
1566
                continue
1567
1568
            # special case 'sugar-'
1569
            elif (current == 0) and _string_at(current, 5, {'SUGAR'}):
1570
                (primary, secondary) = _metaph_add('X', 'S')
1571
                current += 1
1572
                continue
1573
1574
            elif _string_at(current, 2, {'SH'}):
1575
                # Germanic
1576
                if _string_at((current + 1), 4,
1577
                              {'HEIM', 'HOEK', 'HOLM', 'HOLZ'}):
1578
                    (primary, secondary) = _metaph_add('S')
1579
                else:
1580
                    (primary, secondary) = _metaph_add('X')
1581
                current += 2
1582
                continue
1583
1584
            # Italian & Armenian
1585
            elif (_string_at(current, 3, {'SIO', 'SIA'}) or
1586
                  _string_at(current, 4, {'SIAN'})):
1587
                if not _slavo_germanic():
1588
                    (primary, secondary) = _metaph_add('S', 'X')
1589
                else:
1590
                    (primary, secondary) = _metaph_add('S')
1591
                current += 3
1592
                continue
1593
1594
            # German & anglicisations, e.g. 'smith' match 'schmidt',
1595
            #                               'snider' match 'schneider'
1596
            # also, -sz- in Slavic language although in Hungarian it is
1597
            #       pronounced 's'
1598
            elif (((current == 0) and
1599
                   _string_at((current + 1), 1, {'M', 'N', 'L', 'W'})) or
1600
                  _string_at((current + 1), 1, {'Z'})):
1601
                (primary, secondary) = _metaph_add('S', 'X')
1602
                if _string_at((current + 1), 1, {'Z'}):
1603
                    current += 2
1604
                else:
1605
                    current += 1
1606
                continue
1607
1608
            elif _string_at(current, 2, {'SC'}):
1609
                # Schlesinger's rule
1610
                if _get_at(current + 2) == 'H':
1611
                    # dutch origin, e.g. 'school', 'schooner'
1612
                    if _string_at((current + 3), 2,
1613
                                  {'OO', 'ER', 'EN', 'UY', 'ED', 'EM'}):
1614
                        # 'schermerhorn', 'schenker'
1615
                        if _string_at((current + 3), 2, {'ER', 'EN'}):
1616
                            (primary, secondary) = _metaph_add('X', 'SK')
1617
                        else:
1618
                            (primary, secondary) = _metaph_add('SK')
1619
                        current += 3
1620
                        continue
1621
                    else:
1622
                        if (((current == 0) and not _is_vowel(3) and
1623
                             (_get_at(3) != 'W'))):
1624
                            (primary, secondary) = _metaph_add('X', 'S')
1625
                        else:
1626
                            (primary, secondary) = _metaph_add('X')
1627
                        current += 3
1628
                        continue
1629
1630
                elif _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1631
                    (primary, secondary) = _metaph_add('S')
1632
                    current += 3
1633
                    continue
1634
1635
                # else
1636
                else:
1637
                    (primary, secondary) = _metaph_add('SK')
1638
                    current += 3
1639
                    continue
1640
1641
            else:
1642
                # french e.g. 'resnais', 'artois'
1643
                if (current == last) and _string_at((current - 2), 2,
1644
                                                    {'AI', 'OI'}):
1645
                    (primary, secondary) = _metaph_add('', 'S')
1646
                else:
1647
                    (primary, secondary) = _metaph_add('S')
1648
1649
                if _string_at((current + 1), 1, {'S', 'Z'}):
1650
                    current += 2
1651
                else:
1652
                    current += 1
1653
                continue
1654
1655
        elif _get_at(current) == 'T':
1656
            if _string_at(current, 4, {'TION'}):
1657
                (primary, secondary) = _metaph_add('X')
1658
                current += 3
1659
                continue
1660
1661
            elif _string_at(current, 3, {'TIA', 'TCH'}):
1662
                (primary, secondary) = _metaph_add('X')
1663
                current += 3
1664
                continue
1665
1666
            elif (_string_at(current, 2, {'TH'}) or
1667
                  _string_at(current, 3, {'TTH'})):
1668
                # special case 'thomas', 'thames' or germanic
1669
                if ((_string_at((current + 2), 2, {'OM', 'AM'}) or
1670
                     _string_at(0, 4, {'VAN ', 'VON '}) or
1671
                     _string_at(0, 3, {'SCH'}))):
1672
                    (primary, secondary) = _metaph_add('T')
1673
                else:
1674
                    (primary, secondary) = _metaph_add('0', 'T')
1675
                current += 2
1676
                continue
1677
1678
            elif _string_at((current + 1), 1, {'T', 'D'}):
1679
                current += 2
1680
            else:
1681
                current += 1
1682
            (primary, secondary) = _metaph_add('T')
1683
            continue
1684
1685
        elif _get_at(current) == 'V':
1686
            if _get_at(current + 1) == 'V':
1687
                current += 2
1688
            else:
1689
                current += 1
1690
            (primary, secondary) = _metaph_add('F')
1691
            continue
1692
1693
        elif _get_at(current) == 'W':
1694
            # can also be in middle of word
1695
            if _string_at(current, 2, {'WR'}):
1696
                (primary, secondary) = _metaph_add('R')
1697
                current += 2
1698
                continue
1699
            elif ((current == 0) and
1700
                  (_is_vowel(current + 1) or _string_at(current, 2, {'WH'}))):
1701
                # Wasserman should match Vasserman
1702
                if _is_vowel(current + 1):
1703
                    (primary, secondary) = _metaph_add('A', 'F')
1704
                else:
1705
                    # need Uomo to match Womo
1706
                    (primary, secondary) = _metaph_add('A')
1707
1708
            # Arnow should match Arnoff
1709
            if ((((current == last) and _is_vowel(current - 1)) or
1710
                 _string_at((current - 1), 5,
1711
                            {'EWSKI', 'EWSKY', 'OWSKI', 'OWSKY'}) or
1712
                 _string_at(0, 3, ['SCH']))):
1713
                (primary, secondary) = _metaph_add('', 'F')
1714
                current += 1
1715
                continue
1716
            # Polish e.g. 'filipowicz'
1717
            elif _string_at(current, 4, {'WICZ', 'WITZ'}):
1718
                (primary, secondary) = _metaph_add('TS', 'FX')
1719
                current += 4
1720
                continue
1721
            # else skip it
1722
            else:
1723
                current += 1
1724
                continue
1725
1726
        elif _get_at(current) == 'X':
1727
            # French e.g. breaux
1728
            if (not ((current == last) and
1729
                     (_string_at((current - 3), 3, {'IAU', 'EAU'}) or
1730
                      _string_at((current - 2), 2, {'AU', 'OU'})))):
1731
                (primary, secondary) = _metaph_add('KS')
1732
1733
            if _string_at((current + 1), 1, {'C', 'X'}):
1734
                current += 2
1735
            else:
1736
                current += 1
1737
            continue
1738
1739
        elif _get_at(current) == 'Z':
1740
            # Chinese Pinyin e.g. 'zhao'
1741
            if _get_at(current + 1) == 'H':
1742
                (primary, secondary) = _metaph_add('J')
1743
                current += 2
1744
                continue
1745
            elif (_string_at((current + 1), 2, {'ZO', 'ZI', 'ZA'}) or
1746
                  (_slavo_germanic() and ((current > 0) and
1747
                                          _get_at(current - 1) != 'T'))):
1748
                (primary, secondary) = _metaph_add('S', 'TS')
1749
            else:
1750
                (primary, secondary) = _metaph_add('S')
1751
1752
            if _get_at(current + 1) == 'Z':
1753
                current += 2
1754
            else:
1755
                current += 1
1756
            continue
1757
1758
        else:
1759
            current += 1
1760
1761
    if maxlength and maxlength < _INFINITY:
1762
        primary = primary[:maxlength]
1763
        secondary = secondary[:maxlength]
1764
    if primary == secondary:
1765
        secondary = ''
1766
1767
    return (primary, secondary)
1768
1769
1770
def caverphone(word, version=2):
1771
    """Return the Caverphone code for a word.
1772
1773
    A description of version 1 of the algorithm can be found at:
1774
    http://caversham.otago.ac.nz/files/working/ctp060902.pdf
1775
1776
    A description of version 2 of the algorithm can be found at:
1777
    http://caversham.otago.ac.nz/files/working/ctp150804.pdf
1778
1779
    :param str word: the word to transform
1780
    :param int version: the version of Caverphone to employ for encoding
1781
        (defaults to 2)
1782
    :returns: the Caverphone value
1783
    :rtype: str
1784
1785
    >>> caverphone('Christopher')
1786
    'KRSTFA1111'
1787
    >>> caverphone('Niall')
1788
    'NA11111111'
1789
    >>> caverphone('Smith')
1790
    'SMT1111111'
1791
    >>> caverphone('Schmidt')
1792
    'SKMT111111'
1793
1794
    >>> caverphone('Christopher', 1)
1795
    'KRSTF1'
1796
    >>> caverphone('Niall', 1)
1797
    'N11111'
1798
    >>> caverphone('Smith', 1)
1799
    'SMT111'
1800
    >>> caverphone('Schmidt', 1)
1801
    'SKMT11'
1802
    """
1803
    _vowels = {'a', 'e', 'i', 'o', 'u'}
1804
1805
    word = word.lower()
1806
    word = ''.join(c for c in word if c in
1807
                   {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
1808
                    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
1809
                    'y', 'z'})
1810
1811
    def _squeeze_replace(word, char, new_char):
1812
        """Convert strings of char in word to one instance of new_char."""
1813
        while char * 2 in word:
1814
            word = word.replace(char * 2, char)
1815
        return word.replace(char, new_char)
1816
1817
    # the main replacemet algorithm
1818
    if version != 1 and word[-1:] == 'e':
1819
        word = word[:-1]
1820
    if word:
1821
        if word[:5] == 'cough':
1822
            word = 'cou2f'+word[5:]
1823
        if word[:5] == 'rough':
1824
            word = 'rou2f'+word[5:]
1825
        if word[:5] == 'tough':
1826
            word = 'tou2f'+word[5:]
1827
        if word[:6] == 'enough':
1828
            word = 'enou2f'+word[6:]
1829
        if version != 1 and word[:6] == 'trough':
1830
            word = 'trou2f'+word[6:]
1831
        if word[:2] == 'gn':
1832
            word = '2n'+word[2:]
1833
        if word[-2:] == 'mb':
1834
            word = word[:-1]+'2'
1835
        word = word.replace('cq', '2q')
1836
        word = word.replace('ci', 'si')
1837
        word = word.replace('ce', 'se')
1838
        word = word.replace('cy', 'sy')
1839
        word = word.replace('tch', '2ch')
1840
        word = word.replace('c', 'k')
1841
        word = word.replace('q', 'k')
1842
        word = word.replace('x', 'k')
1843
        word = word.replace('v', 'f')
1844
        word = word.replace('dg', '2g')
1845
        word = word.replace('tio', 'sio')
1846
        word = word.replace('tia', 'sia')
1847
        word = word.replace('d', 't')
1848
        word = word.replace('ph', 'fh')
1849
        word = word.replace('b', 'p')
1850
        word = word.replace('sh', 's2')
1851
        word = word.replace('z', 's')
1852
        if word[0] in _vowels:
1853
            word = 'A'+word[1:]
1854
        word = word.replace('a', '3')
1855
        word = word.replace('e', '3')
1856
        word = word.replace('i', '3')
1857
        word = word.replace('o', '3')
1858
        word = word.replace('u', '3')
1859
        if version != 1:
1860
            word = word.replace('j', 'y')
1861
            if word[:2] == 'y3':
1862
                word = 'Y3'+word[2:]
1863
            if word[:1] == 'y':
1864
                word = 'A'+word[1:]
1865
            word = word.replace('y', '3')
1866
        word = word.replace('3gh3', '3kh3')
1867
        word = word.replace('gh', '22')
1868
        word = word.replace('g', 'k')
1869
1870
        word = _squeeze_replace(word, 's', 'S')
1871
        word = _squeeze_replace(word, 't', 'T')
1872
        word = _squeeze_replace(word, 'p', 'P')
1873
        word = _squeeze_replace(word, 'k', 'K')
1874
        word = _squeeze_replace(word, 'f', 'F')
1875
        word = _squeeze_replace(word, 'm', 'M')
1876
        word = _squeeze_replace(word, 'n', 'N')
1877
1878
        word = word.replace('w3', 'W3')
1879
        if version == 1:
1880
            word = word.replace('wy', 'Wy')
1881
        word = word.replace('wh3', 'Wh3')
1882
        if version == 1:
1883
            word = word.replace('why', 'Why')
1884
        if version != 1 and word[-1:] == 'w':
1885
            word = word[:-1]+'3'
1886
        word = word.replace('w', '2')
1887
        if word[:1] == 'h':
1888
            word = 'A'+word[1:]
1889
        word = word.replace('h', '2')
1890
        word = word.replace('r3', 'R3')
1891
        if version == 1:
1892
            word = word.replace('ry', 'Ry')
1893
        if version != 1 and word[-1:] == 'r':
1894
            word = word[:-1]+'3'
1895
        word = word.replace('r', '2')
1896
        word = word.replace('l3', 'L3')
1897
        if version == 1:
1898
            word = word.replace('ly', 'Ly')
1899
        if version != 1 and word[-1:] == 'l':
1900
            word = word[:-1]+'3'
1901
        word = word.replace('l', '2')
1902
        if version == 1:
1903
            word = word.replace('j', 'y')
1904
            word = word.replace('y3', 'Y3')
1905
            word = word.replace('y', '2')
1906
        word = word.replace('2', '')
1907
        if version != 1 and word[-1:] == '3':
1908
            word = word[:-1]+'A'
1909
        word = word.replace('3', '')
1910
1911
    # pad with 1s, then extract the necessary length of code
1912
    word = word+'1'*10
1913
    if version != 1:
1914
        word = word[:10]
1915
    else:
1916
        word = word[:6]
1917
1918
    return word
1919
1920
1921
def alpha_sis(word, maxlength=14):
1922
    """Return the IBM Alpha Search Inquiry System code for a word.
1923
1924
    Based on the algorithm described in "Accessing individual records from
1925
    personal data files using non-unique identifiers" / Gwendolyn B. Moore,
1926
    et al.; prepared for the Institute for Computer Sciences and Technology,
1927
    National Bureau of Standards, Washington, D.C (1977):
1928
    https://archive.org/stream/accessingindivid00moor#page/15/mode/1up
1929
1930
    A collection is necessary since there can be multiple values for a
1931
    single word. But the collection must be ordered since the first value
1932
    is the primary coding.
1933
1934
    :param str word: the word to transform
1935
    :param int maxlength: the length of the code returned (defaults to 14)
1936
    :returns: the Alpha SIS value
1937
    :rtype: tuple
1938
1939
    >>> alpha_sis('Christopher')
1940
    ('06401840000000', '07040184000000', '04018400000000')
1941
    >>> alpha_sis('Niall')
1942
    ('02500000000000',)
1943
    >>> alpha_sis('Smith')
1944
    ('03100000000000',)
1945
    >>> alpha_sis('Schmidt')
1946
    ('06310000000000',)
1947
    """
1948
    _alpha_sis_initials = {'GF': '08', 'GM': '03', 'GN': '02', 'KN': '02',
1949
                           'PF': '08', 'PN': '02', 'PS': '00', 'WR': '04',
1950
                           'A': '1', 'E': '1', 'H': '2', 'I': '1', 'J': '3',
1951
                           'O': '1', 'U': '1', 'W': '4', 'Y': '5'}
1952
    _alpha_sis_initials_order = ('GF', 'GM', 'GN', 'KN', 'PF', 'PN', 'PS',
1953
                                 'WR', 'A', 'E', 'H', 'I', 'J', 'O', 'U', 'W',
1954
                                 'Y')
1955
    _alpha_sis_basic = {'SCH': '6', 'CZ': ('70', '6', '0'),
1956
                        'CH': ('6', '70', '0'), 'CK': ('7', '6'),
1957
                        'DS': ('0', '10'), 'DZ': ('0', '10'),
1958
                        'TS': ('0', '10'), 'TZ': ('0', '10'), 'CI': '0',
1959
                        'CY': '0', 'CE': '0', 'SH': '6', 'DG': '7', 'PH': '8',
1960
                        'C': ('7', '6'), 'K': ('7', '6'), 'Z': '0', 'S': '0',
1961
                        'D': '1', 'T': '1', 'N': '2', 'M': '3', 'R': '4',
1962
                        'L': '5', 'J': '6', 'G': '7', 'Q': '7', 'X': '7',
1963
                        'F': '8', 'V': '8', 'B': '9', 'P': '9'}
1964
    _alpha_sis_basic_order = ('SCH', 'CZ', 'CH', 'CK', 'DS', 'DZ', 'TS', 'TZ',
1965
                              'CI', 'CY', 'CE', 'SH', 'DG', 'PH', 'C', 'K',
1966
                              'Z', 'S', 'D', 'T', 'N', 'M', 'R', 'L', 'J', 'C',
1967
                              'G', 'K', 'Q', 'X', 'F', 'V', 'B', 'P')
1968
1969
    alpha = ['']
1970
    pos = 0
1971
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
1972
    word = word.replace('ß', 'SS')
1973
    word = ''.join(c for c in word if c in
1974
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
1975
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
1976
                    'Y', 'Z'})
1977
1978
    # Clamp maxlength to [4, 64]
1979
    if maxlength is not None:
1980
        maxlength = min(max(4, maxlength), 64)
1981
    else:
1982
        maxlength = 64
1983
1984
    # Do special processing for initial substrings
1985
    for k in _alpha_sis_initials_order:
1986
        if word.startswith(k):
1987
            alpha[0] += _alpha_sis_initials[k]
1988
            pos += len(k)
1989
            break
1990
1991
    # Add a '0' if alpha is still empty
1992
    if not alpha[0]:
1993
        alpha[0] += '0'
1994
1995
    # Whether or not any special initial codes were encoded, iterate
1996
    # through the length of the word in the main encoding loop
1997
    while pos < len(word):
1998
        origpos = pos
1999
        for k in _alpha_sis_basic_order:
2000
            if word[pos:].startswith(k):
2001
                if isinstance(_alpha_sis_basic[k], tuple):
2002
                    newalpha = []
2003
                    for i in range(len(_alpha_sis_basic[k])):
2004
                        newalpha += [_ + _alpha_sis_basic[k][i] for _ in alpha]
2005
                    alpha = newalpha
2006
                else:
2007
                    alpha = [_ + _alpha_sis_basic[k] for _ in alpha]
2008
                pos += len(k)
2009
                break
2010
        if pos == origpos:
2011
            alpha = [_ + '_' for _ in alpha]
2012
            pos += 1
2013
2014
    # Trim doublets and placeholders
2015
    for i in range(len(alpha)):
2016
        pos = 1
2017
        while pos < len(alpha[i]):
2018
            if alpha[i][pos] == alpha[i][pos-1]:
2019
                alpha[i] = alpha[i][:pos]+alpha[i][pos+1:]
2020
            pos += 1
2021
    alpha = (_.replace('_', '') for _ in alpha)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2022
2023
    # Trim codes and return tuple
2024
    alpha = ((_ + ('0'*maxlength))[:maxlength] for _ in alpha)
2025
    return tuple(alpha)
2026
2027
2028
def fuzzy_soundex(word, maxlength=5, zero_pad=True):
2029
    """Return the Fuzzy Soundex code for a word.
2030
2031
    Fuzzy Soundex is an algorithm derived from Soundex, defined in:
2032
    Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for
2033
    Soundex Retrieval."
2034
    http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf
2035
2036
    :param str word: the word to transform
2037
    :param int maxlength: the length of the code returned (defaults to 4)
2038
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2039
        a maxlength string
2040
    :returns: the Fuzzy Soundex value
2041
    :rtype: str
2042
2043
    >>> fuzzy_soundex('Christopher')
2044
    'K6931'
2045
    >>> fuzzy_soundex('Niall')
2046
    'N4000'
2047
    >>> fuzzy_soundex('Smith')
2048
    'S5300'
2049
    >>> fuzzy_soundex('Smith')
2050
    'S5300'
2051
    """
2052
    _fuzzy_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2053
                                           'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2054
                                          '0193017-07745501769301-7-9'))
2055
2056
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2057
    word = word.replace('ß', 'SS')
2058
2059
    # Clamp maxlength to [4, 64]
2060
    if maxlength is not None:
2061
        maxlength = min(max(4, maxlength), 64)
2062
    else:
2063
        maxlength = 64
2064
2065
    if not word:
2066
        if zero_pad:
2067
            return '0' * maxlength
2068
        return '0'
2069
2070
    if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
2071
        word = 'SS' + word[2:]
2072
    elif word[:2] == 'GN':
2073
        word = 'NN' + word[2:]
2074
    elif word[:2] in {'HR', 'WR'}:
2075
        word = 'RR' + word[2:]
2076
    elif word[:2] == 'HW':
2077
        word = 'WW' + word[2:]
2078
    elif word[:2] in {'KN', 'NG'}:
2079
        word = 'NN' + word[2:]
2080
2081
    if word[-2:] == 'CH':
2082
        word = word[:-2] + 'KK'
2083
    elif word[-2:] == 'NT':
2084
        word = word[:-2] + 'TT'
2085
    elif word[-2:] == 'RT':
2086
        word = word[:-2] + 'RR'
2087
    elif word[-3:] == 'RDT':
2088
        word = word[:-3] + 'RR'
2089
2090
    word = word.replace('CA', 'KA')
2091
    word = word.replace('CC', 'KK')
2092
    word = word.replace('CK', 'KK')
2093
    word = word.replace('CE', 'SE')
2094
    word = word.replace('CHL', 'KL')
2095
    word = word.replace('CL', 'KL')
2096
    word = word.replace('CHR', 'KR')
2097
    word = word.replace('CR', 'KR')
2098
    word = word.replace('CI', 'SI')
2099
    word = word.replace('CO', 'KO')
2100
    word = word.replace('CU', 'KU')
2101
    word = word.replace('CY', 'SY')
2102
    word = word.replace('DG', 'GG')
2103
    word = word.replace('GH', 'HH')
2104
    word = word.replace('MAC', 'MK')
2105
    word = word.replace('MC', 'MK')
2106
    word = word.replace('NST', 'NSS')
2107
    word = word.replace('PF', 'FF')
2108
    word = word.replace('PH', 'FF')
2109
    word = word.replace('SCH', 'SSS')
2110
    word = word.replace('TIO', 'SIO')
2111
    word = word.replace('TIA', 'SIO')
2112
    word = word.replace('TCH', 'CHH')
2113
2114
    sdx = word.translate(_fuzzy_soundex_translation)
2115
    sdx = sdx.replace('-', '')
2116
2117
    # remove repeating characters
2118
    sdx = _delete_consecutive_repeats(sdx)
2119
2120
    if word[0] in {'H', 'W', 'Y'}:
2121
        sdx = word[0] + sdx
2122
    else:
2123
        sdx = word[0] + sdx[1:]
2124
2125
    sdx = sdx.replace('0', '')
2126
2127
    if zero_pad:
2128
        sdx += ('0'*maxlength)
2129
2130
    return sdx[:maxlength]
2131
2132
2133
def phonex(word, maxlength=4, zero_pad=True):
2134
    """Return the Phonex code for a word.
2135
2136
    Phonex is an algorithm derived from Soundex, defined in:
2137
    Lait, A. J. and B. Randell. "An Assessment of Name Matching Algorithms".
2138
    http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
2139
2140
    :param str word: the word to transform
2141
    :param int maxlength: the length of the code returned (defaults to 4)
2142
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2143
        a maxlength string
2144
    :returns: the Phonex value
2145
    :rtype: str
2146
2147
    >>> phonex('Christopher')
2148
    'C623'
2149
    >>> phonex('Niall')
2150
    'N400'
2151
    >>> phonex('Schmidt')
2152
    'S253'
2153
    >>> phonex('Smith')
2154
    'S530'
2155
    """
2156
    name = unicodedata.normalize('NFKD', text_type(word.upper()))
2157
    name = name.replace('ß', 'SS')
2158
2159
    # Clamp maxlength to [4, 64]
2160
    if maxlength is not None:
2161
        maxlength = min(max(4, maxlength), 64)
2162
    else:
2163
        maxlength = 64
2164
2165
    name_code = last = ''
2166
2167
    # Deletions effected by replacing with next letter which
2168
    # will be ignored due to duplicate handling of Soundex code.
2169
    # This is faster than 'moving' all subsequent letters.
2170
2171
    # Remove any trailing Ss
2172
    while name[-1:] == 'S':
2173
        name = name[:-1]
2174
2175
    # Phonetic equivalents of first 2 characters
2176
    # Works since duplicate letters are ignored
2177
    if name[:2] == 'KN':
2178
        name = 'N' + name[2:]  # KN.. == N..
2179
    elif name[:2] == 'PH':
2180
        name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
2181
    elif name[:2] == 'WR':
2182
        name = 'R' + name[2:]  # WR.. == R..
2183
2184
    if name:
2185
        # Special case, ignore H first letter (subsequent Hs ignored anyway)
2186
        # Works since duplicate letters are ignored
2187
        if name[0] == 'H':
2188
            name = name[1:]
2189
2190
    if name:
2191
        # Phonetic equivalents of first character
2192
        if name[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2193
            name = 'A' + name[1:]
2194
        elif name[0] in {'B', 'P'}:
2195
            name = 'B' + name[1:]
2196
        elif name[0] in {'V', 'F'}:
2197
            name = 'F' + name[1:]
2198
        elif name[0] in {'C', 'K', 'Q'}:
2199
            name = 'C' + name[1:]
2200
        elif name[0] in {'G', 'J'}:
2201
            name = 'G' + name[1:]
2202
        elif name[0] in {'S', 'Z'}:
2203
            name = 'S' + name[1:]
2204
2205
        name_code = last = name[0]
2206
2207
    # MODIFIED SOUNDEX CODE
2208
    for i in range(1, len(name)):
2209
        code = '0'
2210
        if name[i] in {'B', 'F', 'P', 'V'}:
2211
            code = '1'
2212
        elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
2213
            code = '2'
2214
        elif name[i] in {'D', 'T'}:
2215
            if name[i+1:i+2] != 'C':
2216
                code = '3'
2217
        elif name[i] == 'L':
2218
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2219
                    i+1 == len(name)):
2220
                code = '4'
2221
        elif name[i] in {'M', 'N'}:
2222
            if name[i+1:i+2] in {'D', 'G'}:
2223
                name = name[:i+1] + name[i] + name[i+2:]
2224
            code = '5'
2225
        elif name[i] == 'R':
2226
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2227
                    i+1 == len(name)):
2228
                code = '6'
2229
2230
        if code != last and code != '0' and i != 0:
2231
            name_code += code
2232
2233
        last = name_code[-1]
2234
2235
    if zero_pad:
2236
        name_code += '0' * maxlength
2237
    if not name_code:
2238
        name_code = '0'
2239
    return name_code[:maxlength]
2240
2241
2242
def phonem(word):
2243
    """Return the Phonem code for a word.
2244
2245
    Phonem is defined in:
2246
    Wilde, Georg and Carsten Meyer. 1988. "Nicht wörtlich genommen,
2247
    'Schreibweisentolerante' Suchroutine in dBASE implementiert." c't Magazin
2248
    für Computer Technik. Oct. 1988. 126--131.
2249
2250
    This version is based on the Perl implementation documented at:
2251
    http://ifl.phil-fak.uni-koeln.de/sites/linguistik/Phonetik/import/Phonetik_Files/Allgemeine_Dateien/Martin_Wilz.pdf
2252
    It includes some enhancements presented in the Java port at:
2253
    https://github.com/dcm4che/dcm4che/blob/master/dcm4che-soundex/src/main/java/org/dcm4che3/soundex/Phonem.java
2254
2255
    Phonem is intended chiefly for German names/words.
2256
2257
    :param str word: the word to transform
2258
    :returns: the Phonem value
2259
    :rtype: str
2260
2261
    >>> phonem('Christopher')
2262
    'CRYSDOVR'
2263
    >>> phonem('Niall')
2264
    'NYAL'
2265
    >>> phonem('Smith')
2266
    'SMYD'
2267
    >>> phonem('Schmidt')
2268
    'CMYD'
2269
    """
2270
    _phonem_substitutions = (('SC', 'C'), ('SZ', 'C'), ('CZ', 'C'),
2271
                             ('TZ', 'C'), ('TS', 'C'), ('KS', 'X'),
2272
                             ('PF', 'V'), ('QU', 'KW'), ('PH', 'V'),
2273
                             ('UE', 'Y'), ('AE', 'E'), ('OE', 'Ö'),
2274
                             ('EI', 'AY'), ('EY', 'AY'), ('EU', 'OY'),
2275
                             ('AU', 'A§'), ('OU', '§'))
2276
    _phonem_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2277
                                    'ZKGQÇÑßFWPTÁÀÂÃÅÄÆÉÈÊËIJÌÍÎÏÜݧÚÙÛÔÒÓÕØ'),
2278
                                   'CCCCCNSVVBDAAAAAEEEEEEYYYYYYYYUUUUOOOOÖ'))
2279
2280
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2281
    for i, j in _phonem_substitutions:
2282
        word = word.replace(i, j)
2283
    word = word.translate(_phonem_translation)
2284
2285
    return ''.join(c for c in _delete_consecutive_repeats(word)
2286
                   if c in {'A', 'B', 'C', 'D', 'L', 'M', 'N', 'O', 'R', 'S',
2287
                            'U', 'V', 'W', 'X', 'Y', 'Ö'})
2288
2289
2290
def phonix(word, maxlength=4, zero_pad=True):
2291
    """Return the Phonix code for a word.
2292
2293
    Phonix is a Soundex-like algorithm defined in:
2294
    T.N. Gadd: PHONIX --- The Algorithm, Program 24/4, 1990, p.363-366.
2295
2296
    This implementation is based on
2297
    http://cpansearch.perl.org/src/ULPFR/WAIT-1.800/soundex.c
2298
    http://cs.anu.edu.au/people/Peter.Christen/Febrl/febrl-0.4.01/encode.py
2299
    and
2300
    https://metacpan.org/pod/Text::Phonetic::Phonix
2301
2302
    :param str word: the word to transform
2303
    :param int maxlength: the length of the code returned (defaults to 4)
2304
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2305
        a maxlength string
2306
    :returns: the Phonix value
2307
    :rtype: str
2308
2309
    >>> phonix('Christopher')
2310
    'K683'
2311
    >>> phonix('Niall')
2312
    'N400'
2313
    >>> phonix('Smith')
2314
    'S530'
2315
    >>> phonix('Schmidt')
2316
    'S530'
2317
    """
2318
    # pylint: disable=too-many-branches
2319
    def _start_repl(word, src, tar, post=None):
2320
        r"""Replace src with tar at the start of word."""
2321
        if post:
2322
            for i in post:
2323
                if word.startswith(src+i):
2324
                    return tar + word[len(src):]
2325
        elif word.startswith(src):
2326
            return tar + word[len(src):]
2327
        return word
2328
2329
    def _end_repl(word, src, tar, pre=None):
2330
        r"""Replace src with tar at the end of word."""
2331
        if pre:
2332
            for i in pre:
2333
                if word.endswith(i+src):
2334
                    return word[:-len(src)] + tar
2335
        elif word.endswith(src):
2336
            return word[:-len(src)] + tar
2337
        return word
2338
2339
    def _mid_repl(word, src, tar, pre=None, post=None):
2340
        r"""Replace src with tar in the middle of word."""
2341
        if pre or post:
2342
            if not pre:
2343
                return word[0] + _all_repl(word[1:], src, tar, pre, post)
2344
            elif not post:
2345
                return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
2346
            return _all_repl(word, src, tar, pre, post)
2347
        return (word[0] + _all_repl(word[1:-1], src, tar, pre, post) +
2348
                word[-1])
2349
2350
    def _all_repl(word, src, tar, pre=None, post=None):
2351
        r"""Replace src with tar anywhere in word."""
2352
        if pre or post:
2353
            if post:
2354
                post = post
2355
            else:
2356
                post = frozenset(('',))
2357
            if pre:
2358
                pre = pre
2359
            else:
2360
                pre = frozenset(('',))
2361
2362
            for i, j in ((i, j) for i in pre for j in post):
2363
                word = word.replace(i+src+j, i+tar+j)
2364
            return word
2365
        else:
2366
            return word.replace(src, tar)
2367
2368
    _vow = {'A', 'E', 'I', 'O', 'U'}
2369
    _con = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q',
2370
            'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z'}
2371
2372
    _phonix_substitutions = ((_all_repl, 'DG', 'G'),
2373
                             (_all_repl, 'CO', 'KO'),
2374
                             (_all_repl, 'CA', 'KA'),
2375
                             (_all_repl, 'CU', 'KU'),
2376
                             (_all_repl, 'CY', 'SI'),
2377
                             (_all_repl, 'CI', 'SI'),
2378
                             (_all_repl, 'CE', 'SE'),
2379
                             (_start_repl, 'CL', 'KL', _vow),
2380
                             (_all_repl, 'CK', 'K'),
2381
                             (_end_repl, 'GC', 'K'),
2382
                             (_end_repl, 'JC', 'K'),
2383
                             (_start_repl, 'CHR', 'KR', _vow),
2384
                             (_start_repl, 'CR', 'KR', _vow),
2385
                             (_start_repl, 'WR', 'R'),
2386
                             (_all_repl, 'NC', 'NK'),
2387
                             (_all_repl, 'CT', 'KT'),
2388
                             (_all_repl, 'PH', 'F'),
2389
                             (_all_repl, 'AA', 'AR'),
2390
                             (_all_repl, 'SCH', 'SH'),
2391
                             (_all_repl, 'BTL', 'TL'),
2392
                             (_all_repl, 'GHT', 'T'),
2393
                             (_all_repl, 'AUGH', 'ARF'),
2394
                             (_mid_repl, 'LJ', 'LD', _vow, _vow),
2395
                             (_all_repl, 'LOUGH', 'LOW'),
2396
                             (_start_repl, 'Q', 'KW'),
2397
                             (_start_repl, 'KN', 'N'),
2398
                             (_end_repl, 'GN', 'N'),
2399
                             (_all_repl, 'GHN', 'N'),
2400
                             (_end_repl, 'GNE', 'N'),
2401
                             (_all_repl, 'GHNE', 'NE'),
2402
                             (_end_repl, 'GNES', 'NS'),
2403
                             (_start_repl, 'GN', 'N'),
2404
                             (_mid_repl, 'GN', 'N', None, _con),
2405
                             (_end_repl, 'GN', 'N'),
2406
                             (_start_repl, 'PS', 'S'),
2407
                             (_start_repl, 'PT', 'T'),
2408
                             (_start_repl, 'CZ', 'C'),
2409
                             (_mid_repl, 'WZ', 'Z', _vow),
2410
                             (_mid_repl, 'CZ', 'CH'),
2411
                             (_all_repl, 'LZ', 'LSH'),
2412
                             (_all_repl, 'RZ', 'RSH'),
2413
                             (_mid_repl, 'Z', 'S', None, _vow),
2414
                             (_all_repl, 'ZZ', 'TS'),
2415
                             (_mid_repl, 'Z', 'TS', _con),
2416
                             (_all_repl, 'HROUG', 'REW'),
2417
                             (_all_repl, 'OUGH', 'OF'),
2418
                             (_mid_repl, 'Q', 'KW', _vow, _vow),
2419
                             (_mid_repl, 'J', 'Y', _vow, _vow),
2420
                             (_start_repl, 'YJ', 'Y', _vow),
2421
                             (_start_repl, 'GH', 'G'),
2422
                             (_end_repl, 'GH', 'E', _vow),
2423
                             (_start_repl, 'CY', 'S'),
2424
                             (_all_repl, 'NX', 'NKS'),
2425
                             (_start_repl, 'PF', 'F'),
2426
                             (_end_repl, 'DT', 'T'),
2427
                             (_end_repl, 'TL', 'TIL'),
2428
                             (_end_repl, 'DL', 'DIL'),
2429
                             (_all_repl, 'YTH', 'ITH'),
2430
                             (_start_repl, 'TJ', 'CH', _vow),
2431
                             (_start_repl, 'TSJ', 'CH', _vow),
2432
                             (_start_repl, 'TS', 'T', _vow),
2433
                             (_all_repl, 'TCH', 'CH'),
2434
                             (_mid_repl, 'WSK', 'VSKIE', _vow),
2435
                             (_end_repl, 'WSK', 'VSKIE', _vow),
2436
                             (_start_repl, 'MN', 'N', _vow),
2437
                             (_start_repl, 'PN', 'N', _vow),
2438
                             (_mid_repl, 'STL', 'SL', _vow),
2439
                             (_end_repl, 'STL', 'SL', _vow),
2440
                             (_end_repl, 'TNT', 'ENT'),
2441
                             (_end_repl, 'EAUX', 'OH'),
2442
                             (_all_repl, 'EXCI', 'ECS'),
2443
                             (_all_repl, 'X', 'ECS'),
2444
                             (_end_repl, 'NED', 'ND'),
2445
                             (_all_repl, 'JR', 'DR'),
2446
                             (_end_repl, 'EE', 'EA'),
2447
                             (_all_repl, 'ZS', 'S'),
2448
                             (_mid_repl, 'R', 'AH', _vow, _con),
2449
                             (_end_repl, 'R', 'AH', _vow),
2450
                             (_mid_repl, 'HR', 'AH', _vow, _con),
2451
                             (_end_repl, 'HR', 'AH', _vow),
2452
                             (_end_repl, 'HR', 'AH', _vow),
2453
                             (_end_repl, 'RE', 'AR'),
2454
                             (_end_repl, 'R', 'AH', _vow),
2455
                             (_all_repl, 'LLE', 'LE'),
2456
                             (_end_repl, 'LE', 'ILE', _con),
2457
                             (_end_repl, 'LES', 'ILES', _con),
2458
                             (_end_repl, 'E', ''),
2459
                             (_end_repl, 'ES', 'S'),
2460
                             (_end_repl, 'SS', 'AS', _vow),
2461
                             (_end_repl, 'MB', 'M', _vow),
2462
                             (_all_repl, 'MPTS', 'MPS'),
2463
                             (_all_repl, 'MPS', 'MS'),
2464
                             (_all_repl, 'MPT', 'MT'))
2465
2466
    _phonix_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2467
                                    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2468
                                   '01230720022455012683070808'))
2469
2470
    sdx = ''
2471
2472
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2473
    word = word.replace('ß', 'SS')
2474
    word = ''.join(c for c in word if c in
2475
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2476
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2477
                    'Y', 'Z'})
2478
    if word:
2479
        for trans in _phonix_substitutions:
2480
            word = trans[0](word, *trans[1:])
2481
        if word[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2482
            sdx = 'v' + word[1:].translate(_phonix_translation)
2483
        else:
2484
            sdx = word[0] + word[1:].translate(_phonix_translation)
2485
        sdx = _delete_consecutive_repeats(sdx)
2486
        sdx = sdx.replace('0', '')
2487
2488
    # Clamp maxlength to [4, 64]
2489
    if maxlength is not None:
2490
        maxlength = min(max(4, maxlength), 64)
2491
    else:
2492
        maxlength = 64
2493
2494
    if zero_pad:
2495
        sdx += '0' * maxlength
2496
    if not sdx:
2497
        sdx = '0'
2498
    return sdx[:maxlength]
2499
2500
2501
def sfinxbis(word, maxlength=None):
2502
    """Return the SfinxBis code for a word.
2503
2504
    SfinxBis is a Soundex-like algorithm defined in:
2505
    http://www.swami.se/download/18.248ad5af12aa8136533800091/SfinxBis.pdf
2506
2507
    This implementation follows the reference implementation:
2508
    http://www.swami.se/download/18.248ad5af12aa8136533800093/swamiSfinxBis.java.txt
2509
2510
    SfinxBis is intended chiefly for Swedish names.
2511
2512
    :param str word: the word to transform
2513
    :param int maxlength: the length of the code returned (defaults to
2514
        unlimited)
2515
    :returns: the SfinxBis value
2516
    :rtype: tuple
2517
2518
    >>> sfinxbis('Christopher')
2519
    ('K68376',)
2520
    >>> sfinxbis('Niall')
2521
    ('N4',)
2522
    >>> sfinxbis('Smith')
2523
    ('S53',)
2524
    >>> sfinxbis('Schmidt')
2525
    ('S53',)
2526
2527
    >>> sfinxbis('Johansson')
2528
    ('J585',)
2529
    >>> sfinxbis('Sjöberg')
2530
    ('#162',)
2531
    """
2532
    adelstitler = (' DE LA ', ' DE LAS ', ' DE LOS ', ' VAN DE ', ' VAN DEN ',
2533
                   ' VAN DER ', ' VON DEM ', ' VON DER ',
2534
                   ' AF ', ' AV ', ' DA ', ' DE ', ' DEL ', ' DEN ', ' DES ',
2535
                   ' DI ', ' DO ', ' DON ', ' DOS ', ' DU ', ' E ', ' IN ',
2536
                   ' LA ', ' LE ', ' MAC ', ' MC ', ' VAN ', ' VON ', ' Y ',
2537
                   ' S:T ')
2538
2539
    _harde_vokaler = {'A', 'O', 'U', 'Å'}
2540
    _mjuka_vokaler = {'E', 'I', 'Y', 'Ä', 'Ö'}
2541
    _konsonanter = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P',
2542
                    'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Z'}
2543
    _alfabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2544
                'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2545
                'Y', 'Z', 'Ä', 'Å', 'Ö'}
2546
2547
    _sfinxbis_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2548
                                      'BCDFGHJKLMNPQRSTVZAOUÅEIYÄÖ'),
2549
                                     '123729224551268378999999999'))
2550
2551
    _sfinxbis_substitutions = dict(zip((ord(_) for _ in
2552
                                        'WZÀÁÂÃÆÇÈÉÊËÌÍÎÏÑÒÓÔÕØÙÚÛÜÝ'),
2553
                                       'VSAAAAÄCEEEEIIIINOOOOÖUUUYY'))
2554
2555
    def _foersvensker(ordet):
2556
        """Return the Swedish-ized form of the word."""
2557
        ordet = ordet.replace('STIERN', 'STJÄRN')
2558
        ordet = ordet.replace('HIE', 'HJ')
2559
        ordet = ordet.replace('SIÖ', 'SJÖ')
2560
        ordet = ordet.replace('SCH', 'SH')
2561
        ordet = ordet.replace('QU', 'KV')
2562
        ordet = ordet.replace('IO', 'JO')
2563
        ordet = ordet.replace('PH', 'F')
2564
2565
        for i in _harde_vokaler:
2566
            ordet = ordet.replace(i+'Ü', i+'J')
2567
            ordet = ordet.replace(i+'Y', i+'J')
2568
            ordet = ordet.replace(i+'I', i+'J')
2569
        for i in _mjuka_vokaler:
2570
            ordet = ordet.replace(i+'Ü', i+'J')
2571
            ordet = ordet.replace(i+'Y', i+'J')
2572
            ordet = ordet.replace(i+'I', i+'J')
2573
2574
        if 'H' in ordet:
2575
            for i in _konsonanter:
2576
                ordet = ordet.replace('H'+i, i)
2577
2578
        ordet = ordet.translate(_sfinxbis_substitutions)
2579
2580
        ordet = ordet.replace('Ð', 'ETH')
2581
        ordet = ordet.replace('Þ', 'TH')
2582
        ordet = ordet.replace('ß', 'SS')
2583
2584
        return ordet
2585
2586
    def _koda_foersta_ljudet(ordet):
2587
        """Return the word with the first sound coded."""
2588
        if ordet[0:1] in _mjuka_vokaler or ordet[0:1] in _harde_vokaler:
2589
            ordet = '$' + ordet[1:]
2590
        elif ordet[0:2] in ('DJ', 'GJ', 'HJ', 'LJ'):
2591
            ordet = 'J' + ordet[2:]
2592
        elif ordet[0:1] == 'G' and ordet[1:2] in _mjuka_vokaler:
2593
            ordet = 'J' + ordet[1:]
2594
        elif ordet[0:1] == 'Q':
2595
            ordet = 'K' + ordet[1:]
2596
        elif (ordet[0:2] == 'CH' and
2597
              ordet[2:3] in frozenset(_mjuka_vokaler | _harde_vokaler)):
2598
            ordet = '#' + ordet[2:]
2599
        elif ordet[0:1] == 'C' and ordet[1:2] in _harde_vokaler:
2600
            ordet = 'K' + ordet[1:]
2601
        elif ordet[0:1] == 'C' and ordet[1:2] in _konsonanter:
2602
            ordet = 'K' + ordet[1:]
2603
        elif ordet[0:1] == 'X':
2604
            ordet = 'S' + ordet[1:]
2605
        elif ordet[0:1] == 'C' and ordet[1:2] in _mjuka_vokaler:
2606
            ordet = 'S' + ordet[1:]
2607
        elif ordet[0:3] in ('SKJ', 'STJ', 'SCH'):
2608
            ordet = '#' + ordet[3:]
2609
        elif ordet[0:2] in ('SH', 'KJ', 'TJ', 'SJ'):
2610
            ordet = '#' + ordet[2:]
2611
        elif ordet[0:2] == 'SK' and ordet[2:3] in _mjuka_vokaler:
2612
            ordet = '#' + ordet[2:]
2613
        elif ordet[0:1] == 'K' and ordet[1:2] in _mjuka_vokaler:
2614
            ordet = '#' + ordet[1:]
2615
        return ordet
2616
2617
    # Steg 1, Versaler
2618
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2619
    word = word.replace('ß', 'SS')
2620
    word = word.replace('-', ' ')
2621
2622
    # Steg 2, Ta bort adelsprefix
2623
    for adelstitel in adelstitler:
2624
        while adelstitel in word:
2625
            word = word.replace(adelstitel, ' ')
2626
        if word.startswith(adelstitel[1:]):
2627
            word = word[len(adelstitel)-1:]
2628
2629
    # Split word into tokens
2630
    ordlista = word.split()
2631
2632
    # Steg 3, Ta bort dubbelteckning i början på namnet
2633
    ordlista = [_delete_consecutive_repeats(ordet) for ordet in ordlista]
2634
    if not ordlista:
2635
        return ('',)
2636
2637
    # Steg 4, Försvenskning
2638
    ordlista = [_foersvensker(ordet) for ordet in ordlista]
2639
2640
    # Steg 5, Ta bort alla tecken som inte är A-Ö (65-90,196,197,214)
2641
    ordlista = [''.join(c for c in ordet if c in _alfabet)
2642
                for ordet in ordlista]
2643
2644
    # Steg 6, Koda första ljudet
2645
    ordlista = [_koda_foersta_ljudet(ordet) for ordet in ordlista]
2646
2647
    # Steg 7, Dela upp namnet i två delar
2648
    rest = [ordet[1:] for ordet in ordlista]
2649
2650
    # Steg 8, Utför fonetisk transformation i resten
2651
    rest = [ordet.replace('DT', 'T') for ordet in rest]
2652
    rest = [ordet.replace('X', 'KS') for ordet in rest]
2653
2654
    # Steg 9, Koda resten till en sifferkod
2655
    for vokal in _mjuka_vokaler:
2656
        rest = [ordet.replace('C'+vokal, '8'+vokal) for ordet in rest]
2657
    rest = [ordet.translate(_sfinxbis_translation) for ordet in rest]
2658
2659
    # Steg 10, Ta bort intilliggande dubbletter
2660
    rest = [_delete_consecutive_repeats(ordet) for ordet in rest]
2661
2662
    # Steg 11, Ta bort alla "9"
2663
    rest = [ordet.replace('9', '') for ordet in rest]
2664
2665
    # Steg 12, Sätt ihop delarna igen
2666
    ordlista = [''.join(ordet) for ordet in
2667
                zip((_[0:1] for _ in ordlista), rest)]
2668
2669
    # truncate, if maxlength is set
2670
    if maxlength and maxlength < _INFINITY:
2671
        ordlista = [ordet[:maxlength] for ordet in ordlista]
2672
2673
    return tuple(ordlista)
2674
2675
2676
def phonet(word, mode=1, lang='de', trace=False):
2677
    """Return the phonet code for a word.
2678
2679
    phonet ("Hannoveraner Phonetik") was developed by Jörg Michael and
2680
    documented in c't magazine vol. 25/1999, p. 252. It is a phonetic
2681
    algorithm designed primarily for German.
2682
    Cf. http://www.heise.de/ct/ftp/99/25/252/
2683
2684
    This is a port of Jesper Zedlitz's code, which is licensed LGPL:
2685
    https://github.com/jze/phonet4java/blob/master/src/main/java/de/zedlitz/phonet4java/Phonet.java
2686
2687
    That is, in turn, based on Michael's C code, which is also licensed LGPL:
2688
    ftp://ftp.heise.de/pub/ct/listings/phonet.zip
2689
2690
    :param str word: the word to transform
2691
    :param int mode: the ponet variant to employ (1 or 2)
2692
    :param str lang: 'de' (default) for German
2693
            'none' for no language
2694
    :param bool trace: prints debugging info if True
2695
    :returns: the phonet value
2696
    :rtype: str
2697
2698
    >>> phonet('Christopher')
2699
    'KRISTOFA'
2700
    >>> phonet('Niall')
2701
    'NIAL'
2702
    >>> phonet('Smith')
2703
    'SMIT'
2704
    >>> phonet('Schmidt')
2705
    'SHMIT'
2706
2707
    >>> phonet('Christopher', mode=2)
2708
    'KRIZTUFA'
2709
    >>> phonet('Niall', mode=2)
2710
    'NIAL'
2711
    >>> phonet('Smith', mode=2)
2712
    'ZNIT'
2713
    >>> phonet('Schmidt', mode=2)
2714
    'ZNIT'
2715
2716
    >>> phonet('Christopher', lang='none')
2717
    'CHRISTOPHER'
2718
    >>> phonet('Niall', lang='none')
2719
    'NIAL'
2720
    >>> phonet('Smith', lang='none')
2721
    'SMITH'
2722
    >>> phonet('Schmidt', lang='none')
2723
    'SCHMIDT'
2724
    """
2725
    # pylint: disable=too-many-branches
2726
2727
    _phonet_rules_no_lang = (  # separator chars
2728
        '´', ' ', ' ',
2729
        '"', ' ', ' ',
2730
        '`$', '', '',
2731
        '\'', ' ', ' ',
2732
        ',', ',', ',',
2733
        ';', ',', ',',
2734
        '-', ' ', ' ',
2735
        ' ', ' ', ' ',
2736
        '.', '.', '.',
2737
        ':', '.', '.',
2738
        # German umlauts
2739
        'Ä', 'AE', 'AE',
2740
        'Ö', 'OE', 'OE',
2741
        'Ü', 'UE', 'UE',
2742
        'ß', 'S', 'S',
2743
        # international umlauts
2744
        'À', 'A', 'A',
2745
        'Á', 'A', 'A',
2746
        'Â', 'A', 'A',
2747
        'Ã', 'A', 'A',
2748
        'Å', 'A', 'A',
2749
        'Æ', 'AE', 'AE',
2750
        'Ç', 'C', 'C',
2751
        'Ð', 'DJ', 'DJ',
2752
        'È', 'E', 'E',
2753
        'É', 'E', 'E',
2754
        'Ê', 'E', 'E',
2755
        'Ë', 'E', 'E',
2756
        'Ì', 'I', 'I',
2757
        'Í', 'I', 'I',
2758
        'Î', 'I', 'I',
2759
        'Ï', 'I', 'I',
2760
        'Ñ', 'NH', 'NH',
2761
        'Ò', 'O', 'O',
2762
        'Ó', 'O', 'O',
2763
        'Ô', 'O', 'O',
2764
        'Õ', 'O', 'O',
2765
        'Œ', 'OE', 'OE',
2766
        'Ø', 'OE', 'OE',
2767
        'Š', 'SH', 'SH',
2768
        'Þ', 'TH', 'TH',
2769
        'Ù', 'U', 'U',
2770
        'Ú', 'U', 'U',
2771
        'Û', 'U', 'U',
2772
        'Ý', 'Y', 'Y',
2773
        'Ÿ', 'Y', 'Y',
2774
        # 'normal' letters (A-Z)
2775
        'MC^', 'MAC', 'MAC',
2776
        'MC^', 'MAC', 'MAC',
2777
        'M´^', 'MAC', 'MAC',
2778
        'M\'^', 'MAC', 'MAC',
2779
        'O´^', 'O', 'O',
2780
        'O\'^', 'O', 'O',
2781
        'VAN DEN ^', 'VANDEN', 'VANDEN',
2782
        None, None, None)
2783
2784
    _phonet_rules_german = (  # separator chars
2785
        '´', ' ', ' ',
2786
        '"', ' ', ' ',
2787
        '`$', '', '',
2788
        '\'', ' ', ' ',
2789
        ',', ' ', ' ',
2790
        ';', ' ', ' ',
2791
        '-', ' ', ' ',
2792
        ' ', ' ', ' ',
2793
        '.', '.', '.',
2794
        ':', '.', '.',
2795
        # German umlauts
2796
        'ÄE', 'E', 'E',
2797
        'ÄU<', 'EU', 'EU',
2798
        'ÄV(AEOU)-<', 'EW', None,
2799
        'Ä$', 'Ä', None,
2800
        'Ä<', None, 'E',
2801
        'Ä', 'E', None,
2802
        'ÖE', 'Ö', 'Ö',
2803
        'ÖU', 'Ö', 'Ö',
2804
        'ÖVER--<', 'ÖW', None,
2805
        'ÖV(AOU)-', 'ÖW', None,
2806
        'ÜBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
2807
        'ÜBER^^', 'ÜBA', 'IBA',
2808
        'ÜE', 'Ü', 'I',
2809
        'ÜVER--<', 'ÜW', None,
2810
        'ÜV(AOU)-', 'ÜW', None,
2811
        'Ü', None, 'I',
2812
        'ßCH<', None, 'Z',
2813
        'ß<', 'S', 'Z',
2814
        # international umlauts
2815
        'À<', 'A', 'A',
2816
        'Á<', 'A', 'A',
2817
        'Â<', 'A', 'A',
2818
        'Ã<', 'A', 'A',
2819
        'Å<', 'A', 'A',
2820
        'ÆER-', 'E', 'E',
2821
        'ÆU<', 'EU', 'EU',
2822
        'ÆV(AEOU)-<', 'EW', None,
2823
        'Æ$', 'Ä', None,
2824
        'Æ<', None, 'E',
2825
        'Æ', 'E', None,
2826
        'Ç', 'Z', 'Z',
2827
        'ÐÐ-', '', '',
2828
        'Ð', 'DI', 'TI',
2829
        'È<', 'E', 'E',
2830
        'É<', 'E', 'E',
2831
        'Ê<', 'E', 'E',
2832
        'Ë', 'E', 'E',
2833
        'Ì<', 'I', 'I',
2834
        'Í<', 'I', 'I',
2835
        'Î<', 'I', 'I',
2836
        'Ï', 'I', 'I',
2837
        'ÑÑ-', '', '',
2838
        'Ñ', 'NI', 'NI',
2839
        'Ò<', 'O', 'U',
2840
        'Ó<', 'O', 'U',
2841
        'Ô<', 'O', 'U',
2842
        'Õ<', 'O', 'U',
2843
        'Œ<', 'Ö', 'Ö',
2844
        'Ø(IJY)-<', 'E', 'E',
2845
        'Ø<', 'Ö', 'Ö',
2846
        'Š', 'SH', 'Z',
2847
        'Þ', 'T', 'T',
2848
        'Ù<', 'U', 'U',
2849
        'Ú<', 'U', 'U',
2850
        'Û<', 'U', 'U',
2851
        'Ý<', 'I', 'I',
2852
        'Ÿ<', 'I', 'I',
2853
        # 'normal' letters (A-Z)
2854
        'ABELLE$', 'ABL', 'ABL',
2855
        'ABELL$', 'ABL', 'ABL',
2856
        'ABIENNE$', 'ABIN', 'ABIN',
2857
        'ACHME---^', 'ACH', 'AK',
2858
        'ACEY$', 'AZI', 'AZI',
2859
        'ADV', 'ATW', None,
2860
        'AEGL-', 'EK', None,
2861
        'AEU<', 'EU', 'EU',
2862
        'AE2', 'E', 'E',
2863
        'AFTRAUBEN------', 'AFT ', 'AFT ',
2864
        'AGL-1', 'AK', None,
2865
        'AGNI-^', 'AKN', 'AKN',
2866
        'AGNIE-', 'ANI', 'ANI',
2867
        'AGN(AEOU)-$', 'ANI', 'ANI',
2868
        'AH(AIOÖUÜY)-', 'AH', None,
2869
        'AIA2', 'AIA', 'AIA',
2870
        'AIE$', 'E', 'E',
2871
        'AILL(EOU)-', 'ALI', 'ALI',
2872
        'AINE$', 'EN', 'EN',
2873
        'AIRE$', 'ER', 'ER',
2874
        'AIR-', 'E', 'E',
2875
        'AISE$', 'ES', 'EZ',
2876
        'AISSANCE$', 'ESANS', 'EZANZ',
2877
        'AISSE$', 'ES', 'EZ',
2878
        'AIX$', 'EX', 'EX',
2879
        'AJ(AÄEÈÉÊIOÖUÜ)--', 'A', 'A',
2880
        'AKTIE', 'AXIE', 'AXIE',
2881
        'AKTUEL', 'AKTUEL', None,
2882
        'ALOI^', 'ALOI', 'ALUI',  # Don't merge these rules
2883
        'ALOY^', 'ALOI', 'ALUI',  # needed by 'check_rules'
2884
        'AMATEU(RS)-', 'AMATÖ', 'ANATÖ',
2885
        'ANCH(OEI)-', 'ANSH', 'ANZ',
2886
        'ANDERGEGANG----', 'ANDA GE', 'ANTA KE',
2887
        'ANDERGEHE----', 'ANDA ', 'ANTA ',
2888
        'ANDERGESETZ----', 'ANDA GE', 'ANTA KE',
2889
        'ANDERGING----', 'ANDA ', 'ANTA ',
2890
        'ANDERSETZ(ET)-----', 'ANDA ', 'ANTA ',
2891
        'ANDERZUGEHE----', 'ANDA ZU ', 'ANTA ZU ',
2892
        'ANDERZUSETZE-----', 'ANDA ZU ', 'ANTA ZU ',
2893
        'ANER(BKO)---^^', 'AN', None,
2894
        'ANHAND---^$', 'AN H', 'AN ',
2895
        'ANH(AÄEIOÖUÜY)--^^', 'AN', None,
2896
        'ANIELLE$', 'ANIEL', 'ANIL',
2897
        'ANIEL', 'ANIEL', None,
2898
        'ANSTELLE----^$', 'AN ST', 'AN ZT',
2899
        'ANTI^^', 'ANTI', 'ANTI',
2900
        'ANVER^^', 'ANFA', 'ANFA',
2901
        'ATIA$', 'ATIA', 'ATIA',
2902
        'ATIA(NS)--', 'ATI', 'ATI',
2903
        'ATI(AÄOÖUÜ)-', 'AZI', 'AZI',
2904
        'AUAU--', '', '',
2905
        'AUERE$', 'AUERE', None,
2906
        'AUERE(NS)-$', 'AUERE', None,
2907
        'AUERE(AIOUY)--', 'AUER', None,
2908
        'AUER(AÄIOÖUÜY)-', 'AUER', None,
2909
        'AUER<', 'AUA', 'AUA',
2910
        'AUF^^', 'AUF', 'AUF',
2911
        'AULT$', 'O', 'U',
2912
        'AUR(BCDFGKLMNQSTVWZ)-', 'AUA', 'AUA',
2913
        'AUR$', 'AUA', 'AUA',
2914
        'AUSSE$', 'OS', 'UZ',
2915
        'AUS(ST)-^', 'AUS', 'AUS',
2916
        'AUS^^', 'AUS', 'AUS',
2917
        'AUTOFAHR----', 'AUTO ', 'AUTU ',
2918
        'AUTO^^', 'AUTO', 'AUTU',
2919
        'AUX(IY)-', 'AUX', 'AUX',
2920
        'AUX', 'O', 'U',
2921
        'AU', 'AU', 'AU',
2922
        'AVER--<', 'AW', None,
2923
        'AVIER$', 'AWIE', 'AFIE',
2924
        'AV(EÈÉÊI)-^', 'AW', None,
2925
        'AV(AOU)-', 'AW', None,
2926
        'AYRE$', 'EIRE', 'EIRE',
2927
        'AYRE(NS)-$', 'EIRE', 'EIRE',
2928
        'AYRE(AIOUY)--', 'EIR', 'EIR',
2929
        'AYR(AÄIOÖUÜY)-', 'EIR', 'EIR',
2930
        'AYR<', 'EIA', 'EIA',
2931
        'AYER--<', 'EI', 'EI',
2932
        'AY(AÄEIOÖUÜY)--', 'A', 'A',
2933
        'AË', 'E', 'E',
2934
        'A(IJY)<', 'EI', 'EI',
2935
        'BABY^$', 'BEBI', 'BEBI',
2936
        'BAB(IY)^', 'BEBI', 'BEBI',
2937
        'BEAU^$', 'BO', None,
2938
        'BEA(BCMNRU)-^', 'BEA', 'BEA',
2939
        'BEAT(AEIMORU)-^', 'BEAT', 'BEAT',
2940
        'BEE$', 'BI', 'BI',
2941
        'BEIGE^$', 'BESH', 'BEZ',
2942
        'BENOIT--', 'BENO', 'BENU',
2943
        'BER(DT)-', 'BER', None,
2944
        'BERN(DT)-', 'BERN', None,
2945
        'BE(LMNRST)-^', 'BE', 'BE',
2946
        'BETTE$', 'BET', 'BET',
2947
        'BEVOR^$', 'BEFOR', None,
2948
        'BIC$', 'BIZ', 'BIZ',
2949
        'BOWL(EI)-', 'BOL', 'BUL',
2950
        'BP(AÄEÈÉÊIÌÍÎOÖRUÜY)-', 'B', 'B',
2951
        'BRINGEND-----^', 'BRI', 'BRI',
2952
        'BRINGEND-----', ' BRI', ' BRI',
2953
        'BROW(NS)-', 'BRAU', 'BRAU',
2954
        'BUDGET7', 'BÜGE', 'BIKE',
2955
        'BUFFET7', 'BÜFE', 'BIFE',
2956
        'BYLLE$', 'BILE', 'BILE',
2957
        'BYLL$', 'BIL', 'BIL',
2958
        'BYPA--^', 'BEI', 'BEI',
2959
        'BYTE<', 'BEIT', 'BEIT',
2960
        'BY9^', 'BÜ', None,
2961
        'B(SßZ)$', 'BS', None,
2962
        'CACH(EI)-^', 'KESH', 'KEZ',
2963
        'CAE--', 'Z', 'Z',
2964
        'CA(IY)$', 'ZEI', 'ZEI',
2965
        'CE(EIJUY)--', 'Z', 'Z',
2966
        'CENT<', 'ZENT', 'ZENT',
2967
        'CERST(EI)----^', 'KE', 'KE',
2968
        'CER$', 'ZA', 'ZA',
2969
        'CE3', 'ZE', 'ZE',
2970
        'CH\'S$', 'X', 'X',
2971
        'CH´S$', 'X', 'X',
2972
        'CHAO(ST)-', 'KAO', 'KAU',
2973
        'CHAMPIO-^', 'SHEMPI', 'ZENBI',
2974
        'CHAR(AI)-^', 'KAR', 'KAR',
2975
        'CHAU(CDFSVWXZ)-', 'SHO', 'ZU',
2976
        'CHÄ(CF)-', 'SHE', 'ZE',
2977
        'CHE(CF)-', 'SHE', 'ZE',
2978
        'CHEM-^', 'KE', 'KE',  # or: 'CHE', 'KE'
2979
        'CHEQUE<', 'SHEK', 'ZEK',
2980
        'CHI(CFGPVW)-', 'SHI', 'ZI',
2981
        'CH(AEUY)-<^', 'SH', 'Z',
2982
        'CHK-', '', '',
2983
        'CHO(CKPS)-^', 'SHO', 'ZU',
2984
        'CHRIS-', 'KRI', None,
2985
        'CHRO-', 'KR', None,
2986
        'CH(LOR)-<^', 'K', 'K',
2987
        'CHST-', 'X', 'X',
2988
        'CH(SßXZ)3', 'X', 'X',
2989
        'CHTNI-3', 'CHN', 'KN',
2990
        'CH^', 'K', 'K',  # or: 'CH', 'K'
2991
        'CH', 'CH', 'K',
2992
        'CIC$', 'ZIZ', 'ZIZ',
2993
        'CIENCEFICT----', 'EIENS ', 'EIENZ ',
2994
        'CIENCE$', 'EIENS', 'EIENZ',
2995
        'CIER$', 'ZIE', 'ZIE',
2996
        'CYB-^', 'ZEI', 'ZEI',
2997
        'CY9^', 'ZÜ', 'ZI',
2998
        'C(IJY)-<3', 'Z', 'Z',
2999
        'CLOWN-', 'KLAU', 'KLAU',
3000
        'CCH', 'Z', 'Z',
3001
        'CCE-', 'X', 'X',
3002
        'C(CK)-', '', '',
3003
        'CLAUDET---', 'KLO', 'KLU',
3004
        'CLAUDINE^$', 'KLODIN', 'KLUTIN',
3005
        'COACH', 'KOSH', 'KUZ',
3006
        'COLE$', 'KOL', 'KUL',
3007
        'COUCH', 'KAUSH', 'KAUZ',
3008
        'COW', 'KAU', 'KAU',
3009
        'CQUES$', 'K', 'K',
3010
        'CQUE', 'K', 'K',
3011
        'CRASH--9', 'KRE', 'KRE',
3012
        'CREAT-^', 'KREA', 'KREA',
3013
        'CST', 'XT', 'XT',
3014
        'CS<^', 'Z', 'Z',
3015
        'C(SßX)', 'X', 'X',
3016
        'CT\'S$', 'X', 'X',
3017
        'CT(SßXZ)', 'X', 'X',
3018
        'CZ<', 'Z', 'Z',
3019
        'C(ÈÉÊÌÍÎÝ)3', 'Z', 'Z',
3020
        'C.^', 'C.', 'C.',
3021
        'CÄ-', 'Z', 'Z',
3022
        'CÜ$', 'ZÜ', 'ZI',
3023
        'C\'S$', 'X', 'X',
3024
        'C<', 'K', 'K',
3025
        'DAHER^$', 'DAHER', None,
3026
        'DARAUFFOLGE-----', 'DARAUF ', 'TARAUF ',
3027
        'DAVO(NR)-^$', 'DAFO', 'TAFU',
3028
        'DD(SZ)--<', '', '',
3029
        'DD9', 'D', None,
3030
        'DEPOT7', 'DEPO', 'TEBU',
3031
        'DESIGN', 'DISEIN', 'TIZEIN',
3032
        'DE(LMNRST)-3^', 'DE', 'TE',
3033
        'DETTE$', 'DET', 'TET',
3034
        'DH$', 'T', None,
3035
        'DIC$', 'DIZ', 'TIZ',
3036
        'DIDR-^', 'DIT', None,
3037
        'DIEDR-^', 'DIT', None,
3038
        'DJ(AEIOU)-^', 'I', 'I',
3039
        'DMITR-^', 'DIMIT', 'TINIT',
3040
        'DRY9^', 'DRÜ', None,
3041
        'DT-', '', '',
3042
        'DUIS-^', 'DÜ', 'TI',
3043
        'DURCH^^', 'DURCH', 'TURK',
3044
        'DVA$', 'TWA', None,
3045
        'DY9^', 'DÜ', None,
3046
        'DYS$', 'DIS', None,
3047
        'DS(CH)--<', 'T', 'T',
3048
        'DST', 'ZT', 'ZT',
3049
        'DZS(CH)--', 'T', 'T',
3050
        'D(SßZ)', 'Z', 'Z',
3051
        'D(AÄEIOÖRUÜY)-', 'D', None,
3052
        'D(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'D', None,
3053
        'D\'H^', 'D', 'T',
3054
        'D´H^', 'D', 'T',
3055
        'D`H^', 'D', 'T',
3056
        'D\'S3$', 'Z', 'Z',
3057
        'D´S3$', 'Z', 'Z',
3058
        'D^', 'D', None,
3059
        'D', 'T', 'T',
3060
        'EAULT$', 'O', 'U',
3061
        'EAUX$', 'O', 'U',
3062
        'EAU', 'O', 'U',
3063
        'EAV', 'IW', 'IF',
3064
        'EAS3$', 'EAS', None,
3065
        'EA(AÄEIOÖÜY)-3', 'EA', 'EA',
3066
        'EA3$', 'EA', 'EA',
3067
        'EA3', 'I', 'I',
3068
        'EBENSO^$', 'EBNSO', 'EBNZU',
3069
        'EBENSO^^', 'EBNSO ', 'EBNZU ',
3070
        'EBEN^^', 'EBN', 'EBN',
3071
        'EE9', 'E', 'E',
3072
        'EGL-1', 'EK', None,
3073
        'EHE(IUY)--1', 'EH', None,
3074
        'EHUNG---1', 'E', None,
3075
        'EH(AÄIOÖUÜY)-1', 'EH', None,
3076
        'EIEI--', '', '',
3077
        'EIERE^$', 'EIERE', None,
3078
        'EIERE$', 'EIERE', None,
3079
        'EIERE(NS)-$', 'EIERE', None,
3080
        'EIERE(AIOUY)--', 'EIER', None,
3081
        'EIER(AÄIOÖUÜY)-', 'EIER', None,
3082
        'EIER<', 'EIA', None,
3083
        'EIGL-1', 'EIK', None,
3084
        'EIGH$', 'EI', 'EI',
3085
        'EIH--', 'E', 'E',
3086
        'EILLE$', 'EI', 'EI',
3087
        'EIR(BCDFGKLMNQSTVWZ)-', 'EIA', 'EIA',
3088
        'EIR$', 'EIA', 'EIA',
3089
        'EITRAUBEN------', 'EIT ', 'EIT ',
3090
        'EI', 'EI', 'EI',
3091
        'EJ$', 'EI', 'EI',
3092
        'ELIZ^', 'ELIS', None,
3093
        'ELZ^', 'ELS', None,
3094
        'EL-^', 'E', 'E',
3095
        'ELANG----1', 'E', 'E',
3096
        'EL(DKL)--1', 'E', 'E',
3097
        'EL(MNT)--1$', 'E', 'E',
3098
        'ELYNE$', 'ELINE', 'ELINE',
3099
        'ELYN$', 'ELIN', 'ELIN',
3100
        'EL(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'EL', 'EL',
3101
        'EL-1', 'L', 'L',
3102
        'EM-^', None, 'E',
3103
        'EM(DFKMPQT)--1', None, 'E',
3104
        'EM(AÄEÈÉÊIÌÍÎOÖUÜY)--1', None, 'E',
3105
        'EM-1', None, 'N',
3106
        'ENGAG-^', 'ANGA', 'ANKA',
3107
        'EN-^', 'E', 'E',
3108
        'ENTUEL', 'ENTUEL', None,
3109
        'EN(CDGKQSTZ)--1', 'E', 'E',
3110
        'EN(AÄEÈÉÊIÌÍÎNOÖUÜY)-1', 'EN', 'EN',
3111
        'EN-1', '', '',
3112
        'ERH(AÄEIOÖUÜ)-^', 'ERH', 'ER',
3113
        'ER-^', 'E', 'E',
3114
        'ERREGEND-----', ' ER', ' ER',
3115
        'ERT1$', 'AT', None,
3116
        'ER(DGLKMNRQTZß)-1', 'ER', None,
3117
        'ER(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'ER', 'A',
3118
        'ER1$', 'A', 'A',
3119
        'ER<1', 'A', 'A',
3120
        'ETAT7', 'ETA', 'ETA',
3121
        'ETI(AÄOÖÜU)-', 'EZI', 'EZI',
3122
        'EUERE$', 'EUERE', None,
3123
        'EUERE(NS)-$', 'EUERE', None,
3124
        'EUERE(AIOUY)--', 'EUER', None,
3125
        'EUER(AÄIOÖUÜY)-', 'EUER', None,
3126
        'EUER<', 'EUA', None,
3127
        'EUEU--', '', '',
3128
        'EUILLE$', 'Ö', 'Ö',
3129
        'EUR$', 'ÖR', 'ÖR',
3130
        'EUX', 'Ö', 'Ö',
3131
        'EUSZ$', 'EUS', None,
3132
        'EUTZ$', 'EUS', None,
3133
        'EUYS$', 'EUS', 'EUZ',
3134
        'EUZ$', 'EUS', None,
3135
        'EU', 'EU', 'EU',
3136
        'EVER--<1', 'EW', None,
3137
        'EV(ÄOÖUÜ)-1', 'EW', None,
3138
        'EYER<', 'EIA', 'EIA',
3139
        'EY<', 'EI', 'EI',
3140
        'FACETTE', 'FASET', 'FAZET',
3141
        'FANS--^$', 'FE', 'FE',
3142
        'FAN-^$', 'FE', 'FE',
3143
        'FAULT-', 'FOL', 'FUL',
3144
        'FEE(DL)-', 'FI', 'FI',
3145
        'FEHLER', 'FELA', 'FELA',
3146
        'FE(LMNRST)-3^', 'FE', 'FE',
3147
        'FOERDERN---^', 'FÖRD', 'FÖRT',
3148
        'FOERDERN---', ' FÖRD', ' FÖRT',
3149
        'FOND7', 'FON', 'FUN',
3150
        'FRAIN$', 'FRA', 'FRA',
3151
        'FRISEU(RS)-', 'FRISÖ', 'FRIZÖ',
3152
        'FY9^', 'FÜ', None,
3153
        'FÖRDERN---^', 'FÖRD', 'FÖRT',
3154
        'FÖRDERN---', ' FÖRD', ' FÖRT',
3155
        'GAGS^$', 'GEX', 'KEX',
3156
        'GAG^$', 'GEK', 'KEK',
3157
        'GD', 'KT', 'KT',
3158
        'GEGEN^^', 'GEGN', 'KEKN',
3159
        'GEGENGEKOM-----', 'GEGN ', 'KEKN ',
3160
        'GEGENGESET-----', 'GEGN ', 'KEKN ',
3161
        'GEGENKOMME-----', 'GEGN ', 'KEKN ',
3162
        'GEGENZUKOM---', 'GEGN ZU ', 'KEKN ZU ',
3163
        'GENDETWAS-----$', 'GENT ', 'KENT ',
3164
        'GENRE', 'IORE', 'IURE',
3165
        'GE(LMNRST)-3^', 'GE', 'KE',
3166
        'GER(DKT)-', 'GER', None,
3167
        'GETTE$', 'GET', 'KET',
3168
        'GGF.', 'GF.', None,
3169
        'GG-', '', '',
3170
        'GH', 'G', None,
3171
        'GI(AOU)-^', 'I', 'I',
3172
        'GION-3', 'KIO', 'KIU',
3173
        'G(CK)-', '', '',
3174
        'GJ(AEIOU)-^', 'I', 'I',
3175
        'GMBH^$', 'GMBH', 'GMBH',
3176
        'GNAC$', 'NIAK', 'NIAK',
3177
        'GNON$', 'NION', 'NIUN',
3178
        'GN$', 'N', 'N',
3179
        'GONCAL-^', 'GONZA', 'KUNZA',
3180
        'GRY9^', 'GRÜ', None,
3181
        'G(SßXZ)-<', 'K', 'K',
3182
        'GUCK-', 'KU', 'KU',
3183
        'GUISEP-^', 'IUSE', 'IUZE',
3184
        'GUI-^', 'G', 'K',
3185
        'GUTAUSSEH------^', 'GUT ', 'KUT ',
3186
        'GUTGEHEND------^', 'GUT ', 'KUT ',
3187
        'GY9^', 'GÜ', None,
3188
        'G(AÄEILOÖRUÜY)-', 'G', None,
3189
        'G(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'G', None,
3190
        'G\'S$', 'X', 'X',
3191
        'G´S$', 'X', 'X',
3192
        'G^', 'G', None,
3193
        'G', 'K', 'K',
3194
        'HA(HIUY)--1', 'H', None,
3195
        'HANDVOL---^', 'HANT ', 'ANT ',
3196
        'HANNOVE-^', 'HANOF', None,
3197
        'HAVEN7$', 'HAFN', None,
3198
        'HEAD-', 'HE', 'E',
3199
        'HELIEGEN------', 'E ', 'E ',
3200
        'HESTEHEN------', 'E ', 'E ',
3201
        'HE(LMNRST)-3^', 'HE', 'E',
3202
        'HE(LMN)-1', 'E', 'E',
3203
        'HEUR1$', 'ÖR', 'ÖR',
3204
        'HE(HIUY)--1', 'H', None,
3205
        'HIH(AÄEIOÖUÜY)-1', 'IH', None,
3206
        'HLH(AÄEIOÖUÜY)-1', 'LH', None,
3207
        'HMH(AÄEIOÖUÜY)-1', 'MH', None,
3208
        'HNH(AÄEIOÖUÜY)-1', 'NH', None,
3209
        'HOBBY9^', 'HOBI', None,
3210
        'HOCHBEGAB-----^', 'HOCH ', 'UK ',
3211
        'HOCHTALEN-----^', 'HOCH ', 'UK ',
3212
        'HOCHZUFRI-----^', 'HOCH ', 'UK ',
3213
        'HO(HIY)--1', 'H', None,
3214
        'HRH(AÄEIOÖUÜY)-1', 'RH', None,
3215
        'HUH(AÄEIOÖUÜY)-1', 'UH', None,
3216
        'HUIS^^', 'HÜS', 'IZ',
3217
        'HUIS$', 'ÜS', 'IZ',
3218
        'HUI--1', 'H', None,
3219
        'HYGIEN^', 'HÜKIEN', None,
3220
        'HY9^', 'HÜ', None,
3221
        'HY(BDGMNPST)-', 'Ü', None,
3222
        'H.^', None, 'H.',
3223
        'HÄU--1', 'H', None,
3224
        'H^', 'H', '',
3225
        'H', '', '',
3226
        'ICHELL---', 'ISH', 'IZ',
3227
        'ICHI$', 'ISHI', 'IZI',
3228
        'IEC$', 'IZ', 'IZ',
3229
        'IEDENSTELLE------', 'IDN ', 'ITN ',
3230
        'IEI-3', '', '',
3231
        'IELL3', 'IEL', 'IEL',
3232
        'IENNE$', 'IN', 'IN',
3233
        'IERRE$', 'IER', 'IER',
3234
        'IERZULAN---', 'IR ZU ', 'IR ZU ',
3235
        'IETTE$', 'IT', 'IT',
3236
        'IEU', 'IÖ', 'IÖ',
3237
        'IE<4', 'I', 'I',
3238
        'IGL-1', 'IK', None,
3239
        'IGHT3$', 'EIT', 'EIT',
3240
        'IGNI(EO)-', 'INI', 'INI',
3241
        'IGN(AEOU)-$', 'INI', 'INI',
3242
        'IHER(DGLKRT)--1', 'IHE', None,
3243
        'IHE(IUY)--', 'IH', None,
3244
        'IH(AIOÖUÜY)-', 'IH', None,
3245
        'IJ(AOU)-', 'I', 'I',
3246
        'IJ$', 'I', 'I',
3247
        'IJ<', 'EI', 'EI',
3248
        'IKOLE$', 'IKOL', 'IKUL',
3249
        'ILLAN(STZ)--4', 'ILIA', 'ILIA',
3250
        'ILLAR(DT)--4', 'ILIA', 'ILIA',
3251
        'IMSTAN----^', 'IM ', 'IN ',
3252
        'INDELERREGE------', 'INDL ', 'INTL ',
3253
        'INFRAGE-----^$', 'IN ', 'IN ',
3254
        'INTERN(AOU)-^', 'INTAN', 'INTAN',
3255
        'INVER-', 'INWE', 'INFE',
3256
        'ITI(AÄIOÖUÜ)-', 'IZI', 'IZI',
3257
        'IUSZ$', 'IUS', None,
3258
        'IUTZ$', 'IUS', None,
3259
        'IUZ$', 'IUS', None,
3260
        'IVER--<', 'IW', None,
3261
        'IVIER$', 'IWIE', 'IFIE',
3262
        'IV(ÄOÖUÜ)-', 'IW', None,
3263
        'IV<3', 'IW', None,
3264
        'IY2', 'I', None,
3265
        'I(ÈÉÊ)<4', 'I', 'I',
3266
        'JAVIE---<^', 'ZA', 'ZA',
3267
        'JEANS^$', 'JINS', 'INZ',
3268
        'JEANNE^$', 'IAN', 'IAN',
3269
        'JEAN-^', 'IA', 'IA',
3270
        'JER-^', 'IE', 'IE',
3271
        'JE(LMNST)-', 'IE', 'IE',
3272
        'JI^', 'JI', None,
3273
        'JOR(GK)^$', 'IÖRK', 'IÖRK',
3274
        'J', 'I', 'I',
3275
        'KC(ÄEIJ)-', 'X', 'X',
3276
        'KD', 'KT', None,
3277
        'KE(LMNRST)-3^', 'KE', 'KE',
3278
        'KG(AÄEILOÖRUÜY)-', 'K', None,
3279
        'KH<^', 'K', 'K',
3280
        'KIC$', 'KIZ', 'KIZ',
3281
        'KLE(LMNRST)-3^', 'KLE', 'KLE',
3282
        'KOTELE-^', 'KOTL', 'KUTL',
3283
        'KREAT-^', 'KREA', 'KREA',
3284
        'KRÜS(TZ)--^', 'KRI', None,
3285
        'KRYS(TZ)--^', 'KRI', None,
3286
        'KRY9^', 'KRÜ', None,
3287
        'KSCH---', 'K', 'K',
3288
        'KSH--', 'K', 'K',
3289
        'K(SßXZ)7', 'X', 'X',  # implies 'KST' -> 'XT'
3290
        'KT\'S$', 'X', 'X',
3291
        'KTI(AIOU)-3', 'XI', 'XI',
3292
        'KT(SßXZ)', 'X', 'X',
3293
        'KY9^', 'KÜ', None,
3294
        'K\'S$', 'X', 'X',
3295
        'K´S$', 'X', 'X',
3296
        'LANGES$', ' LANGES', ' LANKEZ',
3297
        'LANGE$', ' LANGE', ' LANKE',
3298
        'LANG$', ' LANK', ' LANK',
3299
        'LARVE-', 'LARF', 'LARF',
3300
        'LD(SßZ)$', 'LS', 'LZ',
3301
        'LD\'S$', 'LS', 'LZ',
3302
        'LD´S$', 'LS', 'LZ',
3303
        'LEAND-^', 'LEAN', 'LEAN',
3304
        'LEERSTEHE-----^', 'LER ', 'LER ',
3305
        'LEICHBLEIB-----', 'LEICH ', 'LEIK ',
3306
        'LEICHLAUTE-----', 'LEICH ', 'LEIK ',
3307
        'LEIDERREGE------', 'LEIT ', 'LEIT ',
3308
        'LEIDGEPR----^', 'LEIT ', 'LEIT ',
3309
        'LEINSTEHE-----', 'LEIN ', 'LEIN ',
3310
        'LEL-', 'LE', 'LE',
3311
        'LE(MNRST)-3^', 'LE', 'LE',
3312
        'LETTE$', 'LET', 'LET',
3313
        'LFGNAG-', 'LFGAN', 'LFKAN',
3314
        'LICHERWEIS----', 'LICHA ', 'LIKA ',
3315
        'LIC$', 'LIZ', 'LIZ',
3316
        'LIVE^$', 'LEIF', 'LEIF',
3317
        'LT(SßZ)$', 'LS', 'LZ',
3318
        'LT\'S$', 'LS', 'LZ',
3319
        'LT´S$', 'LS', 'LZ',
3320
        'LUI(GS)--', 'LU', 'LU',
3321
        'LV(AIO)-', 'LW', None,
3322
        'LY9^', 'LÜ', None,
3323
        'LSTS$', 'LS', 'LZ',
3324
        'LZ(BDFGKLMNPQRSTVWX)-', 'LS', None,
3325
        'L(SßZ)$', 'LS', None,
3326
        'MAIR-<', 'MEI', 'NEI',
3327
        'MANAG-', 'MENE', 'NENE',
3328
        'MANUEL', 'MANUEL', None,
3329
        'MASSEU(RS)-', 'MASÖ', 'NAZÖ',
3330
        'MATCH', 'MESH', 'NEZ',
3331
        'MAURICE', 'MORIS', 'NURIZ',
3332
        'MBH^$', 'MBH', 'MBH',
3333
        'MB(ßZ)$', 'MS', None,
3334
        'MB(SßTZ)-', 'M', 'N',
3335
        'MCG9^', 'MAK', 'NAK',
3336
        'MC9^', 'MAK', 'NAK',
3337
        'MEMOIR-^', 'MEMOA', 'NENUA',
3338
        'MERHAVEN$', 'MAHAFN', None,
3339
        'ME(LMNRST)-3^', 'ME', 'NE',
3340
        'MEN(STZ)--3', 'ME', None,
3341
        'MEN$', 'MEN', None,
3342
        'MIGUEL-', 'MIGE', 'NIKE',
3343
        'MIKE^$', 'MEIK', 'NEIK',
3344
        'MITHILFE----^$', 'MIT H', 'NIT ',
3345
        'MN$', 'M', None,
3346
        'MN', 'N', 'N',
3347
        'MPJUTE-', 'MPUT', 'NBUT',
3348
        'MP(ßZ)$', 'MS', None,
3349
        'MP(SßTZ)-', 'M', 'N',
3350
        'MP(BDJLMNPQVW)-', 'MB', 'NB',
3351
        'MY9^', 'MÜ', None,
3352
        'M(ßZ)$', 'MS', None,
3353
        'M´G7^', 'MAK', 'NAK',
3354
        'M\'G7^', 'MAK', 'NAK',
3355
        'M´^', 'MAK', 'NAK',
3356
        'M\'^', 'MAK', 'NAK',
3357
        'M', None, 'N',
3358
        'NACH^^', 'NACH', 'NAK',
3359
        'NADINE', 'NADIN', 'NATIN',
3360
        'NAIV--', 'NA', 'NA',
3361
        'NAISE$', 'NESE', 'NEZE',
3362
        'NAUGENOMM------', 'NAU ', 'NAU ',
3363
        'NAUSOGUT$', 'NAUSO GUT', 'NAUZU KUT',
3364
        'NCH$', 'NSH', 'NZ',
3365
        'NCOISE$', 'SOA', 'ZUA',
3366
        'NCOIS$', 'SOA', 'ZUA',
3367
        'NDAR$', 'NDA', 'NTA',
3368
        'NDERINGEN------', 'NDE ', 'NTE ',
3369
        'NDRO(CDKTZ)-', 'NTRO', None,
3370
        'ND(BFGJLMNPQVW)-', 'NT', None,
3371
        'ND(SßZ)$', 'NS', 'NZ',
3372
        'ND\'S$', 'NS', 'NZ',
3373
        'ND´S$', 'NS', 'NZ',
3374
        'NEBEN^^', 'NEBN', 'NEBN',
3375
        'NENGELERN------', 'NEN ', 'NEN ',
3376
        'NENLERN(ET)---', 'NEN LE', 'NEN LE',
3377
        'NENZULERNE---', 'NEN ZU LE', 'NEN ZU LE',
3378
        'NE(LMNRST)-3^', 'NE', 'NE',
3379
        'NEN-3', 'NE', 'NE',
3380
        'NETTE$', 'NET', 'NET',
3381
        'NGU^^', 'NU', 'NU',
3382
        'NG(BDFJLMNPQRTVW)-', 'NK', 'NK',
3383
        'NH(AUO)-$', 'NI', 'NI',
3384
        'NICHTSAHNEN-----', 'NIX ', 'NIX ',
3385
        'NICHTSSAGE----', 'NIX ', 'NIX ',
3386
        'NICHTS^^', 'NIX', 'NIX',
3387
        'NICHT^^', 'NICHT', 'NIKT',
3388
        'NINE$', 'NIN', 'NIN',
3389
        'NON^^', 'NON', 'NUN',
3390
        'NOTLEIDE-----^', 'NOT ', 'NUT ',
3391
        'NOT^^', 'NOT', 'NUT',
3392
        'NTI(AIOU)-3', 'NZI', 'NZI',
3393
        'NTIEL--3', 'NZI', 'NZI',
3394
        'NT(SßZ)$', 'NS', 'NZ',
3395
        'NT\'S$', 'NS', 'NZ',
3396
        'NT´S$', 'NS', 'NZ',
3397
        'NYLON', 'NEILON', 'NEILUN',
3398
        'NY9^', 'NÜ', None,
3399
        'NSTZUNEH---', 'NST ZU ', 'NZT ZU ',
3400
        'NSZ-', 'NS', None,
3401
        'NSTS$', 'NS', 'NZ',
3402
        'NZ(BDFGKLMNPQRSTVWX)-', 'NS', None,
3403
        'N(SßZ)$', 'NS', None,
3404
        'OBERE-', 'OBER', None,
3405
        'OBER^^', 'OBA', 'UBA',
3406
        'OEU2', 'Ö', 'Ö',
3407
        'OE<2', 'Ö', 'Ö',
3408
        'OGL-', 'OK', None,
3409
        'OGNIE-', 'ONI', 'UNI',
3410
        'OGN(AEOU)-$', 'ONI', 'UNI',
3411
        'OH(AIOÖUÜY)-', 'OH', None,
3412
        'OIE$', 'Ö', 'Ö',
3413
        'OIRE$', 'OA', 'UA',
3414
        'OIR$', 'OA', 'UA',
3415
        'OIX', 'OA', 'UA',
3416
        'OI<3', 'EU', 'EU',
3417
        'OKAY^$', 'OKE', 'UKE',
3418
        'OLYN$', 'OLIN', 'ULIN',
3419
        'OO(DLMZ)-', 'U', None,
3420
        'OO$', 'U', None,
3421
        'OO-', '', '',
3422
        'ORGINAL-----', 'ORI', 'URI',
3423
        'OTI(AÄOÖUÜ)-', 'OZI', 'UZI',
3424
        'OUI^', 'WI', 'FI',
3425
        'OUILLE$', 'ULIE', 'ULIE',
3426
        'OU(DT)-^', 'AU', 'AU',
3427
        'OUSE$', 'AUS', 'AUZ',
3428
        'OUT-', 'AU', 'AU',
3429
        'OU', 'U', 'U',
3430
        'O(FV)$', 'AU', 'AU',  # due to 'OW$' -> 'AU'
3431
        'OVER--<', 'OW', None,
3432
        'OV(AOU)-', 'OW', None,
3433
        'OW$', 'AU', 'AU',
3434
        'OWS$', 'OS', 'UZ',
3435
        'OJ(AÄEIOÖUÜ)--', 'O', 'U',
3436
        'OYER', 'OIA', None,
3437
        'OY(AÄEIOÖUÜ)--', 'O', 'U',
3438
        'O(JY)<', 'EU', 'EU',
3439
        'OZ$', 'OS', None,
3440
        'O´^', 'O', 'U',
3441
        'O\'^', 'O', 'U',
3442
        'O', None, 'U',
3443
        'PATIEN--^', 'PAZI', 'PAZI',
3444
        'PENSIO-^', 'PANSI', 'PANZI',
3445
        'PE(LMNRST)-3^', 'PE', 'PE',
3446
        'PFER-^', 'FE', 'FE',
3447
        'P(FH)<', 'F', 'F',
3448
        'PIC^$', 'PIK', 'PIK',
3449
        'PIC$', 'PIZ', 'PIZ',
3450
        'PIPELINE', 'PEIBLEIN', 'PEIBLEIN',
3451
        'POLYP-', 'POLÜ', None,
3452
        'POLY^^', 'POLI', 'PULI',
3453
        'PORTRAIT7', 'PORTRE', 'PURTRE',
3454
        'POWER7', 'PAUA', 'PAUA',
3455
        'PP(FH)--<', 'B', 'B',
3456
        'PP-', '', '',
3457
        'PRODUZ-^', 'PRODU', 'BRUTU',
3458
        'PRODUZI--', ' PRODU', ' BRUTU',
3459
        'PRIX^$', 'PRI', 'PRI',
3460
        'PS-^^', 'P', None,
3461
        'P(SßZ)^', None, 'Z',
3462
        'P(SßZ)$', 'BS', None,
3463
        'PT-^', '', '',
3464
        'PTI(AÄOÖUÜ)-3', 'BZI', 'BZI',
3465
        'PY9^', 'PÜ', None,
3466
        'P(AÄEIOÖRUÜY)-', 'P', 'P',
3467
        'P(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'P', None,
3468
        'P.^', None, 'P.',
3469
        'P^', 'P', None,
3470
        'P', 'B', 'B',
3471
        'QI-', 'Z', 'Z',
3472
        'QUARANT--', 'KARA', 'KARA',
3473
        'QUE(LMNRST)-3', 'KWE', 'KFE',
3474
        'QUE$', 'K', 'K',
3475
        'QUI(NS)$', 'KI', 'KI',
3476
        'QUIZ7', 'KWIS', None,
3477
        'Q(UV)7', 'KW', 'KF',
3478
        'Q<', 'K', 'K',
3479
        'RADFAHR----', 'RAT ', 'RAT ',
3480
        'RAEFTEZEHRE-----', 'REFTE ', 'REFTE ',
3481
        'RCH', 'RCH', 'RK',
3482
        'REA(DU)---3^', 'R', None,
3483
        'REBSERZEUG------', 'REBS ', 'REBZ ',
3484
        'RECHERCH^', 'RESHASH', 'REZAZ',
3485
        'RECYCL--', 'RIZEI', 'RIZEI',
3486
        'RE(ALST)-3^', 'RE', None,
3487
        'REE$', 'RI', 'RI',
3488
        'RER$', 'RA', 'RA',
3489
        'RE(MNR)-4', 'RE', 'RE',
3490
        'RETTE$', 'RET', 'RET',
3491
        'REUZ$', 'REUZ', None,
3492
        'REW$', 'RU', 'RU',
3493
        'RH<^', 'R', 'R',
3494
        'RJA(MN)--', 'RI', 'RI',
3495
        'ROWD-^', 'RAU', 'RAU',
3496
        'RTEMONNAIE-', 'RTMON', 'RTNUN',
3497
        'RTI(AÄOÖUÜ)-3', 'RZI', 'RZI',
3498
        'RTIEL--3', 'RZI', 'RZI',
3499
        'RV(AEOU)-3', 'RW', None,
3500
        'RY(KN)-$', 'RI', 'RI',
3501
        'RY9^', 'RÜ', None,
3502
        'RÄFTEZEHRE-----', 'REFTE ', 'REFTE ',
3503
        'SAISO-^', 'SES', 'ZEZ',
3504
        'SAFE^$', 'SEIF', 'ZEIF',
3505
        'SAUCE-^', 'SOS', 'ZUZ',
3506
        'SCHLAGGEBEN-----<', 'SHLAK ', 'ZLAK ',
3507
        'SCHSCH---7', '', '',
3508
        'SCHTSCH', 'SH', 'Z',
3509
        'SC(HZ)<', 'SH', 'Z',
3510
        'SC', 'SK', 'ZK',
3511
        'SELBSTST--7^^', 'SELB', 'ZELB',
3512
        'SELBST7^^', 'SELBST', 'ZELBZT',
3513
        'SERVICE7^', 'SÖRWIS', 'ZÖRFIZ',
3514
        'SERVI-^', 'SERW', None,
3515
        'SE(LMNRST)-3^', 'SE', 'ZE',
3516
        'SETTE$', 'SET', 'ZET',
3517
        'SHP-^', 'S', 'Z',
3518
        'SHST', 'SHT', 'ZT',
3519
        'SHTSH', 'SH', 'Z',
3520
        'SHT', 'ST', 'Z',
3521
        'SHY9^', 'SHÜ', None,
3522
        'SH^^', 'SH', None,
3523
        'SH3', 'SH', 'Z',
3524
        'SICHERGEGAN-----^', 'SICHA ', 'ZIKA ',
3525
        'SICHERGEHE----^', 'SICHA ', 'ZIKA ',
3526
        'SICHERGESTEL------^', 'SICHA ', 'ZIKA ',
3527
        'SICHERSTELL-----^', 'SICHA ', 'ZIKA ',
3528
        'SICHERZU(GS)--^', 'SICHA ZU ', 'ZIKA ZU ',
3529
        'SIEGLI-^', 'SIKL', 'ZIKL',
3530
        'SIGLI-^', 'SIKL', 'ZIKL',
3531
        'SIGHT', 'SEIT', 'ZEIT',
3532
        'SIGN', 'SEIN', 'ZEIN',
3533
        'SKI(NPZ)-', 'SKI', 'ZKI',
3534
        'SKI<^', 'SHI', 'ZI',
3535
        'SODASS^$', 'SO DAS', 'ZU TAZ',
3536
        'SODAß^$', 'SO DAS', 'ZU TAZ',
3537
        'SOGENAN--^', 'SO GEN', 'ZU KEN',
3538
        'SOUND-', 'SAUN', 'ZAUN',
3539
        'STAATS^^', 'STAZ', 'ZTAZ',
3540
        'STADT^^', 'STAT', 'ZTAT',
3541
        'STANDE$', ' STANDE', ' ZTANTE',
3542
        'START^^', 'START', 'ZTART',
3543
        'STAURANT7', 'STORAN', 'ZTURAN',
3544
        'STEAK-', 'STE', 'ZTE',
3545
        'STEPHEN-^$', 'STEW', None,
3546
        'STERN', 'STERN', None,
3547
        'STRAF^^', 'STRAF', 'ZTRAF',
3548
        'ST\'S$', 'Z', 'Z',
3549
        'ST´S$', 'Z', 'Z',
3550
        'STST--', '', '',
3551
        'STS(ACEÈÉÊHIÌÍÎOUÄÜÖ)--', 'ST', 'ZT',
3552
        'ST(SZ)', 'Z', 'Z',
3553
        'SPAREN---^', 'SPA', 'ZPA',
3554
        'SPAREND----', ' SPA', ' ZPA',
3555
        'S(PTW)-^^', 'S', None,
3556
        'SP', 'SP', None,
3557
        'STYN(AE)-$', 'STIN', 'ZTIN',
3558
        'ST', 'ST', 'ZT',
3559
        'SUITE<', 'SIUT', 'ZIUT',
3560
        'SUKE--$', 'S', 'Z',
3561
        'SURF(EI)-', 'SÖRF', 'ZÖRF',
3562
        'SV(AEÈÉÊIÌÍÎOU)-<^', 'SW', None,
3563
        'SYB(IY)--^', 'SIB', None,
3564
        'SYL(KVW)--^', 'SI', None,
3565
        'SY9^', 'SÜ', None,
3566
        'SZE(NPT)-^', 'ZE', 'ZE',
3567
        'SZI(ELN)-^', 'ZI', 'ZI',
3568
        'SZCZ<', 'SH', 'Z',
3569
        'SZT<', 'ST', 'ZT',
3570
        'SZ<3', 'SH', 'Z',
3571
        'SÜL(KVW)--^', 'SI', None,
3572
        'S', None, 'Z',
3573
        'TCH', 'SH', 'Z',
3574
        'TD(AÄEIOÖRUÜY)-', 'T', None,
3575
        'TD(ÀÁÂÃÅÈÉÊËÌÍÎÏÒÓÔÕØÙÚÛÝŸ)-', 'T', None,
3576
        'TEAT-^', 'TEA', 'TEA',
3577
        'TERRAI7^', 'TERA', 'TERA',
3578
        'TE(LMNRST)-3^', 'TE', 'TE',
3579
        'TH<', 'T', 'T',
3580
        'TICHT-', 'TIK', 'TIK',
3581
        'TICH$', 'TIK', 'TIK',
3582
        'TIC$', 'TIZ', 'TIZ',
3583
        'TIGGESTELL-------', 'TIK ', 'TIK ',
3584
        'TIGSTELL-----', 'TIK ', 'TIK ',
3585
        'TOAS-^', 'TO', 'TU',
3586
        'TOILET-', 'TOLE', 'TULE',
3587
        'TOIN-', 'TOA', 'TUA',
3588
        'TRAECHTI-^', 'TRECHT', 'TREKT',
3589
        'TRAECHTIG--', ' TRECHT', ' TREKT',
3590
        'TRAINI-', 'TREN', 'TREN',
3591
        'TRÄCHTI-^', 'TRECHT', 'TREKT',
3592
        'TRÄCHTIG--', ' TRECHT', ' TREKT',
3593
        'TSCH', 'SH', 'Z',
3594
        'TSH', 'SH', 'Z',
3595
        'TST', 'ZT', 'ZT',
3596
        'T(Sß)', 'Z', 'Z',
3597
        'TT(SZ)--<', '', '',
3598
        'TT9', 'T', 'T',
3599
        'TV^$', 'TV', 'TV',
3600
        'TX(AEIOU)-3', 'SH', 'Z',
3601
        'TY9^', 'TÜ', None,
3602
        'TZ-', '', '',
3603
        'T\'S3$', 'Z', 'Z',
3604
        'T´S3$', 'Z', 'Z',
3605
        'UEBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
3606
        'UEBER^^', 'ÜBA', 'IBA',
3607
        'UE2', 'Ü', 'I',
3608
        'UGL-', 'UK', None,
3609
        'UH(AOÖUÜY)-', 'UH', None,
3610
        'UIE$', 'Ü', 'I',
3611
        'UM^^', 'UM', 'UN',
3612
        'UNTERE--3', 'UNTE', 'UNTE',
3613
        'UNTER^^', 'UNTA', 'UNTA',
3614
        'UNVER^^', 'UNFA', 'UNFA',
3615
        'UN^^', 'UN', 'UN',
3616
        'UTI(AÄOÖUÜ)-', 'UZI', 'UZI',
3617
        'UVE-4', 'UW', None,
3618
        'UY2', 'UI', None,
3619
        'UZZ', 'AS', 'AZ',
3620
        'VACL-^', 'WAZ', 'FAZ',
3621
        'VAC$', 'WAZ', 'FAZ',
3622
        'VAN DEN ^', 'FANDN', 'FANTN',
3623
        'VANES-^', 'WANE', None,
3624
        'VATRO-', 'WATR', None,
3625
        'VA(DHJNT)--^', 'F', None,
3626
        'VEDD-^', 'FE', 'FE',
3627
        'VE(BEHIU)--^', 'F', None,
3628
        'VEL(BDLMNT)-^', 'FEL', None,
3629
        'VENTZ-^', 'FEN', None,
3630
        'VEN(NRSZ)-^', 'FEN', None,
3631
        'VER(AB)-^$', 'WER', None,
3632
        'VERBAL^$', 'WERBAL', None,
3633
        'VERBAL(EINS)-^', 'WERBAL', None,
3634
        'VERTEBR--', 'WERTE', None,
3635
        'VEREIN-----', 'F', None,
3636
        'VEREN(AEIOU)-^', 'WEREN', None,
3637
        'VERIFI', 'WERIFI', None,
3638
        'VERON(AEIOU)-^', 'WERON', None,
3639
        'VERSEN^', 'FERSN', 'FAZN',
3640
        'VERSIERT--^', 'WERSI', None,
3641
        'VERSIO--^', 'WERS', None,
3642
        'VERSUS', 'WERSUS', None,
3643
        'VERTI(GK)-', 'WERTI', None,
3644
        'VER^^', 'FER', 'FA',
3645
        'VERSPRECHE-------', ' FER', ' FA',
3646
        'VER$', 'WA', None,
3647
        'VER', 'FA', 'FA',
3648
        'VET(HT)-^', 'FET', 'FET',
3649
        'VETTE$', 'WET', 'FET',
3650
        'VE^', 'WE', None,
3651
        'VIC$', 'WIZ', 'FIZ',
3652
        'VIELSAGE----', 'FIL ', 'FIL ',
3653
        'VIEL', 'FIL', 'FIL',
3654
        'VIEW', 'WIU', 'FIU',
3655
        'VILL(AE)-', 'WIL', None,
3656
        'VIS(ACEIKUVWZ)-<^', 'WIS', None,
3657
        'VI(ELS)--^', 'F', None,
3658
        'VILLON--', 'WILI', 'FILI',
3659
        'VIZE^^', 'FIZE', 'FIZE',
3660
        'VLIE--^', 'FL', None,
3661
        'VL(AEIOU)--', 'W', None,
3662
        'VOKA-^', 'WOK', None,
3663
        'VOL(ATUVW)--^', 'WO', None,
3664
        'VOR^^', 'FOR', 'FUR',
3665
        'VR(AEIOU)--', 'W', None,
3666
        'VV9', 'W', None,
3667
        'VY9^', 'WÜ', 'FI',
3668
        'V(ÜY)-', 'W', None,
3669
        'V(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'W', None,
3670
        'V(AEIJLRU)-<', 'W', None,
3671
        'V.^', 'V.', None,
3672
        'V<', 'F', 'F',
3673
        'WEITERENTWI-----^', 'WEITA ', 'FEITA ',
3674
        'WEITREICH-----^', 'WEIT ', 'FEIT ',
3675
        'WEITVER^', 'WEIT FER', 'FEIT FA',
3676
        'WE(LMNRST)-3^', 'WE', 'FE',
3677
        'WER(DST)-', 'WER', None,
3678
        'WIC$', 'WIZ', 'FIZ',
3679
        'WIEDERU--', 'WIDE', 'FITE',
3680
        'WIEDER^$', 'WIDA', 'FITA',
3681
        'WIEDER^^', 'WIDA ', 'FITA ',
3682
        'WIEVIEL', 'WI FIL', 'FI FIL',
3683
        'WISUEL', 'WISUEL', None,
3684
        'WR-^', 'W', None,
3685
        'WY9^', 'WÜ', 'FI',
3686
        'W(BDFGJKLMNPQRSTZ)-', 'F', None,
3687
        'W$', 'F', None,
3688
        'W', None, 'F',
3689
        'X<^', 'Z', 'Z',
3690
        'XHAVEN$', 'XAFN', None,
3691
        'X(CSZ)', 'X', 'X',
3692
        'XTS(CH)--', 'XT', 'XT',
3693
        'XT(SZ)', 'Z', 'Z',
3694
        'YE(LMNRST)-3^', 'IE', 'IE',
3695
        'YE-3', 'I', 'I',
3696
        'YOR(GK)^$', 'IÖRK', 'IÖRK',
3697
        'Y(AOU)-<7', 'I', 'I',
3698
        'Y(BKLMNPRSTX)-1', 'Ü', None,
3699
        'YVES^$', 'IF', 'IF',
3700
        'YVONNE^$', 'IWON', 'IFUN',
3701
        'Y.^', 'Y.', None,
3702
        'Y', 'I', 'I',
3703
        'ZC(AOU)-', 'SK', 'ZK',
3704
        'ZE(LMNRST)-3^', 'ZE', 'ZE',
3705
        'ZIEJ$', 'ZI', 'ZI',
3706
        'ZIGERJA(HR)-3', 'ZIGA IA', 'ZIKA IA',
3707
        'ZL(AEIOU)-', 'SL', None,
3708
        'ZS(CHT)--', '', '',
3709
        'ZS', 'SH', 'Z',
3710
        'ZUERST', 'ZUERST', 'ZUERST',
3711
        'ZUGRUNDE^$', 'ZU GRUNDE', 'ZU KRUNTE',
3712
        'ZUGRUNDE', 'ZU GRUNDE ', 'ZU KRUNTE ',
3713
        'ZUGUNSTEN', 'ZU GUNSTN', 'ZU KUNZTN',
3714
        'ZUHAUSE-', 'ZU HAUS', 'ZU AUZ',
3715
        'ZULASTEN^$', 'ZU LASTN', 'ZU LAZTN',
3716
        'ZURUECK^^', 'ZURÜK', 'ZURIK',
3717
        'ZURZEIT', 'ZUR ZEIT', 'ZUR ZEIT',
3718
        'ZURÜCK^^', 'ZURÜK', 'ZURIK',
3719
        'ZUSTANDE', 'ZU STANDE', 'ZU ZTANTE',
3720
        'ZUTAGE', 'ZU TAGE', 'ZU TAKE',
3721
        'ZUVER^^', 'ZUFA', 'ZUFA',
3722
        'ZUVIEL', 'ZU FIL', 'ZU FIL',
3723
        'ZUWENIG', 'ZU WENIK', 'ZU FENIK',
3724
        'ZY9^', 'ZÜ', None,
3725
        'ZYK3$', 'ZIK', None,
3726
        'Z(VW)7^', 'SW', None,
3727
        None, None, None)
3728
3729
    phonet_hash = Counter()
3730
    alpha_pos = Counter()
3731
3732
    phonet_hash_1 = Counter()
3733
    phonet_hash_2 = Counter()
3734
3735
    _phonet_upper_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
3736
                                          'abcdefghijklmnopqrstuvwxyzàáâãåäæ' +
3737
                                          'çðèéêëìíîïñòóôõöøœšßþùúûüýÿ'),
3738
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÅÄÆ' +
3739
                                         'ÇÐÈÉÊËÌÍÎÏÑÒÓÔÕÖØŒŠßÞÙÚÛÜÝŸ'))
3740
3741
    def _trinfo(text, rule, err_text, lang):
3742
        """Output debug information."""
3743
        if lang == 'none':
3744
            _phonet_rules = _phonet_rules_no_lang
3745
        else:
3746
            _phonet_rules = _phonet_rules_german
3747
3748
        from_rule = ('(NULL)' if _phonet_rules[rule] is None else
3749
                     _phonet_rules[rule])
3750
        to_rule1 = ('(NULL)' if (_phonet_rules[rule + 1] is None) else
3751
                    _phonet_rules[rule + 1])
3752
        to_rule2 = ('(NULL)' if (_phonet_rules[rule + 2] is None) else
3753
                    _phonet_rules[rule + 2])
3754
        print('"{} {}:  "{}"{}"{}" {}'.format(text, ((rule / 3) + 1),
3755
                                              from_rule, to_rule1, to_rule2,
3756
                                              err_text))
3757
3758
    def _initialize_phonet(lang):
3759
        """Initialize phonet variables."""
3760
        if lang == 'none':
3761
            _phonet_rules = _phonet_rules_no_lang
3762
        else:
3763
            _phonet_rules = _phonet_rules_german
3764
3765
        phonet_hash[''] = -1
3766
3767
        # German and international umlauts
3768
        for j in {'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë',
3769
                  'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø',
3770
                  'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'Œ', 'Š', 'Ÿ'}:
3771
            alpha_pos[j] = 1
3772
            phonet_hash[j] = -1
3773
3774
        # "normal" letters ('A'-'Z')
3775
        for i, j in enumerate('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
3776
            alpha_pos[j] = i + 2
3777
            phonet_hash[j] = -1
3778
3779
        for i in range(26):
3780
            for j in range(28):
3781
                phonet_hash_1[i, j] = -1
3782
                phonet_hash_2[i, j] = -1
3783
3784
        # for each phonetc rule
3785
        for i in range(len(_phonet_rules)):
3786
            rule = _phonet_rules[i]
3787
3788
            if rule and i % 3 == 0:
3789
                # calculate first hash value
3790
                k = _phonet_rules[i][0]
3791
3792
                if phonet_hash[k] < 0 and (_phonet_rules[i+1] or
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash does not seem to be defined.
Loading history...
3793
                                           _phonet_rules[i+2]):
3794
                    phonet_hash[k] = i
3795
3796
                # calculate second hash values
3797
                if k and alpha_pos[k] >= 2:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable alpha_pos does not seem to be defined.
Loading history...
3798
                    k = alpha_pos[k]
3799
3800
                    j = k-2
3801
                    rule = rule[1:]
3802
3803
                    if not rule:
3804
                        rule = ' '
3805
                    elif rule[0] == '(':
3806
                        rule = rule[1:]
3807
                    else:
3808
                        rule = rule[0]
3809
3810
                    while rule and (rule[0] != ')'):
3811
                        k = alpha_pos[rule[0]]
3812
3813
                        if k > 0:
3814
                            # add hash value for this letter
3815
                            if phonet_hash_1[j, k] < 0:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_1 does not seem to be defined.
Loading history...
3816
                                phonet_hash_1[j, k] = i
3817
                                phonet_hash_2[j, k] = i
3818
3819
                            if phonet_hash_2[j, k] >= (i-30):
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_2 does not seem to be defined.
Loading history...
3820
                                phonet_hash_2[j, k] = i
3821
                            else:
3822
                                k = -1
3823
3824
                        if k <= 0:
3825
                            # add hash value for all letters
3826
                            if phonet_hash_1[j, 0] < 0:
3827
                                phonet_hash_1[j, 0] = i
3828
3829
                            phonet_hash_2[j, 0] = i
3830
3831
                        rule = rule[1:]
3832
3833
    def _phonet(term, mode, lang, trace):
3834
        """Return the phonet coded form of a term."""
3835
        if lang == 'none':
3836
            _phonet_rules = _phonet_rules_no_lang
3837
        else:
3838
            _phonet_rules = _phonet_rules_german
3839
3840
        char0 = ''
3841
        dest = term
3842
3843
        if not term:
3844
            return ''
3845
3846
        term_length = len(term)
3847
3848
        # convert input string to upper-case
3849
        src = term.translate(_phonet_upper_translation)
3850
3851
        # check "src"
3852
        i = 0
3853
        j = 0
3854
        zeta = 0
3855
3856
        while i < len(src):
3857
            char = src[i]
3858
3859
            if trace:
3860
                print('\ncheck position {}:  src = "{}",  dest = "{}"'.format
3861
                      (j, src[i:], dest[:j]))
3862
3863
            pos = alpha_pos[char]
3864
3865
            if pos >= 2:
3866
                xpos = pos-2
3867
3868
                if i+1 == len(src):
3869
                    pos = alpha_pos['']
3870
                else:
3871
                    pos = alpha_pos[src[i+1]]
3872
3873
                start1 = phonet_hash_1[xpos, pos]
3874
                start2 = phonet_hash_1[xpos, 0]
3875
                end1 = phonet_hash_2[xpos, pos]
3876
                end2 = phonet_hash_2[xpos, 0]
3877
3878
                # preserve rule priorities
3879
                if (start2 >= 0) and ((start1 < 0) or (start2 < start1)):
3880
                    pos = start1
3881
                    start1 = start2
3882
                    start2 = pos
3883
                    pos = end1
3884
                    end1 = end2
3885
                    end2 = pos
3886
3887
                if (end1 >= start2) and (start2 >= 0):
3888
                    if end2 > end1:
3889
                        end1 = end2
3890
3891
                    start2 = -1
3892
                    end2 = -1
3893
            else:
3894
                pos = phonet_hash[char]
3895
                start1 = pos
3896
                end1 = 10000
3897
                start2 = -1
3898
                end2 = -1
3899
3900
            pos = start1
3901
            zeta0 = 0
3902
3903
            if pos >= 0:
3904
                # check rules for this char
3905
                while ((_phonet_rules[pos] is None) or
3906
                       (_phonet_rules[pos][0] == char)):
3907
                    if pos > end1:
3908
                        if start2 > 0:
3909
                            pos = start2
3910
                            start1 = start2
3911
                            start2 = -1
3912
                            end1 = end2
3913
                            end2 = -1
3914
                            continue
3915
3916
                        break
3917
3918
                    if (((_phonet_rules[pos] is None) or
3919
                         (_phonet_rules[pos + mode] is None))):
3920
                        # no conversion rule available
3921
                        pos += 3
3922
                        continue
3923
3924
                    if trace:
3925
                        _trinfo('> rule no.', pos, 'is being checked', lang)
3926
3927
                    # check whole string
3928
                    matches = 1  # number of matching letters
3929
                    priority = 5  # default priority
3930
                    rule = _phonet_rules[pos]
3931
                    rule = rule[1:]
3932
3933
                    while (rule and
3934
                           (len(src) > (i + matches)) and
3935
                           (src[i + matches] == rule[0]) and
3936
                           not rule[0].isdigit() and
3937
                           (rule not in '(-<^$')):
3938
                        matches += 1
3939
                        rule = rule[1:]
3940
3941
                    if rule and (rule[0] == '('):
3942
                        # check an array of letters
3943
                        if (((len(src) > (i + matches)) and
3944
                             src[i + matches].isalpha() and
3945
                             (src[i + matches] in rule[1:]))):
3946
                            matches += 1
3947
3948
                            while rule and rule[0] != ')':
3949
                                rule = rule[1:]
3950
3951
                            # if rule[0] == ')':
3952
                            rule = rule[1:]
3953
3954
                    if rule:
3955
                        priority0 = ord(rule[0])
3956
                    else:
3957
                        priority0 = 0
3958
3959
                    matches0 = matches
3960
3961
                    while rule and rule[0] == '-' and matches > 1:
3962
                        matches -= 1
3963
                        rule = rule[1:]
3964
3965
                    if rule and rule[0] == '<':
3966
                        rule = rule[1:]
3967
3968
                    if rule and rule[0].isdigit():
3969
                        # read priority
3970
                        priority = int(rule[0])
3971
                        rule = rule[1:]
3972
3973
                    if rule and rule[0:2] == '^^':
3974
                        rule = rule[1:]
3975
3976
                    if (not rule or
3977
                            ((rule[0] == '^') and
3978
                             ((i == 0) or not src[i-1].isalpha()) and
3979
                             ((rule[1:2] != '$') or
3980
                              (not (src[i+matches0:i+matches0+1].isalpha()) and
3981
                               (src[i+matches0:i+matches0+1] != '.')))) or
3982
                            ((rule[0] == '$') and (i > 0) and
3983
                             src[i-1].isalpha() and
3984
                             ((not src[i+matches0:i+matches0+1].isalpha()) and
3985
                              (src[i+matches0:i+matches0+1] != '.')))):
3986
                        # look for continuation, if:
3987
                        # matches > 1 und NO '-' in first string */
3988
                        pos0 = -1
3989
3990
                        start3 = 0
3991
                        start4 = 0
3992
                        end3 = 0
3993
                        end4 = 0
3994
3995
                        if (((matches > 1) and
3996
                             src[i+matches:i+matches+1] and
3997
                             (priority0 != ord('-')))):
3998
                            char0 = src[i+matches-1]
3999
                            pos0 = alpha_pos[char0]
4000
4001
                            if pos0 >= 2 and src[i+matches]:
4002
                                xpos = pos0 - 2
4003
                                pos0 = alpha_pos[src[i+matches]]
4004
                                start3 = phonet_hash_1[xpos, pos0]
4005
                                start4 = phonet_hash_1[xpos, 0]
4006
                                end3 = phonet_hash_2[xpos, pos0]
4007
                                end4 = phonet_hash_2[xpos, 0]
4008
4009
                                # preserve rule priorities
4010
                                if (((start4 >= 0) and
4011
                                     ((start3 < 0) or (start4 < start3)))):
4012
                                    pos0 = start3
4013
                                    start3 = start4
4014
                                    start4 = pos0
4015
                                    pos0 = end3
4016
                                    end3 = end4
4017
                                    end4 = pos0
4018
4019
                                if (end3 >= start4) and (start4 >= 0):
4020
                                    if end4 > end3:
4021
                                        end3 = end4
4022
4023
                                    start4 = -1
4024
                                    end4 = -1
4025
                            else:
4026
                                pos0 = phonet_hash[char0]
4027
                                start3 = pos0
4028
                                end3 = 10000
4029
                                start4 = -1
4030
                                end4 = -1
4031
4032
                            pos0 = start3
4033
4034
                        # check continuation rules for src[i+matches]
4035
                        if pos0 >= 0:
4036
                            while ((_phonet_rules[pos0] is None) or
4037
                                   (_phonet_rules[pos0][0] == char0)):
4038
                                if pos0 > end3:
4039
                                    if start4 > 0:
4040
                                        pos0 = start4
4041
                                        start3 = start4
4042
                                        start4 = -1
4043
                                        end3 = end4
4044
                                        end4 = -1
4045
                                        continue
4046
4047
                                    priority0 = -1
4048
4049
                                    # important
4050
                                    break
4051
4052
                                if (((_phonet_rules[pos0] is None) or
4053
                                     (_phonet_rules[pos0 + mode] is None))):
4054
                                    # no conversion rule available
4055
                                    pos0 += 3
4056
                                    continue
4057
4058
                                if trace:
4059
                                    _trinfo('> > continuation rule no.', pos0,
4060
                                            'is being checked', lang)
4061
4062
                                # check whole string
4063
                                matches0 = matches
4064
                                priority0 = 5
4065
                                rule = _phonet_rules[pos0]
4066
                                rule = rule[1:]
4067
4068
                                while (rule and
4069
                                       (src[i+matches0:i+matches0+1] ==
4070
                                        rule[0]) and
4071
                                       (not rule[0].isdigit() or
4072
                                        (rule in '(-<^$'))):
4073
                                    matches0 += 1
4074
                                    rule = rule[1:]
4075
4076
                                if rule and rule[0] == '(':
4077
                                    # check an array of letters
4078
                                    if ((src[i+matches0:i+matches0+1]
4079
                                         .isalpha() and
4080
                                         (src[i+matches0] in rule[1:]))):
4081
                                        matches0 += 1
4082
4083
                                        while rule and rule[0] != ')':
4084
                                            rule = rule[1:]
4085
4086
                                        # if rule[0] == ')':
4087
                                        rule = rule[1:]
4088
4089
                                while rule and rule[0] == '-':
4090
                                    # "matches0" is NOT decremented
4091
                                    # because of  "if (matches0 == matches)"
4092
                                    rule = rule[1:]
4093
4094
                                if rule and rule[0] == '<':
4095
                                    rule = rule[1:]
4096
4097
                                if rule and rule[0].isdigit():
4098
                                    priority0 = int(rule[0])
4099
                                    rule = rule[1:]
4100
4101
                                if (not rule or
4102
                                        # rule == '^' is not possible here
4103
                                        ((rule[0] == '$') and not
4104
                                         src[i+matches0:i+matches0+1]
4105
                                         .isalpha() and
4106
                                         (src[i+matches0:i+matches0+1]
4107
                                          != '.'))):
4108
                                    if matches0 == matches:
4109
                                        # this is only a partial string
4110
                                        if trace:
4111
                                            _trinfo('> > continuation ' +
4112
                                                    'rule no.',
4113
                                                    pos0,
4114
                                                    'not used (too short)',
4115
                                                    lang)
4116
4117
                                        pos0 += 3
4118
                                        continue
4119
4120
                                    if priority0 < priority:
4121
                                        # priority is too low
4122
                                        if trace:
4123
                                            _trinfo('> > continuation ' +
4124
                                                    'rule no.',
4125
                                                    pos0,
4126
                                                    'not used (priority)',
4127
                                                    lang)
4128
4129
                                        pos0 += 3
4130
                                        continue
4131
4132
                                    # continuation rule found
4133
                                    break
4134
4135
                                if trace:
4136
                                    _trinfo('> > continuation rule no.', pos0,
4137
                                            'not used', lang)
4138
4139
                                pos0 += 3
4140
4141
                            # end of "while"
4142
                            if ((priority0 >= priority) and
4143
                                    ((_phonet_rules[pos0] is not None) and
4144
                                     (_phonet_rules[pos0][0] == char0))):
4145
4146
                                if trace:
4147
                                    _trinfo('> rule no.', pos, '', lang)
4148
                                    _trinfo('> not used because of ' +
4149
                                            'continuation', pos0, '', lang)
4150
4151
                                pos += 3
4152
                                continue
4153
4154
                        # replace string
4155
                        if trace:
4156
                            _trinfo('Rule no.', pos, 'is applied', lang)
4157
4158
                        if ((_phonet_rules[pos] and
4159
                             ('<' in _phonet_rules[pos][1:]))):
4160
                            priority0 = 1
4161
                        else:
4162
                            priority0 = 0
4163
4164
                        rule = _phonet_rules[pos + mode]
4165
4166
                        if (priority0 == 1) and (zeta == 0):
4167
                            # rule with '<' is applied
4168
                            if ((j > 0) and rule and
4169
                                    ((dest[j-1] == char) or
4170
                                     (dest[j-1] == rule[0]))):
4171
                                j -= 1
4172
4173
                            zeta0 = 1
4174
                            zeta += 1
4175
                            matches0 = 0
4176
4177
                            while rule and src[i+matches0]:
4178
                                src = (src[0:i+matches0] + rule[0] +
4179
                                       src[i+matches0+1:])
4180
                                matches0 += 1
4181
                                rule = rule[1:]
4182
4183
                            if matches0 < matches:
4184
                                src = (src[0:i+matches0] +
4185
                                       src[i+matches:])
4186
4187
                            char = src[i]
4188
                        else:
4189
                            i = i + matches - 1
4190
                            zeta = 0
4191
4192
                            while len(rule) > 1:
4193
                                if (j == 0) or (dest[j - 1] != rule[0]):
4194
                                    dest = (dest[0:j] + rule[0] +
4195
                                            dest[min(len(dest), j+1):])
4196
                                    j += 1
4197
4198
                                rule = rule[1:]
4199
4200
                            # new "current char"
4201
                            if not rule:
4202
                                rule = ''
4203
                                char = ''
4204
                            else:
4205
                                char = rule[0]
4206
4207
                            if ((_phonet_rules[pos] and
4208
                                 '^^' in _phonet_rules[pos][1:])):
4209
                                if char:  # pragma: no branch
4210
                                    dest = (dest[0:j] + char +
4211
                                            dest[min(len(dest), j + 1):])
4212
                                    j += 1
4213
4214
                                src = src[i + 1:]
4215
                                i = 0
4216
                                zeta0 = 1
4217
4218
                        break
4219
4220
                    pos += 3
4221
4222
                    if pos > end1 and start2 > 0:
4223
                        pos = start2
4224
                        start1 = start2
4225
                        end1 = end2
4226
                        start2 = -1
4227
                        end2 = -1
4228
4229
            if zeta0 == 0:
4230
                if char and ((j == 0) or (dest[j-1] != char)):
4231
                    # delete multiple letters only
4232
                    dest = dest[0:j] + char + dest[min(j+1, term_length):]
4233
                    j += 1
4234
4235
                i += 1
4236
                zeta = 0
4237
4238
        dest = dest[0:j]
4239
4240
        return dest
4241
4242
    _initialize_phonet(lang)
4243
4244
    word = unicodedata.normalize('NFKC', text_type(word))
4245
    return _phonet(word, mode, lang, trace)
4246
4247
4248
def spfc(word):
4249
    """Return the Standardized Phonetic Frequency Code (SPFC) of a word.
4250
4251
    Standardized Phonetic Frequency Code is roughly Soundex-like.
4252
    This implementation is based on page 19-21 of
4253
    https://archive.org/stream/accessingindivid00moor#page/19/mode/1up
4254
4255
    :param str word: the word to transform
4256
    :returns: the SPFC value
4257
    :rtype: str
4258
4259
    >>> spfc('Christopher Smith')
4260
    '01160'
4261
    >>> spfc('Christopher Schmidt')
4262
    '01160'
4263
    >>> spfc('Niall Smith')
4264
    '01660'
4265
    >>> spfc('Niall Schmidt')
4266
4267
    >>> spfc('L.Smith')
4268
    '01960'
4269
    >>> spfc('R.Miller')
4270
    '65490'
4271
4272
    >>> spfc(('L', 'Smith'))
4273
    '01960'
4274
    >>> spfc(('R', 'Miller'))
4275
    '65490'
4276
    """
4277
    _pf1 = dict(zip((ord(_) for _ in 'SZCKQVFPUWABLORDHIEMNXGJT'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4278
                    '0011112222334445556666777'))
4279
    _pf2 = dict(zip((ord(_) for _ in
4280
                     'SZCKQFPXABORDHIMNGJTUVWEL'),
4281
                    '0011122233445556677788899'))
4282
    _pf3 = dict(zip((ord(_) for _ in
4283
                     'BCKQVDTFLPGJXMNRSZAEHIOUWY'),
4284
                    '00000112223334456677777777'))
4285
4286
    _substitutions = (('DK', 'K'), ('DT', 'T'), ('SC', 'S'), ('KN', 'N'),
4287
                      ('MN', 'N'))
4288
4289
    def _raise_word_ex():
4290
        """Raise an AttributeError."""
4291
        raise AttributeError('word attribute must be a string with a space ' +
4292
                             'or period dividing the first and last names ' +
4293
                             'or a tuple/list consisting of the first and ' +
4294
                             'last names')
4295
4296
    if not word:
4297
        return ''
4298
4299
    if isinstance(word, (str, text_type)):
4300
        names = word.split('.', 1)
4301
        if len(names) != 2:
4302
            names = word.split(' ', 1)
4303
            if len(names) != 2:
4304
                _raise_word_ex()
4305
    elif hasattr(word, '__iter__'):
4306
        if len(word) != 2:
4307
            _raise_word_ex()
4308
        names = word
4309
    else:
4310
        _raise_word_ex()
4311
4312
    names = [unicodedata.normalize('NFKD', text_type(_.strip()
4313
                                                     .replace('ß', 'SS')
4314
                                                     .upper()))
4315
             for _ in names]
0 ignored issues
show
introduced by
The variable names does not seem to be defined for all execution paths.
Loading history...
4316
    code = ''
4317
4318
    def steps_one_to_three(name):
4319
        """Perform the first three steps of SPFC."""
4320
        # filter out non A-Z
4321
        name = ''.join(_ for _ in name if _ in
4322
                       {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
4323
                        'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
4324
                        'W', 'X', 'Y', 'Z'})
4325
4326
        # 1. In the field, convert DK to K, DT to T, SC to S, KN to N,
4327
        # and MN to N
4328
        for subst in _substitutions:
4329
            name = name.replace(subst[0], subst[1])
4330
4331
        # 2. In the name field, replace multiple letters with a single letter
4332
        name = _delete_consecutive_repeats(name)
4333
4334
        # 3. Remove vowels, W, H, and Y, but keep the first letter in the name
4335
        # field.
4336
        if name:
4337
            name = name[0] + ''.join(_ for _ in name[1:] if _ not in
4338
                                     {'A', 'E', 'H', 'I', 'O', 'U', 'W', 'Y'})
4339
        return name
4340
4341
    names = [steps_one_to_three(_) for _ in names]
4342
4343
    # 4. The first digit of the code is obtained using PF1 and the first letter
4344
    # of the name field. Remove this letter after coding.
4345
    if names[1]:
4346
        code += names[1][0].translate(_pf1)
4347
        names[1] = names[1][1:]
4348
4349
    # 5. Using the last letters of the name, use Table PF3 to obtain the
4350
    # second digit of the code. Use as many letters as possible and remove
4351
    # after coding.
4352
    if names[1]:
4353
        if names[1][-3:] == 'STN' or names[1][-3:] == 'PRS':
4354
            code += '8'
4355
            names[1] = names[1][:-3]
4356
        elif names[1][-2:] == 'SN':
4357
            code += '8'
4358
            names[1] = names[1][:-2]
4359
        elif names[1][-3:] == 'STR':
4360
            code += '9'
4361
            names[1] = names[1][:-3]
4362
        elif names[1][-2:] in {'SR', 'TN', 'TD'}:
4363
            code += '9'
4364
            names[1] = names[1][:-2]
4365
        elif names[1][-3:] == 'DRS':
4366
            code += '7'
4367
            names[1] = names[1][:-3]
4368
        elif names[1][-2:] in {'TR', 'MN'}:
4369
            code += '7'
4370
            names[1] = names[1][:-2]
4371
        else:
4372
            code += names[1][-1].translate(_pf3)
4373
            names[1] = names[1][:-1]
4374
4375
    # 6. The third digit is found using Table PF2 and the first character of
4376
    # the first name. Remove after coding.
4377
    if names[0]:
4378
        code += names[0][0].translate(_pf2)
4379
        names[0] = names[0][1:]
4380
4381
    # 7. The fourth digit is found using Table PF2 and the first character of
4382
    # the name field. If no letters remain use zero. After coding remove the
4383
    # letter.
4384
    # 8. The fifth digit is found in the same manner as the fourth using the
4385
    # remaining characters of the name field if any.
4386
    for _ in range(2):
4387
        if names[1]:
4388
            code += names[1][0].translate(_pf2)
4389
            names[1] = names[1][1:]
4390
        else:
4391
            code += '0'
4392
4393
    return code
4394
4395
4396
def statistics_canada(word, maxlength=4):
4397
    """Return the Statistics Canada code for a word.
4398
4399
    The original description of this algorithm could not be located, and
4400
    may only have been specified in an unpublished TR. The coding does not
4401
    appear to be in use by Statistics Canada any longer. In its place, this is
4402
    an implementation of the "Census modified Statistics Canada name coding
4403
    procedure".
4404
4405
    The modified version of this algorithm is described in Appendix B of
4406
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
4407
    Procedure for the SRS Record Linkage System.` Statistical Reporting
4408
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
4409
    https://naldc.nal.usda.gov/download/27833/PDF
4410
4411
    :param str word: the word to transform
4412
    :param int maxlength: the maximum length (default 6) of the code to return
4413
    :param bool modified: indicates whether to use USDA modified algorithm
4414
    :returns: the Statistics Canada name code value
4415
    :rtype: str
4416
4417
    >>> statistics_canada('Christopher')
4418
    'CHRS'
4419
    >>> statistics_canada('Niall')
4420
    'NL'
4421
    >>> statistics_canada('Smith')
4422
    'SMTH'
4423
    >>> statistics_canada('Schmidt')
4424
    'SCHM'
4425
    """
4426
    # uppercase, normalize, decompose, and filter non-A-Z out
4427
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4428
    word = word.replace('ß', 'SS')
4429
    word = ''.join(c for c in word if c in
4430
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4431
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4432
                    'Y', 'Z'})
4433
    if not word:
4434
        return ''
4435
4436
    code = word[1:]
4437
    for vowel in {'A', 'E', 'I', 'O', 'U', 'Y'}:
4438
        code = code.replace(vowel, '')
4439
    code = word[0]+code
4440
    code = _delete_consecutive_repeats(code)
4441
    code = code.replace(' ', '')
4442
4443
    return code[:maxlength]
4444
4445
4446
def lein(word, maxlength=4, zero_pad=True):
4447
    """Return the Lein code for a word.
4448
4449
    This is Lein name coding, based on
4450
    https://naldc.nal.usda.gov/download/27833/PDF
4451
4452
    :param str word: the word to transform
4453
    :param int maxlength: the maximum length (default 4) of the code to return
4454
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4455
        maxlength string
4456
    :returns: the Lein code
4457
    :rtype: str
4458
4459
    >>> lein('Christopher')
4460
    'C351'
4461
    >>> lein('Niall')
4462
    'N300'
4463
    >>> lein('Smith')
4464
    'S210'
4465
    >>> lein('Schmidt')
4466
    'S521'
4467
    """
4468
    _lein_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4469
                                  'BCDFGJKLMNPQRSTVXZ'),
4470
                                 '451455532245351455'))
4471
4472
    # uppercase, normalize, decompose, and filter non-A-Z out
4473
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4474
    word = word.replace('ß', 'SS')
4475
    word = ''.join(c for c in word if c in
4476
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4477
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4478
                    'Y', 'Z'})
4479
4480
    if not word:
4481
        return ''
4482
4483
    code = word[0]  # Rule 1
4484
    word = word[1:].translate({32: None, 65: None, 69: None, 72: None,
4485
                               73: None, 79: None, 85: None, 87: None,
4486
                               89: None})  # Rule 2
4487
    word = _delete_consecutive_repeats(word)  # Rule 3
4488
    code += word.translate(_lein_translation)  # Rule 4
4489
4490
    if zero_pad:
4491
        code += ('0'*maxlength)  # Rule 4
4492
4493
    return code[:maxlength]
4494
4495
4496
def roger_root(word, maxlength=5, zero_pad=True):
4497
    """Return the Roger Root code for a word.
4498
4499
    This is Roger Root name coding, based on
4500
    https://naldc.nal.usda.gov/download/27833/PDF
4501
4502
    :param str word: the word to transform
4503
    :param int maxlength: the maximum length (default 5) of the code to return
4504
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4505
        maxlength string
4506
    :returns: the Roger Root code
4507
    :rtype: str
4508
4509
    >>> roger_root('Christopher')
4510
    '06401'
4511
    >>> roger_root('Niall')
4512
    '02500'
4513
    >>> roger_root('Smith')
4514
    '00310'
4515
    >>> roger_root('Schmidt')
4516
    '06310'
4517
    """
4518
    # uppercase, normalize, decompose, and filter non-A-Z out
4519
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4520
    word = word.replace('ß', 'SS')
4521
    word = ''.join(c for c in word if c in
4522
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4523
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4524
                    'Y', 'Z'})
4525
4526
    if not word:
4527
        return ''
4528
4529
    # '*' is used to prevent combining by _delete_consecutive_repeats()
4530
    _init_patterns = {4: {'TSCH': '06'},
4531
                      3: {'TSH': '06', 'SCH': '06'},
4532
                      2: {'CE': '0*0', 'CH': '06', 'CI': '0*0', 'CY': '0*0',
4533
                          'DG': '07', 'GF': '08', 'GM': '03', 'GN': '02',
4534
                          'KN': '02', 'PF': '08', 'PH': '08', 'PN': '02',
4535
                          'SH': '06', 'TS': '0*0', 'WR': '04'},
4536
                      1: {'A': '1', 'B': '09', 'C': '07', 'D': '01', 'E': '1',
4537
                          'F': '08', 'G': '07', 'H': '2', 'I': '1', 'J': '3',
4538
                          'K': '07', 'L': '05', 'M': '03', 'N': '02', 'O': '1',
4539
                          'P': '09', 'Q': '07', 'R': '04', 'S': '0*0',
4540
                          'T': '01', 'U': '1', 'V': '08', 'W': '4', 'X': '07',
4541
                          'Y': '5', 'Z': '0*0'}}
4542
4543
    _med_patterns = {4: {'TSCH': '6'},
4544
                     3: {'TSH': '6', 'SCH': '6'},
4545
                     2: {'CE': '0', 'CH': '6', 'CI': '0', 'CY': '0', 'DG': '7',
4546
                         'PH': '8', 'SH': '6', 'TS': '0'},
4547
                     1: {'B': '9', 'C': '7', 'D': '1', 'F': '8', 'G': '7',
4548
                         'J': '6', 'K': '7', 'L': '5', 'M': '3', 'N': '2',
4549
                         'P': '9', 'Q': '7', 'R': '4', 'S': '0', 'T': '1',
4550
                         'V': '8', 'X': '7', 'Z': '0',
4551
                         'A': '*', 'E': '*', 'H': '*', 'I': '*', 'O': '*',
4552
                         'U': '*', 'W': '*', 'Y': '*'}}
4553
4554
    code = ''
4555
    pos = 0
4556
4557
    # Do first digit(s) first
4558
    for num in range(4, 0, -1):
4559
        if word[:num] in _init_patterns[num]:
4560
            code = _init_patterns[num][word[:num]]
4561
            pos += num
4562
            break
4563
    else:
4564
        pos += 1  # Advance if nothing is recognized
4565
4566
    # Then code subsequent digits
4567
    while pos < len(word):
4568
        for num in range(4, 0, -1):
4569
            if word[pos:pos+num] in _med_patterns[num]:
4570
                code += _med_patterns[num][word[pos:pos+num]]
4571
                pos += num
4572
                break
4573
        else:
4574
            pos += 1  # Advance if nothing is recognized
4575
4576
    code = _delete_consecutive_repeats(code)
4577
    code = code.replace('*', '')
4578
4579
    if zero_pad:
4580
        code += '0'*maxlength
4581
4582
    return code[:maxlength]
4583
4584
4585
def onca(word, maxlength=4, zero_pad=True):
4586
    """Return the Oxford Name Compression Algorithm (ONCA) code for a word.
4587
4588
    This is the Oxford Name Compression Algorithm, based on:
4589
    Gill, Leicester E. 1997. "OX-LINK: The Oxford Medical Record Linkage
4590
    System." In ``Record Linkage Techniques -- 1997``. Arlington, VA. March
4591
    20--21, 1997.
4592
    https://nces.ed.gov/FCSM/pdf/RLT97.pdf
4593
4594
    I can find no complete description of the "anglicised version of the NYSIIS
4595
    method" identified as the first step in this algorithm, so this is likely
4596
    not a correct implementation, in that it employs the standard NYSIIS
4597
    algorithm.
4598
4599
    :param str word: the word to transform
4600
    :param int maxlength: the maximum length (default 5) of the code to return
4601
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4602
        maxlength string
4603
    :returns: the ONCA code
4604
    :rtype: str
4605
4606
    >>> onca('Christopher')
4607
    'C623'
4608
    >>> onca('Niall')
4609
    'N400'
4610
    >>> onca('Smith')
4611
    'S530'
4612
    >>> onca('Schmidt')
4613
    'S530'
4614
    """
4615
    # In the most extreme case, 3 characters of NYSIIS input can be compressed
4616
    # to one character of output, so give it triple the maxlength.
4617
    return soundex(nysiis(word, maxlength=maxlength*3), maxlength,
4618
                   zero_pad=zero_pad)
4619
4620
4621
def eudex(word, maxlength=8):
4622
    """Return the eudex phonetic hash of a word.
4623
4624
    This implementation of eudex phonetic hashing is based on the specification
4625
    (not the reference implementation) at:
4626
    Ticki. 2017. "Eudex: A blazingly fast phonetic reduction/hashing
4627
    algorithm." https://docs.rs/crate/eudex
4628
4629
    Further details can be found at
4630
    http://ticki.github.io/blog/the-eudex-algorithm/
4631
4632
    :param str word: the word to transform
4633
    :param int maxlength: the length of the code returned (defaults to 8)
4634
    :returns: the eudex hash
4635
    :rtype: str
4636
    """
4637
    _trailing_phones = {
4638
        'a': 0,  # a
4639
        'b': 0b01001000,  # b
4640
        'c': 0b00001100,  # c
4641
        'd': 0b00011000,  # d
4642
        'e': 0,  # e
4643
        'f': 0b01000100,  # f
4644
        'g': 0b00001000,  # g
4645
        'h': 0b00000100,  # h
4646
        'i': 1,  # i
4647
        'j': 0b00000101,  # j
4648
        'k': 0b00001001,  # k
4649
        'l': 0b10100000,  # l
4650
        'm': 0b00000010,  # m
4651
        'n': 0b00010010,  # n
4652
        'o': 0,  # o
4653
        'p': 0b01001001,  # p
4654
        'q': 0b10101000,  # q
4655
        'r': 0b10100001,  # r
4656
        's': 0b00010100,  # s
4657
        't': 0b00011101,  # t
4658
        'u': 1,  # u
4659
        'v': 0b01000101,  # v
4660
        'w': 0b00000000,  # w
4661
        'x': 0b10000100,  # x
4662
        'y': 1,  # y
4663
        'z': 0b10010100,  # z
4664
4665
        'ß': 0b00010101,  # ß
4666
        'à': 0,  # à
4667
        'á': 0,  # á
4668
        'â': 0,  # â
4669
        'ã': 0,  # ã
4670
        'ä': 0,  # ä[æ]
4671
        'å': 1,  # å[oː]
4672
        'æ': 0,  # æ[æ]
4673
        'ç': 0b10010101,  # ç[t͡ʃ]
4674
        'è': 1,  # è
4675
        'é': 1,  # é
4676
        'ê': 1,  # ê
4677
        'ë': 1,  # ë
4678
        'ì': 1,  # ì
4679
        'í': 1,  # í
4680
        'î': 1,  # î
4681
        'ï': 1,  # ï
4682
        'ð': 0b00010101,  # ð[ð̠](represented as a non-plosive T)
4683
        'ñ': 0b00010111,  # ñ[nj](represented as a combination of n and j)
4684
        'ò': 0,  # ò
4685
        'ó': 0,  # ó
4686
        'ô': 0,  # ô
4687
        'õ': 0,  # õ
4688
        'ö': 1,  # ö[ø]
4689
        '÷': 0b11111111,  # ÷
4690
        'ø': 1,  # ø[ø]
4691
        'ù': 1,  # ù
4692
        'ú': 1,  # ú
4693
        'û': 1,  # û
4694
        'ü': 1,  # ü
4695
        'ý': 1,  # ý
4696
        'þ': 0b00010101,  # þ[ð̠](represented as a non-plosive T)
4697
        'ÿ': 1,  # ÿ
4698
    }
4699
4700
    _initial_phones = {
4701
        'a': 0b10000100,  # a*
4702
        'b': 0b00100100,  # b
4703
        'c': 0b00000110,  # c
4704
        'd': 0b00001100,  # d
4705
        'e': 0b11011000,  # e*
4706
        'f': 0b00100010,  # f
4707
        'g': 0b00000100,  # g
4708
        'h': 0b00000010,  # h
4709
        'i': 0b11111000,  # i*
4710
        'j': 0b00000011,  # j
4711
        'k': 0b00000101,  # k
4712
        'l': 0b01010000,  # l
4713
        'm': 0b00000001,  # m
4714
        'n': 0b00001001,  # n
4715
        'o': 0b10010100,  # o*
4716
        'p': 0b00100101,  # p
4717
        'q': 0b01010100,  # q
4718
        'r': 0b01010001,  # r
4719
        's': 0b00001010,  # s
4720
        't': 0b00001110,  # t
4721
        'u': 0b11100000,  # u*
4722
        'v': 0b00100011,  # v
4723
        'w': 0b00000000,  # w
4724
        'x': 0b01000010,  # x
4725
        'y': 0b11100100,  # y*
4726
        'z': 0b01001010,  # z
4727
4728
        'ß': 0b00001011,  # ß
4729
        'à': 0b10000101,  # à
4730
        'á': 0b10000101,  # á
4731
        'â': 0b10000000,  # â
4732
        'ã': 0b10000110,  # ã
4733
        'ä': 0b10100110,  # ä [æ]
4734
        'å': 0b11000010,  # å [oː]
4735
        'æ': 0b10100111,  # æ [æ]
4736
        'ç': 0b01010100,  # ç [t͡ʃ]
4737
        'è': 0b11011001,  # è
4738
        'é': 0b11011001,  # é
4739
        'ê': 0b11011001,  # ê
4740
        'ë': 0b11000110,  # ë [ə] or [œ]
4741
        'ì': 0b11111001,  # ì
4742
        'í': 0b11111001,  # í
4743
        'î': 0b11111001,  # î
4744
        'ï': 0b11111001,  # ï
4745
        'ð': 0b00001011,  # ð [ð̠] (represented as a non-plosive T)
4746
        'ñ': 0b00001011,  # ñ [nj] (represented as a combination of n and j)
4747
        'ò': 0b10010101,  # ò
4748
        'ó': 0b10010101,  # ó
4749
        'ô': 0b10010101,  # ô
4750
        'õ': 0b10010101,  # õ
4751
        'ö': 0b11011100,  # ö [œ] or [ø]
4752
        '÷': 0b11111111,  # ÷
4753
        'ø': 0b11011101,  # ø [œ] or [ø]
4754
        'ù': 0b11100001,  # ù
4755
        'ú': 0b11100001,  # ú
4756
        'û': 0b11100001,  # û
4757
        'ü': 0b11100101,  # ü
4758
        'ý': 0b11100101,  # ý
4759
        'þ': 0b00001011,  # þ [ð̠] (represented as a non-plosive T)
4760
        'ÿ': 0b11100101,  # ÿ
4761
    }
4762
    # Lowercase input & filter unknown characters
4763
    word = ''.join(char for char in word.lower() if char in _initial_phones)
4764
4765
    # Perform initial eudex coding of each character
4766
    values = [_initial_phones[word[0]]]
4767
    values += [_trailing_phones[char] for char in word[1:]]
4768
4769
    # Right-shift by one to determine if second instance should be skipped
4770
    shifted_values = [_ >> 1 for _ in values]
4771
    condensed_values = [values[0]]
4772
    for n in range(1, len(shifted_values)):
4773
        if shifted_values[n] != shifted_values[n-1]:
4774
            condensed_values.append(values[n])
4775
4776
    # Add padding after first character & trim beyond maxlength
4777
    values = ([condensed_values[0]] +
4778
              [0]*max(0, maxlength - len(condensed_values)) +
4779
              condensed_values[1:maxlength])
4780
4781
    # Combine individual character values into eudex hash
4782
    hash_value = 0
4783
    for val in values:
4784
        hash_value = (hash_value << 8) | val
4785
4786
    return hash_value
4787
4788
4789
def haase_phonetik(word, primary_only=False):
4790
    """Return the Haase Phonetik (numeric output) code for a word.
4791
4792
    Based on the algorithm described at
4793
    https://github.com/elastic/elasticsearch/blob/master/plugins/analysis-phonetic/src/main/java/org/elasticsearch/index/analysis/phonetic/HaasePhonetik.java
4794
4795
    Based on the original
4796
    Haase, Martin and Kai Heitmann. 2000. Die Erweiterte Kölner Phonetik.
4797
4798
    While the output code is numeric, it is still a str.
4799
4800
    :param str word: the word to transform
4801
    :returns: the Haase Phonetik value as a numeric string
4802
    :rtype: str
4803
    """
4804
    def _after(word, i, letters):
4805
        """Return True if word[i] follows one of the supplied letters."""
4806
        if i > 0 and word[i-1] in letters:
4807
            return True
4808
        return False
4809
4810
    def _before(word, i, letters):
4811
        """Return True if word[i] precedes one of the supplied letters."""
4812
        if i+1 < len(word) and word[i+1] in letters:
4813
            return True
4814
        return False
4815
4816
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
4817
4818
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4819
    word = word.replace('ß', 'SS')
4820
4821
    word = word.replace('Ä', 'AE')
4822
    word = word.replace('Ö', 'OE')
4823
    word = word.replace('Ü', 'UE')
4824
    word = ''.join(c for c in word if c in
4825
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4826
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4827
                    'Y', 'Z'})
4828
4829
    # Nothing to convert, return base case
4830
    if not word:
4831
        return ''
4832
4833
    variants = []
4834
    if primary_only:
4835
        variants = [word]
4836
    else:
4837
        pos = 0
4838
        if word[:2] == 'CH':
4839
            variants.append(('CH', 'SCH'))
4840
            pos += 2
4841
        len_3_vars = {'OWN': 'AUN', 'WSK': 'RSK', 'SCH': 'CH', 'GLI': 'LI',
4842
                      'AUX': 'O', 'EUX': 'O'}
4843
        while pos < len(word):
4844
            if word[pos:pos+4] == 'ILLE':
4845
                variants.append(('ILLE', 'I'))
4846
                pos += 4
4847
            elif word[pos:pos+3] in len_3_vars:
4848
                variants.append((word[pos:pos+3], len_3_vars[word[pos:pos+3]]))
4849
                pos += 3
4850
            elif word[pos:pos+2] == 'RB':
4851
                variants.append(('RB', 'RW'))
4852
                pos += 2
4853
            elif len(word[pos:]) == 3 and word[pos:] == 'EAU':
4854
                variants.append(('EAU', 'O'))
4855
                pos += 3
4856
            elif len(word[pos:]) == 1 and word[pos:] in {'A', 'O'}:
4857
                if word[pos:] == 'O':
4858
                    variants.append(('O', 'OW'))
4859
                else:
4860
                    variants.append(('A', 'AR'))
4861
                pos += 1
4862
            else:
4863
                variants.append((word[pos],))
4864
                pos += 1
4865
4866
        variants = [''.join(letters) for letters in product(*variants)]
4867
4868
    def _haase_code(word):
4869
        sdx = ''
4870
        for i in range(len(word)):
4871 View Code Duplication
            if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
4872
                sdx += '9'
4873
            elif word[i] == 'B':
4874
                sdx += '1'
4875
            elif word[i] == 'P':
4876
                if _before(word, i, {'H'}):
4877
                    sdx += '3'
4878
                else:
4879
                    sdx += '1'
4880
            elif word[i] in {'D', 'T'}:
4881
                if _before(word, i, {'C', 'S', 'Z'}):
4882
                    sdx += '8'
4883
                else:
4884
                    sdx += '2'
4885
            elif word[i] in {'F', 'V', 'W'}:
4886
                sdx += '3'
4887
            elif word[i] in {'G', 'K', 'Q'}:
4888
                sdx += '4'
4889
            elif word[i] == 'C':
4890
                if _after(word, i, {'S', 'Z'}):
4891
                    sdx += '8'
4892
                elif i == 0:
4893
                    if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R',
4894
                                         'U', 'X'}):
4895
                        sdx += '4'
4896
                    else:
4897
                        sdx += '8'
4898
                elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
4899
                    sdx += '4'
4900
                else:
4901
                    sdx += '8'
4902
            elif word[i] == 'X':
4903
                if _after(word, i, {'C', 'K', 'Q'}):
4904
                    sdx += '8'
4905
                else:
4906
                    sdx += '48'
4907
            elif word[i] == 'L':
4908
                sdx += '5'
4909
            elif word[i] in {'M', 'N'}:
4910
                sdx += '6'
4911
            elif word[i] == 'R':
4912
                sdx += '7'
4913
            elif word[i] in {'S', 'Z'}:
4914
                sdx += '8'
4915
4916
        sdx = _delete_consecutive_repeats(sdx)
4917
4918
        # if sdx:
4919
        #     sdx = sdx[0] + sdx[1:].replace('9', '')
4920
4921
        return sdx
4922
4923
    return tuple(_haase_code(word) for word in variants)
4924
4925
4926
def reth_schek_phonetik(word):
4927
    """Return Reth-Schek Phonetik code for a word.
4928
4929
    This algorithm is proposed in:
4930
    von Reth, Hans-Peter and Schek, Hans-Jörg. 1977. "Eine Zugriffsmethode für
4931
    die phonetische Ähnlichkeitssuche." Heidelberg Scientific Center technical
4932
    reports 77.03.002. IBM Deutschland GmbH.
4933
4934
    Since I couldn't secure a copy of that document (maybe I'll look for it
4935
    next time I'm in Germany), this implementation is based on what I could
4936
    glean from the implementations published by German Record Linkage
4937
    Center (www.record-linkage.de):
4938
    - Privacy-preserving Record Linkage (PPRL) (in R)
4939
    - Merge ToolBox (in Java)
4940
4941
    Rules that are unclear:
4942
    - Should 'C' become 'G' or 'Z'? (PPRL has both, 'Z' rule blocked)
4943
    - Should 'CC' become 'G'? (PPRL has blocked 'CK' that may be typo)
4944
    - Should 'TUI' -> 'ZUI' rule exist? (PPRL has rule, but I can't
4945
        think of a German word with '-tui-' in it.)
4946
    - Should we really change 'SCH' -> 'CH' and then 'CH' -> 'SCH'?
4947
4948
    :param word:
4949
    :return:
4950
    """
4951
    replacements = {3: {'AEH': 'E', 'IEH': 'I', 'OEH': 'OE', 'UEH': 'UE',
4952
                        'SCH': 'CH', 'ZIO': 'TIO', 'TIU': 'TIO', 'ZIU': 'TIO',
4953
                        'CHS': 'X', 'CKS': 'X', 'AEU': 'OI'},
4954
                    2: {'LL': 'L', 'AA': 'A', 'AH': 'A', 'BB': 'B', 'PP': 'B',
4955
                        'BP': 'B', 'PB': 'B', 'DD': 'D', 'DT': 'D', 'TT': 'D',
4956
                        'TH': 'D', 'EE': 'E', 'EH': 'E', 'AE': 'E', 'FF': 'F',
4957
                        'PH': 'F', 'KK': 'K', 'GG': 'G', 'GK': 'G', 'KG': 'G',
4958
                        'CK': 'G', 'CC': 'C', 'IE': 'I', 'IH': 'I', 'MM': 'M',
4959
                        'NN': 'N', 'OO': 'O', 'OH': 'O', 'SZ': 'S', 'UH': 'U',
4960
                        'GS': 'X', 'KS': 'X', 'TZ': 'Z', 'AY': 'AI',
4961
                        'EI': 'AI', 'EY': 'AI', 'EU': 'OI', 'RR': 'R',
4962
                        'SS': 'S', 'KW': 'QU'},
4963
                    1: {'P': 'B', 'T': 'D', 'V': 'F', 'W': 'F', 'C': 'G',
4964
                        'K': 'G', 'Y': 'I'}}
4965
4966
    # Uppercase
4967
    word = word.upper()
4968
4969
    # Replace umlauts/eszett
4970
    word = word.replace('Ä', 'AE')
4971
    word = word.replace('Ö', 'OE')
4972
    word = word.replace('Ü', 'UE')
4973
    word = word.replace('ß', 'SS')
4974
4975
    # Main loop, using above replacements table
4976
    pos = 0
4977
    while pos < len(word):
4978
        for num in range(3, 0, -1):
4979
            if word[pos:pos+num] in replacements[num]:
4980
                word = (word[:pos] + replacements[num][word[pos:pos+num]]
4981
                        + word[pos+num:])
4982
                pos += 1
4983
                break
4984
        else:
4985
            pos += 1  # Advance if nothing is recognized
4986
4987
    # Change 'CH' back(?) to 'SCH'
4988
    word = word.replace('CH', 'SCH')
4989
4990
    # Replace final sequences
4991
    if word[-2:] == 'ER':
4992
        word = word[:-2]+'R'
4993
    elif word[-2:] == 'EL':
4994
        word = word[:-2]+'L'
4995
    elif word[-1] == 'H':
4996
        word = word[:-1]
4997
4998
    return word
4999
5000
5001
def fonem(word):
5002
    """Return the FONEM code of a word.
5003
5004
    FONEM is a phonetic algorithm designed for French (particularly surnames in
5005
    Saguenay, Canada), defined in:
5006
    Bouchard, Gérard, Patrick Brard, and Yolande Lavoie. 1981. "FONEM: Un code
5007
    de transcription phonétique pour la reconstitution automatique des
5008
    familles saguenayennes." Population. 36(6). 1085--1103.
5009
    https://doi.org/10.2307/1532326
5010
    http://www.persee.fr/doc/pop_0032-4663_1981_num_36_6_17248
5011
5012
    Guillaume Plique's Javascript implementation at
5013
    https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/french/fonem.js
5014
    was also consulted for this implementation.
5015
5016
    :param str word: the word to transform
5017
    :returns: the FONEM code
5018
    :rtype: str
5019
    """
5020
    # I don't see a sane way of doing this without regexps :(
5021
    rule_table = {
5022
        # Vowels & groups of vowels
5023
        'V-1':     (re.compile('E?AU'), 'O'),
5024
        'V-2,5':   (re.compile('(E?AU|O)L[TX]$'), 'O'),
5025
        'V-3,4':   (re.compile('E?AU[TX]$'), 'O'),
5026
        'V-6':     (re.compile('E?AUL?D$'), 'O'),
5027
        'V-7':     (re.compile(r'(?<!G)AY$'), 'E'),
5028
        'V-8':     (re.compile('EUX$'), 'EU'),
5029
        'V-9':     (re.compile('EY(?=$|[BCDFGHJKLMNPQRSTVWXZ])'), 'E'),
5030
        'V-10':    ('Y', 'I'),
5031
        'V-11':    (re.compile('(?<=[AEIOUY])I(?=[AEIOUY])'), 'Y'),
5032
        'V-12':    (re.compile('(?<=[AEIOUY])ILL'), 'Y'),
5033
        'V-13':    (re.compile('OU(?=[AEOU]|I(?!LL))'), 'W'),
5034
        'V-14':    (re.compile(r'([AEIOUY])(?=\1)'), ''),
5035
        # Nasal vowels
5036
        'V-15':    (re.compile('[AE]M(?=[BCDFGHJKLMPQRSTVWXZ])(?!$)'), 'EN'),
5037
        'V-16':    (re.compile('OM(?=[BCDFGHJKLMPQRSTVWXZ])'), 'ON'),
5038
        'V-17':    (re.compile('AN(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'EN'),
5039
        'V-18':    (re.compile('(AI[MN]|EIN)(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'),
5040
                    'IN'),
5041
        'V-19':    (re.compile('B(O|U|OU)RNE?$'), 'BURN'),
5042
        'V-20':    (re.compile('(^IM|(?<=[BCDFGHJKLMNPQRSTVWXZ])IM(?=[BCDFGHJKLMPQRSTVWXZ]))'),
5043
                    'IN'),
5044
        # Consonants and groups of consonants
5045
        'C-1':     ('BV', 'V'),
5046
        'C-2':     (re.compile('(?<=[AEIOUY])C(?=[EIY])'), 'SS'),
5047
        'C-3':     (re.compile('(?<=[BDFGHJKLMNPQRSTVWZ])C(?=[EIY])'), 'S'),
5048
        'C-4':     (re.compile('^C(?=[EIY])'), 'S'),
5049
        'C-5':     (re.compile('^C(?=[OUA])'), 'K'),
5050
        'C-6':     (re.compile('(?<=[AEIOUY])C$'), 'K'),
5051
        'C-7':     (re.compile('C(?=[BDFGJKLMNPQRSTVWXZ])'), 'K'),
5052
        'C-8':     (re.compile('CC(?=[AOU])'), 'K'),
5053
        'C-9':     (re.compile('CC(?=[EIY])'), 'X'),
5054
        'C-10':    (re.compile('G(?=[EIY])'), 'J'),
5055
        'C-11':    (re.compile('GA(?=I?[MN])'), 'G#'),
5056
        'C-12':    (re.compile('GE(O|AU)'), 'JO'),
5057
        'C-13':    (re.compile('GNI(?=[AEIOUY])'), 'GN'),
5058
        'C-14':    (re.compile('(?<![PCS])H'), ''),
5059
        'C-15':    ('JEA', 'JA'),
5060
        'C-16':    (re.compile('^MAC(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'MA#'),
5061
        'C-17':    (re.compile('^MC'), 'MA#'),
5062
        'C-18':    ('PH', 'F'),
5063
        'C-19':    ('QU', 'K'),
5064
        'C-20':    (re.compile('^SC(?=[EIY])'), 'S'),
5065
        'C-21':    (re.compile('(?<=.)SC(?=[EIY])'), 'SS'),
5066
        'C-22':    (re.compile('(?<=.)SC(?=[AOU])'), 'SK'),
5067
        'C-23':    ('SH', 'CH'),
5068
        'C-24':    (re.compile('TIA$'), 'SSIA'),
5069
        'C-25':    (re.compile('(?<=[AIOUY])W'), ''),
5070
        'C-26':    (re.compile('X[CSZ]'), 'X'),
5071
        'C-27':    (re.compile('(?<=[AEIOUY])Z|(?<=[BCDFGHJKLMNPQRSTVWXZ])Z(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'S'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (110/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
5072
        'C-28':    (re.compile(r'([BDFGHJKMNPQRTVWXZ])\1'), r'\1'),
5073
        'C-28a':   (re.compile('CC(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'C'),
5074
        'C-28b':   (re.compile('((?<=[BCDFGHJKLMNPQRSTVWXZ])|^)SS'), 'S'),
5075
        'C-28bb':  (re.compile('SS(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'S'),
5076
        'C-28c':   (re.compile('((?<=[^I])|^)LL'), 'L'),
5077
        'C-28d':   (re.compile('ILE$'), 'ILLE'),
5078
        'C-29':    (re.compile('(ILS|[CS]H|[MN]P|R[CFKLNSX])$|([BCDFGHJKLMNPQRSTVWXZ])[BCDFGHJKLMNPQRSTVWXZ]$'), r'\1\2'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (122/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
5079
        'C-30,32': (re.compile('^(SA?INT?|SEI[NM]|CINQ?|ST)(?!E)-?'), 'ST-'),
5080
        'C-31,33': (re.compile('^(SAINTE|STE)-?'), 'STE-'),
5081
        # Rules to undo rule bleeding prevention in C-11, C-16, C-17
5082
        'C-34':    ('G#', 'GA'),
5083
        'C-35':    ('MA#', 'MAC')
5084
    }
5085
    rule_order = [
5086
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5087
        'C-12',
5088
        'C-8', 'C-9', 'C-10',
5089
        'C-16', 'C-17', 'C-2', 'C-3', 'C-7',
5090
        'V-2,5', 'V-3,4', 'V-6',
5091
        'V-1', 'C-14',
5092
        'C-31,33', 'C-30,32',
5093
        'C-11', 'V-15', 'V-17', 'V-18',
5094
        'V-7', 'V-8', 'V-9', 'V-10', 'V-11', 'V-12', 'V-13', 'V-16',
5095
        'V-19', 'V-20',
5096
        'C-1', 'C-4', 'C-5', 'C-6', 'C-13', 'C-15',
5097
        'C-18', 'C-19', 'C-20', 'C-21', 'C-22', 'C-23', 'C-24',
5098
        'C-25', 'C-26', 'C-27',
5099
        'C-29',
5100
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5101
        'C-34', 'C-35'
5102
    ]
5103
5104
    # normalize, upper-case, and filter non-French letters
5105
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
5106
    word = word.translate({198: 'AE', 338: 'OE'})
5107
    word = ''.join(c for c in word if c in
5108
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5109
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5110
                    'Y', 'Z', '-'})
5111
5112
    for rule in rule_order:
5113
        regex, repl = rule_table[rule]
5114
        if isinstance(regex, text_type):
5115
            word = word.replace(regex, repl)
5116
        else:
5117
            word = regex.sub(repl, word)
5118
        # print(rule, word)
5119
5120
    return word
5121
5122
5123
def parmar_kumbharana(word):
5124
    """Return the Parmar-Kumbharana encoding of a word.
5125
5126
    This is based on the phonetic algorithm proposed in
5127
    Parmar, Vimal P. and CK Kumbharana. 2014. "Study Existing Various Phonetic
5128
    Algorithms and Designing and Development of a working model for the New
5129
    Developed Algorithm and Comparison by implementing ti with Existing
5130
    Algorithm(s)." International Journal of Computer Applications. 98(19).
5131
    https://doi.org/10.5120/17295-7795
5132
5133
    :param word:
5134
    :return:
5135
    """
5136
    rule_table = {4: {'OUGH': 'F'},
5137
                  3: {'DGE': 'J',
5138
                      'OUL': 'U',
5139
                      'GHT': 'T'},
5140
                  2: {'CE': 'S', 'CI': 'S', 'CY': 'S',
5141
                      'GE': 'J', 'GI': 'J', 'GY': 'J',
5142
                      'WR': 'R',
5143
                      'GN': 'N', 'KN': 'N', 'PN': 'N',
5144
                      'CK': 'K',
5145
                      'SH': 'S'}}
5146
    vowel_trans = {65: '', 69: '', 73: '', 79: '', 85: '', 89: ''}
5147
5148
    word = word.upper()  # Rule 3
5149
    word = _delete_consecutive_repeats(word)  # Rule 4
5150
5151
    # Rule 5
5152
    i = 0
5153
    while i < len(word):
5154
        for match_len in range(4, 1, -1):
5155
            if word[i:i+match_len] in rule_table[match_len]:
5156
                repl = rule_table[match_len][word[i:i+match_len]]
5157
                word = (word[:i] + repl + word[i+match_len:])
5158
                i += len(repl)
5159
        else:
5160
            i += 1
5161
5162
    word = word[0]+word[1:].translate(vowel_trans)  # Rule 6
5163
    return word
5164
5165
5166
def davidson(lname, fname='.', omit_fname=False):
5167
    """Return Davidson's Consonant Code
5168
5169
    This is based on the name compression system described in:
5170
    Davidson, Leon. 1962. "Retrieval of Misspelled Names in an Airline
5171
    Passenger Record System." Communications of the ACM. 5(3). 169--171.
5172
    https://dl.acm.org/citation.cfm?id=366913
5173
5174
    :param str lname: Last name (or word) to be encoded
5175
    :param str fname: First name (optional), of which the first character is
5176
        included in the code.
5177
    :param str omit_fname: Set to True to completely omit the first character
5178
        of the first name
5179
    :return: Davidson's Consonant Code
5180
    """
5181
    trans = {65: '', 69: '', 73: '', 79: '', 85: '', 72: '', 87: '', 89: ''}
5182
5183
    lname = lname.upper()
5184
    code = _delete_consecutive_repeats(lname[:1] + lname[1:].translate(trans))
5185
    code = code[:4] + (4-len(code))*' '
5186
5187
    if not omit_fname:
5188
        code += fname[:1].upper()
5189
5190
    return code
5191
5192
5193
def bmpm(word, language_arg=0, name_mode='gen', match_mode='approx',
5194
         concat=False, filter_langs=False):
5195
    """Return the Beider-Morse Phonetic Matching algorithm code for a word.
5196
5197
    The Beider-Morse Phonetic Matching algorithm is described at:
5198
    http://stevemorse.org/phonetics/bmpm.htm
5199
    The reference implementation is licensed under GPLv3 and available at:
5200
    http://stevemorse.org/phoneticinfo.htm
5201
5202
    :param str word: the word to transform
5203
    :param str language_arg: the language of the term; supported values
5204
        include:
5205
5206
            - 'any'
5207
            - 'arabic'
5208
            - 'cyrillic'
5209
            - 'czech'
5210
            - 'dutch'
5211
            - 'english'
5212
            - 'french'
5213
            - 'german'
5214
            - 'greek'
5215
            - 'greeklatin'
5216
            - 'hebrew'
5217
            - 'hungarian'
5218
            - 'italian'
5219
            - 'polish'
5220
            - 'portuguese'
5221
            - 'romanian'
5222
            - 'russian'
5223
            - 'spanish'
5224
            - 'turkish'
5225
            - 'germandjsg'
5226
            - 'polishdjskp'
5227
            - 'russiandjsre'
5228
5229
    :param str name_mode: the name mode of the algorithm:
5230
5231
            - 'gen' -- general (default)
5232
            - 'ash' -- Ashkenazi
5233
            - 'sep' -- Sephardic
5234
5235
    :param str match_mode: matching mode: 'approx' or 'exact'
5236
    :param bool concat: concatenation mode
5237
    :param bool filter_langs: filter out incompatible languages
5238
    :returns: the BMPM value(s)
5239
    :rtype: tuple
5240
5241
    >>> bmpm('Christopher')
5242
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5243
    xristYfir xristopi xritopir xritopi xristofi xritofir xritofi tzristopir
5244
    tzristofir zristopir zristopi zritopir zritopi zristofir zristofi zritofir
5245
    zritofi'
5246
    >>> bmpm('Niall')
5247
    'nial niol'
5248
    >>> bmpm('Smith')
5249
    'zmit'
5250
    >>> bmpm('Schmidt')
5251
    'zmit stzmit'
5252
5253
    >>> bmpm('Christopher', language_arg='German')
5254
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5255
    xristYfir'
5256
    >>> bmpm('Christopher', language_arg='English')
5257
    'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir xristafir
5258
    xrQstafir'
5259
    >>> bmpm('Christopher', language_arg='German', name_mode='ash')
5260
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5261
    xristYfir'
5262
5263
    >>> bmpm('Christopher', language_arg='German', match_mode='exact')
5264
    'xriStopher xriStofer xristopher xristofer'
5265
    """
5266
    return _bmpm(word, language_arg, name_mode, match_mode,
5267
                 concat, filter_langs)
5268
5269
5270
if __name__ == '__main__':
5271
    import doctest
5272
    doctest.testmod()
5273