Completed
Push — master ( 76e221...14a933 )
by Chris
08:59
created

abydos.phonetic.haase_phonetik()   F

Complexity

Conditions 31

Size

Total Lines 112
Code Lines 82

Duplication

Lines 44
Ratio 39.29 %

Importance

Changes 0
Metric Value
cc 31
eloc 82
nop 1
dl 44
loc 112
rs 0
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic.haase_phonetik() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (5044/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.phonetic.
20
21
The phonetic module implements phonetic algorithms including:
22
23
    - Robert C. Russell's Index
24
    - American Soundex
25
    - Refined Soundex
26
    - Daitch-Mokotoff Soundex
27
    - Kölner Phonetik
28
    - NYSIIS
29
    - Match Rating Algorithm
30
    - Metaphone
31
    - Double Metaphone
32
    - Caverphone
33
    - Alpha Search Inquiry System
34
    - Fuzzy Soundex
35
    - Phonex
36
    - Phonem
37
    - Phonix
38
    - SfinxBis
39
    - phonet
40
    - Standardized Phonetic Frequency Code
41
    - Statistics Canada
42
    - Lein
43
    - Roger Root
44
    - Oxford Name Compression Algorithm (ONCA)
45
    - Eudex phonetic hash
46
    - Beider-Morse Phonetic Matching
47
"""
48
49
from __future__ import division, unicode_literals
50
51
import unicodedata
52
from collections import Counter
53
from itertools import groupby
54
55
from six import text_type
56
from six.moves import range
57
58
from ._bm import _bmpm
59
60
_INFINITY = float('inf')
61
62
63
def _delete_consecutive_repeats(word):
64
    """Delete consecutive repeated characters in a word.
65
66
    :param str word: the word to transform
67
    :returns: word with consecutive repeating characters collapsed to
68
        a single instance
69
    :rtype: str
70
    """
71
    return ''.join(char for char, _ in groupby(word))
72
73
74
def russell_index(word):
75
    """Return the Russell Index (integer output) of a word.
76
77
    This follows Robert C. Russell's Index algorithm, as described in
78
    US Patent 1,261,167 (1917)
79
80
    :param str word: the word to transform
81
    :returns: the Russell Index value
82
    :rtype: int
83
84
    >>> russell_index('Christopher')
85
    3813428
86
    >>> russell_index('Niall')
87
    715
88
    >>> russell_index('Smith')
89
    3614
90
    >>> russell_index('Schmidt')
91
    3614
92
    """
93
    _russell_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
94
                                     'ABCDEFGIKLMNOPQRSTUVXYZ'),
95
                                    '12341231356712383412313'))
96
97
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
98
    word = word.replace('ß', 'SS')
99
    word = word.replace('GH', '')  # discard gh (rule 3)
100
    word = word.rstrip('SZ')  # discard /[sz]$/ (rule 3)
101
102
    # translate according to Russell's mapping
103
    word = ''.join(c for c in word if c in
104
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'K', 'L', 'M', 'N',
105
                    'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'})
106
    sdx = word.translate(_russell_translation)
107
108
    # remove any 1s after the first occurrence
109
    one = sdx.find('1')+1
110
    if one:
111
        sdx = sdx[:one] + ''.join(c for c in sdx[one:] if c != '1')
112
113
    # remove repeating characters
114
    sdx = _delete_consecutive_repeats(sdx)
115
116
    # return as an int
117
    return int(sdx) if sdx else float('NaN')
118
119
120
def russell_index_num_to_alpha(num):
121
    """Convert the Russell Index integer to an alphabetic string.
122
123
    This follows Robert C. Russell's Index algorithm, as described in
124
    US Patent 1,261,167 (1917)
125
126
    :param int num: a Russell Index integer value
127
    :returns: the Russell Index as an alphabetic string
128
    :rtype: str
129
130
    >>> russell_index_num_to_alpha(3813428)
131
    'CRACDBR'
132
    >>> russell_index_num_to_alpha(715)
133
    'NAL'
134
    >>> russell_index_num_to_alpha(3614)
135
    'CMAD'
136
    """
137
    _russell_num_translation = dict(zip((ord(_) for _ in '12345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
138
                                        'ABCDLMNR'))
139
    num = ''.join(c for c in text_type(num) if c in {'1', '2', '3', '4', '5',
140
                                                     '6', '7', '8'})
141
    if num:
142
        return num.translate(_russell_num_translation)
143
    return ''
144
145
146
def russell_index_alpha(word):
147
    """Return the Russell Index (alphabetic output) for the word.
148
149
    This follows Robert C. Russell's Index algorithm, as described in
150
    US Patent 1,261,167 (1917)
151
152
    :param str word: the word to transform
153
    :returns: the Russell Index value as an alphabetic string
154
    :rtype: str
155
156
    >>> russell_index_alpha('Christopher')
157
    'CRACDBR'
158
    >>> russell_index_alpha('Niall')
159
    'NAL'
160
    >>> russell_index_alpha('Smith')
161
    'CMAD'
162
    >>> russell_index_alpha('Schmidt')
163
    'CMAD'
164
    """
165
    if word:
166
        return russell_index_num_to_alpha(russell_index(word))
167
    return ''
168
169
170
def soundex(word, maxlength=4, var='American', reverse=False, zero_pad=True):
171
    """Return the Soundex code for a word.
172
173
    :param str word: the word to transform
174
    :param int maxlength: the length of the code returned (defaults to 4)
175
    :param str var: the variant of the algorithm to employ (defaults to
176
        'American'):
177
178
        - 'American' follows the American Soundex algorithm, as described at
179
          http://www.archives.gov/publications/general-info-leaflets/55-census.html
180
          and in Knuth(1998:394); this is also called Miracode
181
        - 'special' follows the rules from the 1880-1910 US Census
182
          retrospective re-analysis, in which h & w are not treated as blocking
183
          consonants but as vowels.
184
          Cf. http://creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
185
        - 'dm' computes the Daitch-Mokotoff Soundex
186
187
    :param bool reverse: reverse the word before computing the selected Soundex
188
        (defaults to False); This results in "Reverse Soundex"
189
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
190
        maxlength string
191
    :returns: the Soundex value
192
    :rtype: str
193
194
    >>> soundex("Christopher")
195
    'C623'
196
    >>> soundex("Niall")
197
    'N400'
198
    >>> soundex('Smith')
199
    'S530'
200
    >>> soundex('Schmidt')
201
    'S530'
202
203
204
    >>> soundex('Christopher', maxlength=_INFINITY)
205
    'C623160000000000000000000000000000000000000000000000000000000000'
206
    >>> soundex('Christopher', maxlength=_INFINITY, zero_pad=False)
207
    'C62316'
208
209
    >>> soundex('Christopher', reverse=True)
210
    'R132'
211
212
    >>> soundex('Ashcroft')
213
    'A261'
214
    >>> soundex('Asicroft')
215
    'A226'
216
    >>> soundex('Ashcroft', var='special')
217
    'A226'
218
    >>> soundex('Asicroft', var='special')
219
    'A226'
220
221
    >>> soundex('Christopher', var='dm')
222
    {'494379', '594379'}
223
    >>> soundex('Niall', var='dm')
224
    {'680000'}
225
    >>> soundex('Smith', var='dm')
226
    {'463000'}
227
    >>> soundex('Schmidt', var='dm')
228
    {'463000'}
229
    """
230
    _soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
231
                                     'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
232
                                    '01230129022455012623019202'))
233
234
    # Call the D-M Soundex function itself if requested
235
    if var == 'dm':
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
236
        return dm_soundex(word, maxlength, reverse, zero_pad)
237
    elif var == 'refined':
238
        return refined_soundex(word, maxlength, reverse, zero_pad)
239
240
    # Require a maxlength of at least 4 and not more than 64
241
    if maxlength is not None:
242
        maxlength = min(max(4, maxlength), 64)
243
    else:
244
        maxlength = 64
245
246
    # uppercase, normalize, decompose, and filter non-A-Z out
247
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
248
    word = word.replace('ß', 'SS')
249
    word = ''.join(c for c in word if c in
250
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
251
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
252
                    'Y', 'Z'})
253
254
    # Nothing to convert, return base case
255
    if not word:
256
        if zero_pad:
257
            return '0'*maxlength
258
        return '0'
259
260
    # Reverse word if computing Reverse Soundex
261
    if reverse:
262
        word = word[::-1]
263
264
    # apply the Soundex algorithm
265
    sdx = word.translate(_soundex_translation)
266
267
    if var == 'special':
268
        sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
269
    else:
270
        sdx = sdx.replace('9', '')  # rule 1
271
    sdx = _delete_consecutive_repeats(sdx)  # rule 3
272
273
    if word[0] in 'HW':
274
        sdx = word[0] + sdx
275
    else:
276
        sdx = word[0] + sdx[1:]
277
    sdx = sdx.replace('0', '')  # rule 1
278
279
    if zero_pad:
280
        sdx += ('0'*maxlength)  # rule 4
281
282
    return sdx[:maxlength]
283
284
285
def refined_soundex(word, maxlength=_INFINITY, reverse=False, zero_pad=False):
0 ignored issues
show
Unused Code introduced by
The argument zero_pad seems to be unused.
Loading history...
286
    """Return the Refined Soundex code for a word.
287
288
    This is Soundex, but with more character classes. It appears to have been
289
    defined by the Apache Commons:
290
    https://commons.apache.org/proper/commons-codec/apidocs/src-html/org/apache/commons/codec/language/RefinedSoundex.html
291
292
    :param word: the word to transform
293
    :param maxlength: the length of the code returned (defaults to unlimited)
294
    :param reverse: reverse the word before computing the selected Soundex
295
        (defaults to False); This results in "Reverse Soundex"
296
    :param zero_pad: pad the end of the return value with 0s to achieve a
297
        maxlength string
298
    :returns: the Refined Soundex value
299
    :rtype: str
300
301
    >>> refined_soundex('Christopher')
302
    'C3090360109'
303
    >>> refined_soundex('Niall')
304
    'N807'
305
    >>> refined_soundex('Smith')
306
    'S38060'
307
    >>> refined_soundex('Schmidt')
308
    'S30806'
309
    """
310
    _ref_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
311
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
312
                                        '01360240043788015936020505'))
313
314
    # uppercase, normalize, decompose, and filter non-A-Z out
315
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
316
    word = word.replace('ß', 'SS')
317
    word = ''.join(c for c in word if c in
318
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
319
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
320
                    'Y', 'Z'})
321
322
    # Reverse word if computing Reverse Soundex
323
    if reverse:
324
        word = word[::-1]
325
326
    # apply the Soundex algorithm
327
    sdx = word[0] + word.translate(_ref_soundex_translation)
328
    sdx = _delete_consecutive_repeats(sdx)
329
330
    if maxlength and maxlength < _INFINITY:
331
        sdx = sdx[:maxlength]
332
        sdx += ('0' * maxlength)  # rule 4
333
334
    return sdx
335
336
337
def dm_soundex(word, maxlength=6, reverse=False, zero_pad=True):
338
    """Return the Daitch-Mokotoff Soundex code for a word.
339
340
    Returns values of a word as a set. A collection is necessary since there
341
    can be multiple values for a single word.
342
343
    :param word: the word to transform
344
    :param maxlength: the length of the code returned (defaults to 6)
345
    :param reverse: reverse the word before computing the selected Soundex
346
        (defaults to False); This results in "Reverse Soundex"
347
    :param zero_pad: pad the end of the return value with 0s to achieve a
348
        maxlength string
349
    :returns: the Daitch-Mokotoff Soundex value
350
    :rtype: str
351
352
    >>> dm_soundex('Christopher')
353
    {'494379', '594379'}
354
    >>> dm_soundex('Niall')
355
    {'680000'}
356
    >>> dm_soundex('Smith')
357
    {'463000'}
358
    >>> dm_soundex('Schmidt')
359
    {'463000'}
360
361
    >>> dm_soundex('The quick brown fox', maxlength=20, zero_pad=False)
362
    {'35457976754', '3557976754'}
363
    """
364
    _dms_table = {'STCH': (2, 4, 4), 'DRZ': (4, 4, 4), 'ZH': (4, 4, 4),
365
                  'ZHDZH': (2, 4, 4), 'DZH': (4, 4, 4), 'DRS': (4, 4, 4),
366
                  'DZS': (4, 4, 4), 'SCHTCH': (2, 4, 4), 'SHTSH': (2, 4, 4),
367
                  'SZCZ': (2, 4, 4), 'TZS': (4, 4, 4), 'SZCS': (2, 4, 4),
368
                  'STSH': (2, 4, 4), 'SHCH': (2, 4, 4), 'D': (3, 3, 3),
369
                  'H': (5, 5, '_'), 'TTSCH': (4, 4, 4), 'THS': (4, 4, 4),
370
                  'L': (8, 8, 8), 'P': (7, 7, 7), 'CHS': (5, 54, 54),
371
                  'T': (3, 3, 3), 'X': (5, 54, 54), 'OJ': (0, 1, '_'),
372
                  'OI': (0, 1, '_'), 'SCHTSH': (2, 4, 4), 'OY': (0, 1, '_'),
373
                  'Y': (1, '_', '_'), 'TSH': (4, 4, 4), 'ZDZ': (2, 4, 4),
374
                  'TSZ': (4, 4, 4), 'SHT': (2, 43, 43), 'SCHTSCH': (2, 4, 4),
375
                  'TTSZ': (4, 4, 4), 'TTZ': (4, 4, 4), 'SCH': (4, 4, 4),
376
                  'TTS': (4, 4, 4), 'SZD': (2, 43, 43), 'AI': (0, 1, '_'),
377
                  'PF': (7, 7, 7), 'TCH': (4, 4, 4), 'PH': (7, 7, 7),
378
                  'TTCH': (4, 4, 4), 'SZT': (2, 43, 43), 'ZDZH': (2, 4, 4),
379
                  'EI': (0, 1, '_'), 'G': (5, 5, 5), 'EJ': (0, 1, '_'),
380
                  'ZD': (2, 43, 43), 'IU': (1, '_', '_'), 'K': (5, 5, 5),
381
                  'O': (0, '_', '_'), 'SHTCH': (2, 4, 4), 'S': (4, 4, 4),
382
                  'TRZ': (4, 4, 4), 'SHD': (2, 43, 43), 'DSH': (4, 4, 4),
383
                  'CSZ': (4, 4, 4), 'EU': (1, 1, '_'), 'TRS': (4, 4, 4),
384
                  'ZS': (4, 4, 4), 'STRZ': (2, 4, 4), 'UY': (0, 1, '_'),
385
                  'STRS': (2, 4, 4), 'CZS': (4, 4, 4),
386
                  'MN': ('6_6', '6_6', '6_6'), 'UI': (0, 1, '_'),
387
                  'UJ': (0, 1, '_'), 'UE': (0, '_', '_'), 'EY': (0, 1, '_'),
388
                  'W': (7, 7, 7), 'IA': (1, '_', '_'), 'FB': (7, 7, 7),
389
                  'STSCH': (2, 4, 4), 'SCHT': (2, 43, 43),
390
                  'NM': ('6_6', '6_6', '6_6'), 'SCHD': (2, 43, 43),
391
                  'B': (7, 7, 7), 'DSZ': (4, 4, 4), 'F': (7, 7, 7),
392
                  'N': (6, 6, 6), 'CZ': (4, 4, 4), 'R': (9, 9, 9),
393
                  'U': (0, '_', '_'), 'V': (7, 7, 7), 'CS': (4, 4, 4),
394
                  'Z': (4, 4, 4), 'SZ': (4, 4, 4), 'TSCH': (4, 4, 4),
395
                  'KH': (5, 5, 5), 'ST': (2, 43, 43), 'KS': (5, 54, 54),
396
                  'SH': (4, 4, 4), 'SC': (2, 4, 4), 'SD': (2, 43, 43),
397
                  'DZ': (4, 4, 4), 'ZHD': (2, 43, 43), 'DT': (3, 3, 3),
398
                  'ZSH': (4, 4, 4), 'DS': (4, 4, 4), 'TZ': (4, 4, 4),
399
                  'TS': (4, 4, 4), 'TH': (3, 3, 3), 'TC': (4, 4, 4),
400
                  'A': (0, '_', '_'), 'E': (0, '_', '_'), 'I': (0, '_', '_'),
401
                  'AJ': (0, 1, '_'), 'M': (6, 6, 6), 'Q': (5, 5, 5),
402
                  'AU': (0, 7, '_'), 'IO': (1, '_', '_'), 'AY': (0, 1, '_'),
403
                  'IE': (1, '_', '_'), 'ZSCH': (4, 4, 4),
404
                  'CH': ((5, 4), (5, 4), (5, 4)),
405
                  'CK': ((5, 45), (5, 45), (5, 45)),
406
                  'C': ((5, 4), (5, 4), (5, 4)),
407
                  'J': ((1, 4), ('_', 4), ('_', 4)),
408
                  'RZ': ((94, 4), (94, 4), (94, 4)),
409
                  'RS': ((94, 4), (94, 4), (94, 4))}
410
411
    _dms_order = {'A': ('AI', 'AJ', 'AU', 'AY', 'A'),
412
                  'B': ('B'),
413
                  'C': ('CHS', 'CSZ', 'CZS', 'CH', 'CK', 'CS', 'CZ', 'C'),
414
                  'D': ('DRS', 'DRZ', 'DSH', 'DSZ', 'DZH', 'DZS', 'DS', 'DT',
415
                        'DZ', 'D'),
416
                  'E': ('EI', 'EJ', 'EU', 'EY', 'E'),
417
                  'F': ('FB', 'F'),
418
                  'G': ('G'),
419
                  'H': ('H'),
420
                  'I': ('IA', 'IE', 'IO', 'IU', 'I'),
421
                  'J': ('J'),
422
                  'K': ('KH', 'KS', 'K'),
423
                  'L': ('L'),
424
                  'M': ('MN', 'M'),
425
                  'N': ('NM', 'N'),
426
                  'O': ('OI', 'OJ', 'OY', 'O'),
427
                  'P': ('PF', 'PH', 'P'),
428
                  'Q': ('Q'),
429
                  'R': ('RS', 'RZ', 'R'),
430
                  'S': ('SCHTSCH', 'SCHTCH', 'SCHTSH', 'SHTCH', 'SHTSH',
431
                        'STSCH', 'SCHD', 'SCHT', 'SHCH', 'STCH', 'STRS',
432
                        'STRZ', 'STSH', 'SZCS', 'SZCZ', 'SCH', 'SHD', 'SHT',
433
                        'SZD', 'SZT', 'SC', 'SD', 'SH', 'ST', 'SZ', 'S'),
434
                  'T': ('TTSCH', 'TSCH', 'TTCH', 'TTSZ', 'TCH', 'THS', 'TRS',
435
                        'TRZ', 'TSH', 'TSZ', 'TTS', 'TTZ', 'TZS', 'TC', 'TH',
436
                        'TS', 'TZ', 'T'),
437
                  'U': ('UE', 'UI', 'UJ', 'UY', 'U'),
438
                  'V': ('V'),
439
                  'W': ('W'),
440
                  'X': ('X'),
441
                  'Y': ('Y'),
442
                  'Z': ('ZHDZH', 'ZDZH', 'ZSCH', 'ZDZ', 'ZHD', 'ZSH', 'ZD',
443
                        'ZH', 'ZS', 'Z')}
444
445
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
446
    dms = ['']  # initialize empty code list
447
448
    # Require a maxlength of at least 6 and not more than 64
449
    if maxlength is not None:
450
        maxlength = min(max(6, maxlength), 64)
451
    else:
452
        maxlength = 64
453
454
    # uppercase, normalize, decompose, and filter non-A-Z
455
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
456
    word = word.replace('ß', 'SS')
457
    word = ''.join(c for c in word if c in
458
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
459
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
460
                    'Y', 'Z'})
461
462
    # Nothing to convert, return base case
463
    if not word:
464
        if zero_pad:
465
            return {'0'*maxlength}
466
        return {'0'}
467
468
    # Reverse word if computing Reverse Soundex
469
    if reverse:
470
        word = word[::-1]
471
472
    pos = 0
473
    while pos < len(word):
474
        # Iterate through _dms_order, which specifies the possible substrings
475
        # for which codes exist in the Daitch-Mokotoff coding
476
        for sstr in _dms_order[word[pos]]:  # pragma: no branch
477
            if word[pos:].startswith(sstr):
478
                # Having determined a valid substring start, retrieve the code
479
                dm_val = _dms_table[sstr]
480
481
                # Having retried the code (triple), determine the correct
482
                # positional variant (first, pre-vocalic, elsewhere)
483
                if pos == 0:
484
                    dm_val = dm_val[0]
485
                elif (pos+len(sstr) < len(word) and
486
                      word[pos+len(sstr)] in _vowels):
487
                    dm_val = dm_val[1]
488
                else:
489
                    dm_val = dm_val[2]
490
491
                # Build the code strings
492
                if isinstance(dm_val, tuple):
493
                    dms = [_ + text_type(dm_val[0]) for _ in dms] \
494
                            + [_ + text_type(dm_val[1]) for _ in dms]
495
                else:
496
                    dms = [_ + text_type(dm_val) for _ in dms]
497
                pos += len(sstr)
498
                break
499
500
    # Filter out double letters and _ placeholders
501
    dms = (''.join(c for c in _delete_consecutive_repeats(_) if c != '_')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
502
           for _ in dms)
503
504
    # Trim codes and return set
505
    if zero_pad:
506
        dms = ((_ + ('0'*maxlength))[:maxlength] for _ in dms)
507
    else:
508
        dms = (_[:maxlength] for _ in dms)
509
    return set(dms)
510
511
512
def koelner_phonetik(word):
513
    """Return the Kölner Phonetik (numeric output) code for a word.
514
515
    Based on the algorithm described at
516
    https://de.wikipedia.org/wiki/Kölner_Phonetik
517
518
    While the output code is numeric, it is still a str because 0s can lead
519
    the code.
520
521
    :param str word: the word to transform
522
    :returns: the Kölner Phonetik value as a numeric string
523
    :rtype: str
524
525
    >>> koelner_phonetik('Christopher')
526
    '478237'
527
    >>> koelner_phonetik('Niall')
528
    '65'
529
    >>> koelner_phonetik('Smith')
530
    '862'
531
    >>> koelner_phonetik('Schmidt')
532
    '862'
533
    >>> koelner_phonetik('Müller')
534
    '657'
535
    >>> koelner_phonetik('Zimmermann')
536
    '86766'
537
    """
538
    # pylint: disable=too-many-branches
539
    def _after(word, i, letters):
540
        """Return True if word[i] follows one of the supplied letters."""
541
        if i > 0 and word[i-1] in letters:
542
            return True
543
        return False
544
545
    def _before(word, i, letters):
546
        """Return True if word[i] precedes one of the supplied letters."""
547
        if i+1 < len(word) and word[i+1] in letters:
548
            return True
549
        return False
550
551
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
552
553
    sdx = ''
554
555
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
556
    word = word.replace('ß', 'SS')
557
558
    word = word.replace('Ä', 'AE')
559
    word = word.replace('Ö', 'OE')
560
    word = word.replace('Ü', 'UE')
561
    word = ''.join(c for c in word if c in
562
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
563
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
564
                    'Y', 'Z'})
565
566
    # Nothing to convert, return base case
567
    if not word:
568
        return sdx
569
570
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
571 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
572
            sdx += '0'
573
        elif word[i] == 'B':
574
            sdx += '1'
575
        elif word[i] == 'P':
576
            if _before(word, i, {'H'}):
577
                sdx += '3'
578
            else:
579
                sdx += '1'
580
        elif word[i] in {'D', 'T'}:
581
            if _before(word, i, {'C', 'S', 'Z'}):
582
                sdx += '8'
583
            else:
584
                sdx += '2'
585
        elif word[i] in {'F', 'V', 'W'}:
586
            sdx += '3'
587
        elif word[i] in {'G', 'K', 'Q'}:
588
            sdx += '4'
589
        elif word[i] == 'C':
590
            if _after(word, i, {'S', 'Z'}):
591
                sdx += '8'
592
            elif i == 0:
593
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
594
                                     'X'}):
595
                    sdx += '4'
596
                else:
597
                    sdx += '8'
598
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
599
                sdx += '4'
600
            else:
601
                sdx += '8'
602
        elif word[i] == 'X':
603
            if _after(word, i, {'C', 'K', 'Q'}):
604
                sdx += '8'
605
            else:
606
                sdx += '48'
607
        elif word[i] == 'L':
608
            sdx += '5'
609
        elif word[i] in {'M', 'N'}:
610
            sdx += '6'
611
        elif word[i] == 'R':
612
            sdx += '7'
613
        elif word[i] in {'S', 'Z'}:
614
            sdx += '8'
615
616
    sdx = _delete_consecutive_repeats(sdx)
617
618
    if sdx:
619
        sdx = sdx[0] + sdx[1:].replace('0', '')
620
621
    return sdx
622
623
624
def koelner_phonetik_num_to_alpha(num):
625
    """Convert a Kölner Phonetik code from numeric to alphabetic.
626
627
    :param str num: a numeric Kölner Phonetik representation
628
    :returns: an alphabetic representation of the same word
629
    :rtype: str
630
631
    >>> koelner_phonetik_num_to_alpha(862)
632
    'SNT'
633
    >>> koelner_phonetik_num_to_alpha(657)
634
    'NLR'
635
    >>> koelner_phonetik_num_to_alpha(86766)
636
    'SNRNN'
637
    """
638
    _koelner_num_translation = dict(zip((ord(_) for _ in '012345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
639
                                        'APTFKLNRS'))
640
    num = ''.join(c for c in text_type(num) if c in {'0', '1', '2', '3', '4',
641
                                                     '5', '6', '7', '8'})
642
    return num.translate(_koelner_num_translation)
643
644
645
def koelner_phonetik_alpha(word):
646
    """Return the Kölner Phonetik (alphabetic output) code for a word.
647
648
    :param str word: the word to transform
649
    :returns: the Kölner Phonetik value as an alphabetic string
650
    :rtype: str
651
652
    >>> koelner_phonetik_alpha('Smith')
653
    'SNT'
654
    >>> koelner_phonetik_alpha('Schmidt')
655
    'SNT'
656
    >>> koelner_phonetik_alpha('Müller')
657
    'NLR'
658
    >>> koelner_phonetik_alpha('Zimmermann')
659
    'SNRNN'
660
    """
661
    return koelner_phonetik_num_to_alpha(koelner_phonetik(word))
662
663
664
def nysiis(word, maxlength=6, modified=False):
665
    """Return the NYSIIS code for a word.
666
667
    A description of the New York State Identification and Intelligence System
668
    algorithm can be found at
669
    https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
670
671
    The modified version of this algorithm is described in Appendix B of
672
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
673
    Procedure for the SRS Record Linkage System.` Statistical Reporting
674
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
675
    https://naldc.nal.usda.gov/download/27833/PDF
676
677
    :param str word: the word to transform
678
    :param int maxlength: the maximum length (default 6) of the code to return
679
    :param bool modified: indicates whether to use USDA modified NYSIIS
680
    :returns: the NYSIIS value
681
    :rtype: str
682
683
    >>> nysiis('Christopher')
684
    'CRASTA'
685
    >>> nysiis('Niall')
686
    'NAL'
687
    >>> nysiis('Smith')
688
    'SNAT'
689
    >>> nysiis('Schmidt')
690
    'SNAD'
691
692
    >>> nysiis('Christopher', maxlength=_INFINITY)
693
    'CRASTAFAR'
694
695
    >>> nysiis('Christopher', maxlength=8, modified=True)
696
    'CRASTAFA'
697
    >>> nysiis('Niall', maxlength=8, modified=True)
698
    'NAL'
699
    >>> nysiis('Smith', maxlength=8, modified=True)
700
    'SNAT'
701
    >>> nysiis('Schmidt', maxlength=8, modified=True)
702
    'SNAD'
703
    """
704
    # Require a maxlength of at least 6
705
    if maxlength:
706
        maxlength = max(6, maxlength)
707
708
    _vowels = {'A', 'E', 'I', 'O', 'U'}
709
710
    word = ''.join(c for c in word.upper() if c.isalpha())
711
    word = word.replace('ß', 'SS')
712
713
    # exit early if there are no alphas
714
    if not word:
715
        return ''
716
717
    if modified:
718
        original_first_char = word[0]
719
720
    if word[:3] == 'MAC':
721
        word = 'MCC'+word[3:]
722
    elif word[:2] == 'KN':
723
        word = 'NN'+word[2:]
724
    elif word[:1] == 'K':
725
        word = 'C'+word[1:]
726
    elif word[:2] in {'PH', 'PF'}:
727
        word = 'FF'+word[2:]
728
    elif word[:3] == 'SCH':
729
        word = 'SSS'+word[3:]
730
    elif modified:
731
        if word[:2] == 'WR':
732
            word = 'RR'+word[2:]
733
        elif word[:2] == 'RH':
734
            word = 'RR'+word[2:]
735
        elif word[:2] == 'DG':
736
            word = 'GG'+word[2:]
737
        elif word[:1] in _vowels:
738
            word = 'A'+word[1:]
739
740
    if modified and word[-1] in {'S', 'Z'}:
741
        word = word[:-1]
742
743
    if word[-2:] == 'EE' or word[-2:] == 'IE' or (modified and
744
                                                  word[-2:] == 'YE'):
745
        word = word[:-2]+'Y'
746
    elif word[-2:] in {'DT', 'RT', 'RD'}:
747
        word = word[:-2]+'D'
748
    elif word[-2:] in {'NT', 'ND'}:
749
        word = word[:-2]+('N' if modified else 'D')
750
    elif modified:
751
        if word[-2:] == 'IX':
752
            word = word[:-2]+'ICK'
753
        elif word[-2:] == 'EX':
754
            word = word[:-2]+'ECK'
755
        elif word[-2:] in {'JR', 'SR'}:
756
            return 'ERROR'  # TODO: decide how best to return an error
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
757
758
    key = word[0]
759
760
    skip = 0
761
    for i in range(1, len(word)):
762
        if i >= len(word):
763
            continue
764
        elif skip:
765
            skip -= 1
766
            continue
767
        elif word[i:i+2] == 'EV':
768
            word = word[:i] + 'AF' + word[i+2:]
769
            skip = 1
770
        elif word[i] in _vowels:
771
            word = word[:i] + 'A' + word[i+1:]
772
        elif modified and i != len(word)-1 and word[i] == 'Y':
773
            word = word[:i] + 'A' + word[i+1:]
774
        elif word[i] == 'Q':
775
            word = word[:i] + 'G' + word[i+1:]
776
        elif word[i] == 'Z':
777
            word = word[:i] + 'S' + word[i+1:]
778
        elif word[i] == 'M':
779
            word = word[:i] + 'N' + word[i+1:]
780
        elif word[i:i+2] == 'KN':
781
            word = word[:i] + 'N' + word[i+2:]
782
        elif word[i] == 'K':
783
            word = word[:i] + 'C' + word[i+1:]
784
        elif modified and i == len(word)-3 and word[i:i+3] == 'SCH':
785
            word = word[:i] + 'SSA'
786
            skip = 2
787
        elif word[i:i+3] == 'SCH':
788
            word = word[:i] + 'SSS' + word[i+3:]
789
            skip = 2
790
        elif modified and i == len(word)-2 and word[i:i+2] == 'SH':
791
            word = word[:i] + 'SA'
792
            skip = 1
793
        elif word[i:i+2] == 'SH':
794
            word = word[:i] + 'SS' + word[i+2:]
795
            skip = 1
796
        elif word[i:i+2] == 'PH':
797
            word = word[:i] + 'FF' + word[i+2:]
798
            skip = 1
799
        elif modified and word[i:i+3] == 'GHT':
800
            word = word[:i] + 'TTT' + word[i+3:]
801
            skip = 2
802
        elif modified and word[i:i+2] == 'DG':
803
            word = word[:i] + 'GG' + word[i+2:]
804
            skip = 1
805
        elif modified and word[i:i+2] == 'WR':
806
            word = word[:i] + 'RR' + word[i+2:]
807
            skip = 1
808
        elif word[i] == 'H' and (word[i-1] not in _vowels or
809
                                 word[i+1:i+2] not in _vowels):
810
            word = word[:i] + word[i-1] + word[i+1:]
811
        elif word[i] == 'W' and word[i-1] in _vowels:
812
            word = word[:i] + word[i-1] + word[i+1:]
813
814
        if word[i:i+skip+1] != key[-1:]:
815
            key += word[i:i+skip+1]
816
817
    key = _delete_consecutive_repeats(key)
818
819
    if key[-1] == 'S':
820
        key = key[:-1]
821
    if key[-2:] == 'AY':
822
        key = key[:-2] + 'Y'
823
    if key[-1:] == 'A':
824
        key = key[:-1]
825
    if modified and key[0] == 'A':
826
        key = original_first_char + key[1:]
0 ignored issues
show
introduced by
The variable original_first_char does not seem to be defined in case modified on line 717 is False. Are you sure this can never be the case?
Loading history...
827
828
    if maxlength and maxlength < _INFINITY:
829
        key = key[:maxlength]
830
831
    return key
832
833
834
def mra(word):
835
    """Return the MRA personal numeric identifier (PNI) for a word.
836
837
    A description of the Western Airlines Surname Match Rating Algorithm can
838
    be found on page 18 of
839
    https://archive.org/details/accessingindivid00moor
840
841
    :param str word: the word to transform
842
    :returns: the MRA PNI
843
    :rtype: str
844
845
    >>> mra('Christopher')
846
    'CHRPHR'
847
    >>> mra('Niall')
848
    'NL'
849
    >>> mra('Smith')
850
    'SMTH'
851
    >>> mra('Schmidt')
852
    'SCHMDT'
853
    """
854
    if not word:
855
        return word
856
    word = word.upper()
857
    word = word.replace('ß', 'SS')
858
    word = word[0]+''.join(c for c in word[1:] if
859
                           c not in {'A', 'E', 'I', 'O', 'U'})
860
    word = _delete_consecutive_repeats(word)
861
    if len(word) > 6:
862
        word = word[:3]+word[-3:]
863
    return word
864
865
866
def metaphone(word, maxlength=_INFINITY):
867
    """Return the Metaphone code for a word.
868
869
    Based on Lawrence Philips' Pick BASIC code from 1990:
870
    http://aspell.net/metaphone/metaphone.basic
871
    This incorporates some corrections to the above code, particularly
872
    some of those suggested by Michael Kuhn in:
873
    http://aspell.net/metaphone/metaphone-kuhn.txt
874
875
    :param str word: the word to transform
876
    :param int maxlength: the maximum length of the returned Metaphone code
877
        (defaults to unlimited, but in Philips' original implementation
878
        this was 4)
879
    :returns: the Metaphone value
880
    :rtype: str
881
882
883
    >>> metaphone('Christopher')
884
    'KRSTFR'
885
    >>> metaphone('Niall')
886
    'NL'
887
    >>> metaphone('Smith')
888
    'SM0'
889
    >>> metaphone('Schmidt')
890
    'SKMTT'
891
    """
892
    # pylint: disable=too-many-branches
893
    _vowels = {'A', 'E', 'I', 'O', 'U'}
894
    _frontv = {'E', 'I', 'Y'}
895
    _varson = {'C', 'G', 'P', 'S', 'T'}
896
897
    # Require a maxlength of at least 4
898
    if maxlength is not None:
899
        maxlength = max(4, maxlength)
900
    else:
901
        maxlength = 64
902
903
    # As in variable sound--those modified by adding an "h"
904
    ename = ''.join(c for c in word.upper() if c.isalnum())
905
    ename = ename.replace('ß', 'SS')
906
907
    # Delete nonalphanumeric characters and make all caps
908
    if not ename:
909
        return ''
910
    if ename[0:2] in {'PN', 'AE', 'KN', 'GN', 'WR'}:
911
        ename = ename[1:]
912
    elif ename[0] == 'X':
913
        ename = 'S' + ename[1:]
914
    elif ename[0:2] == 'WH':
915
        ename = 'W' + ename[2:]
916
917
    # Convert to metaph
918
    elen = len(ename)-1
919
    metaph = ''
920
    for i in range(len(ename)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
921
        if len(metaph) >= maxlength:
922
            break
923
        if ((ename[i] not in {'G', 'T'} and
924
             i > 0 and ename[i-1] == ename[i])):
925
            continue
926
927
        if ename[i] in _vowels and i == 0:
928
            metaph = ename[i]
929
930
        elif ename[i] == 'B':
931
            if i != elen or ename[i-1] != 'M':
932
                metaph += ename[i]
933
934
        elif ename[i] == 'C':
935
            if not (i > 0 and ename[i-1] == 'S' and ename[i+1:i+2] in _frontv):
936
                if ename[i+1:i+3] == 'IA':
937
                    metaph += 'X'
938
                elif ename[i+1:i+2] in _frontv:
939
                    metaph += 'S'
940
                elif i > 0 and ename[i-1:i+2] == 'SCH':
941
                    metaph += 'K'
942
                elif ename[i+1:i+2] == 'H':
943
                    if i == 0 and i+1 < elen and ename[i+2:i+3] not in _vowels:
944
                        metaph += 'K'
945
                    else:
946
                        metaph += 'X'
947
                else:
948
                    metaph += 'K'
949
950
        elif ename[i] == 'D':
951
            if ename[i+1:i+2] == 'G' and ename[i+2:i+3] in _frontv:
952
                metaph += 'J'
953
            else:
954
                metaph += 'T'
955
956
        elif ename[i] == 'G':
957
            if ename[i+1:i+2] == 'H' and not (i+1 == elen or
958
                                              ename[i+2:i+3] not in _vowels):
959
                continue
960
            elif i > 0 and ((i+1 == elen and ename[i+1] == 'N') or
961
                            (i+3 == elen and ename[i+1:i+4] == 'NED')):
962
                continue
963
            elif (i-1 > 0 and i+1 <= elen and ename[i-1] == 'D' and
964
                  ename[i+1] in _frontv):
965
                continue
966
            elif ename[i+1:i+2] == 'G':
967
                continue
968
            elif ename[i+1:i+2] in _frontv:
969
                if i == 0 or ename[i-1] != 'G':
970
                    metaph += 'J'
971
                else:
972
                    metaph += 'K'
973
            else:
974
                metaph += 'K'
975
976
        elif ename[i] == 'H':
977
            if ((i > 0 and ename[i-1] in _vowels and
978
                 ename[i+1:i+2] not in _vowels)):
979
                continue
980
            elif i > 0 and ename[i-1] in _varson:
981
                continue
982
            else:
983
                metaph += 'H'
984
985
        elif ename[i] in {'F', 'J', 'L', 'M', 'N', 'R'}:
986
            metaph += ename[i]
987
988
        elif ename[i] == 'K':
989
            if i > 0 and ename[i-1] == 'C':
990
                continue
991
            else:
992
                metaph += 'K'
993
994
        elif ename[i] == 'P':
995
            if ename[i+1:i+2] == 'H':
996
                metaph += 'F'
997
            else:
998
                metaph += 'P'
999
1000
        elif ename[i] == 'Q':
1001
            metaph += 'K'
1002
1003
        elif ename[i] == 'S':
1004
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1005
                 ename[i+2] in 'OA')):
1006
                metaph += 'X'
1007
            elif ename[i+1:i+2] == 'H':
1008
                metaph += 'X'
1009
            else:
1010
                metaph += 'S'
1011
1012
        elif ename[i] == 'T':
1013
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1014
                 ename[i+2] in {'A', 'O'})):
1015
                metaph += 'X'
1016
            elif ename[i+1:i+2] == 'H':
1017
                metaph += '0'
1018
            elif ename[i+1:i+3] != 'CH':
1019
                if ename[i-1:i] != 'T':
1020
                    metaph += 'T'
1021
1022
        elif ename[i] == 'V':
1023
            metaph += 'F'
1024
1025
        elif ename[i] in 'WY':
1026
            if ename[i+1:i+2] in _vowels:
1027
                metaph += ename[i]
1028
1029
        elif ename[i] == 'X':
1030
            metaph += 'KS'
1031
1032
        elif ename[i] == 'Z':
1033
            metaph += 'S'
1034
1035
    return metaph
1036
1037
1038
def double_metaphone(word, maxlength=_INFINITY):
1039
    """Return the Double Metaphone code for a word.
1040
1041
    Based on Lawrence Philips' (Visual) C++ code from 1999:
1042
    http://aspell.net/metaphone/dmetaph.cpp
1043
1044
    :param word: the word to transform
1045
    :param maxlength: the maximum length of the returned Double Metaphone codes
1046
        (defaults to unlimited, but in Philips' original implementation this
1047
        was 4)
1048
    :returns: the Double Metaphone value(s)
1049
    :rtype: tuple
1050
1051
    >>> double_metaphone('Christopher')
1052
    ('KRSTFR', '')
1053
    >>> double_metaphone('Niall')
1054
    ('NL', '')
1055
    >>> double_metaphone('Smith')
1056
    ('SM0', 'XMT')
1057
    >>> double_metaphone('Schmidt')
1058
    ('XMT', 'SMT')
1059
    """
1060
    # pylint: disable=too-many-branches
1061
    # Require a maxlength of at least 4
1062
    if maxlength is not None:
1063
        maxlength = max(4, maxlength)
1064
    else:
1065
        maxlength = 64
1066
1067
    primary = ''
1068
    secondary = ''
1069
1070
    def _slavo_germanic():
1071
        """Return True if the word appears to be Slavic or Germanic."""
1072
        if 'W' in word or 'K' in word or 'CZ' in word:
1073
            return True
1074
        return False
1075
1076
    def _metaph_add(pri, sec=''):
1077
        """Return a new metaphone tuple with the supplied elements."""
1078
        newpri = primary
1079
        newsec = secondary
1080
        if pri:
1081
            newpri += pri
1082
        if sec:
1083
            if sec != ' ':
1084
                newsec += sec
1085
        else:
1086
            newsec += pri
1087
        return (newpri, newsec)
1088
1089
    def _is_vowel(pos):
1090
        """Return True if the character at word[pos] is a vowel."""
1091
        if pos >= 0 and word[pos] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1092
            return True
1093
        return False
1094
1095
    def _get_at(pos):
1096
        """Return the character at word[pos]."""
1097
        return word[pos]
1098
1099
    def _string_at(pos, slen, substrings):
1100
        """Return True if word[pos:pos+slen] is in substrings."""
1101
        if pos < 0:
1102
            return False
1103
        return word[pos:pos+slen] in substrings
1104
1105
    current = 0
1106
    length = len(word)
1107
    if length < 1:
1108
        return ('', '')
1109
    last = length - 1
1110
1111
    word = word.upper()
1112
    word = word.replace('ß', 'SS')
1113
1114
    # Pad the original string so that we can index beyond the edge of the world
1115
    word += '     '
1116
1117
    # Skip these when at start of word
1118
    if word[0:2] in {'GN', 'KN', 'PN', 'WR', 'PS'}:
1119
        current += 1
1120
1121
    # Initial 'X' is pronounced 'Z' e.g. 'Xavier'
1122
    if _get_at(0) == 'X':
1123
        (primary, secondary) = _metaph_add('S')  # 'Z' maps to 'S'
1124
        current += 1
1125
1126
    # Main loop
1127
    while True:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1128
        if current >= length:
1129
            break
1130
1131
        if _get_at(current) in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1132
            if current == 0:
1133
                # All init vowels now map to 'A'
1134
                (primary, secondary) = _metaph_add('A')
1135
            current += 1
1136
            continue
1137
1138
        elif _get_at(current) == 'B':
1139
            # "-mb", e.g", "dumb", already skipped over...
1140
            (primary, secondary) = _metaph_add('P')
1141
            if _get_at(current + 1) == 'B':
1142
                current += 2
1143
            else:
1144
                current += 1
1145
            continue
1146
1147
        elif _get_at(current) == 'Ç':
1148
            (primary, secondary) = _metaph_add('S')
1149
            current += 1
1150
            continue
1151
1152
        elif _get_at(current) == 'C':
1153
            # Various Germanic
1154
            if (current > 1 and not _is_vowel(current - 2) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1155
                    _string_at((current - 1), 3, {'ACH'}) and
1156
                    ((_get_at(current + 2) != 'I') and
1157
                     ((_get_at(current + 2) != 'E') or
1158
                      _string_at((current - 2), 6,
1159
                                 {'BACHER', 'MACHER'})))):
1160
                (primary, secondary) = _metaph_add('K')
1161
                current += 2
1162
                continue
1163
1164
            # Special case 'caesar'
1165
            elif current == 0 and _string_at(current, 6, {'CAESAR'}):
1166
                (primary, secondary) = _metaph_add('S')
1167
                current += 2
1168
                continue
1169
1170
            # Italian 'chianti'
1171
            elif _string_at(current, 4, {'CHIA'}):
1172
                (primary, secondary) = _metaph_add('K')
1173
                current += 2
1174
                continue
1175
1176
            elif _string_at(current, 2, {'CH'}):
1177
                # Find 'Michael'
1178
                if current > 0 and _string_at(current, 4, {'CHAE'}):
1179
                    (primary, secondary) = _metaph_add('K', 'X')
1180
                    current += 2
1181
                    continue
1182
1183
                # Greek roots e.g. 'chemistry', 'chorus'
1184
                elif (current == 0 and
1185
                      (_string_at((current + 1), 5,
1186
                                  {'HARAC', 'HARIS'}) or
1187
                       _string_at((current + 1), 3,
1188
                                  {'HOR', 'HYM', 'HIA', 'HEM'})) and
1189
                      not _string_at(0, 5, {'CHORE'})):
1190
                    (primary, secondary) = _metaph_add('K')
1191
                    current += 2
1192
                    continue
1193
1194
                # Germanic, Greek, or otherwise 'ch' for 'kh' sound
1195
                elif ((_string_at(0, 4, {'VAN ', 'VON '}) or
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (7/5)
Loading history...
1196
                       _string_at(0, 3, {'SCH'})) or
1197
                      # 'architect but not 'arch', 'orchestra', 'orchid'
1198
                      _string_at((current - 2), 6,
1199
                                 {'ORCHES', 'ARCHIT', 'ORCHID'}) or
1200
                      _string_at((current + 2), 1, {'T', 'S'}) or
1201
                      ((_string_at((current - 1), 1,
1202
                                   {'A', 'O', 'U', 'E'}) or
1203
                        (current == 0)) and
1204
                       # e.g., 'wachtler', 'wechsler', but not 'tichner'
1205
                       _string_at((current + 2), 1,
1206
                                  {'L', 'R', 'N', 'M', 'B', 'H', 'F', 'V', 'W',
1207
                                   ' '}))):
1208
                    (primary, secondary) = _metaph_add('K')
1209
1210
                else:
1211
                    if current > 0:
1212
                        if _string_at(0, 2, {'MC'}):
1213
                            # e.g., "McHugh"
1214
                            (primary, secondary) = _metaph_add('K')
1215
                        else:
1216
                            (primary, secondary) = _metaph_add('X', 'K')
1217
                    else:
1218
                        (primary, secondary) = _metaph_add('X')
1219
1220
                current += 2
1221
                continue
1222
1223
            # e.g, 'czerny'
1224
            elif (_string_at(current, 2, {'CZ'}) and
1225
                  not _string_at((current - 2), 4, {'WICZ'})):
1226
                (primary, secondary) = _metaph_add('S', 'X')
1227
                current += 2
1228
                continue
1229
1230
            # e.g., 'focaccia'
1231
            elif _string_at((current + 1), 3, {'CIA'}):
1232
                (primary, secondary) = _metaph_add('X')
1233
                current += 3
1234
1235
            # double 'C', but not if e.g. 'McClellan'
1236
            elif (_string_at(current, 2, {'CC'}) and
1237
                  not ((current == 1) and (_get_at(0) == 'M'))):
1238
                # 'bellocchio' but not 'bacchus'
1239
                if ((_string_at((current + 2), 1,
1240
                                {'I', 'E', 'H'}) and
1241
                     not _string_at((current + 2), 2, ['HU']))):
1242
                    # 'accident', 'accede' 'succeed'
1243
                    if ((((current == 1) and _get_at(current - 1) == 'A') or
1244
                         _string_at((current - 1), 5,
1245
                                    {'UCCEE', 'UCCES'}))):
1246
                        (primary, secondary) = _metaph_add('KS')
1247
                    # 'bacci', 'bertucci', other italian
1248
                    else:
1249
                        (primary, secondary) = _metaph_add('X')
1250
                    current += 3
1251
                    continue
1252
                else:  # Pierce's rule
1253
                    (primary, secondary) = _metaph_add('K')
1254
                    current += 2
1255
                    continue
1256
1257
            elif _string_at(current, 2, {'CK', 'CG', 'CQ'}):
1258
                (primary, secondary) = _metaph_add('K')
1259
                current += 2
1260
                continue
1261
1262
            elif _string_at(current, 2, {'CI', 'CE', 'CY'}):
1263
                # Italian vs. English
1264
                if _string_at(current, 3, {'CIO', 'CIE', 'CIA'}):
1265
                    (primary, secondary) = _metaph_add('S', 'X')
1266
                else:
1267
                    (primary, secondary) = _metaph_add('S')
1268
                current += 2
1269
                continue
1270
1271
            # else
1272
            else:
1273
                (primary, secondary) = _metaph_add('K')
1274
1275
                # name sent in 'mac caffrey', 'mac gregor
1276
                if _string_at((current + 1), 2, {' C', ' Q', ' G'}):
1277
                    current += 3
1278
                elif (_string_at((current + 1), 1,
1279
                                 {'C', 'K', 'Q'}) and
1280
                      not _string_at((current + 1), 2, {'CE', 'CI'})):
1281
                    current += 2
1282
                else:
1283
                    current += 1
1284
                continue
1285
1286
        elif _get_at(current) == 'D':
1287
            if _string_at(current, 2, {'DG'}):
1288
                if _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1289
                    # e.g. 'edge'
1290
                    (primary, secondary) = _metaph_add('J')
1291
                    current += 3
1292
                    continue
1293
                else:
1294
                    # e.g. 'edgar'
1295
                    (primary, secondary) = _metaph_add('TK')
1296
                    current += 2
1297
                    continue
1298
1299
            elif _string_at(current, 2, {'DT', 'DD'}):
1300
                (primary, secondary) = _metaph_add('T')
1301
                current += 2
1302
                continue
1303
1304
            # else
1305
            else:
1306
                (primary, secondary) = _metaph_add('T')
1307
                current += 1
1308
                continue
1309
1310
        elif _get_at(current) == 'F':
1311
            if _get_at(current + 1) == 'F':
1312
                current += 2
1313
            else:
1314
                current += 1
1315
            (primary, secondary) = _metaph_add('F')
1316
            continue
1317
1318
        elif _get_at(current) == 'G':
1319
            if _get_at(current + 1) == 'H':
1320
                if (current > 0) and not _is_vowel(current - 1):
1321
                    (primary, secondary) = _metaph_add('K')
1322
                    current += 2
1323
                    continue
1324
1325
                # 'ghislane', ghiradelli
1326
                elif current == 0:
1327
                    if _get_at(current + 2) == 'I':
1328
                        (primary, secondary) = _metaph_add('J')
1329
                    else:
1330
                        (primary, secondary) = _metaph_add('K')
1331
                    current += 2
1332
                    continue
1333
1334
                # Parker's rule (with some further refinements) - e.g., 'hugh'
1335
                elif (((current > 1) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1336
                       _string_at((current - 2), 1, {'B', 'H', 'D'})) or
1337
                      # e.g., 'bough'
1338
                      ((current > 2) and
1339
                       _string_at((current - 3), 1, {'B', 'H', 'D'})) or
1340
                      # e.g., 'broughton'
1341
                      ((current > 3) and
1342
                       _string_at((current - 4), 1, {'B', 'H'}))):
1343
                    current += 2
1344
                    continue
1345
                else:
1346
                    # e.g. 'laugh', 'McLaughlin', 'cough',
1347
                    #      'gough', 'rough', 'tough'
1348
                    if ((current > 2) and
1349
                            (_get_at(current - 1) == 'U') and
1350
                            (_string_at((current - 3), 1,
1351
                                        {'C', 'G', 'L', 'R', 'T'}))):
1352
                        (primary, secondary) = _metaph_add('F')
1353
                    elif (current > 0) and _get_at(current - 1) != 'I':
1354
                        (primary, secondary) = _metaph_add('K')
1355
                    current += 2
1356
                    continue
1357
1358
            elif _get_at(current + 1) == 'N':
1359
                if (current == 1) and _is_vowel(0) and not _slavo_germanic():
1360
                    (primary, secondary) = _metaph_add('KN', 'N')
1361
                # not e.g. 'cagney'
1362
                elif (not _string_at((current + 2), 2, {'EY'}) and
1363
                      (_get_at(current + 1) != 'Y') and
1364
                      not _slavo_germanic()):
1365
                    (primary, secondary) = _metaph_add('N', 'KN')
1366
                else:
1367
                    (primary, secondary) = _metaph_add('KN')
1368
                current += 2
1369
                continue
1370
1371
            # 'tagliaro'
1372
            elif (_string_at((current + 1), 2, {'LI'}) and
1373
                  not _slavo_germanic()):
1374
                (primary, secondary) = _metaph_add('KL', 'L')
1375
                current += 2
1376
                continue
1377
1378
            # -ges-, -gep-, -gel-, -gie- at beginning
1379
            elif ((current == 0) and
1380
                  ((_get_at(current + 1) == 'Y') or
1381
                   _string_at((current + 1), 2, {'ES', 'EP', 'EB', 'EL', 'EY',
1382
                                                 'IB', 'IL', 'IN', 'IE', 'EI',
1383
                                                 'ER'}))):
1384
                (primary, secondary) = _metaph_add('K', 'J')
1385
                current += 2
1386
                continue
1387
1388
            #  -ger-,  -gy-
1389
            elif ((_string_at((current + 1), 2, {'ER'}) or
1390
                   (_get_at(current + 1) == 'Y')) and not
1391
                  _string_at(0, 6, {'DANGER', 'RANGER', 'MANGER'}) and not
1392
                  _string_at((current - 1), 1, {'E', 'I'}) and not
1393
                  _string_at((current - 1), 3, {'RGY', 'OGY'})):
1394
                (primary, secondary) = _metaph_add('K', 'J')
1395
                current += 2
1396
                continue
1397
1398
            #  italian e.g, 'biaggi'
1399
            elif (_string_at((current + 1), 1, {'E', 'I', 'Y'}) or
1400
                  _string_at((current - 1), 4, {'AGGI', 'OGGI'})):
1401
                # obvious germanic
1402
                if (((_string_at(0, 4, {'VAN ', 'VON '}) or
1403
                      _string_at(0, 3, {'SCH'})) or
1404
                     _string_at((current + 1), 2, {'ET'}))):
1405
                    (primary, secondary) = _metaph_add('K')
1406
                elif _string_at((current + 1), 4, {'IER '}):
1407
                    (primary, secondary) = _metaph_add('J')
1408
                else:
1409
                    (primary, secondary) = _metaph_add('J', 'K')
1410
                current += 2
1411
                continue
1412
1413
            else:
1414
                if _get_at(current + 1) == 'G':
1415
                    current += 2
1416
                else:
1417
                    current += 1
1418
                (primary, secondary) = _metaph_add('K')
1419
                continue
1420
1421
        elif _get_at(current) == 'H':
1422
            # only keep if first & before vowel or btw. 2 vowels
1423
            if ((((current == 0) or _is_vowel(current - 1)) and
1424
                 _is_vowel(current + 1))):
1425
                (primary, secondary) = _metaph_add('H')
1426
                current += 2
1427
            else:  # also takes care of 'HH'
1428
                current += 1
1429
            continue
1430
1431
        elif _get_at(current) == 'J':
1432
            # obvious spanish, 'jose', 'san jacinto'
1433
            if _string_at(current, 4, ['JOSE']) or _string_at(0, 4, {'SAN '}):
1434
                if ((((current == 0) and (_get_at(current + 4) == ' ')) or
1435
                     _string_at(0, 4, ['SAN ']))):
1436
                    (primary, secondary) = _metaph_add('H')
1437
                else:
1438
                    (primary, secondary) = _metaph_add('J', 'H')
1439
                current += 1
1440
                continue
1441
1442
            elif (current == 0) and not _string_at(current, 4, {'JOSE'}):
1443
                # Yankelovich/Jankelowicz
1444
                (primary, secondary) = _metaph_add('J', 'A')
1445
            # Spanish pron. of e.g. 'bajador'
1446
            elif (_is_vowel(current - 1) and
1447
                  not _slavo_germanic() and
1448
                  ((_get_at(current + 1) == 'A') or
1449
                   (_get_at(current + 1) == 'O'))):
1450
                (primary, secondary) = _metaph_add('J', 'H')
1451
            elif current == last:
1452
                (primary, secondary) = _metaph_add('J', ' ')
1453
            elif (not _string_at((current + 1), 1,
1454
                                 {'L', 'T', 'K', 'S', 'N', 'M', 'B', 'Z'}) and
1455
                  not _string_at((current - 1), 1, {'S', 'K', 'L'})):
1456
                (primary, secondary) = _metaph_add('J')
1457
1458
            if _get_at(current + 1) == 'J':  # it could happen!
1459
                current += 2
1460
            else:
1461
                current += 1
1462
            continue
1463
1464
        elif _get_at(current) == 'K':
1465
            if _get_at(current + 1) == 'K':
1466
                current += 2
1467
            else:
1468
                current += 1
1469
            (primary, secondary) = _metaph_add('K')
1470
            continue
1471
1472
        elif _get_at(current) == 'L':
1473
            if _get_at(current + 1) == 'L':
1474
                # Spanish e.g. 'cabrillo', 'gallegos'
1475
                if (((current == (length - 3)) and
1476
                     _string_at((current - 1), 4, {'ILLO', 'ILLA', 'ALLE'})) or
1477
                        ((_string_at((last - 1), 2, {'AS', 'OS'}) or
1478
                          _string_at(last, 1, {'A', 'O'})) and
1479
                         _string_at((current - 1), 4, {'ALLE'}))):
1480
                    (primary, secondary) = _metaph_add('L', ' ')
1481
                    current += 2
1482
                    continue
1483
                current += 2
1484
            else:
1485
                current += 1
1486
            (primary, secondary) = _metaph_add('L')
1487
            continue
1488
1489
        elif _get_at(current) == 'M':
1490
            if (((_string_at((current - 1), 3, {'UMB'}) and
1491
                  (((current + 1) == last) or
1492
                   _string_at((current + 2), 2, {'ER'}))) or
1493
                 # 'dumb', 'thumb'
1494
                 (_get_at(current + 1) == 'M'))):
1495
                current += 2
1496
            else:
1497
                current += 1
1498
            (primary, secondary) = _metaph_add('M')
1499
            continue
1500
1501
        elif _get_at(current) == 'N':
1502
            if _get_at(current + 1) == 'N':
1503
                current += 2
1504
            else:
1505
                current += 1
1506
            (primary, secondary) = _metaph_add('N')
1507
            continue
1508
1509
        elif _get_at(current) == 'Ñ':
1510
            current += 1
1511
            (primary, secondary) = _metaph_add('N')
1512
            continue
1513
1514
        elif _get_at(current) == 'P':
1515
            if _get_at(current + 1) == 'H':
1516
                (primary, secondary) = _metaph_add('F')
1517
                current += 2
1518
                continue
1519
1520
            # also account for "campbell", "raspberry"
1521
            elif _string_at((current + 1), 1, {'P', 'B'}):
1522
                current += 2
1523
            else:
1524
                current += 1
1525
            (primary, secondary) = _metaph_add('P')
1526
            continue
1527
1528
        elif _get_at(current) == 'Q':
1529
            if _get_at(current + 1) == 'Q':
1530
                current += 2
1531
            else:
1532
                current += 1
1533
            (primary, secondary) = _metaph_add('K')
1534
            continue
1535
1536
        elif _get_at(current) == 'R':
1537
            # french e.g. 'rogier', but exclude 'hochmeier'
1538
            if (((current == last) and
1539
                 not _slavo_germanic() and
1540
                 _string_at((current - 2), 2, {'IE'}) and
1541
                 not _string_at((current - 4), 2, {'ME', 'MA'}))):
1542
                (primary, secondary) = _metaph_add('', 'R')
1543
            else:
1544
                (primary, secondary) = _metaph_add('R')
1545
1546
            if _get_at(current + 1) == 'R':
1547
                current += 2
1548
            else:
1549
                current += 1
1550
            continue
1551
1552
        elif _get_at(current) == 'S':
1553
            # special cases 'island', 'isle', 'carlisle', 'carlysle'
1554
            if _string_at((current - 1), 3, {'ISL', 'YSL'}):
1555
                current += 1
1556
                continue
1557
1558
            # special case 'sugar-'
1559
            elif (current == 0) and _string_at(current, 5, {'SUGAR'}):
1560
                (primary, secondary) = _metaph_add('X', 'S')
1561
                current += 1
1562
                continue
1563
1564
            elif _string_at(current, 2, {'SH'}):
1565
                # Germanic
1566
                if _string_at((current + 1), 4,
1567
                              {'HEIM', 'HOEK', 'HOLM', 'HOLZ'}):
1568
                    (primary, secondary) = _metaph_add('S')
1569
                else:
1570
                    (primary, secondary) = _metaph_add('X')
1571
                current += 2
1572
                continue
1573
1574
            # Italian & Armenian
1575
            elif (_string_at(current, 3, {'SIO', 'SIA'}) or
1576
                  _string_at(current, 4, {'SIAN'})):
1577
                if not _slavo_germanic():
1578
                    (primary, secondary) = _metaph_add('S', 'X')
1579
                else:
1580
                    (primary, secondary) = _metaph_add('S')
1581
                current += 3
1582
                continue
1583
1584
            # German & anglicisations, e.g. 'smith' match 'schmidt',
1585
            #                               'snider' match 'schneider'
1586
            # also, -sz- in Slavic language although in Hungarian it is
1587
            #       pronounced 's'
1588
            elif (((current == 0) and
1589
                   _string_at((current + 1), 1, {'M', 'N', 'L', 'W'})) or
1590
                  _string_at((current + 1), 1, {'Z'})):
1591
                (primary, secondary) = _metaph_add('S', 'X')
1592
                if _string_at((current + 1), 1, {'Z'}):
1593
                    current += 2
1594
                else:
1595
                    current += 1
1596
                continue
1597
1598
            elif _string_at(current, 2, {'SC'}):
1599
                # Schlesinger's rule
1600
                if _get_at(current + 2) == 'H':
1601
                    # dutch origin, e.g. 'school', 'schooner'
1602
                    if _string_at((current + 3), 2,
1603
                                  {'OO', 'ER', 'EN', 'UY', 'ED', 'EM'}):
1604
                        # 'schermerhorn', 'schenker'
1605
                        if _string_at((current + 3), 2, {'ER', 'EN'}):
1606
                            (primary, secondary) = _metaph_add('X', 'SK')
1607
                        else:
1608
                            (primary, secondary) = _metaph_add('SK')
1609
                        current += 3
1610
                        continue
1611
                    else:
1612
                        if (((current == 0) and not _is_vowel(3) and
1613
                             (_get_at(3) != 'W'))):
1614
                            (primary, secondary) = _metaph_add('X', 'S')
1615
                        else:
1616
                            (primary, secondary) = _metaph_add('X')
1617
                        current += 3
1618
                        continue
1619
1620
                elif _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1621
                    (primary, secondary) = _metaph_add('S')
1622
                    current += 3
1623
                    continue
1624
1625
                # else
1626
                else:
1627
                    (primary, secondary) = _metaph_add('SK')
1628
                    current += 3
1629
                    continue
1630
1631
            else:
1632
                # french e.g. 'resnais', 'artois'
1633
                if (current == last) and _string_at((current - 2), 2,
1634
                                                    {'AI', 'OI'}):
1635
                    (primary, secondary) = _metaph_add('', 'S')
1636
                else:
1637
                    (primary, secondary) = _metaph_add('S')
1638
1639
                if _string_at((current + 1), 1, {'S', 'Z'}):
1640
                    current += 2
1641
                else:
1642
                    current += 1
1643
                continue
1644
1645
        elif _get_at(current) == 'T':
1646
            if _string_at(current, 4, {'TION'}):
1647
                (primary, secondary) = _metaph_add('X')
1648
                current += 3
1649
                continue
1650
1651
            elif _string_at(current, 3, {'TIA', 'TCH'}):
1652
                (primary, secondary) = _metaph_add('X')
1653
                current += 3
1654
                continue
1655
1656
            elif (_string_at(current, 2, {'TH'}) or
1657
                  _string_at(current, 3, {'TTH'})):
1658
                # special case 'thomas', 'thames' or germanic
1659
                if ((_string_at((current + 2), 2, {'OM', 'AM'}) or
1660
                     _string_at(0, 4, {'VAN ', 'VON '}) or
1661
                     _string_at(0, 3, {'SCH'}))):
1662
                    (primary, secondary) = _metaph_add('T')
1663
                else:
1664
                    (primary, secondary) = _metaph_add('0', 'T')
1665
                current += 2
1666
                continue
1667
1668
            elif _string_at((current + 1), 1, {'T', 'D'}):
1669
                current += 2
1670
            else:
1671
                current += 1
1672
            (primary, secondary) = _metaph_add('T')
1673
            continue
1674
1675
        elif _get_at(current) == 'V':
1676
            if _get_at(current + 1) == 'V':
1677
                current += 2
1678
            else:
1679
                current += 1
1680
            (primary, secondary) = _metaph_add('F')
1681
            continue
1682
1683
        elif _get_at(current) == 'W':
1684
            # can also be in middle of word
1685
            if _string_at(current, 2, {'WR'}):
1686
                (primary, secondary) = _metaph_add('R')
1687
                current += 2
1688
                continue
1689
            elif ((current == 0) and
1690
                  (_is_vowel(current + 1) or _string_at(current, 2, {'WH'}))):
1691
                # Wasserman should match Vasserman
1692
                if _is_vowel(current + 1):
1693
                    (primary, secondary) = _metaph_add('A', 'F')
1694
                else:
1695
                    # need Uomo to match Womo
1696
                    (primary, secondary) = _metaph_add('A')
1697
1698
            # Arnow should match Arnoff
1699
            if ((((current == last) and _is_vowel(current - 1)) or
1700
                 _string_at((current - 1), 5,
1701
                            {'EWSKI', 'EWSKY', 'OWSKI', 'OWSKY'}) or
1702
                 _string_at(0, 3, ['SCH']))):
1703
                (primary, secondary) = _metaph_add('', 'F')
1704
                current += 1
1705
                continue
1706
            # Polish e.g. 'filipowicz'
1707
            elif _string_at(current, 4, {'WICZ', 'WITZ'}):
1708
                (primary, secondary) = _metaph_add('TS', 'FX')
1709
                current += 4
1710
                continue
1711
            # else skip it
1712
            else:
1713
                current += 1
1714
                continue
1715
1716
        elif _get_at(current) == 'X':
1717
            # French e.g. breaux
1718
            if (not ((current == last) and
1719
                     (_string_at((current - 3), 3, {'IAU', 'EAU'}) or
1720
                      _string_at((current - 2), 2, {'AU', 'OU'})))):
1721
                (primary, secondary) = _metaph_add('KS')
1722
1723
            if _string_at((current + 1), 1, {'C', 'X'}):
1724
                current += 2
1725
            else:
1726
                current += 1
1727
            continue
1728
1729
        elif _get_at(current) == 'Z':
1730
            # Chinese Pinyin e.g. 'zhao'
1731
            if _get_at(current + 1) == 'H':
1732
                (primary, secondary) = _metaph_add('J')
1733
                current += 2
1734
                continue
1735
            elif (_string_at((current + 1), 2, {'ZO', 'ZI', 'ZA'}) or
1736
                  (_slavo_germanic() and ((current > 0) and
1737
                                          _get_at(current - 1) != 'T'))):
1738
                (primary, secondary) = _metaph_add('S', 'TS')
1739
            else:
1740
                (primary, secondary) = _metaph_add('S')
1741
1742
            if _get_at(current + 1) == 'Z':
1743
                current += 2
1744
            else:
1745
                current += 1
1746
            continue
1747
1748
        else:
1749
            current += 1
1750
1751
    if maxlength and maxlength < _INFINITY:
1752
        primary = primary[:maxlength]
1753
        secondary = secondary[:maxlength]
1754
    if primary == secondary:
1755
        secondary = ''
1756
1757
    return (primary, secondary)
1758
1759
1760
def caverphone(word, version=2):
1761
    """Return the Caverphone code for a word.
1762
1763
    A description of version 1 of the algorithm can be found at:
1764
    http://caversham.otago.ac.nz/files/working/ctp060902.pdf
1765
1766
    A description of version 2 of the algorithm can be found at:
1767
    http://caversham.otago.ac.nz/files/working/ctp150804.pdf
1768
1769
    :param str word: the word to transform
1770
    :param int version: the version of Caverphone to employ for encoding
1771
        (defaults to 2)
1772
    :returns: the Caverphone value
1773
    :rtype: str
1774
1775
    >>> caverphone('Christopher')
1776
    'KRSTFA1111'
1777
    >>> caverphone('Niall')
1778
    'NA11111111'
1779
    >>> caverphone('Smith')
1780
    'SMT1111111'
1781
    >>> caverphone('Schmidt')
1782
    'SKMT111111'
1783
1784
    >>> caverphone('Christopher', 1)
1785
    'KRSTF1'
1786
    >>> caverphone('Niall', 1)
1787
    'N11111'
1788
    >>> caverphone('Smith', 1)
1789
    'SMT111'
1790
    >>> caverphone('Schmidt', 1)
1791
    'SKMT11'
1792
    """
1793
    _vowels = {'a', 'e', 'i', 'o', 'u'}
1794
1795
    word = word.lower()
1796
    word = ''.join(c for c in word if c in
1797
                   {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
1798
                    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
1799
                    'y', 'z'})
1800
1801
    def _squeeze_replace(word, char, new_char):
1802
        """Convert strings of char in word to one instance of new_char."""
1803
        while char * 2 in word:
1804
            word = word.replace(char * 2, char)
1805
        return word.replace(char, new_char)
1806
1807
    # the main replacemet algorithm
1808
    if version != 1 and word[-1:] == 'e':
1809
        word = word[:-1]
1810
    if word:
1811
        if word[:5] == 'cough':
1812
            word = 'cou2f'+word[5:]
1813
        if word[:5] == 'rough':
1814
            word = 'rou2f'+word[5:]
1815
        if word[:5] == 'tough':
1816
            word = 'tou2f'+word[5:]
1817
        if word[:6] == 'enough':
1818
            word = 'enou2f'+word[6:]
1819
        if version != 1 and word[:6] == 'trough':
1820
            word = 'trou2f'+word[6:]
1821
        if word[:2] == 'gn':
1822
            word = '2n'+word[2:]
1823
        if word[-2:] == 'mb':
1824
            word = word[:-1]+'2'
1825
        word = word.replace('cq', '2q')
1826
        word = word.replace('ci', 'si')
1827
        word = word.replace('ce', 'se')
1828
        word = word.replace('cy', 'sy')
1829
        word = word.replace('tch', '2ch')
1830
        word = word.replace('c', 'k')
1831
        word = word.replace('q', 'k')
1832
        word = word.replace('x', 'k')
1833
        word = word.replace('v', 'f')
1834
        word = word.replace('dg', '2g')
1835
        word = word.replace('tio', 'sio')
1836
        word = word.replace('tia', 'sia')
1837
        word = word.replace('d', 't')
1838
        word = word.replace('ph', 'fh')
1839
        word = word.replace('b', 'p')
1840
        word = word.replace('sh', 's2')
1841
        word = word.replace('z', 's')
1842
        if word[0] in _vowels:
1843
            word = 'A'+word[1:]
1844
        word = word.replace('a', '3')
1845
        word = word.replace('e', '3')
1846
        word = word.replace('i', '3')
1847
        word = word.replace('o', '3')
1848
        word = word.replace('u', '3')
1849
        if version != 1:
1850
            word = word.replace('j', 'y')
1851
            if word[:2] == 'y3':
1852
                word = 'Y3'+word[2:]
1853
            if word[:1] == 'y':
1854
                word = 'A'+word[1:]
1855
            word = word.replace('y', '3')
1856
        word = word.replace('3gh3', '3kh3')
1857
        word = word.replace('gh', '22')
1858
        word = word.replace('g', 'k')
1859
1860
        word = _squeeze_replace(word, 's', 'S')
1861
        word = _squeeze_replace(word, 't', 'T')
1862
        word = _squeeze_replace(word, 'p', 'P')
1863
        word = _squeeze_replace(word, 'k', 'K')
1864
        word = _squeeze_replace(word, 'f', 'F')
1865
        word = _squeeze_replace(word, 'm', 'M')
1866
        word = _squeeze_replace(word, 'n', 'N')
1867
1868
        word = word.replace('w3', 'W3')
1869
        if version == 1:
1870
            word = word.replace('wy', 'Wy')
1871
        word = word.replace('wh3', 'Wh3')
1872
        if version == 1:
1873
            word = word.replace('why', 'Why')
1874
        if version != 1 and word[-1:] == 'w':
1875
            word = word[:-1]+'3'
1876
        word = word.replace('w', '2')
1877
        if word[:1] == 'h':
1878
            word = 'A'+word[1:]
1879
        word = word.replace('h', '2')
1880
        word = word.replace('r3', 'R3')
1881
        if version == 1:
1882
            word = word.replace('ry', 'Ry')
1883
        if version != 1 and word[-1:] == 'r':
1884
            word = word[:-1]+'3'
1885
        word = word.replace('r', '2')
1886
        word = word.replace('l3', 'L3')
1887
        if version == 1:
1888
            word = word.replace('ly', 'Ly')
1889
        if version != 1 and word[-1:] == 'l':
1890
            word = word[:-1]+'3'
1891
        word = word.replace('l', '2')
1892
        if version == 1:
1893
            word = word.replace('j', 'y')
1894
            word = word.replace('y3', 'Y3')
1895
            word = word.replace('y', '2')
1896
        word = word.replace('2', '')
1897
        if version != 1 and word[-1:] == '3':
1898
            word = word[:-1]+'A'
1899
        word = word.replace('3', '')
1900
1901
    # pad with 1s, then extract the necessary length of code
1902
    word = word+'1'*10
1903
    if version != 1:
1904
        word = word[:10]
1905
    else:
1906
        word = word[:6]
1907
1908
    return word
1909
1910
1911
def alpha_sis(word, maxlength=14):
1912
    """Return the IBM Alpha Search Inquiry System code for a word.
1913
1914
    Based on the algorithm described in "Accessing individual records from
1915
    personal data files using non-unique identifiers" / Gwendolyn B. Moore,
1916
    et al.; prepared for the Institute for Computer Sciences and Technology,
1917
    National Bureau of Standards, Washington, D.C (1977):
1918
    https://archive.org/stream/accessingindivid00moor#page/15/mode/1up
1919
1920
    A collection is necessary since there can be multiple values for a
1921
    single word. But the collection must be ordered since the first value
1922
    is the primary coding.
1923
1924
    :param str word: the word to transform
1925
    :param int maxlength: the length of the code returned (defaults to 14)
1926
    :returns: the Alpha SIS value
1927
    :rtype: tuple
1928
1929
    >>> alpha_sis('Christopher')
1930
    ('06401840000000', '07040184000000', '04018400000000')
1931
    >>> alpha_sis('Niall')
1932
    ('02500000000000',)
1933
    >>> alpha_sis('Smith')
1934
    ('03100000000000',)
1935
    >>> alpha_sis('Schmidt')
1936
    ('06310000000000',)
1937
    """
1938
    _alpha_sis_initials = {'GF': '08', 'GM': '03', 'GN': '02', 'KN': '02',
1939
                           'PF': '08', 'PN': '02', 'PS': '00', 'WR': '04',
1940
                           'A': '1', 'E': '1', 'H': '2', 'I': '1', 'J': '3',
1941
                           'O': '1', 'U': '1', 'W': '4', 'Y': '5'}
1942
    _alpha_sis_initials_order = ('GF', 'GM', 'GN', 'KN', 'PF', 'PN', 'PS',
1943
                                 'WR', 'A', 'E', 'H', 'I', 'J', 'O', 'U', 'W',
1944
                                 'Y')
1945
    _alpha_sis_basic = {'SCH': '6', 'CZ': ('70', '6', '0'),
1946
                        'CH': ('6', '70', '0'), 'CK': ('7', '6'),
1947
                        'DS': ('0', '10'), 'DZ': ('0', '10'),
1948
                        'TS': ('0', '10'), 'TZ': ('0', '10'), 'CI': '0',
1949
                        'CY': '0', 'CE': '0', 'SH': '6', 'DG': '7', 'PH': '8',
1950
                        'C': ('7', '6'), 'K': ('7', '6'), 'Z': '0', 'S': '0',
1951
                        'D': '1', 'T': '1', 'N': '2', 'M': '3', 'R': '4',
1952
                        'L': '5', 'J': '6', 'G': '7', 'Q': '7', 'X': '7',
1953
                        'F': '8', 'V': '8', 'B': '9', 'P': '9'}
1954
    _alpha_sis_basic_order = ('SCH', 'CZ', 'CH', 'CK', 'DS', 'DZ', 'TS', 'TZ',
1955
                              'CI', 'CY', 'CE', 'SH', 'DG', 'PH', 'C', 'K',
1956
                              'Z', 'S', 'D', 'T', 'N', 'M', 'R', 'L', 'J', 'C',
1957
                              'G', 'K', 'Q', 'X', 'F', 'V', 'B', 'P')
1958
1959
    alpha = ['']
1960
    pos = 0
1961
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
1962
    word = word.replace('ß', 'SS')
1963
    word = ''.join(c for c in word if c in
1964
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
1965
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
1966
                    'Y', 'Z'})
1967
1968
    # Clamp maxlength to [4, 64]
1969
    if maxlength is not None:
1970
        maxlength = min(max(4, maxlength), 64)
1971
    else:
1972
        maxlength = 64
1973
1974
    # Do special processing for initial substrings
1975
    for k in _alpha_sis_initials_order:
1976
        if word.startswith(k):
1977
            alpha[0] += _alpha_sis_initials[k]
1978
            pos += len(k)
1979
            break
1980
1981
    # Add a '0' if alpha is still empty
1982
    if not alpha[0]:
1983
        alpha[0] += '0'
1984
1985
    # Whether or not any special initial codes were encoded, iterate
1986
    # through the length of the word in the main encoding loop
1987
    while pos < len(word):
1988
        origpos = pos
1989
        for k in _alpha_sis_basic_order:
1990
            if word[pos:].startswith(k):
1991
                if isinstance(_alpha_sis_basic[k], tuple):
1992
                    newalpha = []
1993
                    for i in range(len(_alpha_sis_basic[k])):
1994
                        newalpha += [_ + _alpha_sis_basic[k][i] for _ in alpha]
1995
                    alpha = newalpha
1996
                else:
1997
                    alpha = [_ + _alpha_sis_basic[k] for _ in alpha]
1998
                pos += len(k)
1999
                break
2000
        if pos == origpos:
2001
            alpha = [_ + '_' for _ in alpha]
2002
            pos += 1
2003
2004
    # Trim doublets and placeholders
2005
    for i in range(len(alpha)):
2006
        pos = 1
2007
        while pos < len(alpha[i]):
2008
            if alpha[i][pos] == alpha[i][pos-1]:
2009
                alpha[i] = alpha[i][:pos]+alpha[i][pos+1:]
2010
            pos += 1
2011
    alpha = (_.replace('_', '') for _ in alpha)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2012
2013
    # Trim codes and return tuple
2014
    alpha = ((_ + ('0'*maxlength))[:maxlength] for _ in alpha)
2015
    return tuple(alpha)
2016
2017
2018
def fuzzy_soundex(word, maxlength=5, zero_pad=True):
2019
    """Return the Fuzzy Soundex code for a word.
2020
2021
    Fuzzy Soundex is an algorithm derived from Soundex, defined in:
2022
    Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for
2023
    Soundex Retrieval."
2024
    http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf
2025
2026
    :param str word: the word to transform
2027
    :param int maxlength: the length of the code returned (defaults to 4)
2028
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2029
        a maxlength string
2030
    :returns: the Fuzzy Soundex value
2031
    :rtype: str
2032
2033
    >>> fuzzy_soundex('Christopher')
2034
    'K6931'
2035
    >>> fuzzy_soundex('Niall')
2036
    'N4000'
2037
    >>> fuzzy_soundex('Smith')
2038
    'S5300'
2039
    >>> fuzzy_soundex('Smith')
2040
    'S5300'
2041
    """
2042
    _fuzzy_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2043
                                           'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2044
                                          '0193017-07745501769301-7-9'))
2045
2046
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2047
    word = word.replace('ß', 'SS')
2048
2049
    # Clamp maxlength to [4, 64]
2050
    if maxlength is not None:
2051
        maxlength = min(max(4, maxlength), 64)
2052
    else:
2053
        maxlength = 64
2054
2055
    if not word:
2056
        if zero_pad:
2057
            return '0' * maxlength
2058
        return '0'
2059
2060
    if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
2061
        word = 'SS' + word[2:]
2062
    elif word[:2] == 'GN':
2063
        word = 'NN' + word[2:]
2064
    elif word[:2] in {'HR', 'WR'}:
2065
        word = 'RR' + word[2:]
2066
    elif word[:2] == 'HW':
2067
        word = 'WW' + word[2:]
2068
    elif word[:2] in {'KN', 'NG'}:
2069
        word = 'NN' + word[2:]
2070
2071
    if word[-2:] == 'CH':
2072
        word = word[:-2] + 'KK'
2073
    elif word[-2:] == 'NT':
2074
        word = word[:-2] + 'TT'
2075
    elif word[-2:] == 'RT':
2076
        word = word[:-2] + 'RR'
2077
    elif word[-3:] == 'RDT':
2078
        word = word[:-3] + 'RR'
2079
2080
    word = word.replace('CA', 'KA')
2081
    word = word.replace('CC', 'KK')
2082
    word = word.replace('CK', 'KK')
2083
    word = word.replace('CE', 'SE')
2084
    word = word.replace('CHL', 'KL')
2085
    word = word.replace('CL', 'KL')
2086
    word = word.replace('CHR', 'KR')
2087
    word = word.replace('CR', 'KR')
2088
    word = word.replace('CI', 'SI')
2089
    word = word.replace('CO', 'KO')
2090
    word = word.replace('CU', 'KU')
2091
    word = word.replace('CY', 'SY')
2092
    word = word.replace('DG', 'GG')
2093
    word = word.replace('GH', 'HH')
2094
    word = word.replace('MAC', 'MK')
2095
    word = word.replace('MC', 'MK')
2096
    word = word.replace('NST', 'NSS')
2097
    word = word.replace('PF', 'FF')
2098
    word = word.replace('PH', 'FF')
2099
    word = word.replace('SCH', 'SSS')
2100
    word = word.replace('TIO', 'SIO')
2101
    word = word.replace('TIA', 'SIO')
2102
    word = word.replace('TCH', 'CHH')
2103
2104
    sdx = word.translate(_fuzzy_soundex_translation)
2105
    sdx = sdx.replace('-', '')
2106
2107
    # remove repeating characters
2108
    sdx = _delete_consecutive_repeats(sdx)
2109
2110
    if word[0] in {'H', 'W', 'Y'}:
2111
        sdx = word[0] + sdx
2112
    else:
2113
        sdx = word[0] + sdx[1:]
2114
2115
    sdx = sdx.replace('0', '')
2116
2117
    if zero_pad:
2118
        sdx += ('0'*maxlength)
2119
2120
    return sdx[:maxlength]
2121
2122
2123
def phonex(word, maxlength=4, zero_pad=True):
2124
    """Return the Phonex code for a word.
2125
2126
    Phonex is an algorithm derived from Soundex, defined in:
2127
    Lait, A. J. and B. Randell. "An Assessment of Name Matching Algorithms".
2128
    http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
2129
2130
    :param str word: the word to transform
2131
    :param int maxlength: the length of the code returned (defaults to 4)
2132
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2133
        a maxlength string
2134
    :returns: the Phonex value
2135
    :rtype: str
2136
2137
    >>> phonex('Christopher')
2138
    'C623'
2139
    >>> phonex('Niall')
2140
    'N400'
2141
    >>> phonex('Schmidt')
2142
    'S253'
2143
    >>> phonex('Smith')
2144
    'S530'
2145
    """
2146
    name = unicodedata.normalize('NFKD', text_type(word.upper()))
2147
    name = name.replace('ß', 'SS')
2148
2149
    # Clamp maxlength to [4, 64]
2150
    if maxlength is not None:
2151
        maxlength = min(max(4, maxlength), 64)
2152
    else:
2153
        maxlength = 64
2154
2155
    name_code = last = ''
2156
2157
    # Deletions effected by replacing with next letter which
2158
    # will be ignored due to duplicate handling of Soundex code.
2159
    # This is faster than 'moving' all subsequent letters.
2160
2161
    # Remove any trailing Ss
2162
    while name[-1:] == 'S':
2163
        name = name[:-1]
2164
2165
    # Phonetic equivalents of first 2 characters
2166
    # Works since duplicate letters are ignored
2167
    if name[:2] == 'KN':
2168
        name = 'N' + name[2:]  # KN.. == N..
2169
    elif name[:2] == 'PH':
2170
        name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
2171
    elif name[:2] == 'WR':
2172
        name = 'R' + name[2:]  # WR.. == R..
2173
2174
    if name:
2175
        # Special case, ignore H first letter (subsequent Hs ignored anyway)
2176
        # Works since duplicate letters are ignored
2177
        if name[0] == 'H':
2178
            name = name[1:]
2179
2180
    if name:
2181
        # Phonetic equivalents of first character
2182
        if name[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2183
            name = 'A' + name[1:]
2184
        elif name[0] in {'B', 'P'}:
2185
            name = 'B' + name[1:]
2186
        elif name[0] in {'V', 'F'}:
2187
            name = 'F' + name[1:]
2188
        elif name[0] in {'C', 'K', 'Q'}:
2189
            name = 'C' + name[1:]
2190
        elif name[0] in {'G', 'J'}:
2191
            name = 'G' + name[1:]
2192
        elif name[0] in {'S', 'Z'}:
2193
            name = 'S' + name[1:]
2194
2195
        name_code = last = name[0]
2196
2197
    # MODIFIED SOUNDEX CODE
2198
    for i in range(1, len(name)):
2199
        code = '0'
2200
        if name[i] in {'B', 'F', 'P', 'V'}:
2201
            code = '1'
2202
        elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
2203
            code = '2'
2204
        elif name[i] in {'D', 'T'}:
2205
            if name[i+1:i+2] != 'C':
2206
                code = '3'
2207
        elif name[i] == 'L':
2208
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2209
                    i+1 == len(name)):
2210
                code = '4'
2211
        elif name[i] in {'M', 'N'}:
2212
            if name[i+1:i+2] in {'D', 'G'}:
2213
                name = name[:i+1] + name[i] + name[i+2:]
2214
            code = '5'
2215
        elif name[i] == 'R':
2216
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2217
                    i+1 == len(name)):
2218
                code = '6'
2219
2220
        if code != last and code != '0' and i != 0:
2221
            name_code += code
2222
2223
        last = name_code[-1]
2224
2225
    if zero_pad:
2226
        name_code += '0' * maxlength
2227
    if not name_code:
2228
        name_code = '0'
2229
    return name_code[:maxlength]
2230
2231
2232
def phonem(word):
2233
    """Return the Phonem code for a word.
2234
2235
    Phonem is defined in Wilde, Georg and Carsten Meyer. 1999. "Doppelgaenger
2236
    gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung."
2237
    ct Magazin fuer Computer & Technik 25/1999.
2238
2239
    This version is based on the Perl implementation documented at:
2240
    http://phonetik.phil-fak.uni-koeln.de/fileadmin/home/ritters/Allgemeine_Dateien/Martin_Wilz.pdf
2241
    It includes some enhancements presented in the Java port at:
2242
    https://github.com/dcm4che/dcm4che/blob/master/dcm4che-soundex/src/main/java/org/dcm4che3/soundex/Phonem.java
2243
2244
    Phonem is intended chiefly for German names/words.
2245
2246
    :param str word: the word to transform
2247
    :returns: the Phonem value
2248
    :rtype: str
2249
2250
    >>> phonem('Christopher')
2251
    'CRYSDOVR'
2252
    >>> phonem('Niall')
2253
    'NYAL'
2254
    >>> phonem('Smith')
2255
    'SMYD'
2256
    >>> phonem('Schmidt')
2257
    'CMYD'
2258
    """
2259
    _phonem_substitutions = (('SC', 'C'), ('SZ', 'C'), ('CZ', 'C'),
2260
                             ('TZ', 'C'), ('TS', 'C'), ('KS', 'X'),
2261
                             ('PF', 'V'), ('QU', 'KW'), ('PH', 'V'),
2262
                             ('UE', 'Y'), ('AE', 'E'), ('OE', 'Ö'),
2263
                             ('EI', 'AY'), ('EY', 'AY'), ('EU', 'OY'),
2264
                             ('AU', 'A§'), ('OU', '§'))
2265
    _phonem_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2266
                                    'ZKGQÇÑßFWPTÁÀÂÃÅÄÆÉÈÊËIJÌÍÎÏÜݧÚÙÛÔÒÓÕØ'),
2267
                                   'CCCCCNSVVBDAAAAAEEEEEEYYYYYYYYUUUUOOOOÖ'))
2268
2269
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2270
    for i, j in _phonem_substitutions:
2271
        word = word.replace(i, j)
2272
    word = word.translate(_phonem_translation)
2273
2274
    return ''.join(c for c in _delete_consecutive_repeats(word)
2275
                   if c in {'A', 'B', 'C', 'D', 'L', 'M', 'N', 'O', 'R', 'S',
2276
                            'U', 'V', 'W', 'X', 'Y', 'Ö'})
2277
2278
2279
def phonix(word, maxlength=4, zero_pad=True):
2280
    """Return the Phonix code for a word.
2281
2282
    Phonix is a Soundex-like algorithm defined in:
2283
    T.N. Gadd: PHONIX --- The Algorithm, Program 24/4, 1990, p.363-366.
2284
2285
    This implementation is based on
2286
    http://cpansearch.perl.org/src/ULPFR/WAIT-1.800/soundex.c
2287
    http://cs.anu.edu.au/people/Peter.Christen/Febrl/febrl-0.4.01/encode.py
2288
    and
2289
    https://metacpan.org/pod/Text::Phonetic::Phonix
2290
2291
    :param str word: the word to transform
2292
    :param int maxlength: the length of the code returned (defaults to 4)
2293
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2294
        a maxlength string
2295
    :returns: the Phonix value
2296
    :rtype: str
2297
2298
    >>> phonix('Christopher')
2299
    'K683'
2300
    >>> phonix('Niall')
2301
    'N400'
2302
    >>> phonix('Smith')
2303
    'S530'
2304
    >>> phonix('Schmidt')
2305
    'S530'
2306
    """
2307
    # pylint: disable=too-many-branches
2308
    def _start_repl(word, src, tar, post=None):
2309
        r"""Replace src with tar at the start of word."""
2310
        if post:
2311
            for i in post:
2312
                if word.startswith(src+i):
2313
                    return tar + word[len(src):]
2314
        elif word.startswith(src):
2315
            return tar + word[len(src):]
2316
        return word
2317
2318
    def _end_repl(word, src, tar, pre=None):
2319
        r"""Replace src with tar at the end of word."""
2320
        if pre:
2321
            for i in pre:
2322
                if word.endswith(i+src):
2323
                    return word[:-len(src)] + tar
2324
        elif word.endswith(src):
2325
            return word[:-len(src)] + tar
2326
        return word
2327
2328
    def _mid_repl(word, src, tar, pre=None, post=None):
2329
        r"""Replace src with tar in the middle of word."""
2330
        if pre or post:
2331
            if not pre:
2332
                return word[0] + _all_repl(word[1:], src, tar, pre, post)
2333
            elif not post:
2334
                return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
2335
            return _all_repl(word, src, tar, pre, post)
2336
        return (word[0] + _all_repl(word[1:-1], src, tar, pre, post) +
2337
                word[-1])
2338
2339
    def _all_repl(word, src, tar, pre=None, post=None):
2340
        r"""Replace src with tar anywhere in word."""
2341
        if pre or post:
2342
            if post:
2343
                post = post
2344
            else:
2345
                post = frozenset(('',))
2346
            if pre:
2347
                pre = pre
2348
            else:
2349
                pre = frozenset(('',))
2350
2351
            for i, j in ((i, j) for i in pre for j in post):
2352
                word = word.replace(i+src+j, i+tar+j)
2353
            return word
2354
        else:
2355
            return word.replace(src, tar)
2356
2357
    _vow = {'A', 'E', 'I', 'O', 'U'}
2358
    _con = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q',
2359
            'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z'}
2360
2361
    _phonix_substitutions = ((_all_repl, 'DG', 'G'),
2362
                             (_all_repl, 'CO', 'KO'),
2363
                             (_all_repl, 'CA', 'KA'),
2364
                             (_all_repl, 'CU', 'KU'),
2365
                             (_all_repl, 'CY', 'SI'),
2366
                             (_all_repl, 'CI', 'SI'),
2367
                             (_all_repl, 'CE', 'SE'),
2368
                             (_start_repl, 'CL', 'KL', _vow),
2369
                             (_all_repl, 'CK', 'K'),
2370
                             (_end_repl, 'GC', 'K'),
2371
                             (_end_repl, 'JC', 'K'),
2372
                             (_start_repl, 'CHR', 'KR', _vow),
2373
                             (_start_repl, 'CR', 'KR', _vow),
2374
                             (_start_repl, 'WR', 'R'),
2375
                             (_all_repl, 'NC', 'NK'),
2376
                             (_all_repl, 'CT', 'KT'),
2377
                             (_all_repl, 'PH', 'F'),
2378
                             (_all_repl, 'AA', 'AR'),
2379
                             (_all_repl, 'SCH', 'SH'),
2380
                             (_all_repl, 'BTL', 'TL'),
2381
                             (_all_repl, 'GHT', 'T'),
2382
                             (_all_repl, 'AUGH', 'ARF'),
2383
                             (_mid_repl, 'LJ', 'LD', _vow, _vow),
2384
                             (_all_repl, 'LOUGH', 'LOW'),
2385
                             (_start_repl, 'Q', 'KW'),
2386
                             (_start_repl, 'KN', 'N'),
2387
                             (_end_repl, 'GN', 'N'),
2388
                             (_all_repl, 'GHN', 'N'),
2389
                             (_end_repl, 'GNE', 'N'),
2390
                             (_all_repl, 'GHNE', 'NE'),
2391
                             (_end_repl, 'GNES', 'NS'),
2392
                             (_start_repl, 'GN', 'N'),
2393
                             (_mid_repl, 'GN', 'N', None, _con),
2394
                             (_end_repl, 'GN', 'N'),
2395
                             (_start_repl, 'PS', 'S'),
2396
                             (_start_repl, 'PT', 'T'),
2397
                             (_start_repl, 'CZ', 'C'),
2398
                             (_mid_repl, 'WZ', 'Z', _vow),
2399
                             (_mid_repl, 'CZ', 'CH'),
2400
                             (_all_repl, 'LZ', 'LSH'),
2401
                             (_all_repl, 'RZ', 'RSH'),
2402
                             (_mid_repl, 'Z', 'S', None, _vow),
2403
                             (_all_repl, 'ZZ', 'TS'),
2404
                             (_mid_repl, 'Z', 'TS', _con),
2405
                             (_all_repl, 'HROUG', 'REW'),
2406
                             (_all_repl, 'OUGH', 'OF'),
2407
                             (_mid_repl, 'Q', 'KW', _vow, _vow),
2408
                             (_mid_repl, 'J', 'Y', _vow, _vow),
2409
                             (_start_repl, 'YJ', 'Y', _vow),
2410
                             (_start_repl, 'GH', 'G'),
2411
                             (_end_repl, 'GH', 'E', _vow),
2412
                             (_start_repl, 'CY', 'S'),
2413
                             (_all_repl, 'NX', 'NKS'),
2414
                             (_start_repl, 'PF', 'F'),
2415
                             (_end_repl, 'DT', 'T'),
2416
                             (_end_repl, 'TL', 'TIL'),
2417
                             (_end_repl, 'DL', 'DIL'),
2418
                             (_all_repl, 'YTH', 'ITH'),
2419
                             (_start_repl, 'TJ', 'CH', _vow),
2420
                             (_start_repl, 'TSJ', 'CH', _vow),
2421
                             (_start_repl, 'TS', 'T', _vow),
2422
                             (_all_repl, 'TCH', 'CH'),
2423
                             (_mid_repl, 'WSK', 'VSKIE', _vow),
2424
                             (_end_repl, 'WSK', 'VSKIE', _vow),
2425
                             (_start_repl, 'MN', 'N', _vow),
2426
                             (_start_repl, 'PN', 'N', _vow),
2427
                             (_mid_repl, 'STL', 'SL', _vow),
2428
                             (_end_repl, 'STL', 'SL', _vow),
2429
                             (_end_repl, 'TNT', 'ENT'),
2430
                             (_end_repl, 'EAUX', 'OH'),
2431
                             (_all_repl, 'EXCI', 'ECS'),
2432
                             (_all_repl, 'X', 'ECS'),
2433
                             (_end_repl, 'NED', 'ND'),
2434
                             (_all_repl, 'JR', 'DR'),
2435
                             (_end_repl, 'EE', 'EA'),
2436
                             (_all_repl, 'ZS', 'S'),
2437
                             (_mid_repl, 'R', 'AH', _vow, _con),
2438
                             (_end_repl, 'R', 'AH', _vow),
2439
                             (_mid_repl, 'HR', 'AH', _vow, _con),
2440
                             (_end_repl, 'HR', 'AH', _vow),
2441
                             (_end_repl, 'HR', 'AH', _vow),
2442
                             (_end_repl, 'RE', 'AR'),
2443
                             (_end_repl, 'R', 'AH', _vow),
2444
                             (_all_repl, 'LLE', 'LE'),
2445
                             (_end_repl, 'LE', 'ILE', _con),
2446
                             (_end_repl, 'LES', 'ILES', _con),
2447
                             (_end_repl, 'E', ''),
2448
                             (_end_repl, 'ES', 'S'),
2449
                             (_end_repl, 'SS', 'AS', _vow),
2450
                             (_end_repl, 'MB', 'M', _vow),
2451
                             (_all_repl, 'MPTS', 'MPS'),
2452
                             (_all_repl, 'MPS', 'MS'),
2453
                             (_all_repl, 'MPT', 'MT'))
2454
2455
    _phonix_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2456
                                    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2457
                                   '01230720022455012683070808'))
2458
2459
    sdx = ''
2460
2461
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2462
    word = word.replace('ß', 'SS')
2463
    word = ''.join(c for c in word if c in
2464
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2465
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2466
                    'Y', 'Z'})
2467
    if word:
2468
        for trans in _phonix_substitutions:
2469
            word = trans[0](word, *trans[1:])
2470
        if word[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2471
            sdx = 'v' + word[1:].translate(_phonix_translation)
2472
        else:
2473
            sdx = word[0] + word[1:].translate(_phonix_translation)
2474
        sdx = _delete_consecutive_repeats(sdx)
2475
        sdx = sdx.replace('0', '')
2476
2477
    # Clamp maxlength to [4, 64]
2478
    if maxlength is not None:
2479
        maxlength = min(max(4, maxlength), 64)
2480
    else:
2481
        maxlength = 64
2482
2483
    if zero_pad:
2484
        sdx += '0' * maxlength
2485
    if not sdx:
2486
        sdx = '0'
2487
    return sdx[:maxlength]
2488
2489
2490
def sfinxbis(word, maxlength=None):
2491
    """Return the SfinxBis code for a word.
2492
2493
    SfinxBis is a Soundex-like algorithm defined in:
2494
    http://www.swami.se/download/18.248ad5af12aa8136533800091/SfinxBis.pdf
2495
2496
    This implementation follows the reference implementation:
2497
    http://www.swami.se/download/18.248ad5af12aa8136533800093/swamiSfinxBis.java.txt
2498
2499
    SfinxBis is intended chiefly for Swedish names.
2500
2501
    :param str word: the word to transform
2502
    :param int maxlength: the length of the code returned (defaults to
2503
        unlimited)
2504
    :returns: the SfinxBis value
2505
    :rtype: tuple
2506
2507
    >>> sfinxbis('Christopher')
2508
    ('K68376',)
2509
    >>> sfinxbis('Niall')
2510
    ('N4',)
2511
    >>> sfinxbis('Smith')
2512
    ('S53',)
2513
    >>> sfinxbis('Schmidt')
2514
    ('S53',)
2515
2516
    >>> sfinxbis('Johansson')
2517
    ('J585',)
2518
    >>> sfinxbis('Sjöberg')
2519
    ('#162',)
2520
    """
2521
    adelstitler = (' DE LA ', ' DE LAS ', ' DE LOS ', ' VAN DE ', ' VAN DEN ',
2522
                   ' VAN DER ', ' VON DEM ', ' VON DER ',
2523
                   ' AF ', ' AV ', ' DA ', ' DE ', ' DEL ', ' DEN ', ' DES ',
2524
                   ' DI ', ' DO ', ' DON ', ' DOS ', ' DU ', ' E ', ' IN ',
2525
                   ' LA ', ' LE ', ' MAC ', ' MC ', ' VAN ', ' VON ', ' Y ',
2526
                   ' S:T ')
2527
2528
    _harde_vokaler = {'A', 'O', 'U', 'Å'}
2529
    _mjuka_vokaler = {'E', 'I', 'Y', 'Ä', 'Ö'}
2530
    _konsonanter = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P',
2531
                    'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Z'}
2532
    _alfabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2533
                'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2534
                'Y', 'Z', 'Ä', 'Å', 'Ö'}
2535
2536
    _sfinxbis_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2537
                                      'BCDFGHJKLMNPQRSTVZAOUÅEIYÄÖ'),
2538
                                     '123729224551268378999999999'))
2539
2540
    _sfinxbis_substitutions = dict(zip((ord(_) for _ in
2541
                                        'WZÀÁÂÃÆÇÈÉÊËÌÍÎÏÑÒÓÔÕØÙÚÛÜÝ'),
2542
                                       'VSAAAAÄCEEEEIIIINOOOOÖUUUYY'))
2543
2544
    def _foersvensker(ordet):
2545
        """Return the Swedish-ized form of the word."""
2546
        ordet = ordet.replace('STIERN', 'STJÄRN')
2547
        ordet = ordet.replace('HIE', 'HJ')
2548
        ordet = ordet.replace('SIÖ', 'SJÖ')
2549
        ordet = ordet.replace('SCH', 'SH')
2550
        ordet = ordet.replace('QU', 'KV')
2551
        ordet = ordet.replace('IO', 'JO')
2552
        ordet = ordet.replace('PH', 'F')
2553
2554
        for i in _harde_vokaler:
2555
            ordet = ordet.replace(i+'Ü', i+'J')
2556
            ordet = ordet.replace(i+'Y', i+'J')
2557
            ordet = ordet.replace(i+'I', i+'J')
2558
        for i in _mjuka_vokaler:
2559
            ordet = ordet.replace(i+'Ü', i+'J')
2560
            ordet = ordet.replace(i+'Y', i+'J')
2561
            ordet = ordet.replace(i+'I', i+'J')
2562
2563
        if 'H' in ordet:
2564
            for i in _konsonanter:
2565
                ordet = ordet.replace('H'+i, i)
2566
2567
        ordet = ordet.translate(_sfinxbis_substitutions)
2568
2569
        ordet = ordet.replace('Ð', 'ETH')
2570
        ordet = ordet.replace('Þ', 'TH')
2571
        ordet = ordet.replace('ß', 'SS')
2572
2573
        return ordet
2574
2575
    def _koda_foersta_ljudet(ordet):
2576
        """Return the word with the first sound coded."""
2577
        if ordet[0:1] in _mjuka_vokaler or ordet[0:1] in _harde_vokaler:
2578
            ordet = '$' + ordet[1:]
2579
        elif ordet[0:2] in ('DJ', 'GJ', 'HJ', 'LJ'):
2580
            ordet = 'J' + ordet[2:]
2581
        elif ordet[0:1] == 'G' and ordet[1:2] in _mjuka_vokaler:
2582
            ordet = 'J' + ordet[1:]
2583
        elif ordet[0:1] == 'Q':
2584
            ordet = 'K' + ordet[1:]
2585
        elif (ordet[0:2] == 'CH' and
2586
              ordet[2:3] in frozenset(_mjuka_vokaler | _harde_vokaler)):
2587
            ordet = '#' + ordet[2:]
2588
        elif ordet[0:1] == 'C' and ordet[1:2] in _harde_vokaler:
2589
            ordet = 'K' + ordet[1:]
2590
        elif ordet[0:1] == 'C' and ordet[1:2] in _konsonanter:
2591
            ordet = 'K' + ordet[1:]
2592
        elif ordet[0:1] == 'X':
2593
            ordet = 'S' + ordet[1:]
2594
        elif ordet[0:1] == 'C' and ordet[1:2] in _mjuka_vokaler:
2595
            ordet = 'S' + ordet[1:]
2596
        elif ordet[0:3] in ('SKJ', 'STJ', 'SCH'):
2597
            ordet = '#' + ordet[3:]
2598
        elif ordet[0:2] in ('SH', 'KJ', 'TJ', 'SJ'):
2599
            ordet = '#' + ordet[2:]
2600
        elif ordet[0:2] == 'SK' and ordet[2:3] in _mjuka_vokaler:
2601
            ordet = '#' + ordet[2:]
2602
        elif ordet[0:1] == 'K' and ordet[1:2] in _mjuka_vokaler:
2603
            ordet = '#' + ordet[1:]
2604
        return ordet
2605
2606
    # Steg 1, Versaler
2607
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2608
    word = word.replace('ß', 'SS')
2609
    word = word.replace('-', ' ')
2610
2611
    # Steg 2, Ta bort adelsprefix
2612
    for adelstitel in adelstitler:
2613
        while adelstitel in word:
2614
            word = word.replace(adelstitel, ' ')
2615
        if word.startswith(adelstitel[1:]):
2616
            word = word[len(adelstitel)-1:]
2617
2618
    # Split word into tokens
2619
    ordlista = word.split()
2620
2621
    # Steg 3, Ta bort dubbelteckning i början på namnet
2622
    ordlista = [_delete_consecutive_repeats(ordet) for ordet in ordlista]
2623
    if not ordlista:
2624
        return ('',)
2625
2626
    # Steg 4, Försvenskning
2627
    ordlista = [_foersvensker(ordet) for ordet in ordlista]
2628
2629
    # Steg 5, Ta bort alla tecken som inte är A-Ö (65-90,196,197,214)
2630
    ordlista = [''.join(c for c in ordet if c in _alfabet)
2631
                for ordet in ordlista]
2632
2633
    # Steg 6, Koda första ljudet
2634
    ordlista = [_koda_foersta_ljudet(ordet) for ordet in ordlista]
2635
2636
    # Steg 7, Dela upp namnet i två delar
2637
    rest = [ordet[1:] for ordet in ordlista]
2638
2639
    # Steg 8, Utför fonetisk transformation i resten
2640
    rest = [ordet.replace('DT', 'T') for ordet in rest]
2641
    rest = [ordet.replace('X', 'KS') for ordet in rest]
2642
2643
    # Steg 9, Koda resten till en sifferkod
2644
    for vokal in _mjuka_vokaler:
2645
        rest = [ordet.replace('C'+vokal, '8'+vokal) for ordet in rest]
2646
    rest = [ordet.translate(_sfinxbis_translation) for ordet in rest]
2647
2648
    # Steg 10, Ta bort intilliggande dubbletter
2649
    rest = [_delete_consecutive_repeats(ordet) for ordet in rest]
2650
2651
    # Steg 11, Ta bort alla "9"
2652
    rest = [ordet.replace('9', '') for ordet in rest]
2653
2654
    # Steg 12, Sätt ihop delarna igen
2655
    ordlista = [''.join(ordet) for ordet in
2656
                zip((_[0:1] for _ in ordlista), rest)]
2657
2658
    # truncate, if maxlength is set
2659
    if maxlength and maxlength < _INFINITY:
2660
        ordlista = [ordet[:maxlength] for ordet in ordlista]
2661
2662
    return tuple(ordlista)
2663
2664
2665
def phonet(word, mode=1, lang='de', trace=False):
2666
    """Return the phonet code for a word.
2667
2668
    phonet ("Hannoveraner Phonetik") was developed by Jörg Michael and
2669
    documented in c't magazine vol. 25/1999, p. 252. It is a phonetic
2670
    algorithm designed primarily for German.
2671
    Cf. http://www.heise.de/ct/ftp/99/25/252/
2672
2673
    This is a port of Jesper Zedlitz's code, which is licensed LGPL:
2674
    https://code.google.com/p/phonet4java/source/browse/trunk/src/main/java/com/googlecode/phonet4java/Phonet.java
2675
2676
    That is, in turn, based on Michael's C code, which is also licensed LGPL:
2677
    ftp://ftp.heise.de/pub/ct/listings/phonet.zip
2678
2679
    :param str word: the word to transform
2680
    :param int mode: the ponet variant to employ (1 or 2)
2681
    :param str lang: 'de' (default) for German
2682
            'none' for no language
2683
    :param bool trace: prints debugging info if True
2684
    :returns: the phonet value
2685
    :rtype: str
2686
2687
    >>> phonet('Christopher')
2688
    'KRISTOFA'
2689
    >>> phonet('Niall')
2690
    'NIAL'
2691
    >>> phonet('Smith')
2692
    'SMIT'
2693
    >>> phonet('Schmidt')
2694
    'SHMIT'
2695
2696
    >>> phonet('Christopher', mode=2)
2697
    'KRIZTUFA'
2698
    >>> phonet('Niall', mode=2)
2699
    'NIAL'
2700
    >>> phonet('Smith', mode=2)
2701
    'ZNIT'
2702
    >>> phonet('Schmidt', mode=2)
2703
    'ZNIT'
2704
2705
    >>> phonet('Christopher', lang='none')
2706
    'CHRISTOPHER'
2707
    >>> phonet('Niall', lang='none')
2708
    'NIAL'
2709
    >>> phonet('Smith', lang='none')
2710
    'SMITH'
2711
    >>> phonet('Schmidt', lang='none')
2712
    'SCHMIDT'
2713
    """
2714
    # pylint: disable=too-many-branches
2715
2716
    _phonet_rules_no_lang = (  # separator chars
2717
        '´', ' ', ' ',
2718
        '"', ' ', ' ',
2719
        '`$', '', '',
2720
        '\'', ' ', ' ',
2721
        ',', ',', ',',
2722
        ';', ',', ',',
2723
        '-', ' ', ' ',
2724
        ' ', ' ', ' ',
2725
        '.', '.', '.',
2726
        ':', '.', '.',
2727
        # German umlauts
2728
        'Ä', 'AE', 'AE',
2729
        'Ö', 'OE', 'OE',
2730
        'Ü', 'UE', 'UE',
2731
        'ß', 'S', 'S',
2732
        # international umlauts
2733
        'À', 'A', 'A',
2734
        'Á', 'A', 'A',
2735
        'Â', 'A', 'A',
2736
        'Ã', 'A', 'A',
2737
        'Å', 'A', 'A',
2738
        'Æ', 'AE', 'AE',
2739
        'Ç', 'C', 'C',
2740
        'Ð', 'DJ', 'DJ',
2741
        'È', 'E', 'E',
2742
        'É', 'E', 'E',
2743
        'Ê', 'E', 'E',
2744
        'Ë', 'E', 'E',
2745
        'Ì', 'I', 'I',
2746
        'Í', 'I', 'I',
2747
        'Î', 'I', 'I',
2748
        'Ï', 'I', 'I',
2749
        'Ñ', 'NH', 'NH',
2750
        'Ò', 'O', 'O',
2751
        'Ó', 'O', 'O',
2752
        'Ô', 'O', 'O',
2753
        'Õ', 'O', 'O',
2754
        'Œ', 'OE', 'OE',
2755
        'Ø', 'OE', 'OE',
2756
        'Š', 'SH', 'SH',
2757
        'Þ', 'TH', 'TH',
2758
        'Ù', 'U', 'U',
2759
        'Ú', 'U', 'U',
2760
        'Û', 'U', 'U',
2761
        'Ý', 'Y', 'Y',
2762
        'Ÿ', 'Y', 'Y',
2763
        # 'normal' letters (A-Z)
2764
        'MC^', 'MAC', 'MAC',
2765
        'MC^', 'MAC', 'MAC',
2766
        'M´^', 'MAC', 'MAC',
2767
        'M\'^', 'MAC', 'MAC',
2768
        'O´^', 'O', 'O',
2769
        'O\'^', 'O', 'O',
2770
        'VAN DEN ^', 'VANDEN', 'VANDEN',
2771
        None, None, None)
2772
2773
    _phonet_rules_german = (  # separator chars
2774
        '´', ' ', ' ',
2775
        '"', ' ', ' ',
2776
        '`$', '', '',
2777
        '\'', ' ', ' ',
2778
        ',', ' ', ' ',
2779
        ';', ' ', ' ',
2780
        '-', ' ', ' ',
2781
        ' ', ' ', ' ',
2782
        '.', '.', '.',
2783
        ':', '.', '.',
2784
        # German umlauts
2785
        'ÄE', 'E', 'E',
2786
        'ÄU<', 'EU', 'EU',
2787
        'ÄV(AEOU)-<', 'EW', None,
2788
        'Ä$', 'Ä', None,
2789
        'Ä<', None, 'E',
2790
        'Ä', 'E', None,
2791
        'ÖE', 'Ö', 'Ö',
2792
        'ÖU', 'Ö', 'Ö',
2793
        'ÖVER--<', 'ÖW', None,
2794
        'ÖV(AOU)-', 'ÖW', None,
2795
        'ÜBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
2796
        'ÜBER^^', 'ÜBA', 'IBA',
2797
        'ÜE', 'Ü', 'I',
2798
        'ÜVER--<', 'ÜW', None,
2799
        'ÜV(AOU)-', 'ÜW', None,
2800
        'Ü', None, 'I',
2801
        'ßCH<', None, 'Z',
2802
        'ß<', 'S', 'Z',
2803
        # international umlauts
2804
        'À<', 'A', 'A',
2805
        'Á<', 'A', 'A',
2806
        'Â<', 'A', 'A',
2807
        'Ã<', 'A', 'A',
2808
        'Å<', 'A', 'A',
2809
        'ÆER-', 'E', 'E',
2810
        'ÆU<', 'EU', 'EU',
2811
        'ÆV(AEOU)-<', 'EW', None,
2812
        'Æ$', 'Ä', None,
2813
        'Æ<', None, 'E',
2814
        'Æ', 'E', None,
2815
        'Ç', 'Z', 'Z',
2816
        'ÐÐ-', '', '',
2817
        'Ð', 'DI', 'TI',
2818
        'È<', 'E', 'E',
2819
        'É<', 'E', 'E',
2820
        'Ê<', 'E', 'E',
2821
        'Ë', 'E', 'E',
2822
        'Ì<', 'I', 'I',
2823
        'Í<', 'I', 'I',
2824
        'Î<', 'I', 'I',
2825
        'Ï', 'I', 'I',
2826
        'ÑÑ-', '', '',
2827
        'Ñ', 'NI', 'NI',
2828
        'Ò<', 'O', 'U',
2829
        'Ó<', 'O', 'U',
2830
        'Ô<', 'O', 'U',
2831
        'Õ<', 'O', 'U',
2832
        'Œ<', 'Ö', 'Ö',
2833
        'Ø(IJY)-<', 'E', 'E',
2834
        'Ø<', 'Ö', 'Ö',
2835
        'Š', 'SH', 'Z',
2836
        'Þ', 'T', 'T',
2837
        'Ù<', 'U', 'U',
2838
        'Ú<', 'U', 'U',
2839
        'Û<', 'U', 'U',
2840
        'Ý<', 'I', 'I',
2841
        'Ÿ<', 'I', 'I',
2842
        # 'normal' letters (A-Z)
2843
        'ABELLE$', 'ABL', 'ABL',
2844
        'ABELL$', 'ABL', 'ABL',
2845
        'ABIENNE$', 'ABIN', 'ABIN',
2846
        'ACHME---^', 'ACH', 'AK',
2847
        'ACEY$', 'AZI', 'AZI',
2848
        'ADV', 'ATW', None,
2849
        'AEGL-', 'EK', None,
2850
        'AEU<', 'EU', 'EU',
2851
        'AE2', 'E', 'E',
2852
        'AFTRAUBEN------', 'AFT ', 'AFT ',
2853
        'AGL-1', 'AK', None,
2854
        'AGNI-^', 'AKN', 'AKN',
2855
        'AGNIE-', 'ANI', 'ANI',
2856
        'AGN(AEOU)-$', 'ANI', 'ANI',
2857
        'AH(AIOÖUÜY)-', 'AH', None,
2858
        'AIA2', 'AIA', 'AIA',
2859
        'AIE$', 'E', 'E',
2860
        'AILL(EOU)-', 'ALI', 'ALI',
2861
        'AINE$', 'EN', 'EN',
2862
        'AIRE$', 'ER', 'ER',
2863
        'AIR-', 'E', 'E',
2864
        'AISE$', 'ES', 'EZ',
2865
        'AISSANCE$', 'ESANS', 'EZANZ',
2866
        'AISSE$', 'ES', 'EZ',
2867
        'AIX$', 'EX', 'EX',
2868
        'AJ(AÄEÈÉÊIOÖUÜ)--', 'A', 'A',
2869
        'AKTIE', 'AXIE', 'AXIE',
2870
        'AKTUEL', 'AKTUEL', None,
2871
        'ALOI^', 'ALOI', 'ALUI',  # Don't merge these rules
2872
        'ALOY^', 'ALOI', 'ALUI',  # needed by 'check_rules'
2873
        'AMATEU(RS)-', 'AMATÖ', 'ANATÖ',
2874
        'ANCH(OEI)-', 'ANSH', 'ANZ',
2875
        'ANDERGEGANG----', 'ANDA GE', 'ANTA KE',
2876
        'ANDERGEHE----', 'ANDA ', 'ANTA ',
2877
        'ANDERGESETZ----', 'ANDA GE', 'ANTA KE',
2878
        'ANDERGING----', 'ANDA ', 'ANTA ',
2879
        'ANDERSETZ(ET)-----', 'ANDA ', 'ANTA ',
2880
        'ANDERZUGEHE----', 'ANDA ZU ', 'ANTA ZU ',
2881
        'ANDERZUSETZE-----', 'ANDA ZU ', 'ANTA ZU ',
2882
        'ANER(BKO)---^^', 'AN', None,
2883
        'ANHAND---^$', 'AN H', 'AN ',
2884
        'ANH(AÄEIOÖUÜY)--^^', 'AN', None,
2885
        'ANIELLE$', 'ANIEL', 'ANIL',
2886
        'ANIEL', 'ANIEL', None,
2887
        'ANSTELLE----^$', 'AN ST', 'AN ZT',
2888
        'ANTI^^', 'ANTI', 'ANTI',
2889
        'ANVER^^', 'ANFA', 'ANFA',
2890
        'ATIA$', 'ATIA', 'ATIA',
2891
        'ATIA(NS)--', 'ATI', 'ATI',
2892
        'ATI(AÄOÖUÜ)-', 'AZI', 'AZI',
2893
        'AUAU--', '', '',
2894
        'AUERE$', 'AUERE', None,
2895
        'AUERE(NS)-$', 'AUERE', None,
2896
        'AUERE(AIOUY)--', 'AUER', None,
2897
        'AUER(AÄIOÖUÜY)-', 'AUER', None,
2898
        'AUER<', 'AUA', 'AUA',
2899
        'AUF^^', 'AUF', 'AUF',
2900
        'AULT$', 'O', 'U',
2901
        'AUR(BCDFGKLMNQSTVWZ)-', 'AUA', 'AUA',
2902
        'AUR$', 'AUA', 'AUA',
2903
        'AUSSE$', 'OS', 'UZ',
2904
        'AUS(ST)-^', 'AUS', 'AUS',
2905
        'AUS^^', 'AUS', 'AUS',
2906
        'AUTOFAHR----', 'AUTO ', 'AUTU ',
2907
        'AUTO^^', 'AUTO', 'AUTU',
2908
        'AUX(IY)-', 'AUX', 'AUX',
2909
        'AUX', 'O', 'U',
2910
        'AU', 'AU', 'AU',
2911
        'AVER--<', 'AW', None,
2912
        'AVIER$', 'AWIE', 'AFIE',
2913
        'AV(EÈÉÊI)-^', 'AW', None,
2914
        'AV(AOU)-', 'AW', None,
2915
        'AYRE$', 'EIRE', 'EIRE',
2916
        'AYRE(NS)-$', 'EIRE', 'EIRE',
2917
        'AYRE(AIOUY)--', 'EIR', 'EIR',
2918
        'AYR(AÄIOÖUÜY)-', 'EIR', 'EIR',
2919
        'AYR<', 'EIA', 'EIA',
2920
        'AYER--<', 'EI', 'EI',
2921
        'AY(AÄEIOÖUÜY)--', 'A', 'A',
2922
        'AË', 'E', 'E',
2923
        'A(IJY)<', 'EI', 'EI',
2924
        'BABY^$', 'BEBI', 'BEBI',
2925
        'BAB(IY)^', 'BEBI', 'BEBI',
2926
        'BEAU^$', 'BO', None,
2927
        'BEA(BCMNRU)-^', 'BEA', 'BEA',
2928
        'BEAT(AEIMORU)-^', 'BEAT', 'BEAT',
2929
        'BEE$', 'BI', 'BI',
2930
        'BEIGE^$', 'BESH', 'BEZ',
2931
        'BENOIT--', 'BENO', 'BENU',
2932
        'BER(DT)-', 'BER', None,
2933
        'BERN(DT)-', 'BERN', None,
2934
        'BE(LMNRST)-^', 'BE', 'BE',
2935
        'BETTE$', 'BET', 'BET',
2936
        'BEVOR^$', 'BEFOR', None,
2937
        'BIC$', 'BIZ', 'BIZ',
2938
        'BOWL(EI)-', 'BOL', 'BUL',
2939
        'BP(AÄEÈÉÊIÌÍÎOÖRUÜY)-', 'B', 'B',
2940
        'BRINGEND-----^', 'BRI', 'BRI',
2941
        'BRINGEND-----', ' BRI', ' BRI',
2942
        'BROW(NS)-', 'BRAU', 'BRAU',
2943
        'BUDGET7', 'BÜGE', 'BIKE',
2944
        'BUFFET7', 'BÜFE', 'BIFE',
2945
        'BYLLE$', 'BILE', 'BILE',
2946
        'BYLL$', 'BIL', 'BIL',
2947
        'BYPA--^', 'BEI', 'BEI',
2948
        'BYTE<', 'BEIT', 'BEIT',
2949
        'BY9^', 'BÜ', None,
2950
        'B(SßZ)$', 'BS', None,
2951
        'CACH(EI)-^', 'KESH', 'KEZ',
2952
        'CAE--', 'Z', 'Z',
2953
        'CA(IY)$', 'ZEI', 'ZEI',
2954
        'CE(EIJUY)--', 'Z', 'Z',
2955
        'CENT<', 'ZENT', 'ZENT',
2956
        'CERST(EI)----^', 'KE', 'KE',
2957
        'CER$', 'ZA', 'ZA',
2958
        'CE3', 'ZE', 'ZE',
2959
        'CH\'S$', 'X', 'X',
2960
        'CH´S$', 'X', 'X',
2961
        'CHAO(ST)-', 'KAO', 'KAU',
2962
        'CHAMPIO-^', 'SHEMPI', 'ZENBI',
2963
        'CHAR(AI)-^', 'KAR', 'KAR',
2964
        'CHAU(CDFSVWXZ)-', 'SHO', 'ZU',
2965
        'CHÄ(CF)-', 'SHE', 'ZE',
2966
        'CHE(CF)-', 'SHE', 'ZE',
2967
        'CHEM-^', 'KE', 'KE',  # or: 'CHE', 'KE'
2968
        'CHEQUE<', 'SHEK', 'ZEK',
2969
        'CHI(CFGPVW)-', 'SHI', 'ZI',
2970
        'CH(AEUY)-<^', 'SH', 'Z',
2971
        'CHK-', '', '',
2972
        'CHO(CKPS)-^', 'SHO', 'ZU',
2973
        'CHRIS-', 'KRI', None,
2974
        'CHRO-', 'KR', None,
2975
        'CH(LOR)-<^', 'K', 'K',
2976
        'CHST-', 'X', 'X',
2977
        'CH(SßXZ)3', 'X', 'X',
2978
        'CHTNI-3', 'CHN', 'KN',
2979
        'CH^', 'K', 'K',  # or: 'CH', 'K'
2980
        'CH', 'CH', 'K',
2981
        'CIC$', 'ZIZ', 'ZIZ',
2982
        'CIENCEFICT----', 'EIENS ', 'EIENZ ',
2983
        'CIENCE$', 'EIENS', 'EIENZ',
2984
        'CIER$', 'ZIE', 'ZIE',
2985
        'CYB-^', 'ZEI', 'ZEI',
2986
        'CY9^', 'ZÜ', 'ZI',
2987
        'C(IJY)-<3', 'Z', 'Z',
2988
        'CLOWN-', 'KLAU', 'KLAU',
2989
        'CCH', 'Z', 'Z',
2990
        'CCE-', 'X', 'X',
2991
        'C(CK)-', '', '',
2992
        'CLAUDET---', 'KLO', 'KLU',
2993
        'CLAUDINE^$', 'KLODIN', 'KLUTIN',
2994
        'COACH', 'KOSH', 'KUZ',
2995
        'COLE$', 'KOL', 'KUL',
2996
        'COUCH', 'KAUSH', 'KAUZ',
2997
        'COW', 'KAU', 'KAU',
2998
        'CQUES$', 'K', 'K',
2999
        'CQUE', 'K', 'K',
3000
        'CRASH--9', 'KRE', 'KRE',
3001
        'CREAT-^', 'KREA', 'KREA',
3002
        'CST', 'XT', 'XT',
3003
        'CS<^', 'Z', 'Z',
3004
        'C(SßX)', 'X', 'X',
3005
        'CT\'S$', 'X', 'X',
3006
        'CT(SßXZ)', 'X', 'X',
3007
        'CZ<', 'Z', 'Z',
3008
        'C(ÈÉÊÌÍÎÝ)3', 'Z', 'Z',
3009
        'C.^', 'C.', 'C.',
3010
        'CÄ-', 'Z', 'Z',
3011
        'CÜ$', 'ZÜ', 'ZI',
3012
        'C\'S$', 'X', 'X',
3013
        'C<', 'K', 'K',
3014
        'DAHER^$', 'DAHER', None,
3015
        'DARAUFFOLGE-----', 'DARAUF ', 'TARAUF ',
3016
        'DAVO(NR)-^$', 'DAFO', 'TAFU',
3017
        'DD(SZ)--<', '', '',
3018
        'DD9', 'D', None,
3019
        'DEPOT7', 'DEPO', 'TEBU',
3020
        'DESIGN', 'DISEIN', 'TIZEIN',
3021
        'DE(LMNRST)-3^', 'DE', 'TE',
3022
        'DETTE$', 'DET', 'TET',
3023
        'DH$', 'T', None,
3024
        'DIC$', 'DIZ', 'TIZ',
3025
        'DIDR-^', 'DIT', None,
3026
        'DIEDR-^', 'DIT', None,
3027
        'DJ(AEIOU)-^', 'I', 'I',
3028
        'DMITR-^', 'DIMIT', 'TINIT',
3029
        'DRY9^', 'DRÜ', None,
3030
        'DT-', '', '',
3031
        'DUIS-^', 'DÜ', 'TI',
3032
        'DURCH^^', 'DURCH', 'TURK',
3033
        'DVA$', 'TWA', None,
3034
        'DY9^', 'DÜ', None,
3035
        'DYS$', 'DIS', None,
3036
        'DS(CH)--<', 'T', 'T',
3037
        'DST', 'ZT', 'ZT',
3038
        'DZS(CH)--', 'T', 'T',
3039
        'D(SßZ)', 'Z', 'Z',
3040
        'D(AÄEIOÖRUÜY)-', 'D', None,
3041
        'D(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'D', None,
3042
        'D\'H^', 'D', 'T',
3043
        'D´H^', 'D', 'T',
3044
        'D`H^', 'D', 'T',
3045
        'D\'S3$', 'Z', 'Z',
3046
        'D´S3$', 'Z', 'Z',
3047
        'D^', 'D', None,
3048
        'D', 'T', 'T',
3049
        'EAULT$', 'O', 'U',
3050
        'EAUX$', 'O', 'U',
3051
        'EAU', 'O', 'U',
3052
        'EAV', 'IW', 'IF',
3053
        'EAS3$', 'EAS', None,
3054
        'EA(AÄEIOÖÜY)-3', 'EA', 'EA',
3055
        'EA3$', 'EA', 'EA',
3056
        'EA3', 'I', 'I',
3057
        'EBENSO^$', 'EBNSO', 'EBNZU',
3058
        'EBENSO^^', 'EBNSO ', 'EBNZU ',
3059
        'EBEN^^', 'EBN', 'EBN',
3060
        'EE9', 'E', 'E',
3061
        'EGL-1', 'EK', None,
3062
        'EHE(IUY)--1', 'EH', None,
3063
        'EHUNG---1', 'E', None,
3064
        'EH(AÄIOÖUÜY)-1', 'EH', None,
3065
        'EIEI--', '', '',
3066
        'EIERE^$', 'EIERE', None,
3067
        'EIERE$', 'EIERE', None,
3068
        'EIERE(NS)-$', 'EIERE', None,
3069
        'EIERE(AIOUY)--', 'EIER', None,
3070
        'EIER(AÄIOÖUÜY)-', 'EIER', None,
3071
        'EIER<', 'EIA', None,
3072
        'EIGL-1', 'EIK', None,
3073
        'EIGH$', 'EI', 'EI',
3074
        'EIH--', 'E', 'E',
3075
        'EILLE$', 'EI', 'EI',
3076
        'EIR(BCDFGKLMNQSTVWZ)-', 'EIA', 'EIA',
3077
        'EIR$', 'EIA', 'EIA',
3078
        'EITRAUBEN------', 'EIT ', 'EIT ',
3079
        'EI', 'EI', 'EI',
3080
        'EJ$', 'EI', 'EI',
3081
        'ELIZ^', 'ELIS', None,
3082
        'ELZ^', 'ELS', None,
3083
        'EL-^', 'E', 'E',
3084
        'ELANG----1', 'E', 'E',
3085
        'EL(DKL)--1', 'E', 'E',
3086
        'EL(MNT)--1$', 'E', 'E',
3087
        'ELYNE$', 'ELINE', 'ELINE',
3088
        'ELYN$', 'ELIN', 'ELIN',
3089
        'EL(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'EL', 'EL',
3090
        'EL-1', 'L', 'L',
3091
        'EM-^', None, 'E',
3092
        'EM(DFKMPQT)--1', None, 'E',
3093
        'EM(AÄEÈÉÊIÌÍÎOÖUÜY)--1', None, 'E',
3094
        'EM-1', None, 'N',
3095
        'ENGAG-^', 'ANGA', 'ANKA',
3096
        'EN-^', 'E', 'E',
3097
        'ENTUEL', 'ENTUEL', None,
3098
        'EN(CDGKQSTZ)--1', 'E', 'E',
3099
        'EN(AÄEÈÉÊIÌÍÎNOÖUÜY)-1', 'EN', 'EN',
3100
        'EN-1', '', '',
3101
        'ERH(AÄEIOÖUÜ)-^', 'ERH', 'ER',
3102
        'ER-^', 'E', 'E',
3103
        'ERREGEND-----', ' ER', ' ER',
3104
        'ERT1$', 'AT', None,
3105
        'ER(DGLKMNRQTZß)-1', 'ER', None,
3106
        'ER(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'ER', 'A',
3107
        'ER1$', 'A', 'A',
3108
        'ER<1', 'A', 'A',
3109
        'ETAT7', 'ETA', 'ETA',
3110
        'ETI(AÄOÖÜU)-', 'EZI', 'EZI',
3111
        'EUERE$', 'EUERE', None,
3112
        'EUERE(NS)-$', 'EUERE', None,
3113
        'EUERE(AIOUY)--', 'EUER', None,
3114
        'EUER(AÄIOÖUÜY)-', 'EUER', None,
3115
        'EUER<', 'EUA', None,
3116
        'EUEU--', '', '',
3117
        'EUILLE$', 'Ö', 'Ö',
3118
        'EUR$', 'ÖR', 'ÖR',
3119
        'EUX', 'Ö', 'Ö',
3120
        'EUSZ$', 'EUS', None,
3121
        'EUTZ$', 'EUS', None,
3122
        'EUYS$', 'EUS', 'EUZ',
3123
        'EUZ$', 'EUS', None,
3124
        'EU', 'EU', 'EU',
3125
        'EVER--<1', 'EW', None,
3126
        'EV(ÄOÖUÜ)-1', 'EW', None,
3127
        'EYER<', 'EIA', 'EIA',
3128
        'EY<', 'EI', 'EI',
3129
        'FACETTE', 'FASET', 'FAZET',
3130
        'FANS--^$', 'FE', 'FE',
3131
        'FAN-^$', 'FE', 'FE',
3132
        'FAULT-', 'FOL', 'FUL',
3133
        'FEE(DL)-', 'FI', 'FI',
3134
        'FEHLER', 'FELA', 'FELA',
3135
        'FE(LMNRST)-3^', 'FE', 'FE',
3136
        'FOERDERN---^', 'FÖRD', 'FÖRT',
3137
        'FOERDERN---', ' FÖRD', ' FÖRT',
3138
        'FOND7', 'FON', 'FUN',
3139
        'FRAIN$', 'FRA', 'FRA',
3140
        'FRISEU(RS)-', 'FRISÖ', 'FRIZÖ',
3141
        'FY9^', 'FÜ', None,
3142
        'FÖRDERN---^', 'FÖRD', 'FÖRT',
3143
        'FÖRDERN---', ' FÖRD', ' FÖRT',
3144
        'GAGS^$', 'GEX', 'KEX',
3145
        'GAG^$', 'GEK', 'KEK',
3146
        'GD', 'KT', 'KT',
3147
        'GEGEN^^', 'GEGN', 'KEKN',
3148
        'GEGENGEKOM-----', 'GEGN ', 'KEKN ',
3149
        'GEGENGESET-----', 'GEGN ', 'KEKN ',
3150
        'GEGENKOMME-----', 'GEGN ', 'KEKN ',
3151
        'GEGENZUKOM---', 'GEGN ZU ', 'KEKN ZU ',
3152
        'GENDETWAS-----$', 'GENT ', 'KENT ',
3153
        'GENRE', 'IORE', 'IURE',
3154
        'GE(LMNRST)-3^', 'GE', 'KE',
3155
        'GER(DKT)-', 'GER', None,
3156
        'GETTE$', 'GET', 'KET',
3157
        'GGF.', 'GF.', None,
3158
        'GG-', '', '',
3159
        'GH', 'G', None,
3160
        'GI(AOU)-^', 'I', 'I',
3161
        'GION-3', 'KIO', 'KIU',
3162
        'G(CK)-', '', '',
3163
        'GJ(AEIOU)-^', 'I', 'I',
3164
        'GMBH^$', 'GMBH', 'GMBH',
3165
        'GNAC$', 'NIAK', 'NIAK',
3166
        'GNON$', 'NION', 'NIUN',
3167
        'GN$', 'N', 'N',
3168
        'GONCAL-^', 'GONZA', 'KUNZA',
3169
        'GRY9^', 'GRÜ', None,
3170
        'G(SßXZ)-<', 'K', 'K',
3171
        'GUCK-', 'KU', 'KU',
3172
        'GUISEP-^', 'IUSE', 'IUZE',
3173
        'GUI-^', 'G', 'K',
3174
        'GUTAUSSEH------^', 'GUT ', 'KUT ',
3175
        'GUTGEHEND------^', 'GUT ', 'KUT ',
3176
        'GY9^', 'GÜ', None,
3177
        'G(AÄEILOÖRUÜY)-', 'G', None,
3178
        'G(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'G', None,
3179
        'G\'S$', 'X', 'X',
3180
        'G´S$', 'X', 'X',
3181
        'G^', 'G', None,
3182
        'G', 'K', 'K',
3183
        'HA(HIUY)--1', 'H', None,
3184
        'HANDVOL---^', 'HANT ', 'ANT ',
3185
        'HANNOVE-^', 'HANOF', None,
3186
        'HAVEN7$', 'HAFN', None,
3187
        'HEAD-', 'HE', 'E',
3188
        'HELIEGEN------', 'E ', 'E ',
3189
        'HESTEHEN------', 'E ', 'E ',
3190
        'HE(LMNRST)-3^', 'HE', 'E',
3191
        'HE(LMN)-1', 'E', 'E',
3192
        'HEUR1$', 'ÖR', 'ÖR',
3193
        'HE(HIUY)--1', 'H', None,
3194
        'HIH(AÄEIOÖUÜY)-1', 'IH', None,
3195
        'HLH(AÄEIOÖUÜY)-1', 'LH', None,
3196
        'HMH(AÄEIOÖUÜY)-1', 'MH', None,
3197
        'HNH(AÄEIOÖUÜY)-1', 'NH', None,
3198
        'HOBBY9^', 'HOBI', None,
3199
        'HOCHBEGAB-----^', 'HOCH ', 'UK ',
3200
        'HOCHTALEN-----^', 'HOCH ', 'UK ',
3201
        'HOCHZUFRI-----^', 'HOCH ', 'UK ',
3202
        'HO(HIY)--1', 'H', None,
3203
        'HRH(AÄEIOÖUÜY)-1', 'RH', None,
3204
        'HUH(AÄEIOÖUÜY)-1', 'UH', None,
3205
        'HUIS^^', 'HÜS', 'IZ',
3206
        'HUIS$', 'ÜS', 'IZ',
3207
        'HUI--1', 'H', None,
3208
        'HYGIEN^', 'HÜKIEN', None,
3209
        'HY9^', 'HÜ', None,
3210
        'HY(BDGMNPST)-', 'Ü', None,
3211
        'H.^', None, 'H.',
3212
        'HÄU--1', 'H', None,
3213
        'H^', 'H', '',
3214
        'H', '', '',
3215
        'ICHELL---', 'ISH', 'IZ',
3216
        'ICHI$', 'ISHI', 'IZI',
3217
        'IEC$', 'IZ', 'IZ',
3218
        'IEDENSTELLE------', 'IDN ', 'ITN ',
3219
        'IEI-3', '', '',
3220
        'IELL3', 'IEL', 'IEL',
3221
        'IENNE$', 'IN', 'IN',
3222
        'IERRE$', 'IER', 'IER',
3223
        'IERZULAN---', 'IR ZU ', 'IR ZU ',
3224
        'IETTE$', 'IT', 'IT',
3225
        'IEU', 'IÖ', 'IÖ',
3226
        'IE<4', 'I', 'I',
3227
        'IGL-1', 'IK', None,
3228
        'IGHT3$', 'EIT', 'EIT',
3229
        'IGNI(EO)-', 'INI', 'INI',
3230
        'IGN(AEOU)-$', 'INI', 'INI',
3231
        'IHER(DGLKRT)--1', 'IHE', None,
3232
        'IHE(IUY)--', 'IH', None,
3233
        'IH(AIOÖUÜY)-', 'IH', None,
3234
        'IJ(AOU)-', 'I', 'I',
3235
        'IJ$', 'I', 'I',
3236
        'IJ<', 'EI', 'EI',
3237
        'IKOLE$', 'IKOL', 'IKUL',
3238
        'ILLAN(STZ)--4', 'ILIA', 'ILIA',
3239
        'ILLAR(DT)--4', 'ILIA', 'ILIA',
3240
        'IMSTAN----^', 'IM ', 'IN ',
3241
        'INDELERREGE------', 'INDL ', 'INTL ',
3242
        'INFRAGE-----^$', 'IN ', 'IN ',
3243
        'INTERN(AOU)-^', 'INTAN', 'INTAN',
3244
        'INVER-', 'INWE', 'INFE',
3245
        'ITI(AÄIOÖUÜ)-', 'IZI', 'IZI',
3246
        'IUSZ$', 'IUS', None,
3247
        'IUTZ$', 'IUS', None,
3248
        'IUZ$', 'IUS', None,
3249
        'IVER--<', 'IW', None,
3250
        'IVIER$', 'IWIE', 'IFIE',
3251
        'IV(ÄOÖUÜ)-', 'IW', None,
3252
        'IV<3', 'IW', None,
3253
        'IY2', 'I', None,
3254
        'I(ÈÉÊ)<4', 'I', 'I',
3255
        'JAVIE---<^', 'ZA', 'ZA',
3256
        'JEANS^$', 'JINS', 'INZ',
3257
        'JEANNE^$', 'IAN', 'IAN',
3258
        'JEAN-^', 'IA', 'IA',
3259
        'JER-^', 'IE', 'IE',
3260
        'JE(LMNST)-', 'IE', 'IE',
3261
        'JI^', 'JI', None,
3262
        'JOR(GK)^$', 'IÖRK', 'IÖRK',
3263
        'J', 'I', 'I',
3264
        'KC(ÄEIJ)-', 'X', 'X',
3265
        'KD', 'KT', None,
3266
        'KE(LMNRST)-3^', 'KE', 'KE',
3267
        'KG(AÄEILOÖRUÜY)-', 'K', None,
3268
        'KH<^', 'K', 'K',
3269
        'KIC$', 'KIZ', 'KIZ',
3270
        'KLE(LMNRST)-3^', 'KLE', 'KLE',
3271
        'KOTELE-^', 'KOTL', 'KUTL',
3272
        'KREAT-^', 'KREA', 'KREA',
3273
        'KRÜS(TZ)--^', 'KRI', None,
3274
        'KRYS(TZ)--^', 'KRI', None,
3275
        'KRY9^', 'KRÜ', None,
3276
        'KSCH---', 'K', 'K',
3277
        'KSH--', 'K', 'K',
3278
        'K(SßXZ)7', 'X', 'X',  # implies 'KST' -> 'XT'
3279
        'KT\'S$', 'X', 'X',
3280
        'KTI(AIOU)-3', 'XI', 'XI',
3281
        'KT(SßXZ)', 'X', 'X',
3282
        'KY9^', 'KÜ', None,
3283
        'K\'S$', 'X', 'X',
3284
        'K´S$', 'X', 'X',
3285
        'LANGES$', ' LANGES', ' LANKEZ',
3286
        'LANGE$', ' LANGE', ' LANKE',
3287
        'LANG$', ' LANK', ' LANK',
3288
        'LARVE-', 'LARF', 'LARF',
3289
        'LD(SßZ)$', 'LS', 'LZ',
3290
        'LD\'S$', 'LS', 'LZ',
3291
        'LD´S$', 'LS', 'LZ',
3292
        'LEAND-^', 'LEAN', 'LEAN',
3293
        'LEERSTEHE-----^', 'LER ', 'LER ',
3294
        'LEICHBLEIB-----', 'LEICH ', 'LEIK ',
3295
        'LEICHLAUTE-----', 'LEICH ', 'LEIK ',
3296
        'LEIDERREGE------', 'LEIT ', 'LEIT ',
3297
        'LEIDGEPR----^', 'LEIT ', 'LEIT ',
3298
        'LEINSTEHE-----', 'LEIN ', 'LEIN ',
3299
        'LEL-', 'LE', 'LE',
3300
        'LE(MNRST)-3^', 'LE', 'LE',
3301
        'LETTE$', 'LET', 'LET',
3302
        'LFGNAG-', 'LFGAN', 'LFKAN',
3303
        'LICHERWEIS----', 'LICHA ', 'LIKA ',
3304
        'LIC$', 'LIZ', 'LIZ',
3305
        'LIVE^$', 'LEIF', 'LEIF',
3306
        'LT(SßZ)$', 'LS', 'LZ',
3307
        'LT\'S$', 'LS', 'LZ',
3308
        'LT´S$', 'LS', 'LZ',
3309
        'LUI(GS)--', 'LU', 'LU',
3310
        'LV(AIO)-', 'LW', None,
3311
        'LY9^', 'LÜ', None,
3312
        'LSTS$', 'LS', 'LZ',
3313
        'LZ(BDFGKLMNPQRSTVWX)-', 'LS', None,
3314
        'L(SßZ)$', 'LS', None,
3315
        'MAIR-<', 'MEI', 'NEI',
3316
        'MANAG-', 'MENE', 'NENE',
3317
        'MANUEL', 'MANUEL', None,
3318
        'MASSEU(RS)-', 'MASÖ', 'NAZÖ',
3319
        'MATCH', 'MESH', 'NEZ',
3320
        'MAURICE', 'MORIS', 'NURIZ',
3321
        'MBH^$', 'MBH', 'MBH',
3322
        'MB(ßZ)$', 'MS', None,
3323
        'MB(SßTZ)-', 'M', 'N',
3324
        'MCG9^', 'MAK', 'NAK',
3325
        'MC9^', 'MAK', 'NAK',
3326
        'MEMOIR-^', 'MEMOA', 'NENUA',
3327
        'MERHAVEN$', 'MAHAFN', None,
3328
        'ME(LMNRST)-3^', 'ME', 'NE',
3329
        'MEN(STZ)--3', 'ME', None,
3330
        'MEN$', 'MEN', None,
3331
        'MIGUEL-', 'MIGE', 'NIKE',
3332
        'MIKE^$', 'MEIK', 'NEIK',
3333
        'MITHILFE----^$', 'MIT H', 'NIT ',
3334
        'MN$', 'M', None,
3335
        'MN', 'N', 'N',
3336
        'MPJUTE-', 'MPUT', 'NBUT',
3337
        'MP(ßZ)$', 'MS', None,
3338
        'MP(SßTZ)-', 'M', 'N',
3339
        'MP(BDJLMNPQVW)-', 'MB', 'NB',
3340
        'MY9^', 'MÜ', None,
3341
        'M(ßZ)$', 'MS', None,
3342
        'M´G7^', 'MAK', 'NAK',
3343
        'M\'G7^', 'MAK', 'NAK',
3344
        'M´^', 'MAK', 'NAK',
3345
        'M\'^', 'MAK', 'NAK',
3346
        'M', None, 'N',
3347
        'NACH^^', 'NACH', 'NAK',
3348
        'NADINE', 'NADIN', 'NATIN',
3349
        'NAIV--', 'NA', 'NA',
3350
        'NAISE$', 'NESE', 'NEZE',
3351
        'NAUGENOMM------', 'NAU ', 'NAU ',
3352
        'NAUSOGUT$', 'NAUSO GUT', 'NAUZU KUT',
3353
        'NCH$', 'NSH', 'NZ',
3354
        'NCOISE$', 'SOA', 'ZUA',
3355
        'NCOIS$', 'SOA', 'ZUA',
3356
        'NDAR$', 'NDA', 'NTA',
3357
        'NDERINGEN------', 'NDE ', 'NTE ',
3358
        'NDRO(CDKTZ)-', 'NTRO', None,
3359
        'ND(BFGJLMNPQVW)-', 'NT', None,
3360
        'ND(SßZ)$', 'NS', 'NZ',
3361
        'ND\'S$', 'NS', 'NZ',
3362
        'ND´S$', 'NS', 'NZ',
3363
        'NEBEN^^', 'NEBN', 'NEBN',
3364
        'NENGELERN------', 'NEN ', 'NEN ',
3365
        'NENLERN(ET)---', 'NEN LE', 'NEN LE',
3366
        'NENZULERNE---', 'NEN ZU LE', 'NEN ZU LE',
3367
        'NE(LMNRST)-3^', 'NE', 'NE',
3368
        'NEN-3', 'NE', 'NE',
3369
        'NETTE$', 'NET', 'NET',
3370
        'NGU^^', 'NU', 'NU',
3371
        'NG(BDFJLMNPQRTVW)-', 'NK', 'NK',
3372
        'NH(AUO)-$', 'NI', 'NI',
3373
        'NICHTSAHNEN-----', 'NIX ', 'NIX ',
3374
        'NICHTSSAGE----', 'NIX ', 'NIX ',
3375
        'NICHTS^^', 'NIX', 'NIX',
3376
        'NICHT^^', 'NICHT', 'NIKT',
3377
        'NINE$', 'NIN', 'NIN',
3378
        'NON^^', 'NON', 'NUN',
3379
        'NOTLEIDE-----^', 'NOT ', 'NUT ',
3380
        'NOT^^', 'NOT', 'NUT',
3381
        'NTI(AIOU)-3', 'NZI', 'NZI',
3382
        'NTIEL--3', 'NZI', 'NZI',
3383
        'NT(SßZ)$', 'NS', 'NZ',
3384
        'NT\'S$', 'NS', 'NZ',
3385
        'NT´S$', 'NS', 'NZ',
3386
        'NYLON', 'NEILON', 'NEILUN',
3387
        'NY9^', 'NÜ', None,
3388
        'NSTZUNEH---', 'NST ZU ', 'NZT ZU ',
3389
        'NSZ-', 'NS', None,
3390
        'NSTS$', 'NS', 'NZ',
3391
        'NZ(BDFGKLMNPQRSTVWX)-', 'NS', None,
3392
        'N(SßZ)$', 'NS', None,
3393
        'OBERE-', 'OBER', None,
3394
        'OBER^^', 'OBA', 'UBA',
3395
        'OEU2', 'Ö', 'Ö',
3396
        'OE<2', 'Ö', 'Ö',
3397
        'OGL-', 'OK', None,
3398
        'OGNIE-', 'ONI', 'UNI',
3399
        'OGN(AEOU)-$', 'ONI', 'UNI',
3400
        'OH(AIOÖUÜY)-', 'OH', None,
3401
        'OIE$', 'Ö', 'Ö',
3402
        'OIRE$', 'OA', 'UA',
3403
        'OIR$', 'OA', 'UA',
3404
        'OIX', 'OA', 'UA',
3405
        'OI<3', 'EU', 'EU',
3406
        'OKAY^$', 'OKE', 'UKE',
3407
        'OLYN$', 'OLIN', 'ULIN',
3408
        'OO(DLMZ)-', 'U', None,
3409
        'OO$', 'U', None,
3410
        'OO-', '', '',
3411
        'ORGINAL-----', 'ORI', 'URI',
3412
        'OTI(AÄOÖUÜ)-', 'OZI', 'UZI',
3413
        'OUI^', 'WI', 'FI',
3414
        'OUILLE$', 'ULIE', 'ULIE',
3415
        'OU(DT)-^', 'AU', 'AU',
3416
        'OUSE$', 'AUS', 'AUZ',
3417
        'OUT-', 'AU', 'AU',
3418
        'OU', 'U', 'U',
3419
        'O(FV)$', 'AU', 'AU',  # due to 'OW$' -> 'AU'
3420
        'OVER--<', 'OW', None,
3421
        'OV(AOU)-', 'OW', None,
3422
        'OW$', 'AU', 'AU',
3423
        'OWS$', 'OS', 'UZ',
3424
        'OJ(AÄEIOÖUÜ)--', 'O', 'U',
3425
        'OYER', 'OIA', None,
3426
        'OY(AÄEIOÖUÜ)--', 'O', 'U',
3427
        'O(JY)<', 'EU', 'EU',
3428
        'OZ$', 'OS', None,
3429
        'O´^', 'O', 'U',
3430
        'O\'^', 'O', 'U',
3431
        'O', None, 'U',
3432
        'PATIEN--^', 'PAZI', 'PAZI',
3433
        'PENSIO-^', 'PANSI', 'PANZI',
3434
        'PE(LMNRST)-3^', 'PE', 'PE',
3435
        'PFER-^', 'FE', 'FE',
3436
        'P(FH)<', 'F', 'F',
3437
        'PIC^$', 'PIK', 'PIK',
3438
        'PIC$', 'PIZ', 'PIZ',
3439
        'PIPELINE', 'PEIBLEIN', 'PEIBLEIN',
3440
        'POLYP-', 'POLÜ', None,
3441
        'POLY^^', 'POLI', 'PULI',
3442
        'PORTRAIT7', 'PORTRE', 'PURTRE',
3443
        'POWER7', 'PAUA', 'PAUA',
3444
        'PP(FH)--<', 'B', 'B',
3445
        'PP-', '', '',
3446
        'PRODUZ-^', 'PRODU', 'BRUTU',
3447
        'PRODUZI--', ' PRODU', ' BRUTU',
3448
        'PRIX^$', 'PRI', 'PRI',
3449
        'PS-^^', 'P', None,
3450
        'P(SßZ)^', None, 'Z',
3451
        'P(SßZ)$', 'BS', None,
3452
        'PT-^', '', '',
3453
        'PTI(AÄOÖUÜ)-3', 'BZI', 'BZI',
3454
        'PY9^', 'PÜ', None,
3455
        'P(AÄEIOÖRUÜY)-', 'P', 'P',
3456
        'P(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'P', None,
3457
        'P.^', None, 'P.',
3458
        'P^', 'P', None,
3459
        'P', 'B', 'B',
3460
        'QI-', 'Z', 'Z',
3461
        'QUARANT--', 'KARA', 'KARA',
3462
        'QUE(LMNRST)-3', 'KWE', 'KFE',
3463
        'QUE$', 'K', 'K',
3464
        'QUI(NS)$', 'KI', 'KI',
3465
        'QUIZ7', 'KWIS', None,
3466
        'Q(UV)7', 'KW', 'KF',
3467
        'Q<', 'K', 'K',
3468
        'RADFAHR----', 'RAT ', 'RAT ',
3469
        'RAEFTEZEHRE-----', 'REFTE ', 'REFTE ',
3470
        'RCH', 'RCH', 'RK',
3471
        'REA(DU)---3^', 'R', None,
3472
        'REBSERZEUG------', 'REBS ', 'REBZ ',
3473
        'RECHERCH^', 'RESHASH', 'REZAZ',
3474
        'RECYCL--', 'RIZEI', 'RIZEI',
3475
        'RE(ALST)-3^', 'RE', None,
3476
        'REE$', 'RI', 'RI',
3477
        'RER$', 'RA', 'RA',
3478
        'RE(MNR)-4', 'RE', 'RE',
3479
        'RETTE$', 'RET', 'RET',
3480
        'REUZ$', 'REUZ', None,
3481
        'REW$', 'RU', 'RU',
3482
        'RH<^', 'R', 'R',
3483
        'RJA(MN)--', 'RI', 'RI',
3484
        'ROWD-^', 'RAU', 'RAU',
3485
        'RTEMONNAIE-', 'RTMON', 'RTNUN',
3486
        'RTI(AÄOÖUÜ)-3', 'RZI', 'RZI',
3487
        'RTIEL--3', 'RZI', 'RZI',
3488
        'RV(AEOU)-3', 'RW', None,
3489
        'RY(KN)-$', 'RI', 'RI',
3490
        'RY9^', 'RÜ', None,
3491
        'RÄFTEZEHRE-----', 'REFTE ', 'REFTE ',
3492
        'SAISO-^', 'SES', 'ZEZ',
3493
        'SAFE^$', 'SEIF', 'ZEIF',
3494
        'SAUCE-^', 'SOS', 'ZUZ',
3495
        'SCHLAGGEBEN-----<', 'SHLAK ', 'ZLAK ',
3496
        'SCHSCH---7', '', '',
3497
        'SCHTSCH', 'SH', 'Z',
3498
        'SC(HZ)<', 'SH', 'Z',
3499
        'SC', 'SK', 'ZK',
3500
        'SELBSTST--7^^', 'SELB', 'ZELB',
3501
        'SELBST7^^', 'SELBST', 'ZELBZT',
3502
        'SERVICE7^', 'SÖRWIS', 'ZÖRFIZ',
3503
        'SERVI-^', 'SERW', None,
3504
        'SE(LMNRST)-3^', 'SE', 'ZE',
3505
        'SETTE$', 'SET', 'ZET',
3506
        'SHP-^', 'S', 'Z',
3507
        'SHST', 'SHT', 'ZT',
3508
        'SHTSH', 'SH', 'Z',
3509
        'SHT', 'ST', 'Z',
3510
        'SHY9^', 'SHÜ', None,
3511
        'SH^^', 'SH', None,
3512
        'SH3', 'SH', 'Z',
3513
        'SICHERGEGAN-----^', 'SICHA ', 'ZIKA ',
3514
        'SICHERGEHE----^', 'SICHA ', 'ZIKA ',
3515
        'SICHERGESTEL------^', 'SICHA ', 'ZIKA ',
3516
        'SICHERSTELL-----^', 'SICHA ', 'ZIKA ',
3517
        'SICHERZU(GS)--^', 'SICHA ZU ', 'ZIKA ZU ',
3518
        'SIEGLI-^', 'SIKL', 'ZIKL',
3519
        'SIGLI-^', 'SIKL', 'ZIKL',
3520
        'SIGHT', 'SEIT', 'ZEIT',
3521
        'SIGN', 'SEIN', 'ZEIN',
3522
        'SKI(NPZ)-', 'SKI', 'ZKI',
3523
        'SKI<^', 'SHI', 'ZI',
3524
        'SODASS^$', 'SO DAS', 'ZU TAZ',
3525
        'SODAß^$', 'SO DAS', 'ZU TAZ',
3526
        'SOGENAN--^', 'SO GEN', 'ZU KEN',
3527
        'SOUND-', 'SAUN', 'ZAUN',
3528
        'STAATS^^', 'STAZ', 'ZTAZ',
3529
        'STADT^^', 'STAT', 'ZTAT',
3530
        'STANDE$', ' STANDE', ' ZTANTE',
3531
        'START^^', 'START', 'ZTART',
3532
        'STAURANT7', 'STORAN', 'ZTURAN',
3533
        'STEAK-', 'STE', 'ZTE',
3534
        'STEPHEN-^$', 'STEW', None,
3535
        'STERN', 'STERN', None,
3536
        'STRAF^^', 'STRAF', 'ZTRAF',
3537
        'ST\'S$', 'Z', 'Z',
3538
        'ST´S$', 'Z', 'Z',
3539
        'STST--', '', '',
3540
        'STS(ACEÈÉÊHIÌÍÎOUÄÜÖ)--', 'ST', 'ZT',
3541
        'ST(SZ)', 'Z', 'Z',
3542
        'SPAREN---^', 'SPA', 'ZPA',
3543
        'SPAREND----', ' SPA', ' ZPA',
3544
        'S(PTW)-^^', 'S', None,
3545
        'SP', 'SP', None,
3546
        'STYN(AE)-$', 'STIN', 'ZTIN',
3547
        'ST', 'ST', 'ZT',
3548
        'SUITE<', 'SIUT', 'ZIUT',
3549
        'SUKE--$', 'S', 'Z',
3550
        'SURF(EI)-', 'SÖRF', 'ZÖRF',
3551
        'SV(AEÈÉÊIÌÍÎOU)-<^', 'SW', None,
3552
        'SYB(IY)--^', 'SIB', None,
3553
        'SYL(KVW)--^', 'SI', None,
3554
        'SY9^', 'SÜ', None,
3555
        'SZE(NPT)-^', 'ZE', 'ZE',
3556
        'SZI(ELN)-^', 'ZI', 'ZI',
3557
        'SZCZ<', 'SH', 'Z',
3558
        'SZT<', 'ST', 'ZT',
3559
        'SZ<3', 'SH', 'Z',
3560
        'SÜL(KVW)--^', 'SI', None,
3561
        'S', None, 'Z',
3562
        'TCH', 'SH', 'Z',
3563
        'TD(AÄEIOÖRUÜY)-', 'T', None,
3564
        'TD(ÀÁÂÃÅÈÉÊËÌÍÎÏÒÓÔÕØÙÚÛÝŸ)-', 'T', None,
3565
        'TEAT-^', 'TEA', 'TEA',
3566
        'TERRAI7^', 'TERA', 'TERA',
3567
        'TE(LMNRST)-3^', 'TE', 'TE',
3568
        'TH<', 'T', 'T',
3569
        'TICHT-', 'TIK', 'TIK',
3570
        'TICH$', 'TIK', 'TIK',
3571
        'TIC$', 'TIZ', 'TIZ',
3572
        'TIGGESTELL-------', 'TIK ', 'TIK ',
3573
        'TIGSTELL-----', 'TIK ', 'TIK ',
3574
        'TOAS-^', 'TO', 'TU',
3575
        'TOILET-', 'TOLE', 'TULE',
3576
        'TOIN-', 'TOA', 'TUA',
3577
        'TRAECHTI-^', 'TRECHT', 'TREKT',
3578
        'TRAECHTIG--', ' TRECHT', ' TREKT',
3579
        'TRAINI-', 'TREN', 'TREN',
3580
        'TRÄCHTI-^', 'TRECHT', 'TREKT',
3581
        'TRÄCHTIG--', ' TRECHT', ' TREKT',
3582
        'TSCH', 'SH', 'Z',
3583
        'TSH', 'SH', 'Z',
3584
        'TST', 'ZT', 'ZT',
3585
        'T(Sß)', 'Z', 'Z',
3586
        'TT(SZ)--<', '', '',
3587
        'TT9', 'T', 'T',
3588
        'TV^$', 'TV', 'TV',
3589
        'TX(AEIOU)-3', 'SH', 'Z',
3590
        'TY9^', 'TÜ', None,
3591
        'TZ-', '', '',
3592
        'T\'S3$', 'Z', 'Z',
3593
        'T´S3$', 'Z', 'Z',
3594
        'UEBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
3595
        'UEBER^^', 'ÜBA', 'IBA',
3596
        'UE2', 'Ü', 'I',
3597
        'UGL-', 'UK', None,
3598
        'UH(AOÖUÜY)-', 'UH', None,
3599
        'UIE$', 'Ü', 'I',
3600
        'UM^^', 'UM', 'UN',
3601
        'UNTERE--3', 'UNTE', 'UNTE',
3602
        'UNTER^^', 'UNTA', 'UNTA',
3603
        'UNVER^^', 'UNFA', 'UNFA',
3604
        'UN^^', 'UN', 'UN',
3605
        'UTI(AÄOÖUÜ)-', 'UZI', 'UZI',
3606
        'UVE-4', 'UW', None,
3607
        'UY2', 'UI', None,
3608
        'UZZ', 'AS', 'AZ',
3609
        'VACL-^', 'WAZ', 'FAZ',
3610
        'VAC$', 'WAZ', 'FAZ',
3611
        'VAN DEN ^', 'FANDN', 'FANTN',
3612
        'VANES-^', 'WANE', None,
3613
        'VATRO-', 'WATR', None,
3614
        'VA(DHJNT)--^', 'F', None,
3615
        'VEDD-^', 'FE', 'FE',
3616
        'VE(BEHIU)--^', 'F', None,
3617
        'VEL(BDLMNT)-^', 'FEL', None,
3618
        'VENTZ-^', 'FEN', None,
3619
        'VEN(NRSZ)-^', 'FEN', None,
3620
        'VER(AB)-^$', 'WER', None,
3621
        'VERBAL^$', 'WERBAL', None,
3622
        'VERBAL(EINS)-^', 'WERBAL', None,
3623
        'VERTEBR--', 'WERTE', None,
3624
        'VEREIN-----', 'F', None,
3625
        'VEREN(AEIOU)-^', 'WEREN', None,
3626
        'VERIFI', 'WERIFI', None,
3627
        'VERON(AEIOU)-^', 'WERON', None,
3628
        'VERSEN^', 'FERSN', 'FAZN',
3629
        'VERSIERT--^', 'WERSI', None,
3630
        'VERSIO--^', 'WERS', None,
3631
        'VERSUS', 'WERSUS', None,
3632
        'VERTI(GK)-', 'WERTI', None,
3633
        'VER^^', 'FER', 'FA',
3634
        'VERSPRECHE-------', ' FER', ' FA',
3635
        'VER$', 'WA', None,
3636
        'VER', 'FA', 'FA',
3637
        'VET(HT)-^', 'FET', 'FET',
3638
        'VETTE$', 'WET', 'FET',
3639
        'VE^', 'WE', None,
3640
        'VIC$', 'WIZ', 'FIZ',
3641
        'VIELSAGE----', 'FIL ', 'FIL ',
3642
        'VIEL', 'FIL', 'FIL',
3643
        'VIEW', 'WIU', 'FIU',
3644
        'VILL(AE)-', 'WIL', None,
3645
        'VIS(ACEIKUVWZ)-<^', 'WIS', None,
3646
        'VI(ELS)--^', 'F', None,
3647
        'VILLON--', 'WILI', 'FILI',
3648
        'VIZE^^', 'FIZE', 'FIZE',
3649
        'VLIE--^', 'FL', None,
3650
        'VL(AEIOU)--', 'W', None,
3651
        'VOKA-^', 'WOK', None,
3652
        'VOL(ATUVW)--^', 'WO', None,
3653
        'VOR^^', 'FOR', 'FUR',
3654
        'VR(AEIOU)--', 'W', None,
3655
        'VV9', 'W', None,
3656
        'VY9^', 'WÜ', 'FI',
3657
        'V(ÜY)-', 'W', None,
3658
        'V(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'W', None,
3659
        'V(AEIJLRU)-<', 'W', None,
3660
        'V.^', 'V.', None,
3661
        'V<', 'F', 'F',
3662
        'WEITERENTWI-----^', 'WEITA ', 'FEITA ',
3663
        'WEITREICH-----^', 'WEIT ', 'FEIT ',
3664
        'WEITVER^', 'WEIT FER', 'FEIT FA',
3665
        'WE(LMNRST)-3^', 'WE', 'FE',
3666
        'WER(DST)-', 'WER', None,
3667
        'WIC$', 'WIZ', 'FIZ',
3668
        'WIEDERU--', 'WIDE', 'FITE',
3669
        'WIEDER^$', 'WIDA', 'FITA',
3670
        'WIEDER^^', 'WIDA ', 'FITA ',
3671
        'WIEVIEL', 'WI FIL', 'FI FIL',
3672
        'WISUEL', 'WISUEL', None,
3673
        'WR-^', 'W', None,
3674
        'WY9^', 'WÜ', 'FI',
3675
        'W(BDFGJKLMNPQRSTZ)-', 'F', None,
3676
        'W$', 'F', None,
3677
        'W', None, 'F',
3678
        'X<^', 'Z', 'Z',
3679
        'XHAVEN$', 'XAFN', None,
3680
        'X(CSZ)', 'X', 'X',
3681
        'XTS(CH)--', 'XT', 'XT',
3682
        'XT(SZ)', 'Z', 'Z',
3683
        'YE(LMNRST)-3^', 'IE', 'IE',
3684
        'YE-3', 'I', 'I',
3685
        'YOR(GK)^$', 'IÖRK', 'IÖRK',
3686
        'Y(AOU)-<7', 'I', 'I',
3687
        'Y(BKLMNPRSTX)-1', 'Ü', None,
3688
        'YVES^$', 'IF', 'IF',
3689
        'YVONNE^$', 'IWON', 'IFUN',
3690
        'Y.^', 'Y.', None,
3691
        'Y', 'I', 'I',
3692
        'ZC(AOU)-', 'SK', 'ZK',
3693
        'ZE(LMNRST)-3^', 'ZE', 'ZE',
3694
        'ZIEJ$', 'ZI', 'ZI',
3695
        'ZIGERJA(HR)-3', 'ZIGA IA', 'ZIKA IA',
3696
        'ZL(AEIOU)-', 'SL', None,
3697
        'ZS(CHT)--', '', '',
3698
        'ZS', 'SH', 'Z',
3699
        'ZUERST', 'ZUERST', 'ZUERST',
3700
        'ZUGRUNDE^$', 'ZU GRUNDE', 'ZU KRUNTE',
3701
        'ZUGRUNDE', 'ZU GRUNDE ', 'ZU KRUNTE ',
3702
        'ZUGUNSTEN', 'ZU GUNSTN', 'ZU KUNZTN',
3703
        'ZUHAUSE-', 'ZU HAUS', 'ZU AUZ',
3704
        'ZULASTEN^$', 'ZU LASTN', 'ZU LAZTN',
3705
        'ZURUECK^^', 'ZURÜK', 'ZURIK',
3706
        'ZURZEIT', 'ZUR ZEIT', 'ZUR ZEIT',
3707
        'ZURÜCK^^', 'ZURÜK', 'ZURIK',
3708
        'ZUSTANDE', 'ZU STANDE', 'ZU ZTANTE',
3709
        'ZUTAGE', 'ZU TAGE', 'ZU TAKE',
3710
        'ZUVER^^', 'ZUFA', 'ZUFA',
3711
        'ZUVIEL', 'ZU FIL', 'ZU FIL',
3712
        'ZUWENIG', 'ZU WENIK', 'ZU FENIK',
3713
        'ZY9^', 'ZÜ', None,
3714
        'ZYK3$', 'ZIK', None,
3715
        'Z(VW)7^', 'SW', None,
3716
        None, None, None)
3717
3718
    phonet_hash = Counter()
3719
    alpha_pos = Counter()
3720
3721
    phonet_hash_1 = Counter()
3722
    phonet_hash_2 = Counter()
3723
3724
    _phonet_upper_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
3725
                                          'abcdefghijklmnopqrstuvwxyzàáâãåäæ' +
3726
                                          'çðèéêëìíîïñòóôõöøœšßþùúûüýÿ'),
3727
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÅÄÆ' +
3728
                                         'ÇÐÈÉÊËÌÍÎÏÑÒÓÔÕÖØŒŠßÞÙÚÛÜÝŸ'))
3729
3730
    def _trinfo(text, rule, err_text, lang):
3731
        """Output debug information."""
3732
        if lang == 'none':
3733
            _phonet_rules = _phonet_rules_no_lang
3734
        else:
3735
            _phonet_rules = _phonet_rules_german
3736
3737
        from_rule = ('(NULL)' if _phonet_rules[rule] is None else
3738
                     _phonet_rules[rule])
3739
        to_rule1 = ('(NULL)' if (_phonet_rules[rule + 1] is None) else
3740
                    _phonet_rules[rule + 1])
3741
        to_rule2 = ('(NULL)' if (_phonet_rules[rule + 2] is None) else
3742
                    _phonet_rules[rule + 2])
3743
        print('"{} {}:  "{}"{}"{}" {}'.format(text, ((rule / 3) + 1),
3744
                                              from_rule, to_rule1, to_rule2,
3745
                                              err_text))
3746
3747
    def _initialize_phonet(lang):
3748
        """Initialize phonet variables."""
3749
        if lang == 'none':
3750
            _phonet_rules = _phonet_rules_no_lang
3751
        else:
3752
            _phonet_rules = _phonet_rules_german
3753
3754
        phonet_hash[''] = -1
3755
3756
        # German and international umlauts
3757
        for j in {'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë',
3758
                  'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø',
3759
                  'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'Œ', 'Š', 'Ÿ'}:
3760
            alpha_pos[j] = 1
3761
            phonet_hash[j] = -1
3762
3763
        # "normal" letters ('A'-'Z')
3764
        for i, j in enumerate('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
3765
            alpha_pos[j] = i + 2
3766
            phonet_hash[j] = -1
3767
3768
        for i in range(26):
3769
            for j in range(28):
3770
                phonet_hash_1[i, j] = -1
3771
                phonet_hash_2[i, j] = -1
3772
3773
        # for each phonetc rule
3774
        for i in range(len(_phonet_rules)):
3775
            rule = _phonet_rules[i]
3776
3777
            if rule and i % 3 == 0:
3778
                # calculate first hash value
3779
                k = _phonet_rules[i][0]
3780
3781
                if phonet_hash[k] < 0 and (_phonet_rules[i+1] or
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash does not seem to be defined.
Loading history...
3782
                                           _phonet_rules[i+2]):
3783
                    phonet_hash[k] = i
3784
3785
                # calculate second hash values
3786
                if k and alpha_pos[k] >= 2:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable alpha_pos does not seem to be defined.
Loading history...
3787
                    k = alpha_pos[k]
3788
3789
                    j = k-2
3790
                    rule = rule[1:]
3791
3792
                    if not rule:
3793
                        rule = ' '
3794
                    elif rule[0] == '(':
3795
                        rule = rule[1:]
3796
                    else:
3797
                        rule = rule[0]
3798
3799
                    while rule and (rule[0] != ')'):
3800
                        k = alpha_pos[rule[0]]
3801
3802
                        if k > 0:
3803
                            # add hash value for this letter
3804
                            if phonet_hash_1[j, k] < 0:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_1 does not seem to be defined.
Loading history...
3805
                                phonet_hash_1[j, k] = i
3806
                                phonet_hash_2[j, k] = i
3807
3808
                            if phonet_hash_2[j, k] >= (i-30):
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_2 does not seem to be defined.
Loading history...
3809
                                phonet_hash_2[j, k] = i
3810
                            else:
3811
                                k = -1
3812
3813
                        if k <= 0:
3814
                            # add hash value for all letters
3815
                            if phonet_hash_1[j, 0] < 0:
3816
                                phonet_hash_1[j, 0] = i
3817
3818
                            phonet_hash_2[j, 0] = i
3819
3820
                        rule = rule[1:]
3821
3822
    def _phonet(term, mode, lang, trace):
3823
        """Return the phonet coded form of a term."""
3824
        if lang == 'none':
3825
            _phonet_rules = _phonet_rules_no_lang
3826
        else:
3827
            _phonet_rules = _phonet_rules_german
3828
3829
        char0 = ''
3830
        dest = term
3831
3832
        if not term:
3833
            return ''
3834
3835
        term_length = len(term)
3836
3837
        # convert input string to upper-case
3838
        src = term.translate(_phonet_upper_translation)
3839
3840
        # check "src"
3841
        i = 0
3842
        j = 0
3843
        zeta = 0
3844
3845
        while i < len(src):
3846
            char = src[i]
3847
3848
            if trace:
3849
                print('\ncheck position {}:  src = "{}",  dest = "{}"'.format
3850
                      (j, src[i:], dest[:j]))
3851
3852
            pos = alpha_pos[char]
3853
3854
            if pos >= 2:
3855
                xpos = pos-2
3856
3857
                if i+1 == len(src):
3858
                    pos = alpha_pos['']
3859
                else:
3860
                    pos = alpha_pos[src[i+1]]
3861
3862
                start1 = phonet_hash_1[xpos, pos]
3863
                start2 = phonet_hash_1[xpos, 0]
3864
                end1 = phonet_hash_2[xpos, pos]
3865
                end2 = phonet_hash_2[xpos, 0]
3866
3867
                # preserve rule priorities
3868
                if (start2 >= 0) and ((start1 < 0) or (start2 < start1)):
3869
                    pos = start1
3870
                    start1 = start2
3871
                    start2 = pos
3872
                    pos = end1
3873
                    end1 = end2
3874
                    end2 = pos
3875
3876
                if (end1 >= start2) and (start2 >= 0):
3877
                    if end2 > end1:
3878
                        end1 = end2
3879
3880
                    start2 = -1
3881
                    end2 = -1
3882
            else:
3883
                pos = phonet_hash[char]
3884
                start1 = pos
3885
                end1 = 10000
3886
                start2 = -1
3887
                end2 = -1
3888
3889
            pos = start1
3890
            zeta0 = 0
3891
3892
            if pos >= 0:
3893
                # check rules for this char
3894
                while ((_phonet_rules[pos] is None) or
3895
                       (_phonet_rules[pos][0] == char)):
3896
                    if pos > end1:
3897
                        if start2 > 0:
3898
                            pos = start2
3899
                            start1 = start2
3900
                            start2 = -1
3901
                            end1 = end2
3902
                            end2 = -1
3903
                            continue
3904
3905
                        break
3906
3907
                    if (((_phonet_rules[pos] is None) or
3908
                         (_phonet_rules[pos + mode] is None))):
3909
                        # no conversion rule available
3910
                        pos += 3
3911
                        continue
3912
3913
                    if trace:
3914
                        _trinfo('> rule no.', pos, 'is being checked', lang)
3915
3916
                    # check whole string
3917
                    matches = 1  # number of matching letters
3918
                    priority = 5  # default priority
3919
                    rule = _phonet_rules[pos]
3920
                    rule = rule[1:]
3921
3922
                    while (rule and
3923
                           (len(src) > (i + matches)) and
3924
                           (src[i + matches] == rule[0]) and
3925
                           not rule[0].isdigit() and
3926
                           (rule not in '(-<^$')):
3927
                        matches += 1
3928
                        rule = rule[1:]
3929
3930
                    if rule and (rule[0] == '('):
3931
                        # check an array of letters
3932
                        if (((len(src) > (i + matches)) and
3933
                             src[i + matches].isalpha() and
3934
                             (src[i + matches] in rule[1:]))):
3935
                            matches += 1
3936
3937
                            while rule and rule[0] != ')':
3938
                                rule = rule[1:]
3939
3940
                            # if rule[0] == ')':
3941
                            rule = rule[1:]
3942
3943
                    if rule:
3944
                        priority0 = ord(rule[0])
3945
                    else:
3946
                        priority0 = 0
3947
3948
                    matches0 = matches
3949
3950
                    while rule and rule[0] == '-' and matches > 1:
3951
                        matches -= 1
3952
                        rule = rule[1:]
3953
3954
                    if rule and rule[0] == '<':
3955
                        rule = rule[1:]
3956
3957
                    if rule and rule[0].isdigit():
3958
                        # read priority
3959
                        priority = int(rule[0])
3960
                        rule = rule[1:]
3961
3962
                    if rule and rule[0:2] == '^^':
3963
                        rule = rule[1:]
3964
3965
                    if (not rule or
3966
                            ((rule[0] == '^') and
3967
                             ((i == 0) or not src[i-1].isalpha()) and
3968
                             ((rule[1:2] != '$') or
3969
                              (not (src[i+matches0:i+matches0+1].isalpha()) and
3970
                               (src[i+matches0:i+matches0+1] != '.')))) or
3971
                            ((rule[0] == '$') and (i > 0) and
3972
                             src[i-1].isalpha() and
3973
                             ((not src[i+matches0:i+matches0+1].isalpha()) and
3974
                              (src[i+matches0:i+matches0+1] != '.')))):
3975
                        # look for continuation, if:
3976
                        # matches > 1 und NO '-' in first string */
3977
                        pos0 = -1
3978
3979
                        start3 = 0
3980
                        start4 = 0
3981
                        end3 = 0
3982
                        end4 = 0
3983
3984
                        if (((matches > 1) and
3985
                             src[i+matches:i+matches+1] and
3986
                             (priority0 != ord('-')))):
3987
                            char0 = src[i+matches-1]
3988
                            pos0 = alpha_pos[char0]
3989
3990
                            if pos0 >= 2 and src[i+matches]:
3991
                                xpos = pos0 - 2
3992
                                pos0 = alpha_pos[src[i+matches]]
3993
                                start3 = phonet_hash_1[xpos, pos0]
3994
                                start4 = phonet_hash_1[xpos, 0]
3995
                                end3 = phonet_hash_2[xpos, pos0]
3996
                                end4 = phonet_hash_2[xpos, 0]
3997
3998
                                # preserve rule priorities
3999
                                if (((start4 >= 0) and
4000
                                     ((start3 < 0) or (start4 < start3)))):
4001
                                    pos0 = start3
4002
                                    start3 = start4
4003
                                    start4 = pos0
4004
                                    pos0 = end3
4005
                                    end3 = end4
4006
                                    end4 = pos0
4007
4008
                                if (end3 >= start4) and (start4 >= 0):
4009
                                    if end4 > end3:
4010
                                        end3 = end4
4011
4012
                                    start4 = -1
4013
                                    end4 = -1
4014
                            else:
4015
                                pos0 = phonet_hash[char0]
4016
                                start3 = pos0
4017
                                end3 = 10000
4018
                                start4 = -1
4019
                                end4 = -1
4020
4021
                            pos0 = start3
4022
4023
                        # check continuation rules for src[i+matches]
4024
                        if pos0 >= 0:
4025
                            while ((_phonet_rules[pos0] is None) or
4026
                                   (_phonet_rules[pos0][0] == char0)):
4027
                                if pos0 > end3:
4028
                                    if start4 > 0:
4029
                                        pos0 = start4
4030
                                        start3 = start4
4031
                                        start4 = -1
4032
                                        end3 = end4
4033
                                        end4 = -1
4034
                                        continue
4035
4036
                                    priority0 = -1
4037
4038
                                    # important
4039
                                    break
4040
4041
                                if (((_phonet_rules[pos0] is None) or
4042
                                     (_phonet_rules[pos0 + mode] is None))):
4043
                                    # no conversion rule available
4044
                                    pos0 += 3
4045
                                    continue
4046
4047
                                if trace:
4048
                                    _trinfo('> > continuation rule no.', pos0,
4049
                                            'is being checked', lang)
4050
4051
                                # check whole string
4052
                                matches0 = matches
4053
                                priority0 = 5
4054
                                rule = _phonet_rules[pos0]
4055
                                rule = rule[1:]
4056
4057
                                while (rule and
4058
                                       (src[i+matches0:i+matches0+1] ==
4059
                                        rule[0]) and
4060
                                       (not rule[0].isdigit() or
4061
                                        (rule in '(-<^$'))):
4062
                                    matches0 += 1
4063
                                    rule = rule[1:]
4064
4065
                                if rule and rule[0] == '(':
4066
                                    # check an array of letters
4067
                                    if ((src[i+matches0:i+matches0+1]
4068
                                         .isalpha() and
4069
                                         (src[i+matches0] in rule[1:]))):
4070
                                        matches0 += 1
4071
4072
                                        while rule and rule[0] != ')':
4073
                                            rule = rule[1:]
4074
4075
                                        # if rule[0] == ')':
4076
                                        rule = rule[1:]
4077
4078
                                while rule and rule[0] == '-':
4079
                                    # "matches0" is NOT decremented
4080
                                    # because of  "if (matches0 == matches)"
4081
                                    rule = rule[1:]
4082
4083
                                if rule and rule[0] == '<':
4084
                                    rule = rule[1:]
4085
4086
                                if rule and rule[0].isdigit():
4087
                                    priority0 = int(rule[0])
4088
                                    rule = rule[1:]
4089
4090
                                if (not rule or
4091
                                        # rule == '^' is not possible here
4092
                                        ((rule[0] == '$') and not
4093
                                         src[i+matches0:i+matches0+1]
4094
                                         .isalpha() and
4095
                                         (src[i+matches0:i+matches0+1]
4096
                                          != '.'))):
4097
                                    if matches0 == matches:
4098
                                        # this is only a partial string
4099
                                        if trace:
4100
                                            _trinfo('> > continuation ' +
4101
                                                    'rule no.',
4102
                                                    pos0,
4103
                                                    'not used (too short)',
4104
                                                    lang)
4105
4106
                                        pos0 += 3
4107
                                        continue
4108
4109
                                    if priority0 < priority:
4110
                                        # priority is too low
4111
                                        if trace:
4112
                                            _trinfo('> > continuation ' +
4113
                                                    'rule no.',
4114
                                                    pos0,
4115
                                                    'not used (priority)',
4116
                                                    lang)
4117
4118
                                        pos0 += 3
4119
                                        continue
4120
4121
                                    # continuation rule found
4122
                                    break
4123
4124
                                if trace:
4125
                                    _trinfo('> > continuation rule no.', pos0,
4126
                                            'not used', lang)
4127
4128
                                pos0 += 3
4129
4130
                            # end of "while"
4131
                            if ((priority0 >= priority) and
4132
                                    ((_phonet_rules[pos0] is not None) and
4133
                                     (_phonet_rules[pos0][0] == char0))):
4134
4135
                                if trace:
4136
                                    _trinfo('> rule no.', pos, '', lang)
4137
                                    _trinfo('> not used because of ' +
4138
                                            'continuation', pos0, '', lang)
4139
4140
                                pos += 3
4141
                                continue
4142
4143
                        # replace string
4144
                        if trace:
4145
                            _trinfo('Rule no.', pos, 'is applied', lang)
4146
4147
                        if ((_phonet_rules[pos] and
4148
                             ('<' in _phonet_rules[pos][1:]))):
4149
                            priority0 = 1
4150
                        else:
4151
                            priority0 = 0
4152
4153
                        rule = _phonet_rules[pos + mode]
4154
4155
                        if (priority0 == 1) and (zeta == 0):
4156
                            # rule with '<' is applied
4157
                            if ((j > 0) and rule and
4158
                                    ((dest[j-1] == char) or
4159
                                     (dest[j-1] == rule[0]))):
4160
                                j -= 1
4161
4162
                            zeta0 = 1
4163
                            zeta += 1
4164
                            matches0 = 0
4165
4166
                            while rule and src[i+matches0]:
4167
                                src = (src[0:i+matches0] + rule[0] +
4168
                                       src[i+matches0+1:])
4169
                                matches0 += 1
4170
                                rule = rule[1:]
4171
4172
                            if matches0 < matches:
4173
                                src = (src[0:i+matches0] +
4174
                                       src[i+matches:])
4175
4176
                            char = src[i]
4177
                        else:
4178
                            i = i + matches - 1
4179
                            zeta = 0
4180
4181
                            while len(rule) > 1:
4182
                                if (j == 0) or (dest[j - 1] != rule[0]):
4183
                                    dest = (dest[0:j] + rule[0] +
4184
                                            dest[min(len(dest), j+1):])
4185
                                    j += 1
4186
4187
                                rule = rule[1:]
4188
4189
                            # new "current char"
4190
                            if not rule:
4191
                                rule = ''
4192
                                char = ''
4193
                            else:
4194
                                char = rule[0]
4195
4196
                            if ((_phonet_rules[pos] and
4197
                                 '^^' in _phonet_rules[pos][1:])):
4198
                                if char:  # pragma: no branch
4199
                                    dest = (dest[0:j] + char +
4200
                                            dest[min(len(dest), j + 1):])
4201
                                    j += 1
4202
4203
                                src = src[i + 1:]
4204
                                i = 0
4205
                                zeta0 = 1
4206
4207
                        break
4208
4209
                    pos += 3
4210
4211
                    if pos > end1 and start2 > 0:
4212
                        pos = start2
4213
                        start1 = start2
4214
                        end1 = end2
4215
                        start2 = -1
4216
                        end2 = -1
4217
4218
            if zeta0 == 0:
4219
                if char and ((j == 0) or (dest[j-1] != char)):
4220
                    # delete multiple letters only
4221
                    dest = dest[0:j] + char + dest[min(j+1, term_length):]
4222
                    j += 1
4223
4224
                i += 1
4225
                zeta = 0
4226
4227
        dest = dest[0:j]
4228
4229
        return dest
4230
4231
    _initialize_phonet(lang)
4232
4233
    word = unicodedata.normalize('NFKC', text_type(word))
4234
    return _phonet(word, mode, lang, trace)
4235
4236
4237
def spfc(word):
4238
    """Return the Standardized Phonetic Frequency Code (SPFC) of a word.
4239
4240
    Standardized Phonetic Frequency Code is roughly Soundex-like.
4241
    This implementation is based on page 19-21 of
4242
    https://archive.org/stream/accessingindivid00moor#page/19/mode/1up
4243
4244
    :param str word: the word to transform
4245
    :returns: the SPFC value
4246
    :rtype: str
4247
4248
    >>> spfc('Christopher Smith')
4249
    '01160'
4250
    >>> spfc('Christopher Schmidt')
4251
    '01160'
4252
    >>> spfc('Niall Smith')
4253
    '01660'
4254
    >>> spfc('Niall Schmidt')
4255
4256
    >>> spfc('L.Smith')
4257
    '01960'
4258
    >>> spfc('R.Miller')
4259
    '65490'
4260
4261
    >>> spfc(('L', 'Smith'))
4262
    '01960'
4263
    >>> spfc(('R', 'Miller'))
4264
    '65490'
4265
    """
4266
    _pf1 = dict(zip((ord(_) for _ in 'SZCKQVFPUWABLORDHIEMNXGJT'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4267
                    '0011112222334445556666777'))
4268
    _pf2 = dict(zip((ord(_) for _ in
4269
                     'SZCKQFPXABORDHIMNGJTUVWEL'),
4270
                    '0011122233445556677788899'))
4271
    _pf3 = dict(zip((ord(_) for _ in
4272
                     'BCKQVDTFLPGJXMNRSZAEHIOUWY'),
4273
                    '00000112223334456677777777'))
4274
4275
    _substitutions = (('DK', 'K'), ('DT', 'T'), ('SC', 'S'), ('KN', 'N'),
4276
                      ('MN', 'N'))
4277
4278
    def _raise_word_ex():
4279
        """Raise an AttributeError."""
4280
        raise AttributeError('word attribute must be a string with a space ' +
4281
                             'or period dividing the first and last names ' +
4282
                             'or a tuple/list consisting of the first and ' +
4283
                             'last names')
4284
4285
    if not word:
4286
        return ''
4287
4288
    if isinstance(word, (str, text_type)):
4289
        names = word.split('.', 1)
4290
        if len(names) != 2:
4291
            names = word.split(' ', 1)
4292
            if len(names) != 2:
4293
                _raise_word_ex()
4294
    elif hasattr(word, '__iter__'):
4295
        if len(word) != 2:
4296
            _raise_word_ex()
4297
        names = word
4298
    else:
4299
        _raise_word_ex()
4300
4301
    names = [unicodedata.normalize('NFKD', text_type(_.strip()
4302
                                                     .replace('ß', 'SS')
4303
                                                     .upper()))
4304
             for _ in names]
0 ignored issues
show
introduced by
The variable names does not seem to be defined for all execution paths.
Loading history...
4305
    code = ''
4306
4307
    def steps_one_to_three(name):
4308
        """Perform the first three steps of SPFC."""
4309
        # filter out non A-Z
4310
        name = ''.join(_ for _ in name if _ in
4311
                       {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
4312
                        'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
4313
                        'W', 'X', 'Y', 'Z'})
4314
4315
        # 1. In the field, convert DK to K, DT to T, SC to S, KN to N,
4316
        # and MN to N
4317
        for subst in _substitutions:
4318
            name = name.replace(subst[0], subst[1])
4319
4320
        # 2. In the name field, replace multiple letters with a single letter
4321
        name = _delete_consecutive_repeats(name)
4322
4323
        # 3. Remove vowels, W, H, and Y, but keep the first letter in the name
4324
        # field.
4325
        if name:
4326
            name = name[0] + ''.join(_ for _ in name[1:] if _ not in
4327
                                     {'A', 'E', 'H', 'I', 'O', 'U', 'W', 'Y'})
4328
        return name
4329
4330
    names = [steps_one_to_three(_) for _ in names]
4331
4332
    # 4. The first digit of the code is obtained using PF1 and the first letter
4333
    # of the name field. Remove this letter after coding.
4334
    if names[1]:
4335
        code += names[1][0].translate(_pf1)
4336
        names[1] = names[1][1:]
4337
4338
    # 5. Using the last letters of the name, use Table PF3 to obtain the
4339
    # second digit of the code. Use as many letters as possible and remove
4340
    # after coding.
4341
    if names[1]:
4342
        if names[1][-3:] == 'STN' or names[1][-3:] == 'PRS':
4343
            code += '8'
4344
            names[1] = names[1][:-3]
4345
        elif names[1][-2:] == 'SN':
4346
            code += '8'
4347
            names[1] = names[1][:-2]
4348
        elif names[1][-3:] == 'STR':
4349
            code += '9'
4350
            names[1] = names[1][:-3]
4351
        elif names[1][-2:] in {'SR', 'TN', 'TD'}:
4352
            code += '9'
4353
            names[1] = names[1][:-2]
4354
        elif names[1][-3:] == 'DRS':
4355
            code += '7'
4356
            names[1] = names[1][:-3]
4357
        elif names[1][-2:] in {'TR', 'MN'}:
4358
            code += '7'
4359
            names[1] = names[1][:-2]
4360
        else:
4361
            code += names[1][-1].translate(_pf3)
4362
            names[1] = names[1][:-1]
4363
4364
    # 6. The third digit is found using Table PF2 and the first character of
4365
    # the first name. Remove after coding.
4366
    if names[0]:
4367
        code += names[0][0].translate(_pf2)
4368
        names[0] = names[0][1:]
4369
4370
    # 7. The fourth digit is found using Table PF2 and the first character of
4371
    # the name field. If no letters remain use zero. After coding remove the
4372
    # letter.
4373
    # 8. The fifth digit is found in the same manner as the fourth using the
4374
    # remaining characters of the name field if any.
4375
    for _ in range(2):
4376
        if names[1]:
4377
            code += names[1][0].translate(_pf2)
4378
            names[1] = names[1][1:]
4379
        else:
4380
            code += '0'
4381
4382
    return code
4383
4384
4385
def statistics_canada(word, maxlength=4):
4386
    """Return the Statistics Canada code for a word.
4387
4388
    The original description of this algorithm could not be located, and
4389
    may only have been specified in an unpublished TR. The coding does not
4390
    appear to be in use by Statistics Canada any longer. In its place, this is
4391
    an implementation of the "Census modified Statistics Canada name coding
4392
    procedure".
4393
4394
    The modified version of this algorithm is described in Appendix B of
4395
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
4396
    Procedure for the SRS Record Linkage System.` Statistical Reporting
4397
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
4398
    https://naldc.nal.usda.gov/download/27833/PDF
4399
4400
    :param str word: the word to transform
4401
    :param int maxlength: the maximum length (default 6) of the code to return
4402
    :param bool modified: indicates whether to use USDA modified algorithm
4403
    :returns: the Statistics Canada name code value
4404
    :rtype: str
4405
4406
    >>> statistics_canada('Christopher')
4407
    'CHRS'
4408
    >>> statistics_canada('Niall')
4409
    'NL'
4410
    >>> statistics_canada('Smith')
4411
    'SMTH'
4412
    >>> statistics_canada('Schmidt')
4413
    'SCHM'
4414
    """
4415
    # uppercase, normalize, decompose, and filter non-A-Z out
4416
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4417
    word = word.replace('ß', 'SS')
4418
    word = ''.join(c for c in word if c in
4419
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4420
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4421
                    'Y', 'Z'})
4422
    if not word:
4423
        return ''
4424
4425
    code = word[1:]
4426
    for vowel in {'A', 'E', 'I', 'O', 'U', 'Y'}:
4427
        code = code.replace(vowel, '')
4428
    code = word[0]+code
4429
    code = _delete_consecutive_repeats(code)
4430
    code = code.replace(' ', '')
4431
4432
    return code[:maxlength]
4433
4434
4435
def lein(word, maxlength=4, zero_pad=True):
4436
    """Return the Lein code for a word.
4437
4438
    This is Lein name coding, based on
4439
    https://naldc-legacy.nal.usda.gov/naldc/download.xhtml?id=27833&content=PDF
4440
4441
    :param str word: the word to transform
4442
    :param int maxlength: the maximum length (default 4) of the code to return
4443
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4444
        maxlength string
4445
    :returns: the Lein code
4446
    :rtype: str
4447
4448
    >>> lein('Christopher')
4449
    'C351'
4450
    >>> lein('Niall')
4451
    'N300'
4452
    >>> lein('Smith')
4453
    'S210'
4454
    >>> lein('Schmidt')
4455
    'S521'
4456
    """
4457
    _lein_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4458
                                  'BCDFGJKLMNPQRSTVXZ'),
4459
                                 '451455532245351455'))
4460
4461
    # uppercase, normalize, decompose, and filter non-A-Z out
4462
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4463
    word = word.replace('ß', 'SS')
4464
    word = ''.join(c for c in word if c in
4465
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4466
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4467
                    'Y', 'Z'})
4468
4469
    if not word:
4470
        return ''
4471
4472
    code = word[0]  # Rule 1
4473
    word = word[1:].translate(str.maketrans('', '', 'AEIOUYWH '))  # Rule 2
4474
    word = _delete_consecutive_repeats(word)  # Rule 3
4475
    code += word.translate(_lein_translation)  # Rule 4
4476
4477
    if zero_pad:
4478
        code += ('0'*maxlength)  # Rule 4
4479
4480
    return code[:maxlength]
4481
4482
4483
def roger_root(word, maxlength=5, zero_pad=True):
4484
    """Return the Roger Root code for a word.
4485
4486
    This is Roger Root name coding, based on
4487
    https://naldc-legacy.nal.usda.gov/naldc/download.xhtml?id=27833&content=PDF
4488
4489
    :param str word: the word to transform
4490
    :param int maxlength: the maximum length (default 5) of the code to return
4491
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4492
        maxlength string
4493
    :returns: the Roger Root code
4494
    :rtype: str
4495
4496
    >>> roger_root('Christopher')
4497
    '06401'
4498
    >>> roger_root('Niall')
4499
    '02500'
4500
    >>> roger_root('Smith')
4501
    '00310'
4502
    >>> roger_root('Schmidt')
4503
    '06310'
4504
    """
4505
    # uppercase, normalize, decompose, and filter non-A-Z out
4506
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4507
    word = word.replace('ß', 'SS')
4508
    word = ''.join(c for c in word if c in
4509
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4510
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4511
                    'Y', 'Z'})
4512
4513
    if not word:
4514
        return ''
4515
4516
    # '*' is used to prevent combining by _delete_consecutive_repeats()
4517
    _init_patterns = {4: {'TSCH': '06'},
4518
                      3: {'TSH': '06', 'SCH': '06'},
4519
                      2: {'CE': '0*0', 'CH': '06', 'CI': '0*0', 'CY': '0*0',
4520
                          'DG': '07', 'GF': '08', 'GM': '03', 'GN': '02',
4521
                          'KN': '02', 'PF': '08', 'PH': '08', 'PN': '02',
4522
                          'SH': '06', 'TS': '0*0', 'WR': '04'},
4523
                      1: {'A': '1', 'B': '09', 'C': '07', 'D': '01', 'E': '1',
4524
                          'F': '08', 'G': '07', 'H': '2', 'I': '1', 'J': '3',
4525
                          'K': '07', 'L': '05', 'M': '03', 'N': '02', 'O': '1',
4526
                          'P': '09', 'Q': '07', 'R': '04', 'S': '0*0',
4527
                          'T': '01', 'U': '1', 'V': '08', 'W': '4', 'X': '07',
4528
                          'Y': '5', 'Z': '0*0'}}
4529
4530
    _med_patterns = {4: {'TSCH': '6'},
4531
                     3: {'TSH': '6', 'SCH': '6'},
4532
                     2: {'CE': '0', 'CH': '6', 'CI': '0', 'CY': '0', 'DG': '7',
4533
                         'PH': '8', 'SH': '6', 'TS': '0'},
4534
                     1: {'B': '9', 'C': '7', 'D': '1', 'F': '8', 'G': '7',
4535
                         'J': '6', 'K': '7', 'L': '5', 'M': '3', 'N': '2',
4536
                         'P': '9', 'Q': '7', 'R': '4', 'S': '0', 'T': '1',
4537
                         'V': '8', 'X': '7', 'Z': '0',
4538
                         'A': '*', 'E': '*', 'H': '*', 'I': '*', 'O': '*',
4539
                         'U': '*', 'W': '*', 'Y': '*'}}
4540
4541
    code = ''
4542
    pos = 0
4543
4544
    # Do first digit(s) first
4545
    for num in range(4, 0, -1):
4546
        if word[:num] in _init_patterns[num]:
4547
            code = _init_patterns[num][word[:num]]
4548
            pos += num
4549
            break
4550
    else:
4551
        pos += 1  # Advance if nothing is recognized
4552
4553
    # Then code subsequent digits
4554
    while pos < len(word):
4555
        for num in range(4, 0, -1):
4556
            if word[pos:pos+num] in _med_patterns[num]:
4557
                code += _med_patterns[num][word[pos:pos+num]]
4558
                pos += num
4559
                break
4560
        else:
4561
            pos += 1  # Advance if nothing is recognized
4562
4563
    code = _delete_consecutive_repeats(code)
4564
    code = code.replace('*', '')
4565
4566
    if zero_pad:
4567
        code += '0'*maxlength
4568
4569
    return code[:maxlength]
4570
4571
4572
def onca(word, maxlength=4, zero_pad=True):
4573
    """Return the Oxford Name Compression Algorithm (ONCA) code for a word.
4574
4575
    This is the Oxford Name Compression Algorithm, based on:
4576
    Gill, Leicester E. 1997. "OX-LINK: The Oxford Medical Record Linkage
4577
    System." In ``Record Linkage Techniques -- 1997``. Arlington, VA. March
4578
    20--21, 1997.
4579
    https://nces.ed.gov/FCSM/pdf/RLT97.pdf
4580
4581
    I can find no complete description of the "anglicised version of the NYSIIS
4582
    method" identified as the first step in this algorithm, so this is likely
4583
    not a correct implementation, in that it employs the standard NYSIIS
4584
    algorithm.
4585
4586
    :param str word: the word to transform
4587
    :param int maxlength: the maximum length (default 5) of the code to return
4588
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4589
        maxlength string
4590
    :returns: the ONCA code
4591
    :rtype: str
4592
4593
    >>> onca('Christopher')
4594
    'C623'
4595
    >>> onca('Niall')
4596
    'N400'
4597
    >>> onca('Smith')
4598
    'S530'
4599
    >>> onca('Schmidt')
4600
    'S530'
4601
    """
4602
    # In the most extreme case, 3 characters of NYSIIS input can be compressed
4603
    # to one character of output, so give it triple the maxlength.
4604
    return soundex(nysiis(word, maxlength=maxlength*3), maxlength,
4605
                   zero_pad=zero_pad)
4606
4607
4608
def eudex(word, maxlength=8):
4609
    """Return the eudex phonetic hash of a word.
4610
4611
    This implementation of eudex phonetic hashing is based on the specification
4612
    (not the reference implementation) at:
4613
    Ticki. 2017. "Eudex: A blazingly fast phonetic reduction/hashing
4614
    algorithm." https://docs.rs/crate/eudex
4615
4616
    Further details can be found at
4617
    http://ticki.github.io/blog/the-eudex-algorithm/
4618
4619
    :param str word: the word to transform
4620
    :param int maxlength: the length of the code returned (defaults to 8)
4621
    :returns: the eudex hash
4622
    :rtype: str
4623
    """
4624
    _trailing_phones = {
4625
        'a': 0,  # a
4626
        'b': 0b01001000,  # b
4627
        'c': 0b00001100,  # c
4628
        'd': 0b00011000,  # d
4629
        'e': 0,  # e
4630
        'f': 0b01000100,  # f
4631
        'g': 0b00001000,  # g
4632
        'h': 0b00000100,  # h
4633
        'i': 1,  # i
4634
        'j': 0b00000101,  # j
4635
        'k': 0b00001001,  # k
4636
        'l': 0b10100000,  # l
4637
        'm': 0b00000010,  # m
4638
        'n': 0b00010010,  # n
4639
        'o': 0,  # o
4640
        'p': 0b01001001,  # p
4641
        'q': 0b10101000,  # q
4642
        'r': 0b10100001,  # r
4643
        's': 0b00010100,  # s
4644
        't': 0b00011101,  # t
4645
        'u': 1,  # u
4646
        'v': 0b01000101,  # v
4647
        'w': 0b00000000,  # w
4648
        'x': 0b10000100,  # x
4649
        'y': 1,  # y
4650
        'z': 0b10010100,  # z
4651
4652
        'ß': 0b00010101,  # ß
4653
        'à': 0,  # à
4654
        'á': 0,  # á
4655
        'â': 0,  # â
4656
        'ã': 0,  # ã
4657
        'ä': 0,  # ä[æ]
4658
        'å': 1,  # å[oː]
4659
        'æ': 0,  # æ[æ]
4660
        'ç': 0b10010101,  # ç[t͡ʃ]
4661
        'è': 1,  # è
4662
        'é': 1,  # é
4663
        'ê': 1,  # ê
4664
        'ë': 1,  # ë
4665
        'ì': 1,  # ì
4666
        'í': 1,  # í
4667
        'î': 1,  # î
4668
        'ï': 1,  # ï
4669
        'ð': 0b00010101,  # ð[ð̠](represented as a non-plosive T)
4670
        'ñ': 0b00010111,  # ñ[nj](represented as a combination of n and j)
4671
        'ò': 0,  # ò
4672
        'ó': 0,  # ó
4673
        'ô': 0,  # ô
4674
        'õ': 0,  # õ
4675
        'ö': 1,  # ö[ø]
4676
        '÷': 0b11111111,  # ÷
4677
        'ø': 1,  # ø[ø]
4678
        'ù': 1,  # ù
4679
        'ú': 1,  # ú
4680
        'û': 1,  # û
4681
        'ü': 1,  # ü
4682
        'ý': 1,  # ý
4683
        'þ': 0b00010101,  # þ[ð̠](represented as a non-plosive T)
4684
        'ÿ': 1,  # ÿ
4685
    }
4686
4687
    _initial_phones = {
4688
        'a': 0b10000100,  # a*
4689
        'b': 0b00100100,  # b
4690
        'c': 0b00000110,  # c
4691
        'd': 0b00001100,  # d
4692
        'e': 0b11011000,  # e*
4693
        'f': 0b00100010,  # f
4694
        'g': 0b00000100,  # g
4695
        'h': 0b00000010,  # h
4696
        'i': 0b11111000,  # i*
4697
        'j': 0b00000011,  # j
4698
        'k': 0b00000101,  # k
4699
        'l': 0b01010000,  # l
4700
        'm': 0b00000001,  # m
4701
        'n': 0b00001001,  # n
4702
        'o': 0b10010100,  # o*
4703
        'p': 0b00100101,  # p
4704
        'q': 0b01010100,  # q
4705
        'r': 0b01010001,  # r
4706
        's': 0b00001010,  # s
4707
        't': 0b00001110,  # t
4708
        'u': 0b11100000,  # u*
4709
        'v': 0b00100011,  # v
4710
        'w': 0b00000000,  # w
4711
        'x': 0b01000010,  # x
4712
        'y': 0b11100100,  # y*
4713
        'z': 0b01001010,  # z
4714
4715
        'ß': 0b00001011,  # ß
4716
        'à': 0b10000101,  # à
4717
        'á': 0b10000101,  # á
4718
        'â': 0b10000000,  # â
4719
        'ã': 0b10000110,  # ã
4720
        'ä': 0b10100110,  # ä [æ]
4721
        'å': 0b11000010,  # å [oː]
4722
        'æ': 0b10100111,  # æ [æ]
4723
        'ç': 0b01010100,  # ç [t͡ʃ]
4724
        'è': 0b11011001,  # è
4725
        'é': 0b11011001,  # é
4726
        'ê': 0b11011001,  # ê
4727
        'ë': 0b11000110,  # ë [ə] or [œ]
4728
        'ì': 0b11111001,  # ì
4729
        'í': 0b11111001,  # í
4730
        'î': 0b11111001,  # î
4731
        'ï': 0b11111001,  # ï
4732
        'ð': 0b00001011,  # ð [ð̠] (represented as a non-plosive T)
4733
        'ñ': 0b00001011,  # ñ [nj] (represented as a combination of n and j)
4734
        'ò': 0b10010101,  # ò
4735
        'ó': 0b10010101,  # ó
4736
        'ô': 0b10010101,  # ô
4737
        'õ': 0b10010101,  # õ
4738
        'ö': 0b11011100,  # ö [œ] or [ø]
4739
        '÷': 0b11111111,  # ÷
4740
        'ø': 0b11011101,  # ø [œ] or [ø]
4741
        'ù': 0b11100001,  # ù
4742
        'ú': 0b11100001,  # ú
4743
        'û': 0b11100001,  # û
4744
        'ü': 0b11100101,  # ü
4745
        'ý': 0b11100101,  # ý
4746
        'þ': 0b00001011,  # þ [ð̠] (represented as a non-plosive T)
4747
        'ÿ': 0b11100101,  # ÿ
4748
    }
4749
    # Lowercase input & filter unknown characters
4750
    word = ''.join(char for char in word.lower() if char in _initial_phones)
4751
4752
    # Perform initial eudex coding of each character
4753
    values = [_initial_phones[word[0]]]
4754
    values += [_trailing_phones[char] for char in word[1:]]
4755
4756
    # Right-shift by one to determine if second instance should be skipped
4757
    shifted_values = [_ >> 1 for _ in values]
4758
    condensed_values = [values[0]]
4759
    for n in range(1, len(shifted_values)):
4760
        if shifted_values[n] != shifted_values[n-1]:
4761
            condensed_values.append(values[n])
4762
4763
    # Add padding after first character & trim beyond maxlength
4764
    values = ([condensed_values[0]] +
4765
              [0]*max(0, maxlength - len(condensed_values)) +
4766
              condensed_values[1:maxlength])
4767
4768
    # Combine individual character values into eudex hash
4769
    hash_value = 0
4770
    for val in values:
4771
        hash_value = (hash_value << 8) | val
4772
4773
    return hash_value
4774
4775
4776
def haase_phonetik(word):
4777
    """Return the Haase Phonetik (numeric output) code for a word.
4778
4779
    Based on the algorithm described at
4780
    https://github.com/elastic/elasticsearch/blob/master/plugins/analysis-phonetic/src/main/java/org/elasticsearch/index/analysis/phonetic/HaasePhonetik.java
4781
4782
    While the output code is numeric, it is still a str.
4783
4784
    :param str word: the word to transform
4785
    :returns: the Haase Phonetik value as a numeric string
4786
    :rtype: str
4787
    """
4788
    def _after(word, i, letters):
4789
        """Return True if word[i] follows one of the supplied letters."""
4790
        if i > 0 and word[i-1] in letters:
4791
            return True
4792
        return False
4793
4794
    def _before(word, i, letters):
4795
        """Return True if word[i] precedes one of the supplied letters."""
4796
        if i+1 < len(word) and word[i+1] in letters:
4797
            return True
4798
        return False
4799
4800
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
4801
4802
    sdx = ''
4803
4804
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4805
    word = word.replace('ß', 'SS')
4806
4807
    word = word.replace('Ä', 'AE')
4808
    word = word.replace('Ö', 'OE')
4809
    word = word.replace('Ü', 'UE')
4810
    word = ''.join(c for c in word if c in
4811
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4812
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4813
                    'Y', 'Z'})
4814
4815
    # Nothing to convert, return base case
4816
    if not word:
4817
        return sdx
4818
4819
    word = word.replace('AUN', 'OWN')
4820
    word = word.replace('RB', 'RW')
4821
    word = word.replace('WSK', 'RSK')
4822
    if word[-1] == 'A':
4823
        word = word[:-1]+'AR'
4824
    if word[-1] == 'O':
4825
        word = word[:-1]+'OW'
4826
    word = word.replace('SCH', 'CH')
4827
    word = word.replace('GLI', 'LI')
4828
    if word[-3:] == 'EAU':
4829
        word = word[:-3]+'O'
4830
    if word[:2] == 'CH':
4831
        word = 'SCH'+word[2:]
4832
    word = word.replace('AUX', 'O')
4833
    word = word.replace('EUX', 'O')
4834
    word = word.replace('ILLE', 'I')
4835
4836
    for i in range(len(word)):
4837 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
4838
            sdx += '9'
4839
        elif word[i] == 'B':
4840
            sdx += '1'
4841
        elif word[i] == 'P':
4842
            if _before(word, i, {'H'}):
4843
                sdx += '3'
4844
            else:
4845
                sdx += '1'
4846
        elif word[i] in {'D', 'T'}:
4847
            if _before(word, i, {'C', 'S', 'Z'}):
4848
                sdx += '8'
4849
            else:
4850
                sdx += '2'
4851
        elif word[i] in {'F', 'V', 'W'}:
4852
            sdx += '3'
4853
        elif word[i] in {'G', 'K', 'Q'}:
4854
            sdx += '4'
4855
        elif word[i] == 'C':
4856
            if _after(word, i, {'S', 'Z'}):
4857
                sdx += '8'
4858
            elif i == 0:
4859
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
4860
                                     'X'}):
4861
                    sdx += '4'
4862
                else:
4863
                    sdx += '8'
4864
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
4865
                sdx += '4'
4866
            else:
4867
                sdx += '8'
4868
        elif word[i] == 'X':
4869
            if _after(word, i, {'C', 'K', 'Q'}):
4870
                sdx += '8'
4871
            else:
4872
                sdx += '48'
4873
        elif word[i] == 'L':
4874
            sdx += '5'
4875
        elif word[i] in {'M', 'N'}:
4876
            sdx += '6'
4877
        elif word[i] == 'R':
4878
            sdx += '7'
4879
        elif word[i] in {'S', 'Z'}:
4880
            sdx += '8'
4881
4882
    sdx = _delete_consecutive_repeats(sdx)
4883
4884
    if sdx:
4885
        sdx = sdx[0] + sdx[1:].replace('9', '')
4886
4887
    return sdx
4888
4889
4890
def reth_schek_phonetik(word):
4891
    """Return Reth-Schek Phonetik code for a word.
4892
4893
    This algorithm is proposed in:
4894
    von Reth, Hans-Peter and Schek, Hans-Jörg. 1977. "Eine Zugriffsmethode für
4895
    die phonetische Ähnlichkeitssuche." Heidelberg Scientific Center technical
4896
    reports 77.03.002. IBM Deutschland GmbH.
4897
4898
    Since I couldn't secure a copy of that document (maybe I'll look for it
4899
    next time I'm in Germany), this implementation is based on what I could
4900
    glean from the implementations published by German Record Linkage
4901
    Center (www.record-linkage.de):
4902
    - Privacy-preserving Record Linkage (PPRL) (in R)
4903
    - Merge ToolBox (in Java)
4904
4905
    Rules that are unclear:
4906
    - Should 'C' become 'G' or 'Z'? (PPRL has both, 'Z' rule blocked)
4907
    - Should 'CC' become 'G'? (PPRL has blocked 'CK' that may be typo)
4908
    - Should 'TUI' -> 'ZUI' rule exist? (PPRL has rule, but I can't
4909
        think of a German word with '-tui-' in it.)
4910
    - Should we really change 'SCH' -> 'CH' and then 'CH' -> 'SCH'?
4911
4912
    :param word:
4913
    :return:
4914
    """
4915
    replacements = {3: {'AEH': 'E', 'IEH': 'I', 'OEH': 'OE', 'UEH': 'UE',
4916
                        'SCH': 'CH', 'ZIO': 'TIO', 'TIU': 'TIO', 'ZIU': 'TIO',
4917
                        'CHS': 'X', 'CKS': 'X', 'AEU': 'OI'},
4918
                    2: {'LL': 'L', 'AA': 'A', 'AH': 'A', 'BB': 'B', 'PP': 'B',
4919
                        'BP': 'B', 'PB': 'B', 'DD': 'D', 'DT': 'D', 'TT': 'D',
4920
                        'TH': 'D', 'EE': 'E', 'EH': 'E', 'AE': 'E', 'FF': 'F',
4921
                        'PH': 'F', 'KK': 'K', 'GG': 'G', 'GK': 'G', 'KG': 'G',
4922
                        'CK': 'G', 'CC': 'C', 'IE': 'I', 'IH': 'I', 'MM': 'M',
4923
                        'NN': 'N', 'OO': 'O', 'OH': 'O', 'SZ': 'S', 'UH': 'U',
4924
                        'GS': 'X', 'KS': 'X', 'TZ': 'Z', 'AY': 'AI',
4925
                        'EI': 'AI', 'EY': 'AI', 'EU': 'OI', 'RR': 'R',
4926
                        'SS': 'S', 'KW': 'QU'},
4927
                    1: {'P': 'B', 'T': 'D', 'V': 'F', 'W': 'F', 'C': 'G',
4928
                        'K': 'G', 'Y': 'I'}}
4929
4930
    # Uppercase
4931
    word = word.upper()
4932
4933
    # Replace umlauts/eszett
4934
    word = word.replace('Ä', 'AE')
4935
    word = word.replace('Ö', 'OE')
4936
    word = word.replace('Ü', 'UE')
4937
    word = word.replace('ß', 'SS')
4938
4939
    # Main loop, using above replacements table
4940
    pos = 0
4941
    while pos < len(word):
4942
        for num in range(3, 0, -1):
4943
            if word[pos:pos+num] in replacements[num]:
4944
                word = (word[:pos] + replacements[num][word[pos:pos+num]]
4945
                        + word[pos+num:])
4946
                pos += 1
4947
                break
4948
        else:
4949
            pos += 1  # Advance if nothing is recognized
4950
4951
    # Change 'CH' back(?) to 'SCH'
4952
    word = word.replace('CH', 'SCH')
4953
4954
    # Replace final sequences
4955
    if word[-2:] == 'ER':
4956
        word = word[:-2]+'R'
4957
    elif word[-2:] == 'EL':
4958
        word = word[:-2]+'L'
4959
    elif word[-1] == 'H':
4960
        word = word[:-1]
4961
4962
    return word
4963
4964
4965
def bmpm(word, language_arg=0, name_mode='gen', match_mode='approx',
4966
         concat=False, filter_langs=False):
4967
    """Return the Beider-Morse Phonetic Matching algorithm code for a word.
4968
4969
    The Beider-Morse Phonetic Matching algorithm is described at:
4970
    http://stevemorse.org/phonetics/bmpm.htm
4971
    The reference implementation is licensed under GPLv3 and available at:
4972
    http://stevemorse.org/phoneticinfo.htm
4973
4974
    :param str word: the word to transform
4975
    :param str language_arg: the language of the term; supported values
4976
        include:
4977
4978
            - 'any'
4979
            - 'arabic'
4980
            - 'cyrillic'
4981
            - 'czech'
4982
            - 'dutch'
4983
            - 'english'
4984
            - 'french'
4985
            - 'german'
4986
            - 'greek'
4987
            - 'greeklatin'
4988
            - 'hebrew'
4989
            - 'hungarian'
4990
            - 'italian'
4991
            - 'polish'
4992
            - 'portuguese'
4993
            - 'romanian'
4994
            - 'russian'
4995
            - 'spanish'
4996
            - 'turkish'
4997
            - 'germandjsg'
4998
            - 'polishdjskp'
4999
            - 'russiandjsre'
5000
5001
    :param str name_mode: the name mode of the algorithm:
5002
5003
            - 'gen' -- general (default)
5004
            - 'ash' -- Ashkenazi
5005
            - 'sep' -- Sephardic
5006
5007
    :param str match_mode: matching mode: 'approx' or 'exact'
5008
    :param bool concat: concatenation mode
5009
    :param bool filter_langs: filter out incompatible languages
5010
    :returns: the BMPM value(s)
5011
    :rtype: tuple
5012
5013
    >>> bmpm('Christopher')
5014
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5015
    xristYfir xristopi xritopir xritopi xristofi xritofir xritofi tzristopir
5016
    tzristofir zristopir zristopi zritopir zritopi zristofir zristofi zritofir
5017
    zritofi'
5018
    >>> bmpm('Niall')
5019
    'nial niol'
5020
    >>> bmpm('Smith')
5021
    'zmit'
5022
    >>> bmpm('Schmidt')
5023
    'zmit stzmit'
5024
5025
    >>> bmpm('Christopher', language_arg='German')
5026
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5027
    xristYfir'
5028
    >>> bmpm('Christopher', language_arg='English')
5029
    'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir xristafir
5030
    xrQstafir'
5031
    >>> bmpm('Christopher', language_arg='German', name_mode='ash')
5032
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5033
    xristYfir'
5034
5035
    >>> bmpm('Christopher', language_arg='German', match_mode='exact')
5036
    'xriStopher xriStofer xristopher xristofer'
5037
    """
5038
    return _bmpm(word, language_arg, name_mode, match_mode,
5039
                 concat, filter_langs)
5040
5041
5042
if __name__ == '__main__':
5043
    import doctest
5044
    doctest.testmod()
5045