Test Failed
Push — master ( 1b1e60...2d4fd8 )
by Chris
11:32
created

abydos.phonetic   F

Complexity

Total Complexity 833

Size/Duplication

Total Lines 5224
Duplicated Lines 1.68 %

Importance

Changes 0
Metric Value
eloc 3492
dl 88
loc 5224
rs 0.8
c 0
b 0
f 0
wmc 833

33 Functions

Rating   Name   Duplication   Size   Complexity  
F haase_phonetik() 44 112 31
B reth_schek_phonetik() 0 73 8
B eudex() 0 166 4
A onca() 0 34 1
C roger_root() 0 87 10
A refined_soundex() 0 50 4
A koelner_phonetik_alpha() 0 17 1
F nysiis() 0 168 68
F fuzzy_soundex() 0 103 15
F alpha_sis() 0 105 14
F spfc() 0 146 21
C soundex() 0 122 11
A russell_index_num_to_alpha() 0 24 2
F metaphone() 0 170 80
F double_metaphone() 0 720 220
F dm_soundex() 0 173 13
A koelner_phonetik_num_to_alpha() 0 19 1
A russell_index() 0 44 3
A statistics_canada() 0 48 3
F caverphone() 0 149 32
F phonet() 0 1570 154
B fonem() 0 120 3
F phonex() 0 107 33
A russell_index_alpha() 0 22 2
F sfinxbis() 0 173 33
A lein() 0 48 3
F koelner_phonetik() 44 110 27
A _delete_consecutive_repeats() 0 9 1
F phonix() 0 209 24
A mra() 0 30 3
A phonem() 0 45 2
A bmpm() 0 75 1
B parmar_kumbharana() 0 41 5

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complexity

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like abydos.phonetic often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (5223/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.phonetic.
20
21
The phonetic module implements phonetic algorithms including:
22
23
    - Robert C. Russell's Index
24
    - American Soundex
25
    - Refined Soundex
26
    - Daitch-Mokotoff Soundex
27
    - Kölner Phonetik
28
    - NYSIIS
29
    - Match Rating Algorithm
30
    - Metaphone
31
    - Double Metaphone
32
    - Caverphone
33
    - Alpha Search Inquiry System
34
    - Fuzzy Soundex
35
    - Phonex
36
    - Phonem
37
    - Phonix
38
    - SfinxBis
39
    - phonet
40
    - Standardized Phonetic Frequency Code
41
    - Statistics Canada
42
    - Lein
43
    - Roger Root
44
    - Oxford Name Compression Algorithm (ONCA)
45
    - Eudex phonetic hash
46
    - Haase Phonetik
47
    - Reth-Schek Phonetik
48
    - Beider-Morse Phonetic Matching
49
"""
50
51
from __future__ import division, unicode_literals
52
53
import re
54
import unicodedata
55
from collections import Counter
56
from itertools import groupby
57
58
from six import text_type
59
from six.moves import range
60
61
from ._bm import _bmpm
62
63
_INFINITY = float('inf')
64
65
66
def _delete_consecutive_repeats(word):
67
    """Delete consecutive repeated characters in a word.
68
69
    :param str word: the word to transform
70
    :returns: word with consecutive repeating characters collapsed to
71
        a single instance
72
    :rtype: str
73
    """
74
    return ''.join(char for char, _ in groupby(word))
75
76
77
def russell_index(word):
78
    """Return the Russell Index (integer output) of a word.
79
80
    This follows Robert C. Russell's Index algorithm, as described in
81
    US Patent 1,261,167 (1917)
82
83
    :param str word: the word to transform
84
    :returns: the Russell Index value
85
    :rtype: int
86
87
    >>> russell_index('Christopher')
88
    3813428
89
    >>> russell_index('Niall')
90
    715
91
    >>> russell_index('Smith')
92
    3614
93
    >>> russell_index('Schmidt')
94
    3614
95
    """
96
    _russell_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
97
                                     'ABCDEFGIKLMNOPQRSTUVXYZ'),
98
                                    '12341231356712383412313'))
99
100
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
101
    word = word.replace('ß', 'SS')
102
    word = word.replace('GH', '')  # discard gh (rule 3)
103
    word = word.rstrip('SZ')  # discard /[sz]$/ (rule 3)
104
105
    # translate according to Russell's mapping
106
    word = ''.join(c for c in word if c in
107
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'K', 'L', 'M', 'N',
108
                    'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'})
109
    sdx = word.translate(_russell_translation)
110
111
    # remove any 1s after the first occurrence
112
    one = sdx.find('1')+1
113
    if one:
114
        sdx = sdx[:one] + ''.join(c for c in sdx[one:] if c != '1')
115
116
    # remove repeating characters
117
    sdx = _delete_consecutive_repeats(sdx)
118
119
    # return as an int
120
    return int(sdx) if sdx else float('NaN')
121
122
123
def russell_index_num_to_alpha(num):
124
    """Convert the Russell Index integer to an alphabetic string.
125
126
    This follows Robert C. Russell's Index algorithm, as described in
127
    US Patent 1,261,167 (1917)
128
129
    :param int num: a Russell Index integer value
130
    :returns: the Russell Index as an alphabetic string
131
    :rtype: str
132
133
    >>> russell_index_num_to_alpha(3813428)
134
    'CRACDBR'
135
    >>> russell_index_num_to_alpha(715)
136
    'NAL'
137
    >>> russell_index_num_to_alpha(3614)
138
    'CMAD'
139
    """
140
    _russell_num_translation = dict(zip((ord(_) for _ in '12345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
141
                                        'ABCDLMNR'))
142
    num = ''.join(c for c in text_type(num) if c in {'1', '2', '3', '4', '5',
143
                                                     '6', '7', '8'})
144
    if num:
145
        return num.translate(_russell_num_translation)
146
    return ''
147
148
149
def russell_index_alpha(word):
150
    """Return the Russell Index (alphabetic output) for the word.
151
152
    This follows Robert C. Russell's Index algorithm, as described in
153
    US Patent 1,261,167 (1917)
154
155
    :param str word: the word to transform
156
    :returns: the Russell Index value as an alphabetic string
157
    :rtype: str
158
159
    >>> russell_index_alpha('Christopher')
160
    'CRACDBR'
161
    >>> russell_index_alpha('Niall')
162
    'NAL'
163
    >>> russell_index_alpha('Smith')
164
    'CMAD'
165
    >>> russell_index_alpha('Schmidt')
166
    'CMAD'
167
    """
168
    if word:
169
        return russell_index_num_to_alpha(russell_index(word))
170
    return ''
171
172
173
def soundex(word, maxlength=4, var='American', reverse=False, zero_pad=True):
174
    """Return the Soundex code for a word.
175
176
    :param str word: the word to transform
177
    :param int maxlength: the length of the code returned (defaults to 4)
178
    :param str var: the variant of the algorithm to employ (defaults to
179
        'American'):
180
181
        - 'American' follows the American Soundex algorithm, as described at
182
          http://www.archives.gov/publications/general-info-leaflets/55-census.html
183
          and in Knuth(1998:394); this is also called Miracode
184
        - 'special' follows the rules from the 1880-1910 US Census
185
          retrospective re-analysis, in which h & w are not treated as blocking
186
          consonants but as vowels.
187
          Cf. http://creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
188
        - 'dm' computes the Daitch-Mokotoff Soundex
189
        - 'German' applies German rules, as shown at
190
           http://www.nausa.uni-oldenburg.de/soundex.htm
191
192
    :param bool reverse: reverse the word before computing the selected Soundex
193
        (defaults to False); This results in "Reverse Soundex"
194
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
195
        maxlength string
196
    :returns: the Soundex value
197
    :rtype: str
198
199
    >>> soundex("Christopher")
200
    'C623'
201
    >>> soundex("Niall")
202
    'N400'
203
    >>> soundex('Smith')
204
    'S530'
205
    >>> soundex('Schmidt')
206
    'S530'
207
208
209
    >>> soundex('Christopher', maxlength=_INFINITY)
210
    'C623160000000000000000000000000000000000000000000000000000000000'
211
    >>> soundex('Christopher', maxlength=_INFINITY, zero_pad=False)
212
    'C62316'
213
214
    >>> soundex('Christopher', reverse=True)
215
    'R132'
216
217
    >>> soundex('Ashcroft')
218
    'A261'
219
    >>> soundex('Asicroft')
220
    'A226'
221
    >>> soundex('Ashcroft', var='special')
222
    'A226'
223
    >>> soundex('Asicroft', var='special')
224
    'A226'
225
226
    >>> soundex('Christopher', var='dm')
227
    {'494379', '594379'}
228
    >>> soundex('Niall', var='dm')
229
    {'680000'}
230
    >>> soundex('Smith', var='dm')
231
    {'463000'}
232
    >>> soundex('Schmidt', var='dm')
233
    {'463000'}
234
    """
235
    _soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
236
                                     'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
237
                                    '01230129022455012623019202'))
238
239
    # Call the D-M Soundex function itself if requested
240
    if var == 'dm':
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
241
        return dm_soundex(word, maxlength, reverse, zero_pad)
242
    elif var == 'refined':
243
        return refined_soundex(word, maxlength, reverse, zero_pad)
244
    elif var == 'German':
245
        _soundex_translation['H'] = 0
246
        _soundex_translation['W'] = 0
247
        # Although http://www.nausa.uni-oldenburg.de/soundex.htm
248
        # also indicates that umlauted vowels should be decomposed
249
        # to VE, this has no effect below since the umlauts are
250
        # disposed of and all vowels become 0s.
251
252
    # Require a maxlength of at least 4 and not more than 64
253
    if maxlength is not None:
254
        maxlength = min(max(4, maxlength), 64)
255
    else:
256
        maxlength = 64
257
258
    # uppercase, normalize, decompose, and filter non-A-Z out
259
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
260
    word = word.replace('ß', 'SS')
261
    word = ''.join(c for c in word if c in
262
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
263
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
264
                    'Y', 'Z'})
265
266
    # Nothing to convert, return base case
267
    if not word:
268
        if zero_pad:
269
            return '0'*maxlength
270
        return '0'
271
272
    # Reverse word if computing Reverse Soundex
273
    if reverse:
274
        word = word[::-1]
275
276
    # apply the Soundex algorithm
277
    sdx = word.translate(_soundex_translation)
278
279
    if var == 'special':
280
        sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
281
    else:
282
        sdx = sdx.replace('9', '')  # rule 1
283
    sdx = _delete_consecutive_repeats(sdx)  # rule 3
284
285
    if word[0] in 'HW':
286
        sdx = word[0] + sdx
287
    else:
288
        sdx = word[0] + sdx[1:]
289
    sdx = sdx.replace('0', '')  # rule 1
290
291
    if zero_pad:
292
        sdx += ('0'*maxlength)  # rule 4
293
294
    return sdx[:maxlength]
295
296
297
def refined_soundex(word, maxlength=_INFINITY, reverse=False, zero_pad=False):
0 ignored issues
show
Unused Code introduced by
The argument zero_pad seems to be unused.
Loading history...
298
    """Return the Refined Soundex code for a word.
299
300
    This is Soundex, but with more character classes. It appears to have been
301
    defined by the Apache Commons:
302
    https://commons.apache.org/proper/commons-codec/apidocs/src-html/org/apache/commons/codec/language/RefinedSoundex.html
303
304
    :param word: the word to transform
305
    :param maxlength: the length of the code returned (defaults to unlimited)
306
    :param reverse: reverse the word before computing the selected Soundex
307
        (defaults to False); This results in "Reverse Soundex"
308
    :param zero_pad: pad the end of the return value with 0s to achieve a
309
        maxlength string
310
    :returns: the Refined Soundex value
311
    :rtype: str
312
313
    >>> refined_soundex('Christopher')
314
    'C3090360109'
315
    >>> refined_soundex('Niall')
316
    'N807'
317
    >>> refined_soundex('Smith')
318
    'S38060'
319
    >>> refined_soundex('Schmidt')
320
    'S30806'
321
    """
322
    _ref_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
323
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
324
                                        '01360240043788015936020505'))
325
326
    # uppercase, normalize, decompose, and filter non-A-Z out
327
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
328
    word = word.replace('ß', 'SS')
329
    word = ''.join(c for c in word if c in
330
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
331
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
332
                    'Y', 'Z'})
333
334
    # Reverse word if computing Reverse Soundex
335
    if reverse:
336
        word = word[::-1]
337
338
    # apply the Soundex algorithm
339
    sdx = word[0] + word.translate(_ref_soundex_translation)
340
    sdx = _delete_consecutive_repeats(sdx)
341
342
    if maxlength and maxlength < _INFINITY:
343
        sdx = sdx[:maxlength]
344
        sdx += ('0' * maxlength)  # rule 4
345
346
    return sdx
347
348
349
def dm_soundex(word, maxlength=6, reverse=False, zero_pad=True):
350
    """Return the Daitch-Mokotoff Soundex code for a word.
351
352
    Returns values of a word as a set. A collection is necessary since there
353
    can be multiple values for a single word.
354
355
    :param word: the word to transform
356
    :param maxlength: the length of the code returned (defaults to 6)
357
    :param reverse: reverse the word before computing the selected Soundex
358
        (defaults to False); This results in "Reverse Soundex"
359
    :param zero_pad: pad the end of the return value with 0s to achieve a
360
        maxlength string
361
    :returns: the Daitch-Mokotoff Soundex value
362
    :rtype: str
363
364
    >>> dm_soundex('Christopher')
365
    {'494379', '594379'}
366
    >>> dm_soundex('Niall')
367
    {'680000'}
368
    >>> dm_soundex('Smith')
369
    {'463000'}
370
    >>> dm_soundex('Schmidt')
371
    {'463000'}
372
373
    >>> dm_soundex('The quick brown fox', maxlength=20, zero_pad=False)
374
    {'35457976754', '3557976754'}
375
    """
376
    _dms_table = {'STCH': (2, 4, 4), 'DRZ': (4, 4, 4), 'ZH': (4, 4, 4),
377
                  'ZHDZH': (2, 4, 4), 'DZH': (4, 4, 4), 'DRS': (4, 4, 4),
378
                  'DZS': (4, 4, 4), 'SCHTCH': (2, 4, 4), 'SHTSH': (2, 4, 4),
379
                  'SZCZ': (2, 4, 4), 'TZS': (4, 4, 4), 'SZCS': (2, 4, 4),
380
                  'STSH': (2, 4, 4), 'SHCH': (2, 4, 4), 'D': (3, 3, 3),
381
                  'H': (5, 5, '_'), 'TTSCH': (4, 4, 4), 'THS': (4, 4, 4),
382
                  'L': (8, 8, 8), 'P': (7, 7, 7), 'CHS': (5, 54, 54),
383
                  'T': (3, 3, 3), 'X': (5, 54, 54), 'OJ': (0, 1, '_'),
384
                  'OI': (0, 1, '_'), 'SCHTSH': (2, 4, 4), 'OY': (0, 1, '_'),
385
                  'Y': (1, '_', '_'), 'TSH': (4, 4, 4), 'ZDZ': (2, 4, 4),
386
                  'TSZ': (4, 4, 4), 'SHT': (2, 43, 43), 'SCHTSCH': (2, 4, 4),
387
                  'TTSZ': (4, 4, 4), 'TTZ': (4, 4, 4), 'SCH': (4, 4, 4),
388
                  'TTS': (4, 4, 4), 'SZD': (2, 43, 43), 'AI': (0, 1, '_'),
389
                  'PF': (7, 7, 7), 'TCH': (4, 4, 4), 'PH': (7, 7, 7),
390
                  'TTCH': (4, 4, 4), 'SZT': (2, 43, 43), 'ZDZH': (2, 4, 4),
391
                  'EI': (0, 1, '_'), 'G': (5, 5, 5), 'EJ': (0, 1, '_'),
392
                  'ZD': (2, 43, 43), 'IU': (1, '_', '_'), 'K': (5, 5, 5),
393
                  'O': (0, '_', '_'), 'SHTCH': (2, 4, 4), 'S': (4, 4, 4),
394
                  'TRZ': (4, 4, 4), 'SHD': (2, 43, 43), 'DSH': (4, 4, 4),
395
                  'CSZ': (4, 4, 4), 'EU': (1, 1, '_'), 'TRS': (4, 4, 4),
396
                  'ZS': (4, 4, 4), 'STRZ': (2, 4, 4), 'UY': (0, 1, '_'),
397
                  'STRS': (2, 4, 4), 'CZS': (4, 4, 4),
398
                  'MN': ('6_6', '6_6', '6_6'), 'UI': (0, 1, '_'),
399
                  'UJ': (0, 1, '_'), 'UE': (0, '_', '_'), 'EY': (0, 1, '_'),
400
                  'W': (7, 7, 7), 'IA': (1, '_', '_'), 'FB': (7, 7, 7),
401
                  'STSCH': (2, 4, 4), 'SCHT': (2, 43, 43),
402
                  'NM': ('6_6', '6_6', '6_6'), 'SCHD': (2, 43, 43),
403
                  'B': (7, 7, 7), 'DSZ': (4, 4, 4), 'F': (7, 7, 7),
404
                  'N': (6, 6, 6), 'CZ': (4, 4, 4), 'R': (9, 9, 9),
405
                  'U': (0, '_', '_'), 'V': (7, 7, 7), 'CS': (4, 4, 4),
406
                  'Z': (4, 4, 4), 'SZ': (4, 4, 4), 'TSCH': (4, 4, 4),
407
                  'KH': (5, 5, 5), 'ST': (2, 43, 43), 'KS': (5, 54, 54),
408
                  'SH': (4, 4, 4), 'SC': (2, 4, 4), 'SD': (2, 43, 43),
409
                  'DZ': (4, 4, 4), 'ZHD': (2, 43, 43), 'DT': (3, 3, 3),
410
                  'ZSH': (4, 4, 4), 'DS': (4, 4, 4), 'TZ': (4, 4, 4),
411
                  'TS': (4, 4, 4), 'TH': (3, 3, 3), 'TC': (4, 4, 4),
412
                  'A': (0, '_', '_'), 'E': (0, '_', '_'), 'I': (0, '_', '_'),
413
                  'AJ': (0, 1, '_'), 'M': (6, 6, 6), 'Q': (5, 5, 5),
414
                  'AU': (0, 7, '_'), 'IO': (1, '_', '_'), 'AY': (0, 1, '_'),
415
                  'IE': (1, '_', '_'), 'ZSCH': (4, 4, 4),
416
                  'CH': ((5, 4), (5, 4), (5, 4)),
417
                  'CK': ((5, 45), (5, 45), (5, 45)),
418
                  'C': ((5, 4), (5, 4), (5, 4)),
419
                  'J': ((1, 4), ('_', 4), ('_', 4)),
420
                  'RZ': ((94, 4), (94, 4), (94, 4)),
421
                  'RS': ((94, 4), (94, 4), (94, 4))}
422
423
    _dms_order = {'A': ('AI', 'AJ', 'AU', 'AY', 'A'),
424
                  'B': ('B'),
425
                  'C': ('CHS', 'CSZ', 'CZS', 'CH', 'CK', 'CS', 'CZ', 'C'),
426
                  'D': ('DRS', 'DRZ', 'DSH', 'DSZ', 'DZH', 'DZS', 'DS', 'DT',
427
                        'DZ', 'D'),
428
                  'E': ('EI', 'EJ', 'EU', 'EY', 'E'),
429
                  'F': ('FB', 'F'),
430
                  'G': ('G'),
431
                  'H': ('H'),
432
                  'I': ('IA', 'IE', 'IO', 'IU', 'I'),
433
                  'J': ('J'),
434
                  'K': ('KH', 'KS', 'K'),
435
                  'L': ('L'),
436
                  'M': ('MN', 'M'),
437
                  'N': ('NM', 'N'),
438
                  'O': ('OI', 'OJ', 'OY', 'O'),
439
                  'P': ('PF', 'PH', 'P'),
440
                  'Q': ('Q'),
441
                  'R': ('RS', 'RZ', 'R'),
442
                  'S': ('SCHTSCH', 'SCHTCH', 'SCHTSH', 'SHTCH', 'SHTSH',
443
                        'STSCH', 'SCHD', 'SCHT', 'SHCH', 'STCH', 'STRS',
444
                        'STRZ', 'STSH', 'SZCS', 'SZCZ', 'SCH', 'SHD', 'SHT',
445
                        'SZD', 'SZT', 'SC', 'SD', 'SH', 'ST', 'SZ', 'S'),
446
                  'T': ('TTSCH', 'TSCH', 'TTCH', 'TTSZ', 'TCH', 'THS', 'TRS',
447
                        'TRZ', 'TSH', 'TSZ', 'TTS', 'TTZ', 'TZS', 'TC', 'TH',
448
                        'TS', 'TZ', 'T'),
449
                  'U': ('UE', 'UI', 'UJ', 'UY', 'U'),
450
                  'V': ('V'),
451
                  'W': ('W'),
452
                  'X': ('X'),
453
                  'Y': ('Y'),
454
                  'Z': ('ZHDZH', 'ZDZH', 'ZSCH', 'ZDZ', 'ZHD', 'ZSH', 'ZD',
455
                        'ZH', 'ZS', 'Z')}
456
457
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
458
    dms = ['']  # initialize empty code list
459
460
    # Require a maxlength of at least 6 and not more than 64
461
    if maxlength is not None:
462
        maxlength = min(max(6, maxlength), 64)
463
    else:
464
        maxlength = 64
465
466
    # uppercase, normalize, decompose, and filter non-A-Z
467
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
468
    word = word.replace('ß', 'SS')
469
    word = ''.join(c for c in word if c in
470
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
471
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
472
                    'Y', 'Z'})
473
474
    # Nothing to convert, return base case
475
    if not word:
476
        if zero_pad:
477
            return {'0'*maxlength}
478
        return {'0'}
479
480
    # Reverse word if computing Reverse Soundex
481
    if reverse:
482
        word = word[::-1]
483
484
    pos = 0
485
    while pos < len(word):
486
        # Iterate through _dms_order, which specifies the possible substrings
487
        # for which codes exist in the Daitch-Mokotoff coding
488
        for sstr in _dms_order[word[pos]]:  # pragma: no branch
489
            if word[pos:].startswith(sstr):
490
                # Having determined a valid substring start, retrieve the code
491
                dm_val = _dms_table[sstr]
492
493
                # Having retried the code (triple), determine the correct
494
                # positional variant (first, pre-vocalic, elsewhere)
495
                if pos == 0:
496
                    dm_val = dm_val[0]
497
                elif (pos+len(sstr) < len(word) and
498
                      word[pos+len(sstr)] in _vowels):
499
                    dm_val = dm_val[1]
500
                else:
501
                    dm_val = dm_val[2]
502
503
                # Build the code strings
504
                if isinstance(dm_val, tuple):
505
                    dms = [_ + text_type(dm_val[0]) for _ in dms] \
506
                            + [_ + text_type(dm_val[1]) for _ in dms]
507
                else:
508
                    dms = [_ + text_type(dm_val) for _ in dms]
509
                pos += len(sstr)
510
                break
511
512
    # Filter out double letters and _ placeholders
513
    dms = (''.join(c for c in _delete_consecutive_repeats(_) if c != '_')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
514
           for _ in dms)
515
516
    # Trim codes and return set
517
    if zero_pad:
518
        dms = ((_ + ('0'*maxlength))[:maxlength] for _ in dms)
519
    else:
520
        dms = (_[:maxlength] for _ in dms)
521
    return set(dms)
522
523
524
def koelner_phonetik(word):
525
    """Return the Kölner Phonetik (numeric output) code for a word.
526
527
    Based on the algorithm described at
528
    https://de.wikipedia.org/wiki/Kölner_Phonetik
529
530
    While the output code is numeric, it is still a str because 0s can lead
531
    the code.
532
533
    :param str word: the word to transform
534
    :returns: the Kölner Phonetik value as a numeric string
535
    :rtype: str
536
537
    >>> koelner_phonetik('Christopher')
538
    '478237'
539
    >>> koelner_phonetik('Niall')
540
    '65'
541
    >>> koelner_phonetik('Smith')
542
    '862'
543
    >>> koelner_phonetik('Schmidt')
544
    '862'
545
    >>> koelner_phonetik('Müller')
546
    '657'
547
    >>> koelner_phonetik('Zimmermann')
548
    '86766'
549
    """
550
    # pylint: disable=too-many-branches
551
    def _after(word, i, letters):
552
        """Return True if word[i] follows one of the supplied letters."""
553
        if i > 0 and word[i-1] in letters:
554
            return True
555
        return False
556
557
    def _before(word, i, letters):
558
        """Return True if word[i] precedes one of the supplied letters."""
559
        if i+1 < len(word) and word[i+1] in letters:
560
            return True
561
        return False
562
563
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
564
565
    sdx = ''
566
567
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
568
    word = word.replace('ß', 'SS')
569
570
    word = word.replace('Ä', 'AE')
571
    word = word.replace('Ö', 'OE')
572
    word = word.replace('Ü', 'UE')
573
    word = ''.join(c for c in word if c in
574
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
575
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
576
                    'Y', 'Z'})
577
578
    # Nothing to convert, return base case
579
    if not word:
580
        return sdx
581
582
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
583 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
584
            sdx += '0'
585
        elif word[i] == 'B':
586
            sdx += '1'
587
        elif word[i] == 'P':
588
            if _before(word, i, {'H'}):
589
                sdx += '3'
590
            else:
591
                sdx += '1'
592
        elif word[i] in {'D', 'T'}:
593
            if _before(word, i, {'C', 'S', 'Z'}):
594
                sdx += '8'
595
            else:
596
                sdx += '2'
597
        elif word[i] in {'F', 'V', 'W'}:
598
            sdx += '3'
599
        elif word[i] in {'G', 'K', 'Q'}:
600
            sdx += '4'
601
        elif word[i] == 'C':
602
            if _after(word, i, {'S', 'Z'}):
603
                sdx += '8'
604
            elif i == 0:
605
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
606
                                     'X'}):
607
                    sdx += '4'
608
                else:
609
                    sdx += '8'
610
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
611
                sdx += '4'
612
            else:
613
                sdx += '8'
614
        elif word[i] == 'X':
615
            if _after(word, i, {'C', 'K', 'Q'}):
616
                sdx += '8'
617
            else:
618
                sdx += '48'
619
        elif word[i] == 'L':
620
            sdx += '5'
621
        elif word[i] in {'M', 'N'}:
622
            sdx += '6'
623
        elif word[i] == 'R':
624
            sdx += '7'
625
        elif word[i] in {'S', 'Z'}:
626
            sdx += '8'
627
628
    sdx = _delete_consecutive_repeats(sdx)
629
630
    if sdx:
631
        sdx = sdx[0] + sdx[1:].replace('0', '')
632
633
    return sdx
634
635
636
def koelner_phonetik_num_to_alpha(num):
637
    """Convert a Kölner Phonetik code from numeric to alphabetic.
638
639
    :param str num: a numeric Kölner Phonetik representation
640
    :returns: an alphabetic representation of the same word
641
    :rtype: str
642
643
    >>> koelner_phonetik_num_to_alpha(862)
644
    'SNT'
645
    >>> koelner_phonetik_num_to_alpha(657)
646
    'NLR'
647
    >>> koelner_phonetik_num_to_alpha(86766)
648
    'SNRNN'
649
    """
650
    _koelner_num_translation = dict(zip((ord(_) for _ in '012345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
651
                                        'APTFKLNRS'))
652
    num = ''.join(c for c in text_type(num) if c in {'0', '1', '2', '3', '4',
653
                                                     '5', '6', '7', '8'})
654
    return num.translate(_koelner_num_translation)
655
656
657
def koelner_phonetik_alpha(word):
658
    """Return the Kölner Phonetik (alphabetic output) code for a word.
659
660
    :param str word: the word to transform
661
    :returns: the Kölner Phonetik value as an alphabetic string
662
    :rtype: str
663
664
    >>> koelner_phonetik_alpha('Smith')
665
    'SNT'
666
    >>> koelner_phonetik_alpha('Schmidt')
667
    'SNT'
668
    >>> koelner_phonetik_alpha('Müller')
669
    'NLR'
670
    >>> koelner_phonetik_alpha('Zimmermann')
671
    'SNRNN'
672
    """
673
    return koelner_phonetik_num_to_alpha(koelner_phonetik(word))
674
675
676
def nysiis(word, maxlength=6, modified=False):
677
    """Return the NYSIIS code for a word.
678
679
    A description of the New York State Identification and Intelligence System
680
    algorithm can be found at
681
    https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
682
683
    The modified version of this algorithm is described in Appendix B of
684
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
685
    Procedure for the SRS Record Linkage System.` Statistical Reporting
686
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
687
    https://naldc.nal.usda.gov/download/27833/PDF
688
689
    :param str word: the word to transform
690
    :param int maxlength: the maximum length (default 6) of the code to return
691
    :param bool modified: indicates whether to use USDA modified NYSIIS
692
    :returns: the NYSIIS value
693
    :rtype: str
694
695
    >>> nysiis('Christopher')
696
    'CRASTA'
697
    >>> nysiis('Niall')
698
    'NAL'
699
    >>> nysiis('Smith')
700
    'SNAT'
701
    >>> nysiis('Schmidt')
702
    'SNAD'
703
704
    >>> nysiis('Christopher', maxlength=_INFINITY)
705
    'CRASTAFAR'
706
707
    >>> nysiis('Christopher', maxlength=8, modified=True)
708
    'CRASTAFA'
709
    >>> nysiis('Niall', maxlength=8, modified=True)
710
    'NAL'
711
    >>> nysiis('Smith', maxlength=8, modified=True)
712
    'SNAT'
713
    >>> nysiis('Schmidt', maxlength=8, modified=True)
714
    'SNAD'
715
    """
716
    # Require a maxlength of at least 6
717
    if maxlength:
718
        maxlength = max(6, maxlength)
719
720
    _vowels = {'A', 'E', 'I', 'O', 'U'}
721
722
    word = ''.join(c for c in word.upper() if c.isalpha())
723
    word = word.replace('ß', 'SS')
724
725
    # exit early if there are no alphas
726
    if not word:
727
        return ''
728
729
    if modified:
730
        original_first_char = word[0]
731
732
    if word[:3] == 'MAC':
733
        word = 'MCC'+word[3:]
734
    elif word[:2] == 'KN':
735
        word = 'NN'+word[2:]
736
    elif word[:1] == 'K':
737
        word = 'C'+word[1:]
738
    elif word[:2] in {'PH', 'PF'}:
739
        word = 'FF'+word[2:]
740
    elif word[:3] == 'SCH':
741
        word = 'SSS'+word[3:]
742
    elif modified:
743
        if word[:2] == 'WR':
744
            word = 'RR'+word[2:]
745
        elif word[:2] == 'RH':
746
            word = 'RR'+word[2:]
747
        elif word[:2] == 'DG':
748
            word = 'GG'+word[2:]
749
        elif word[:1] in _vowels:
750
            word = 'A'+word[1:]
751
752
    if modified and word[-1] in {'S', 'Z'}:
753
        word = word[:-1]
754
755
    if word[-2:] == 'EE' or word[-2:] == 'IE' or (modified and
756
                                                  word[-2:] == 'YE'):
757
        word = word[:-2]+'Y'
758
    elif word[-2:] in {'DT', 'RT', 'RD'}:
759
        word = word[:-2]+'D'
760
    elif word[-2:] in {'NT', 'ND'}:
761
        word = word[:-2]+('N' if modified else 'D')
762
    elif modified:
763
        if word[-2:] == 'IX':
764
            word = word[:-2]+'ICK'
765
        elif word[-2:] == 'EX':
766
            word = word[:-2]+'ECK'
767
        elif word[-2:] in {'JR', 'SR'}:
768
            return 'ERROR'  # TODO: decide how best to return an error
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
769
770
    key = word[0]
771
772
    skip = 0
773
    for i in range(1, len(word)):
774
        if i >= len(word):
775
            continue
776
        elif skip:
777
            skip -= 1
778
            continue
779
        elif word[i:i+2] == 'EV':
780
            word = word[:i] + 'AF' + word[i+2:]
781
            skip = 1
782
        elif word[i] in _vowels:
783
            word = word[:i] + 'A' + word[i+1:]
784
        elif modified and i != len(word)-1 and word[i] == 'Y':
785
            word = word[:i] + 'A' + word[i+1:]
786
        elif word[i] == 'Q':
787
            word = word[:i] + 'G' + word[i+1:]
788
        elif word[i] == 'Z':
789
            word = word[:i] + 'S' + word[i+1:]
790
        elif word[i] == 'M':
791
            word = word[:i] + 'N' + word[i+1:]
792
        elif word[i:i+2] == 'KN':
793
            word = word[:i] + 'N' + word[i+2:]
794
        elif word[i] == 'K':
795
            word = word[:i] + 'C' + word[i+1:]
796
        elif modified and i == len(word)-3 and word[i:i+3] == 'SCH':
797
            word = word[:i] + 'SSA'
798
            skip = 2
799
        elif word[i:i+3] == 'SCH':
800
            word = word[:i] + 'SSS' + word[i+3:]
801
            skip = 2
802
        elif modified and i == len(word)-2 and word[i:i+2] == 'SH':
803
            word = word[:i] + 'SA'
804
            skip = 1
805
        elif word[i:i+2] == 'SH':
806
            word = word[:i] + 'SS' + word[i+2:]
807
            skip = 1
808
        elif word[i:i+2] == 'PH':
809
            word = word[:i] + 'FF' + word[i+2:]
810
            skip = 1
811
        elif modified and word[i:i+3] == 'GHT':
812
            word = word[:i] + 'TTT' + word[i+3:]
813
            skip = 2
814
        elif modified and word[i:i+2] == 'DG':
815
            word = word[:i] + 'GG' + word[i+2:]
816
            skip = 1
817
        elif modified and word[i:i+2] == 'WR':
818
            word = word[:i] + 'RR' + word[i+2:]
819
            skip = 1
820
        elif word[i] == 'H' and (word[i-1] not in _vowels or
821
                                 word[i+1:i+2] not in _vowels):
822
            word = word[:i] + word[i-1] + word[i+1:]
823
        elif word[i] == 'W' and word[i-1] in _vowels:
824
            word = word[:i] + word[i-1] + word[i+1:]
825
826
        if word[i:i+skip+1] != key[-1:]:
827
            key += word[i:i+skip+1]
828
829
    key = _delete_consecutive_repeats(key)
830
831
    if key[-1] == 'S':
832
        key = key[:-1]
833
    if key[-2:] == 'AY':
834
        key = key[:-2] + 'Y'
835
    if key[-1:] == 'A':
836
        key = key[:-1]
837
    if modified and key[0] == 'A':
838
        key = original_first_char + key[1:]
0 ignored issues
show
introduced by
The variable original_first_char does not seem to be defined in case modified on line 729 is False. Are you sure this can never be the case?
Loading history...
839
840
    if maxlength and maxlength < _INFINITY:
841
        key = key[:maxlength]
842
843
    return key
844
845
846
def mra(word):
847
    """Return the MRA personal numeric identifier (PNI) for a word.
848
849
    A description of the Western Airlines Surname Match Rating Algorithm can
850
    be found on page 18 of
851
    https://archive.org/details/accessingindivid00moor
852
853
    :param str word: the word to transform
854
    :returns: the MRA PNI
855
    :rtype: str
856
857
    >>> mra('Christopher')
858
    'CHRPHR'
859
    >>> mra('Niall')
860
    'NL'
861
    >>> mra('Smith')
862
    'SMTH'
863
    >>> mra('Schmidt')
864
    'SCHMDT'
865
    """
866
    if not word:
867
        return word
868
    word = word.upper()
869
    word = word.replace('ß', 'SS')
870
    word = word[0]+''.join(c for c in word[1:] if
871
                           c not in {'A', 'E', 'I', 'O', 'U'})
872
    word = _delete_consecutive_repeats(word)
873
    if len(word) > 6:
874
        word = word[:3]+word[-3:]
875
    return word
876
877
878
def metaphone(word, maxlength=_INFINITY):
879
    """Return the Metaphone code for a word.
880
881
    Based on Lawrence Philips' Pick BASIC code from 1990:
882
    http://aspell.net/metaphone/metaphone.basic
883
    This incorporates some corrections to the above code, particularly
884
    some of those suggested by Michael Kuhn in:
885
    http://aspell.net/metaphone/metaphone-kuhn.txt
886
887
    :param str word: the word to transform
888
    :param int maxlength: the maximum length of the returned Metaphone code
889
        (defaults to unlimited, but in Philips' original implementation
890
        this was 4)
891
    :returns: the Metaphone value
892
    :rtype: str
893
894
895
    >>> metaphone('Christopher')
896
    'KRSTFR'
897
    >>> metaphone('Niall')
898
    'NL'
899
    >>> metaphone('Smith')
900
    'SM0'
901
    >>> metaphone('Schmidt')
902
    'SKMTT'
903
    """
904
    # pylint: disable=too-many-branches
905
    _vowels = {'A', 'E', 'I', 'O', 'U'}
906
    _frontv = {'E', 'I', 'Y'}
907
    _varson = {'C', 'G', 'P', 'S', 'T'}
908
909
    # Require a maxlength of at least 4
910
    if maxlength is not None:
911
        maxlength = max(4, maxlength)
912
    else:
913
        maxlength = 64
914
915
    # As in variable sound--those modified by adding an "h"
916
    ename = ''.join(c for c in word.upper() if c.isalnum())
917
    ename = ename.replace('ß', 'SS')
918
919
    # Delete nonalphanumeric characters and make all caps
920
    if not ename:
921
        return ''
922
    if ename[0:2] in {'PN', 'AE', 'KN', 'GN', 'WR'}:
923
        ename = ename[1:]
924
    elif ename[0] == 'X':
925
        ename = 'S' + ename[1:]
926
    elif ename[0:2] == 'WH':
927
        ename = 'W' + ename[2:]
928
929
    # Convert to metaph
930
    elen = len(ename)-1
931
    metaph = ''
932
    for i in range(len(ename)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
933
        if len(metaph) >= maxlength:
934
            break
935
        if ((ename[i] not in {'G', 'T'} and
936
             i > 0 and ename[i-1] == ename[i])):
937
            continue
938
939
        if ename[i] in _vowels and i == 0:
940
            metaph = ename[i]
941
942
        elif ename[i] == 'B':
943
            if i != elen or ename[i-1] != 'M':
944
                metaph += ename[i]
945
946
        elif ename[i] == 'C':
947
            if not (i > 0 and ename[i-1] == 'S' and ename[i+1:i+2] in _frontv):
948
                if ename[i+1:i+3] == 'IA':
949
                    metaph += 'X'
950
                elif ename[i+1:i+2] in _frontv:
951
                    metaph += 'S'
952
                elif i > 0 and ename[i-1:i+2] == 'SCH':
953
                    metaph += 'K'
954
                elif ename[i+1:i+2] == 'H':
955
                    if i == 0 and i+1 < elen and ename[i+2:i+3] not in _vowels:
956
                        metaph += 'K'
957
                    else:
958
                        metaph += 'X'
959
                else:
960
                    metaph += 'K'
961
962
        elif ename[i] == 'D':
963
            if ename[i+1:i+2] == 'G' and ename[i+2:i+3] in _frontv:
964
                metaph += 'J'
965
            else:
966
                metaph += 'T'
967
968
        elif ename[i] == 'G':
969
            if ename[i+1:i+2] == 'H' and not (i+1 == elen or
970
                                              ename[i+2:i+3] not in _vowels):
971
                continue
972
            elif i > 0 and ((i+1 == elen and ename[i+1] == 'N') or
973
                            (i+3 == elen and ename[i+1:i+4] == 'NED')):
974
                continue
975
            elif (i-1 > 0 and i+1 <= elen and ename[i-1] == 'D' and
976
                  ename[i+1] in _frontv):
977
                continue
978
            elif ename[i+1:i+2] == 'G':
979
                continue
980
            elif ename[i+1:i+2] in _frontv:
981
                if i == 0 or ename[i-1] != 'G':
982
                    metaph += 'J'
983
                else:
984
                    metaph += 'K'
985
            else:
986
                metaph += 'K'
987
988
        elif ename[i] == 'H':
989
            if ((i > 0 and ename[i-1] in _vowels and
990
                 ename[i+1:i+2] not in _vowels)):
991
                continue
992
            elif i > 0 and ename[i-1] in _varson:
993
                continue
994
            else:
995
                metaph += 'H'
996
997
        elif ename[i] in {'F', 'J', 'L', 'M', 'N', 'R'}:
998
            metaph += ename[i]
999
1000
        elif ename[i] == 'K':
1001
            if i > 0 and ename[i-1] == 'C':
1002
                continue
1003
            else:
1004
                metaph += 'K'
1005
1006
        elif ename[i] == 'P':
1007
            if ename[i+1:i+2] == 'H':
1008
                metaph += 'F'
1009
            else:
1010
                metaph += 'P'
1011
1012
        elif ename[i] == 'Q':
1013
            metaph += 'K'
1014
1015
        elif ename[i] == 'S':
1016
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1017
                 ename[i+2] in 'OA')):
1018
                metaph += 'X'
1019
            elif ename[i+1:i+2] == 'H':
1020
                metaph += 'X'
1021
            else:
1022
                metaph += 'S'
1023
1024
        elif ename[i] == 'T':
1025
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1026
                 ename[i+2] in {'A', 'O'})):
1027
                metaph += 'X'
1028
            elif ename[i+1:i+2] == 'H':
1029
                metaph += '0'
1030
            elif ename[i+1:i+3] != 'CH':
1031
                if ename[i-1:i] != 'T':
1032
                    metaph += 'T'
1033
1034
        elif ename[i] == 'V':
1035
            metaph += 'F'
1036
1037
        elif ename[i] in 'WY':
1038
            if ename[i+1:i+2] in _vowels:
1039
                metaph += ename[i]
1040
1041
        elif ename[i] == 'X':
1042
            metaph += 'KS'
1043
1044
        elif ename[i] == 'Z':
1045
            metaph += 'S'
1046
1047
    return metaph
1048
1049
1050
def double_metaphone(word, maxlength=_INFINITY):
1051
    """Return the Double Metaphone code for a word.
1052
1053
    Based on Lawrence Philips' (Visual) C++ code from 1999:
1054
    http://aspell.net/metaphone/dmetaph.cpp
1055
1056
    :param word: the word to transform
1057
    :param maxlength: the maximum length of the returned Double Metaphone codes
1058
        (defaults to unlimited, but in Philips' original implementation this
1059
        was 4)
1060
    :returns: the Double Metaphone value(s)
1061
    :rtype: tuple
1062
1063
    >>> double_metaphone('Christopher')
1064
    ('KRSTFR', '')
1065
    >>> double_metaphone('Niall')
1066
    ('NL', '')
1067
    >>> double_metaphone('Smith')
1068
    ('SM0', 'XMT')
1069
    >>> double_metaphone('Schmidt')
1070
    ('XMT', 'SMT')
1071
    """
1072
    # pylint: disable=too-many-branches
1073
    # Require a maxlength of at least 4
1074
    if maxlength is not None:
1075
        maxlength = max(4, maxlength)
1076
    else:
1077
        maxlength = 64
1078
1079
    primary = ''
1080
    secondary = ''
1081
1082
    def _slavo_germanic():
1083
        """Return True if the word appears to be Slavic or Germanic."""
1084
        if 'W' in word or 'K' in word or 'CZ' in word:
1085
            return True
1086
        return False
1087
1088
    def _metaph_add(pri, sec=''):
1089
        """Return a new metaphone tuple with the supplied elements."""
1090
        newpri = primary
1091
        newsec = secondary
1092
        if pri:
1093
            newpri += pri
1094
        if sec:
1095
            if sec != ' ':
1096
                newsec += sec
1097
        else:
1098
            newsec += pri
1099
        return (newpri, newsec)
1100
1101
    def _is_vowel(pos):
1102
        """Return True if the character at word[pos] is a vowel."""
1103
        if pos >= 0 and word[pos] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1104
            return True
1105
        return False
1106
1107
    def _get_at(pos):
1108
        """Return the character at word[pos]."""
1109
        return word[pos]
1110
1111
    def _string_at(pos, slen, substrings):
1112
        """Return True if word[pos:pos+slen] is in substrings."""
1113
        if pos < 0:
1114
            return False
1115
        return word[pos:pos+slen] in substrings
1116
1117
    current = 0
1118
    length = len(word)
1119
    if length < 1:
1120
        return ('', '')
1121
    last = length - 1
1122
1123
    word = word.upper()
1124
    word = word.replace('ß', 'SS')
1125
1126
    # Pad the original string so that we can index beyond the edge of the world
1127
    word += '     '
1128
1129
    # Skip these when at start of word
1130
    if word[0:2] in {'GN', 'KN', 'PN', 'WR', 'PS'}:
1131
        current += 1
1132
1133
    # Initial 'X' is pronounced 'Z' e.g. 'Xavier'
1134
    if _get_at(0) == 'X':
1135
        (primary, secondary) = _metaph_add('S')  # 'Z' maps to 'S'
1136
        current += 1
1137
1138
    # Main loop
1139
    while True:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1140
        if current >= length:
1141
            break
1142
1143
        if _get_at(current) in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1144
            if current == 0:
1145
                # All init vowels now map to 'A'
1146
                (primary, secondary) = _metaph_add('A')
1147
            current += 1
1148
            continue
1149
1150
        elif _get_at(current) == 'B':
1151
            # "-mb", e.g", "dumb", already skipped over...
1152
            (primary, secondary) = _metaph_add('P')
1153
            if _get_at(current + 1) == 'B':
1154
                current += 2
1155
            else:
1156
                current += 1
1157
            continue
1158
1159
        elif _get_at(current) == 'Ç':
1160
            (primary, secondary) = _metaph_add('S')
1161
            current += 1
1162
            continue
1163
1164
        elif _get_at(current) == 'C':
1165
            # Various Germanic
1166
            if (current > 1 and not _is_vowel(current - 2) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1167
                    _string_at((current - 1), 3, {'ACH'}) and
1168
                    ((_get_at(current + 2) != 'I') and
1169
                     ((_get_at(current + 2) != 'E') or
1170
                      _string_at((current - 2), 6,
1171
                                 {'BACHER', 'MACHER'})))):
1172
                (primary, secondary) = _metaph_add('K')
1173
                current += 2
1174
                continue
1175
1176
            # Special case 'caesar'
1177
            elif current == 0 and _string_at(current, 6, {'CAESAR'}):
1178
                (primary, secondary) = _metaph_add('S')
1179
                current += 2
1180
                continue
1181
1182
            # Italian 'chianti'
1183
            elif _string_at(current, 4, {'CHIA'}):
1184
                (primary, secondary) = _metaph_add('K')
1185
                current += 2
1186
                continue
1187
1188
            elif _string_at(current, 2, {'CH'}):
1189
                # Find 'Michael'
1190
                if current > 0 and _string_at(current, 4, {'CHAE'}):
1191
                    (primary, secondary) = _metaph_add('K', 'X')
1192
                    current += 2
1193
                    continue
1194
1195
                # Greek roots e.g. 'chemistry', 'chorus'
1196
                elif (current == 0 and
1197
                      (_string_at((current + 1), 5,
1198
                                  {'HARAC', 'HARIS'}) or
1199
                       _string_at((current + 1), 3,
1200
                                  {'HOR', 'HYM', 'HIA', 'HEM'})) and
1201
                      not _string_at(0, 5, {'CHORE'})):
1202
                    (primary, secondary) = _metaph_add('K')
1203
                    current += 2
1204
                    continue
1205
1206
                # Germanic, Greek, or otherwise 'ch' for 'kh' sound
1207
                elif ((_string_at(0, 4, {'VAN ', 'VON '}) or
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (7/5)
Loading history...
1208
                       _string_at(0, 3, {'SCH'})) or
1209
                      # 'architect but not 'arch', 'orchestra', 'orchid'
1210
                      _string_at((current - 2), 6,
1211
                                 {'ORCHES', 'ARCHIT', 'ORCHID'}) or
1212
                      _string_at((current + 2), 1, {'T', 'S'}) or
1213
                      ((_string_at((current - 1), 1,
1214
                                   {'A', 'O', 'U', 'E'}) or
1215
                        (current == 0)) and
1216
                       # e.g., 'wachtler', 'wechsler', but not 'tichner'
1217
                       _string_at((current + 2), 1,
1218
                                  {'L', 'R', 'N', 'M', 'B', 'H', 'F', 'V', 'W',
1219
                                   ' '}))):
1220
                    (primary, secondary) = _metaph_add('K')
1221
1222
                else:
1223
                    if current > 0:
1224
                        if _string_at(0, 2, {'MC'}):
1225
                            # e.g., "McHugh"
1226
                            (primary, secondary) = _metaph_add('K')
1227
                        else:
1228
                            (primary, secondary) = _metaph_add('X', 'K')
1229
                    else:
1230
                        (primary, secondary) = _metaph_add('X')
1231
1232
                current += 2
1233
                continue
1234
1235
            # e.g, 'czerny'
1236
            elif (_string_at(current, 2, {'CZ'}) and
1237
                  not _string_at((current - 2), 4, {'WICZ'})):
1238
                (primary, secondary) = _metaph_add('S', 'X')
1239
                current += 2
1240
                continue
1241
1242
            # e.g., 'focaccia'
1243
            elif _string_at((current + 1), 3, {'CIA'}):
1244
                (primary, secondary) = _metaph_add('X')
1245
                current += 3
1246
1247
            # double 'C', but not if e.g. 'McClellan'
1248
            elif (_string_at(current, 2, {'CC'}) and
1249
                  not ((current == 1) and (_get_at(0) == 'M'))):
1250
                # 'bellocchio' but not 'bacchus'
1251
                if ((_string_at((current + 2), 1,
1252
                                {'I', 'E', 'H'}) and
1253
                     not _string_at((current + 2), 2, ['HU']))):
1254
                    # 'accident', 'accede' 'succeed'
1255
                    if ((((current == 1) and _get_at(current - 1) == 'A') or
1256
                         _string_at((current - 1), 5,
1257
                                    {'UCCEE', 'UCCES'}))):
1258
                        (primary, secondary) = _metaph_add('KS')
1259
                    # 'bacci', 'bertucci', other italian
1260
                    else:
1261
                        (primary, secondary) = _metaph_add('X')
1262
                    current += 3
1263
                    continue
1264
                else:  # Pierce's rule
1265
                    (primary, secondary) = _metaph_add('K')
1266
                    current += 2
1267
                    continue
1268
1269
            elif _string_at(current, 2, {'CK', 'CG', 'CQ'}):
1270
                (primary, secondary) = _metaph_add('K')
1271
                current += 2
1272
                continue
1273
1274
            elif _string_at(current, 2, {'CI', 'CE', 'CY'}):
1275
                # Italian vs. English
1276
                if _string_at(current, 3, {'CIO', 'CIE', 'CIA'}):
1277
                    (primary, secondary) = _metaph_add('S', 'X')
1278
                else:
1279
                    (primary, secondary) = _metaph_add('S')
1280
                current += 2
1281
                continue
1282
1283
            # else
1284
            else:
1285
                (primary, secondary) = _metaph_add('K')
1286
1287
                # name sent in 'mac caffrey', 'mac gregor
1288
                if _string_at((current + 1), 2, {' C', ' Q', ' G'}):
1289
                    current += 3
1290
                elif (_string_at((current + 1), 1,
1291
                                 {'C', 'K', 'Q'}) and
1292
                      not _string_at((current + 1), 2, {'CE', 'CI'})):
1293
                    current += 2
1294
                else:
1295
                    current += 1
1296
                continue
1297
1298
        elif _get_at(current) == 'D':
1299
            if _string_at(current, 2, {'DG'}):
1300
                if _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1301
                    # e.g. 'edge'
1302
                    (primary, secondary) = _metaph_add('J')
1303
                    current += 3
1304
                    continue
1305
                else:
1306
                    # e.g. 'edgar'
1307
                    (primary, secondary) = _metaph_add('TK')
1308
                    current += 2
1309
                    continue
1310
1311
            elif _string_at(current, 2, {'DT', 'DD'}):
1312
                (primary, secondary) = _metaph_add('T')
1313
                current += 2
1314
                continue
1315
1316
            # else
1317
            else:
1318
                (primary, secondary) = _metaph_add('T')
1319
                current += 1
1320
                continue
1321
1322
        elif _get_at(current) == 'F':
1323
            if _get_at(current + 1) == 'F':
1324
                current += 2
1325
            else:
1326
                current += 1
1327
            (primary, secondary) = _metaph_add('F')
1328
            continue
1329
1330
        elif _get_at(current) == 'G':
1331
            if _get_at(current + 1) == 'H':
1332
                if (current > 0) and not _is_vowel(current - 1):
1333
                    (primary, secondary) = _metaph_add('K')
1334
                    current += 2
1335
                    continue
1336
1337
                # 'ghislane', ghiradelli
1338
                elif current == 0:
1339
                    if _get_at(current + 2) == 'I':
1340
                        (primary, secondary) = _metaph_add('J')
1341
                    else:
1342
                        (primary, secondary) = _metaph_add('K')
1343
                    current += 2
1344
                    continue
1345
1346
                # Parker's rule (with some further refinements) - e.g., 'hugh'
1347
                elif (((current > 1) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1348
                       _string_at((current - 2), 1, {'B', 'H', 'D'})) or
1349
                      # e.g., 'bough'
1350
                      ((current > 2) and
1351
                       _string_at((current - 3), 1, {'B', 'H', 'D'})) or
1352
                      # e.g., 'broughton'
1353
                      ((current > 3) and
1354
                       _string_at((current - 4), 1, {'B', 'H'}))):
1355
                    current += 2
1356
                    continue
1357
                else:
1358
                    # e.g. 'laugh', 'McLaughlin', 'cough',
1359
                    #      'gough', 'rough', 'tough'
1360
                    if ((current > 2) and
1361
                            (_get_at(current - 1) == 'U') and
1362
                            (_string_at((current - 3), 1,
1363
                                        {'C', 'G', 'L', 'R', 'T'}))):
1364
                        (primary, secondary) = _metaph_add('F')
1365
                    elif (current > 0) and _get_at(current - 1) != 'I':
1366
                        (primary, secondary) = _metaph_add('K')
1367
                    current += 2
1368
                    continue
1369
1370
            elif _get_at(current + 1) == 'N':
1371
                if (current == 1) and _is_vowel(0) and not _slavo_germanic():
1372
                    (primary, secondary) = _metaph_add('KN', 'N')
1373
                # not e.g. 'cagney'
1374
                elif (not _string_at((current + 2), 2, {'EY'}) and
1375
                      (_get_at(current + 1) != 'Y') and
1376
                      not _slavo_germanic()):
1377
                    (primary, secondary) = _metaph_add('N', 'KN')
1378
                else:
1379
                    (primary, secondary) = _metaph_add('KN')
1380
                current += 2
1381
                continue
1382
1383
            # 'tagliaro'
1384
            elif (_string_at((current + 1), 2, {'LI'}) and
1385
                  not _slavo_germanic()):
1386
                (primary, secondary) = _metaph_add('KL', 'L')
1387
                current += 2
1388
                continue
1389
1390
            # -ges-, -gep-, -gel-, -gie- at beginning
1391
            elif ((current == 0) and
1392
                  ((_get_at(current + 1) == 'Y') or
1393
                   _string_at((current + 1), 2, {'ES', 'EP', 'EB', 'EL', 'EY',
1394
                                                 'IB', 'IL', 'IN', 'IE', 'EI',
1395
                                                 'ER'}))):
1396
                (primary, secondary) = _metaph_add('K', 'J')
1397
                current += 2
1398
                continue
1399
1400
            #  -ger-,  -gy-
1401
            elif ((_string_at((current + 1), 2, {'ER'}) or
1402
                   (_get_at(current + 1) == 'Y')) and not
1403
                  _string_at(0, 6, {'DANGER', 'RANGER', 'MANGER'}) and not
1404
                  _string_at((current - 1), 1, {'E', 'I'}) and not
1405
                  _string_at((current - 1), 3, {'RGY', 'OGY'})):
1406
                (primary, secondary) = _metaph_add('K', 'J')
1407
                current += 2
1408
                continue
1409
1410
            #  italian e.g, 'biaggi'
1411
            elif (_string_at((current + 1), 1, {'E', 'I', 'Y'}) or
1412
                  _string_at((current - 1), 4, {'AGGI', 'OGGI'})):
1413
                # obvious germanic
1414
                if (((_string_at(0, 4, {'VAN ', 'VON '}) or
1415
                      _string_at(0, 3, {'SCH'})) or
1416
                     _string_at((current + 1), 2, {'ET'}))):
1417
                    (primary, secondary) = _metaph_add('K')
1418
                elif _string_at((current + 1), 4, {'IER '}):
1419
                    (primary, secondary) = _metaph_add('J')
1420
                else:
1421
                    (primary, secondary) = _metaph_add('J', 'K')
1422
                current += 2
1423
                continue
1424
1425
            else:
1426
                if _get_at(current + 1) == 'G':
1427
                    current += 2
1428
                else:
1429
                    current += 1
1430
                (primary, secondary) = _metaph_add('K')
1431
                continue
1432
1433
        elif _get_at(current) == 'H':
1434
            # only keep if first & before vowel or btw. 2 vowels
1435
            if ((((current == 0) or _is_vowel(current - 1)) and
1436
                 _is_vowel(current + 1))):
1437
                (primary, secondary) = _metaph_add('H')
1438
                current += 2
1439
            else:  # also takes care of 'HH'
1440
                current += 1
1441
            continue
1442
1443
        elif _get_at(current) == 'J':
1444
            # obvious spanish, 'jose', 'san jacinto'
1445
            if _string_at(current, 4, ['JOSE']) or _string_at(0, 4, {'SAN '}):
1446
                if ((((current == 0) and (_get_at(current + 4) == ' ')) or
1447
                     _string_at(0, 4, ['SAN ']))):
1448
                    (primary, secondary) = _metaph_add('H')
1449
                else:
1450
                    (primary, secondary) = _metaph_add('J', 'H')
1451
                current += 1
1452
                continue
1453
1454
            elif (current == 0) and not _string_at(current, 4, {'JOSE'}):
1455
                # Yankelovich/Jankelowicz
1456
                (primary, secondary) = _metaph_add('J', 'A')
1457
            # Spanish pron. of e.g. 'bajador'
1458
            elif (_is_vowel(current - 1) and
1459
                  not _slavo_germanic() and
1460
                  ((_get_at(current + 1) == 'A') or
1461
                   (_get_at(current + 1) == 'O'))):
1462
                (primary, secondary) = _metaph_add('J', 'H')
1463
            elif current == last:
1464
                (primary, secondary) = _metaph_add('J', ' ')
1465
            elif (not _string_at((current + 1), 1,
1466
                                 {'L', 'T', 'K', 'S', 'N', 'M', 'B', 'Z'}) and
1467
                  not _string_at((current - 1), 1, {'S', 'K', 'L'})):
1468
                (primary, secondary) = _metaph_add('J')
1469
1470
            if _get_at(current + 1) == 'J':  # it could happen!
1471
                current += 2
1472
            else:
1473
                current += 1
1474
            continue
1475
1476
        elif _get_at(current) == 'K':
1477
            if _get_at(current + 1) == 'K':
1478
                current += 2
1479
            else:
1480
                current += 1
1481
            (primary, secondary) = _metaph_add('K')
1482
            continue
1483
1484
        elif _get_at(current) == 'L':
1485
            if _get_at(current + 1) == 'L':
1486
                # Spanish e.g. 'cabrillo', 'gallegos'
1487
                if (((current == (length - 3)) and
1488
                     _string_at((current - 1), 4, {'ILLO', 'ILLA', 'ALLE'})) or
1489
                        ((_string_at((last - 1), 2, {'AS', 'OS'}) or
1490
                          _string_at(last, 1, {'A', 'O'})) and
1491
                         _string_at((current - 1), 4, {'ALLE'}))):
1492
                    (primary, secondary) = _metaph_add('L', ' ')
1493
                    current += 2
1494
                    continue
1495
                current += 2
1496
            else:
1497
                current += 1
1498
            (primary, secondary) = _metaph_add('L')
1499
            continue
1500
1501
        elif _get_at(current) == 'M':
1502
            if (((_string_at((current - 1), 3, {'UMB'}) and
1503
                  (((current + 1) == last) or
1504
                   _string_at((current + 2), 2, {'ER'}))) or
1505
                 # 'dumb', 'thumb'
1506
                 (_get_at(current + 1) == 'M'))):
1507
                current += 2
1508
            else:
1509
                current += 1
1510
            (primary, secondary) = _metaph_add('M')
1511
            continue
1512
1513
        elif _get_at(current) == 'N':
1514
            if _get_at(current + 1) == 'N':
1515
                current += 2
1516
            else:
1517
                current += 1
1518
            (primary, secondary) = _metaph_add('N')
1519
            continue
1520
1521
        elif _get_at(current) == 'Ñ':
1522
            current += 1
1523
            (primary, secondary) = _metaph_add('N')
1524
            continue
1525
1526
        elif _get_at(current) == 'P':
1527
            if _get_at(current + 1) == 'H':
1528
                (primary, secondary) = _metaph_add('F')
1529
                current += 2
1530
                continue
1531
1532
            # also account for "campbell", "raspberry"
1533
            elif _string_at((current + 1), 1, {'P', 'B'}):
1534
                current += 2
1535
            else:
1536
                current += 1
1537
            (primary, secondary) = _metaph_add('P')
1538
            continue
1539
1540
        elif _get_at(current) == 'Q':
1541
            if _get_at(current + 1) == 'Q':
1542
                current += 2
1543
            else:
1544
                current += 1
1545
            (primary, secondary) = _metaph_add('K')
1546
            continue
1547
1548
        elif _get_at(current) == 'R':
1549
            # french e.g. 'rogier', but exclude 'hochmeier'
1550
            if (((current == last) and
1551
                 not _slavo_germanic() and
1552
                 _string_at((current - 2), 2, {'IE'}) and
1553
                 not _string_at((current - 4), 2, {'ME', 'MA'}))):
1554
                (primary, secondary) = _metaph_add('', 'R')
1555
            else:
1556
                (primary, secondary) = _metaph_add('R')
1557
1558
            if _get_at(current + 1) == 'R':
1559
                current += 2
1560
            else:
1561
                current += 1
1562
            continue
1563
1564
        elif _get_at(current) == 'S':
1565
            # special cases 'island', 'isle', 'carlisle', 'carlysle'
1566
            if _string_at((current - 1), 3, {'ISL', 'YSL'}):
1567
                current += 1
1568
                continue
1569
1570
            # special case 'sugar-'
1571
            elif (current == 0) and _string_at(current, 5, {'SUGAR'}):
1572
                (primary, secondary) = _metaph_add('X', 'S')
1573
                current += 1
1574
                continue
1575
1576
            elif _string_at(current, 2, {'SH'}):
1577
                # Germanic
1578
                if _string_at((current + 1), 4,
1579
                              {'HEIM', 'HOEK', 'HOLM', 'HOLZ'}):
1580
                    (primary, secondary) = _metaph_add('S')
1581
                else:
1582
                    (primary, secondary) = _metaph_add('X')
1583
                current += 2
1584
                continue
1585
1586
            # Italian & Armenian
1587
            elif (_string_at(current, 3, {'SIO', 'SIA'}) or
1588
                  _string_at(current, 4, {'SIAN'})):
1589
                if not _slavo_germanic():
1590
                    (primary, secondary) = _metaph_add('S', 'X')
1591
                else:
1592
                    (primary, secondary) = _metaph_add('S')
1593
                current += 3
1594
                continue
1595
1596
            # German & anglicisations, e.g. 'smith' match 'schmidt',
1597
            #                               'snider' match 'schneider'
1598
            # also, -sz- in Slavic language although in Hungarian it is
1599
            #       pronounced 's'
1600
            elif (((current == 0) and
1601
                   _string_at((current + 1), 1, {'M', 'N', 'L', 'W'})) or
1602
                  _string_at((current + 1), 1, {'Z'})):
1603
                (primary, secondary) = _metaph_add('S', 'X')
1604
                if _string_at((current + 1), 1, {'Z'}):
1605
                    current += 2
1606
                else:
1607
                    current += 1
1608
                continue
1609
1610
            elif _string_at(current, 2, {'SC'}):
1611
                # Schlesinger's rule
1612
                if _get_at(current + 2) == 'H':
1613
                    # dutch origin, e.g. 'school', 'schooner'
1614
                    if _string_at((current + 3), 2,
1615
                                  {'OO', 'ER', 'EN', 'UY', 'ED', 'EM'}):
1616
                        # 'schermerhorn', 'schenker'
1617
                        if _string_at((current + 3), 2, {'ER', 'EN'}):
1618
                            (primary, secondary) = _metaph_add('X', 'SK')
1619
                        else:
1620
                            (primary, secondary) = _metaph_add('SK')
1621
                        current += 3
1622
                        continue
1623
                    else:
1624
                        if (((current == 0) and not _is_vowel(3) and
1625
                             (_get_at(3) != 'W'))):
1626
                            (primary, secondary) = _metaph_add('X', 'S')
1627
                        else:
1628
                            (primary, secondary) = _metaph_add('X')
1629
                        current += 3
1630
                        continue
1631
1632
                elif _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1633
                    (primary, secondary) = _metaph_add('S')
1634
                    current += 3
1635
                    continue
1636
1637
                # else
1638
                else:
1639
                    (primary, secondary) = _metaph_add('SK')
1640
                    current += 3
1641
                    continue
1642
1643
            else:
1644
                # french e.g. 'resnais', 'artois'
1645
                if (current == last) and _string_at((current - 2), 2,
1646
                                                    {'AI', 'OI'}):
1647
                    (primary, secondary) = _metaph_add('', 'S')
1648
                else:
1649
                    (primary, secondary) = _metaph_add('S')
1650
1651
                if _string_at((current + 1), 1, {'S', 'Z'}):
1652
                    current += 2
1653
                else:
1654
                    current += 1
1655
                continue
1656
1657
        elif _get_at(current) == 'T':
1658
            if _string_at(current, 4, {'TION'}):
1659
                (primary, secondary) = _metaph_add('X')
1660
                current += 3
1661
                continue
1662
1663
            elif _string_at(current, 3, {'TIA', 'TCH'}):
1664
                (primary, secondary) = _metaph_add('X')
1665
                current += 3
1666
                continue
1667
1668
            elif (_string_at(current, 2, {'TH'}) or
1669
                  _string_at(current, 3, {'TTH'})):
1670
                # special case 'thomas', 'thames' or germanic
1671
                if ((_string_at((current + 2), 2, {'OM', 'AM'}) or
1672
                     _string_at(0, 4, {'VAN ', 'VON '}) or
1673
                     _string_at(0, 3, {'SCH'}))):
1674
                    (primary, secondary) = _metaph_add('T')
1675
                else:
1676
                    (primary, secondary) = _metaph_add('0', 'T')
1677
                current += 2
1678
                continue
1679
1680
            elif _string_at((current + 1), 1, {'T', 'D'}):
1681
                current += 2
1682
            else:
1683
                current += 1
1684
            (primary, secondary) = _metaph_add('T')
1685
            continue
1686
1687
        elif _get_at(current) == 'V':
1688
            if _get_at(current + 1) == 'V':
1689
                current += 2
1690
            else:
1691
                current += 1
1692
            (primary, secondary) = _metaph_add('F')
1693
            continue
1694
1695
        elif _get_at(current) == 'W':
1696
            # can also be in middle of word
1697
            if _string_at(current, 2, {'WR'}):
1698
                (primary, secondary) = _metaph_add('R')
1699
                current += 2
1700
                continue
1701
            elif ((current == 0) and
1702
                  (_is_vowel(current + 1) or _string_at(current, 2, {'WH'}))):
1703
                # Wasserman should match Vasserman
1704
                if _is_vowel(current + 1):
1705
                    (primary, secondary) = _metaph_add('A', 'F')
1706
                else:
1707
                    # need Uomo to match Womo
1708
                    (primary, secondary) = _metaph_add('A')
1709
1710
            # Arnow should match Arnoff
1711
            if ((((current == last) and _is_vowel(current - 1)) or
1712
                 _string_at((current - 1), 5,
1713
                            {'EWSKI', 'EWSKY', 'OWSKI', 'OWSKY'}) or
1714
                 _string_at(0, 3, ['SCH']))):
1715
                (primary, secondary) = _metaph_add('', 'F')
1716
                current += 1
1717
                continue
1718
            # Polish e.g. 'filipowicz'
1719
            elif _string_at(current, 4, {'WICZ', 'WITZ'}):
1720
                (primary, secondary) = _metaph_add('TS', 'FX')
1721
                current += 4
1722
                continue
1723
            # else skip it
1724
            else:
1725
                current += 1
1726
                continue
1727
1728
        elif _get_at(current) == 'X':
1729
            # French e.g. breaux
1730
            if (not ((current == last) and
1731
                     (_string_at((current - 3), 3, {'IAU', 'EAU'}) or
1732
                      _string_at((current - 2), 2, {'AU', 'OU'})))):
1733
                (primary, secondary) = _metaph_add('KS')
1734
1735
            if _string_at((current + 1), 1, {'C', 'X'}):
1736
                current += 2
1737
            else:
1738
                current += 1
1739
            continue
1740
1741
        elif _get_at(current) == 'Z':
1742
            # Chinese Pinyin e.g. 'zhao'
1743
            if _get_at(current + 1) == 'H':
1744
                (primary, secondary) = _metaph_add('J')
1745
                current += 2
1746
                continue
1747
            elif (_string_at((current + 1), 2, {'ZO', 'ZI', 'ZA'}) or
1748
                  (_slavo_germanic() and ((current > 0) and
1749
                                          _get_at(current - 1) != 'T'))):
1750
                (primary, secondary) = _metaph_add('S', 'TS')
1751
            else:
1752
                (primary, secondary) = _metaph_add('S')
1753
1754
            if _get_at(current + 1) == 'Z':
1755
                current += 2
1756
            else:
1757
                current += 1
1758
            continue
1759
1760
        else:
1761
            current += 1
1762
1763
    if maxlength and maxlength < _INFINITY:
1764
        primary = primary[:maxlength]
1765
        secondary = secondary[:maxlength]
1766
    if primary == secondary:
1767
        secondary = ''
1768
1769
    return (primary, secondary)
1770
1771
1772
def caverphone(word, version=2):
1773
    """Return the Caverphone code for a word.
1774
1775
    A description of version 1 of the algorithm can be found at:
1776
    http://caversham.otago.ac.nz/files/working/ctp060902.pdf
1777
1778
    A description of version 2 of the algorithm can be found at:
1779
    http://caversham.otago.ac.nz/files/working/ctp150804.pdf
1780
1781
    :param str word: the word to transform
1782
    :param int version: the version of Caverphone to employ for encoding
1783
        (defaults to 2)
1784
    :returns: the Caverphone value
1785
    :rtype: str
1786
1787
    >>> caverphone('Christopher')
1788
    'KRSTFA1111'
1789
    >>> caverphone('Niall')
1790
    'NA11111111'
1791
    >>> caverphone('Smith')
1792
    'SMT1111111'
1793
    >>> caverphone('Schmidt')
1794
    'SKMT111111'
1795
1796
    >>> caverphone('Christopher', 1)
1797
    'KRSTF1'
1798
    >>> caverphone('Niall', 1)
1799
    'N11111'
1800
    >>> caverphone('Smith', 1)
1801
    'SMT111'
1802
    >>> caverphone('Schmidt', 1)
1803
    'SKMT11'
1804
    """
1805
    _vowels = {'a', 'e', 'i', 'o', 'u'}
1806
1807
    word = word.lower()
1808
    word = ''.join(c for c in word if c in
1809
                   {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
1810
                    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
1811
                    'y', 'z'})
1812
1813
    def _squeeze_replace(word, char, new_char):
1814
        """Convert strings of char in word to one instance of new_char."""
1815
        while char * 2 in word:
1816
            word = word.replace(char * 2, char)
1817
        return word.replace(char, new_char)
1818
1819
    # the main replacemet algorithm
1820
    if version != 1 and word[-1:] == 'e':
1821
        word = word[:-1]
1822
    if word:
1823
        if word[:5] == 'cough':
1824
            word = 'cou2f'+word[5:]
1825
        if word[:5] == 'rough':
1826
            word = 'rou2f'+word[5:]
1827
        if word[:5] == 'tough':
1828
            word = 'tou2f'+word[5:]
1829
        if word[:6] == 'enough':
1830
            word = 'enou2f'+word[6:]
1831
        if version != 1 and word[:6] == 'trough':
1832
            word = 'trou2f'+word[6:]
1833
        if word[:2] == 'gn':
1834
            word = '2n'+word[2:]
1835
        if word[-2:] == 'mb':
1836
            word = word[:-1]+'2'
1837
        word = word.replace('cq', '2q')
1838
        word = word.replace('ci', 'si')
1839
        word = word.replace('ce', 'se')
1840
        word = word.replace('cy', 'sy')
1841
        word = word.replace('tch', '2ch')
1842
        word = word.replace('c', 'k')
1843
        word = word.replace('q', 'k')
1844
        word = word.replace('x', 'k')
1845
        word = word.replace('v', 'f')
1846
        word = word.replace('dg', '2g')
1847
        word = word.replace('tio', 'sio')
1848
        word = word.replace('tia', 'sia')
1849
        word = word.replace('d', 't')
1850
        word = word.replace('ph', 'fh')
1851
        word = word.replace('b', 'p')
1852
        word = word.replace('sh', 's2')
1853
        word = word.replace('z', 's')
1854
        if word[0] in _vowels:
1855
            word = 'A'+word[1:]
1856
        word = word.replace('a', '3')
1857
        word = word.replace('e', '3')
1858
        word = word.replace('i', '3')
1859
        word = word.replace('o', '3')
1860
        word = word.replace('u', '3')
1861
        if version != 1:
1862
            word = word.replace('j', 'y')
1863
            if word[:2] == 'y3':
1864
                word = 'Y3'+word[2:]
1865
            if word[:1] == 'y':
1866
                word = 'A'+word[1:]
1867
            word = word.replace('y', '3')
1868
        word = word.replace('3gh3', '3kh3')
1869
        word = word.replace('gh', '22')
1870
        word = word.replace('g', 'k')
1871
1872
        word = _squeeze_replace(word, 's', 'S')
1873
        word = _squeeze_replace(word, 't', 'T')
1874
        word = _squeeze_replace(word, 'p', 'P')
1875
        word = _squeeze_replace(word, 'k', 'K')
1876
        word = _squeeze_replace(word, 'f', 'F')
1877
        word = _squeeze_replace(word, 'm', 'M')
1878
        word = _squeeze_replace(word, 'n', 'N')
1879
1880
        word = word.replace('w3', 'W3')
1881
        if version == 1:
1882
            word = word.replace('wy', 'Wy')
1883
        word = word.replace('wh3', 'Wh3')
1884
        if version == 1:
1885
            word = word.replace('why', 'Why')
1886
        if version != 1 and word[-1:] == 'w':
1887
            word = word[:-1]+'3'
1888
        word = word.replace('w', '2')
1889
        if word[:1] == 'h':
1890
            word = 'A'+word[1:]
1891
        word = word.replace('h', '2')
1892
        word = word.replace('r3', 'R3')
1893
        if version == 1:
1894
            word = word.replace('ry', 'Ry')
1895
        if version != 1 and word[-1:] == 'r':
1896
            word = word[:-1]+'3'
1897
        word = word.replace('r', '2')
1898
        word = word.replace('l3', 'L3')
1899
        if version == 1:
1900
            word = word.replace('ly', 'Ly')
1901
        if version != 1 and word[-1:] == 'l':
1902
            word = word[:-1]+'3'
1903
        word = word.replace('l', '2')
1904
        if version == 1:
1905
            word = word.replace('j', 'y')
1906
            word = word.replace('y3', 'Y3')
1907
            word = word.replace('y', '2')
1908
        word = word.replace('2', '')
1909
        if version != 1 and word[-1:] == '3':
1910
            word = word[:-1]+'A'
1911
        word = word.replace('3', '')
1912
1913
    # pad with 1s, then extract the necessary length of code
1914
    word = word+'1'*10
1915
    if version != 1:
1916
        word = word[:10]
1917
    else:
1918
        word = word[:6]
1919
1920
    return word
1921
1922
1923
def alpha_sis(word, maxlength=14):
1924
    """Return the IBM Alpha Search Inquiry System code for a word.
1925
1926
    Based on the algorithm described in "Accessing individual records from
1927
    personal data files using non-unique identifiers" / Gwendolyn B. Moore,
1928
    et al.; prepared for the Institute for Computer Sciences and Technology,
1929
    National Bureau of Standards, Washington, D.C (1977):
1930
    https://archive.org/stream/accessingindivid00moor#page/15/mode/1up
1931
1932
    A collection is necessary since there can be multiple values for a
1933
    single word. But the collection must be ordered since the first value
1934
    is the primary coding.
1935
1936
    :param str word: the word to transform
1937
    :param int maxlength: the length of the code returned (defaults to 14)
1938
    :returns: the Alpha SIS value
1939
    :rtype: tuple
1940
1941
    >>> alpha_sis('Christopher')
1942
    ('06401840000000', '07040184000000', '04018400000000')
1943
    >>> alpha_sis('Niall')
1944
    ('02500000000000',)
1945
    >>> alpha_sis('Smith')
1946
    ('03100000000000',)
1947
    >>> alpha_sis('Schmidt')
1948
    ('06310000000000',)
1949
    """
1950
    _alpha_sis_initials = {'GF': '08', 'GM': '03', 'GN': '02', 'KN': '02',
1951
                           'PF': '08', 'PN': '02', 'PS': '00', 'WR': '04',
1952
                           'A': '1', 'E': '1', 'H': '2', 'I': '1', 'J': '3',
1953
                           'O': '1', 'U': '1', 'W': '4', 'Y': '5'}
1954
    _alpha_sis_initials_order = ('GF', 'GM', 'GN', 'KN', 'PF', 'PN', 'PS',
1955
                                 'WR', 'A', 'E', 'H', 'I', 'J', 'O', 'U', 'W',
1956
                                 'Y')
1957
    _alpha_sis_basic = {'SCH': '6', 'CZ': ('70', '6', '0'),
1958
                        'CH': ('6', '70', '0'), 'CK': ('7', '6'),
1959
                        'DS': ('0', '10'), 'DZ': ('0', '10'),
1960
                        'TS': ('0', '10'), 'TZ': ('0', '10'), 'CI': '0',
1961
                        'CY': '0', 'CE': '0', 'SH': '6', 'DG': '7', 'PH': '8',
1962
                        'C': ('7', '6'), 'K': ('7', '6'), 'Z': '0', 'S': '0',
1963
                        'D': '1', 'T': '1', 'N': '2', 'M': '3', 'R': '4',
1964
                        'L': '5', 'J': '6', 'G': '7', 'Q': '7', 'X': '7',
1965
                        'F': '8', 'V': '8', 'B': '9', 'P': '9'}
1966
    _alpha_sis_basic_order = ('SCH', 'CZ', 'CH', 'CK', 'DS', 'DZ', 'TS', 'TZ',
1967
                              'CI', 'CY', 'CE', 'SH', 'DG', 'PH', 'C', 'K',
1968
                              'Z', 'S', 'D', 'T', 'N', 'M', 'R', 'L', 'J', 'C',
1969
                              'G', 'K', 'Q', 'X', 'F', 'V', 'B', 'P')
1970
1971
    alpha = ['']
1972
    pos = 0
1973
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
1974
    word = word.replace('ß', 'SS')
1975
    word = ''.join(c for c in word if c in
1976
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
1977
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
1978
                    'Y', 'Z'})
1979
1980
    # Clamp maxlength to [4, 64]
1981
    if maxlength is not None:
1982
        maxlength = min(max(4, maxlength), 64)
1983
    else:
1984
        maxlength = 64
1985
1986
    # Do special processing for initial substrings
1987
    for k in _alpha_sis_initials_order:
1988
        if word.startswith(k):
1989
            alpha[0] += _alpha_sis_initials[k]
1990
            pos += len(k)
1991
            break
1992
1993
    # Add a '0' if alpha is still empty
1994
    if not alpha[0]:
1995
        alpha[0] += '0'
1996
1997
    # Whether or not any special initial codes were encoded, iterate
1998
    # through the length of the word in the main encoding loop
1999
    while pos < len(word):
2000
        origpos = pos
2001
        for k in _alpha_sis_basic_order:
2002
            if word[pos:].startswith(k):
2003
                if isinstance(_alpha_sis_basic[k], tuple):
2004
                    newalpha = []
2005
                    for i in range(len(_alpha_sis_basic[k])):
2006
                        newalpha += [_ + _alpha_sis_basic[k][i] for _ in alpha]
2007
                    alpha = newalpha
2008
                else:
2009
                    alpha = [_ + _alpha_sis_basic[k] for _ in alpha]
2010
                pos += len(k)
2011
                break
2012
        if pos == origpos:
2013
            alpha = [_ + '_' for _ in alpha]
2014
            pos += 1
2015
2016
    # Trim doublets and placeholders
2017
    for i in range(len(alpha)):
2018
        pos = 1
2019
        while pos < len(alpha[i]):
2020
            if alpha[i][pos] == alpha[i][pos-1]:
2021
                alpha[i] = alpha[i][:pos]+alpha[i][pos+1:]
2022
            pos += 1
2023
    alpha = (_.replace('_', '') for _ in alpha)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2024
2025
    # Trim codes and return tuple
2026
    alpha = ((_ + ('0'*maxlength))[:maxlength] for _ in alpha)
2027
    return tuple(alpha)
2028
2029
2030
def fuzzy_soundex(word, maxlength=5, zero_pad=True):
2031
    """Return the Fuzzy Soundex code for a word.
2032
2033
    Fuzzy Soundex is an algorithm derived from Soundex, defined in:
2034
    Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for
2035
    Soundex Retrieval."
2036
    http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf
2037
2038
    :param str word: the word to transform
2039
    :param int maxlength: the length of the code returned (defaults to 4)
2040
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2041
        a maxlength string
2042
    :returns: the Fuzzy Soundex value
2043
    :rtype: str
2044
2045
    >>> fuzzy_soundex('Christopher')
2046
    'K6931'
2047
    >>> fuzzy_soundex('Niall')
2048
    'N4000'
2049
    >>> fuzzy_soundex('Smith')
2050
    'S5300'
2051
    >>> fuzzy_soundex('Smith')
2052
    'S5300'
2053
    """
2054
    _fuzzy_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2055
                                           'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2056
                                          '0193017-07745501769301-7-9'))
2057
2058
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2059
    word = word.replace('ß', 'SS')
2060
2061
    # Clamp maxlength to [4, 64]
2062
    if maxlength is not None:
2063
        maxlength = min(max(4, maxlength), 64)
2064
    else:
2065
        maxlength = 64
2066
2067
    if not word:
2068
        if zero_pad:
2069
            return '0' * maxlength
2070
        return '0'
2071
2072
    if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
2073
        word = 'SS' + word[2:]
2074
    elif word[:2] == 'GN':
2075
        word = 'NN' + word[2:]
2076
    elif word[:2] in {'HR', 'WR'}:
2077
        word = 'RR' + word[2:]
2078
    elif word[:2] == 'HW':
2079
        word = 'WW' + word[2:]
2080
    elif word[:2] in {'KN', 'NG'}:
2081
        word = 'NN' + word[2:]
2082
2083
    if word[-2:] == 'CH':
2084
        word = word[:-2] + 'KK'
2085
    elif word[-2:] == 'NT':
2086
        word = word[:-2] + 'TT'
2087
    elif word[-2:] == 'RT':
2088
        word = word[:-2] + 'RR'
2089
    elif word[-3:] == 'RDT':
2090
        word = word[:-3] + 'RR'
2091
2092
    word = word.replace('CA', 'KA')
2093
    word = word.replace('CC', 'KK')
2094
    word = word.replace('CK', 'KK')
2095
    word = word.replace('CE', 'SE')
2096
    word = word.replace('CHL', 'KL')
2097
    word = word.replace('CL', 'KL')
2098
    word = word.replace('CHR', 'KR')
2099
    word = word.replace('CR', 'KR')
2100
    word = word.replace('CI', 'SI')
2101
    word = word.replace('CO', 'KO')
2102
    word = word.replace('CU', 'KU')
2103
    word = word.replace('CY', 'SY')
2104
    word = word.replace('DG', 'GG')
2105
    word = word.replace('GH', 'HH')
2106
    word = word.replace('MAC', 'MK')
2107
    word = word.replace('MC', 'MK')
2108
    word = word.replace('NST', 'NSS')
2109
    word = word.replace('PF', 'FF')
2110
    word = word.replace('PH', 'FF')
2111
    word = word.replace('SCH', 'SSS')
2112
    word = word.replace('TIO', 'SIO')
2113
    word = word.replace('TIA', 'SIO')
2114
    word = word.replace('TCH', 'CHH')
2115
2116
    sdx = word.translate(_fuzzy_soundex_translation)
2117
    sdx = sdx.replace('-', '')
2118
2119
    # remove repeating characters
2120
    sdx = _delete_consecutive_repeats(sdx)
2121
2122
    if word[0] in {'H', 'W', 'Y'}:
2123
        sdx = word[0] + sdx
2124
    else:
2125
        sdx = word[0] + sdx[1:]
2126
2127
    sdx = sdx.replace('0', '')
2128
2129
    if zero_pad:
2130
        sdx += ('0'*maxlength)
2131
2132
    return sdx[:maxlength]
2133
2134
2135
def phonex(word, maxlength=4, zero_pad=True):
2136
    """Return the Phonex code for a word.
2137
2138
    Phonex is an algorithm derived from Soundex, defined in:
2139
    Lait, A. J. and B. Randell. "An Assessment of Name Matching Algorithms".
2140
    http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
2141
2142
    :param str word: the word to transform
2143
    :param int maxlength: the length of the code returned (defaults to 4)
2144
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2145
        a maxlength string
2146
    :returns: the Phonex value
2147
    :rtype: str
2148
2149
    >>> phonex('Christopher')
2150
    'C623'
2151
    >>> phonex('Niall')
2152
    'N400'
2153
    >>> phonex('Schmidt')
2154
    'S253'
2155
    >>> phonex('Smith')
2156
    'S530'
2157
    """
2158
    name = unicodedata.normalize('NFKD', text_type(word.upper()))
2159
    name = name.replace('ß', 'SS')
2160
2161
    # Clamp maxlength to [4, 64]
2162
    if maxlength is not None:
2163
        maxlength = min(max(4, maxlength), 64)
2164
    else:
2165
        maxlength = 64
2166
2167
    name_code = last = ''
2168
2169
    # Deletions effected by replacing with next letter which
2170
    # will be ignored due to duplicate handling of Soundex code.
2171
    # This is faster than 'moving' all subsequent letters.
2172
2173
    # Remove any trailing Ss
2174
    while name[-1:] == 'S':
2175
        name = name[:-1]
2176
2177
    # Phonetic equivalents of first 2 characters
2178
    # Works since duplicate letters are ignored
2179
    if name[:2] == 'KN':
2180
        name = 'N' + name[2:]  # KN.. == N..
2181
    elif name[:2] == 'PH':
2182
        name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
2183
    elif name[:2] == 'WR':
2184
        name = 'R' + name[2:]  # WR.. == R..
2185
2186
    if name:
2187
        # Special case, ignore H first letter (subsequent Hs ignored anyway)
2188
        # Works since duplicate letters are ignored
2189
        if name[0] == 'H':
2190
            name = name[1:]
2191
2192
    if name:
2193
        # Phonetic equivalents of first character
2194
        if name[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2195
            name = 'A' + name[1:]
2196
        elif name[0] in {'B', 'P'}:
2197
            name = 'B' + name[1:]
2198
        elif name[0] in {'V', 'F'}:
2199
            name = 'F' + name[1:]
2200
        elif name[0] in {'C', 'K', 'Q'}:
2201
            name = 'C' + name[1:]
2202
        elif name[0] in {'G', 'J'}:
2203
            name = 'G' + name[1:]
2204
        elif name[0] in {'S', 'Z'}:
2205
            name = 'S' + name[1:]
2206
2207
        name_code = last = name[0]
2208
2209
    # MODIFIED SOUNDEX CODE
2210
    for i in range(1, len(name)):
2211
        code = '0'
2212
        if name[i] in {'B', 'F', 'P', 'V'}:
2213
            code = '1'
2214
        elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
2215
            code = '2'
2216
        elif name[i] in {'D', 'T'}:
2217
            if name[i+1:i+2] != 'C':
2218
                code = '3'
2219
        elif name[i] == 'L':
2220
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2221
                    i+1 == len(name)):
2222
                code = '4'
2223
        elif name[i] in {'M', 'N'}:
2224
            if name[i+1:i+2] in {'D', 'G'}:
2225
                name = name[:i+1] + name[i] + name[i+2:]
2226
            code = '5'
2227
        elif name[i] == 'R':
2228
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2229
                    i+1 == len(name)):
2230
                code = '6'
2231
2232
        if code != last and code != '0' and i != 0:
2233
            name_code += code
2234
2235
        last = name_code[-1]
2236
2237
    if zero_pad:
2238
        name_code += '0' * maxlength
2239
    if not name_code:
2240
        name_code = '0'
2241
    return name_code[:maxlength]
2242
2243
2244
def phonem(word):
2245
    """Return the Phonem code for a word.
2246
2247
    Phonem is defined in Wilde, Georg and Carsten Meyer. 1999. "Doppelgaenger
2248
    gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung."
2249
    ct Magazin fuer Computer & Technik 25/1999.
2250
2251
    This version is based on the Perl implementation documented at:
2252
    http://phonetik.phil-fak.uni-koeln.de/fileadmin/home/ritters/Allgemeine_Dateien/Martin_Wilz.pdf
2253
    It includes some enhancements presented in the Java port at:
2254
    https://github.com/dcm4che/dcm4che/blob/master/dcm4che-soundex/src/main/java/org/dcm4che3/soundex/Phonem.java
2255
2256
    Phonem is intended chiefly for German names/words.
2257
2258
    :param str word: the word to transform
2259
    :returns: the Phonem value
2260
    :rtype: str
2261
2262
    >>> phonem('Christopher')
2263
    'CRYSDOVR'
2264
    >>> phonem('Niall')
2265
    'NYAL'
2266
    >>> phonem('Smith')
2267
    'SMYD'
2268
    >>> phonem('Schmidt')
2269
    'CMYD'
2270
    """
2271
    _phonem_substitutions = (('SC', 'C'), ('SZ', 'C'), ('CZ', 'C'),
2272
                             ('TZ', 'C'), ('TS', 'C'), ('KS', 'X'),
2273
                             ('PF', 'V'), ('QU', 'KW'), ('PH', 'V'),
2274
                             ('UE', 'Y'), ('AE', 'E'), ('OE', 'Ö'),
2275
                             ('EI', 'AY'), ('EY', 'AY'), ('EU', 'OY'),
2276
                             ('AU', 'A§'), ('OU', '§'))
2277
    _phonem_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2278
                                    'ZKGQÇÑßFWPTÁÀÂÃÅÄÆÉÈÊËIJÌÍÎÏÜݧÚÙÛÔÒÓÕØ'),
2279
                                   'CCCCCNSVVBDAAAAAEEEEEEYYYYYYYYUUUUOOOOÖ'))
2280
2281
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2282
    for i, j in _phonem_substitutions:
2283
        word = word.replace(i, j)
2284
    word = word.translate(_phonem_translation)
2285
2286
    return ''.join(c for c in _delete_consecutive_repeats(word)
2287
                   if c in {'A', 'B', 'C', 'D', 'L', 'M', 'N', 'O', 'R', 'S',
2288
                            'U', 'V', 'W', 'X', 'Y', 'Ö'})
2289
2290
2291
def phonix(word, maxlength=4, zero_pad=True):
2292
    """Return the Phonix code for a word.
2293
2294
    Phonix is a Soundex-like algorithm defined in:
2295
    T.N. Gadd: PHONIX --- The Algorithm, Program 24/4, 1990, p.363-366.
2296
2297
    This implementation is based on
2298
    http://cpansearch.perl.org/src/ULPFR/WAIT-1.800/soundex.c
2299
    http://cs.anu.edu.au/people/Peter.Christen/Febrl/febrl-0.4.01/encode.py
2300
    and
2301
    https://metacpan.org/pod/Text::Phonetic::Phonix
2302
2303
    :param str word: the word to transform
2304
    :param int maxlength: the length of the code returned (defaults to 4)
2305
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2306
        a maxlength string
2307
    :returns: the Phonix value
2308
    :rtype: str
2309
2310
    >>> phonix('Christopher')
2311
    'K683'
2312
    >>> phonix('Niall')
2313
    'N400'
2314
    >>> phonix('Smith')
2315
    'S530'
2316
    >>> phonix('Schmidt')
2317
    'S530'
2318
    """
2319
    # pylint: disable=too-many-branches
2320
    def _start_repl(word, src, tar, post=None):
2321
        r"""Replace src with tar at the start of word."""
2322
        if post:
2323
            for i in post:
2324
                if word.startswith(src+i):
2325
                    return tar + word[len(src):]
2326
        elif word.startswith(src):
2327
            return tar + word[len(src):]
2328
        return word
2329
2330
    def _end_repl(word, src, tar, pre=None):
2331
        r"""Replace src with tar at the end of word."""
2332
        if pre:
2333
            for i in pre:
2334
                if word.endswith(i+src):
2335
                    return word[:-len(src)] + tar
2336
        elif word.endswith(src):
2337
            return word[:-len(src)] + tar
2338
        return word
2339
2340
    def _mid_repl(word, src, tar, pre=None, post=None):
2341
        r"""Replace src with tar in the middle of word."""
2342
        if pre or post:
2343
            if not pre:
2344
                return word[0] + _all_repl(word[1:], src, tar, pre, post)
2345
            elif not post:
2346
                return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
2347
            return _all_repl(word, src, tar, pre, post)
2348
        return (word[0] + _all_repl(word[1:-1], src, tar, pre, post) +
2349
                word[-1])
2350
2351
    def _all_repl(word, src, tar, pre=None, post=None):
2352
        r"""Replace src with tar anywhere in word."""
2353
        if pre or post:
2354
            if post:
2355
                post = post
2356
            else:
2357
                post = frozenset(('',))
2358
            if pre:
2359
                pre = pre
2360
            else:
2361
                pre = frozenset(('',))
2362
2363
            for i, j in ((i, j) for i in pre for j in post):
2364
                word = word.replace(i+src+j, i+tar+j)
2365
            return word
2366
        else:
2367
            return word.replace(src, tar)
2368
2369
    _vow = {'A', 'E', 'I', 'O', 'U'}
2370
    _con = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q',
2371
            'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z'}
2372
2373
    _phonix_substitutions = ((_all_repl, 'DG', 'G'),
2374
                             (_all_repl, 'CO', 'KO'),
2375
                             (_all_repl, 'CA', 'KA'),
2376
                             (_all_repl, 'CU', 'KU'),
2377
                             (_all_repl, 'CY', 'SI'),
2378
                             (_all_repl, 'CI', 'SI'),
2379
                             (_all_repl, 'CE', 'SE'),
2380
                             (_start_repl, 'CL', 'KL', _vow),
2381
                             (_all_repl, 'CK', 'K'),
2382
                             (_end_repl, 'GC', 'K'),
2383
                             (_end_repl, 'JC', 'K'),
2384
                             (_start_repl, 'CHR', 'KR', _vow),
2385
                             (_start_repl, 'CR', 'KR', _vow),
2386
                             (_start_repl, 'WR', 'R'),
2387
                             (_all_repl, 'NC', 'NK'),
2388
                             (_all_repl, 'CT', 'KT'),
2389
                             (_all_repl, 'PH', 'F'),
2390
                             (_all_repl, 'AA', 'AR'),
2391
                             (_all_repl, 'SCH', 'SH'),
2392
                             (_all_repl, 'BTL', 'TL'),
2393
                             (_all_repl, 'GHT', 'T'),
2394
                             (_all_repl, 'AUGH', 'ARF'),
2395
                             (_mid_repl, 'LJ', 'LD', _vow, _vow),
2396
                             (_all_repl, 'LOUGH', 'LOW'),
2397
                             (_start_repl, 'Q', 'KW'),
2398
                             (_start_repl, 'KN', 'N'),
2399
                             (_end_repl, 'GN', 'N'),
2400
                             (_all_repl, 'GHN', 'N'),
2401
                             (_end_repl, 'GNE', 'N'),
2402
                             (_all_repl, 'GHNE', 'NE'),
2403
                             (_end_repl, 'GNES', 'NS'),
2404
                             (_start_repl, 'GN', 'N'),
2405
                             (_mid_repl, 'GN', 'N', None, _con),
2406
                             (_end_repl, 'GN', 'N'),
2407
                             (_start_repl, 'PS', 'S'),
2408
                             (_start_repl, 'PT', 'T'),
2409
                             (_start_repl, 'CZ', 'C'),
2410
                             (_mid_repl, 'WZ', 'Z', _vow),
2411
                             (_mid_repl, 'CZ', 'CH'),
2412
                             (_all_repl, 'LZ', 'LSH'),
2413
                             (_all_repl, 'RZ', 'RSH'),
2414
                             (_mid_repl, 'Z', 'S', None, _vow),
2415
                             (_all_repl, 'ZZ', 'TS'),
2416
                             (_mid_repl, 'Z', 'TS', _con),
2417
                             (_all_repl, 'HROUG', 'REW'),
2418
                             (_all_repl, 'OUGH', 'OF'),
2419
                             (_mid_repl, 'Q', 'KW', _vow, _vow),
2420
                             (_mid_repl, 'J', 'Y', _vow, _vow),
2421
                             (_start_repl, 'YJ', 'Y', _vow),
2422
                             (_start_repl, 'GH', 'G'),
2423
                             (_end_repl, 'GH', 'E', _vow),
2424
                             (_start_repl, 'CY', 'S'),
2425
                             (_all_repl, 'NX', 'NKS'),
2426
                             (_start_repl, 'PF', 'F'),
2427
                             (_end_repl, 'DT', 'T'),
2428
                             (_end_repl, 'TL', 'TIL'),
2429
                             (_end_repl, 'DL', 'DIL'),
2430
                             (_all_repl, 'YTH', 'ITH'),
2431
                             (_start_repl, 'TJ', 'CH', _vow),
2432
                             (_start_repl, 'TSJ', 'CH', _vow),
2433
                             (_start_repl, 'TS', 'T', _vow),
2434
                             (_all_repl, 'TCH', 'CH'),
2435
                             (_mid_repl, 'WSK', 'VSKIE', _vow),
2436
                             (_end_repl, 'WSK', 'VSKIE', _vow),
2437
                             (_start_repl, 'MN', 'N', _vow),
2438
                             (_start_repl, 'PN', 'N', _vow),
2439
                             (_mid_repl, 'STL', 'SL', _vow),
2440
                             (_end_repl, 'STL', 'SL', _vow),
2441
                             (_end_repl, 'TNT', 'ENT'),
2442
                             (_end_repl, 'EAUX', 'OH'),
2443
                             (_all_repl, 'EXCI', 'ECS'),
2444
                             (_all_repl, 'X', 'ECS'),
2445
                             (_end_repl, 'NED', 'ND'),
2446
                             (_all_repl, 'JR', 'DR'),
2447
                             (_end_repl, 'EE', 'EA'),
2448
                             (_all_repl, 'ZS', 'S'),
2449
                             (_mid_repl, 'R', 'AH', _vow, _con),
2450
                             (_end_repl, 'R', 'AH', _vow),
2451
                             (_mid_repl, 'HR', 'AH', _vow, _con),
2452
                             (_end_repl, 'HR', 'AH', _vow),
2453
                             (_end_repl, 'HR', 'AH', _vow),
2454
                             (_end_repl, 'RE', 'AR'),
2455
                             (_end_repl, 'R', 'AH', _vow),
2456
                             (_all_repl, 'LLE', 'LE'),
2457
                             (_end_repl, 'LE', 'ILE', _con),
2458
                             (_end_repl, 'LES', 'ILES', _con),
2459
                             (_end_repl, 'E', ''),
2460
                             (_end_repl, 'ES', 'S'),
2461
                             (_end_repl, 'SS', 'AS', _vow),
2462
                             (_end_repl, 'MB', 'M', _vow),
2463
                             (_all_repl, 'MPTS', 'MPS'),
2464
                             (_all_repl, 'MPS', 'MS'),
2465
                             (_all_repl, 'MPT', 'MT'))
2466
2467
    _phonix_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2468
                                    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2469
                                   '01230720022455012683070808'))
2470
2471
    sdx = ''
2472
2473
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2474
    word = word.replace('ß', 'SS')
2475
    word = ''.join(c for c in word if c in
2476
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2477
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2478
                    'Y', 'Z'})
2479
    if word:
2480
        for trans in _phonix_substitutions:
2481
            word = trans[0](word, *trans[1:])
2482
        if word[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2483
            sdx = 'v' + word[1:].translate(_phonix_translation)
2484
        else:
2485
            sdx = word[0] + word[1:].translate(_phonix_translation)
2486
        sdx = _delete_consecutive_repeats(sdx)
2487
        sdx = sdx.replace('0', '')
2488
2489
    # Clamp maxlength to [4, 64]
2490
    if maxlength is not None:
2491
        maxlength = min(max(4, maxlength), 64)
2492
    else:
2493
        maxlength = 64
2494
2495
    if zero_pad:
2496
        sdx += '0' * maxlength
2497
    if not sdx:
2498
        sdx = '0'
2499
    return sdx[:maxlength]
2500
2501
2502
def sfinxbis(word, maxlength=None):
2503
    """Return the SfinxBis code for a word.
2504
2505
    SfinxBis is a Soundex-like algorithm defined in:
2506
    http://www.swami.se/download/18.248ad5af12aa8136533800091/SfinxBis.pdf
2507
2508
    This implementation follows the reference implementation:
2509
    http://www.swami.se/download/18.248ad5af12aa8136533800093/swamiSfinxBis.java.txt
2510
2511
    SfinxBis is intended chiefly for Swedish names.
2512
2513
    :param str word: the word to transform
2514
    :param int maxlength: the length of the code returned (defaults to
2515
        unlimited)
2516
    :returns: the SfinxBis value
2517
    :rtype: tuple
2518
2519
    >>> sfinxbis('Christopher')
2520
    ('K68376',)
2521
    >>> sfinxbis('Niall')
2522
    ('N4',)
2523
    >>> sfinxbis('Smith')
2524
    ('S53',)
2525
    >>> sfinxbis('Schmidt')
2526
    ('S53',)
2527
2528
    >>> sfinxbis('Johansson')
2529
    ('J585',)
2530
    >>> sfinxbis('Sjöberg')
2531
    ('#162',)
2532
    """
2533
    adelstitler = (' DE LA ', ' DE LAS ', ' DE LOS ', ' VAN DE ', ' VAN DEN ',
2534
                   ' VAN DER ', ' VON DEM ', ' VON DER ',
2535
                   ' AF ', ' AV ', ' DA ', ' DE ', ' DEL ', ' DEN ', ' DES ',
2536
                   ' DI ', ' DO ', ' DON ', ' DOS ', ' DU ', ' E ', ' IN ',
2537
                   ' LA ', ' LE ', ' MAC ', ' MC ', ' VAN ', ' VON ', ' Y ',
2538
                   ' S:T ')
2539
2540
    _harde_vokaler = {'A', 'O', 'U', 'Å'}
2541
    _mjuka_vokaler = {'E', 'I', 'Y', 'Ä', 'Ö'}
2542
    _konsonanter = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P',
2543
                    'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Z'}
2544
    _alfabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2545
                'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2546
                'Y', 'Z', 'Ä', 'Å', 'Ö'}
2547
2548
    _sfinxbis_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2549
                                      'BCDFGHJKLMNPQRSTVZAOUÅEIYÄÖ'),
2550
                                     '123729224551268378999999999'))
2551
2552
    _sfinxbis_substitutions = dict(zip((ord(_) for _ in
2553
                                        'WZÀÁÂÃÆÇÈÉÊËÌÍÎÏÑÒÓÔÕØÙÚÛÜÝ'),
2554
                                       'VSAAAAÄCEEEEIIIINOOOOÖUUUYY'))
2555
2556
    def _foersvensker(ordet):
2557
        """Return the Swedish-ized form of the word."""
2558
        ordet = ordet.replace('STIERN', 'STJÄRN')
2559
        ordet = ordet.replace('HIE', 'HJ')
2560
        ordet = ordet.replace('SIÖ', 'SJÖ')
2561
        ordet = ordet.replace('SCH', 'SH')
2562
        ordet = ordet.replace('QU', 'KV')
2563
        ordet = ordet.replace('IO', 'JO')
2564
        ordet = ordet.replace('PH', 'F')
2565
2566
        for i in _harde_vokaler:
2567
            ordet = ordet.replace(i+'Ü', i+'J')
2568
            ordet = ordet.replace(i+'Y', i+'J')
2569
            ordet = ordet.replace(i+'I', i+'J')
2570
        for i in _mjuka_vokaler:
2571
            ordet = ordet.replace(i+'Ü', i+'J')
2572
            ordet = ordet.replace(i+'Y', i+'J')
2573
            ordet = ordet.replace(i+'I', i+'J')
2574
2575
        if 'H' in ordet:
2576
            for i in _konsonanter:
2577
                ordet = ordet.replace('H'+i, i)
2578
2579
        ordet = ordet.translate(_sfinxbis_substitutions)
2580
2581
        ordet = ordet.replace('Ð', 'ETH')
2582
        ordet = ordet.replace('Þ', 'TH')
2583
        ordet = ordet.replace('ß', 'SS')
2584
2585
        return ordet
2586
2587
    def _koda_foersta_ljudet(ordet):
2588
        """Return the word with the first sound coded."""
2589
        if ordet[0:1] in _mjuka_vokaler or ordet[0:1] in _harde_vokaler:
2590
            ordet = '$' + ordet[1:]
2591
        elif ordet[0:2] in ('DJ', 'GJ', 'HJ', 'LJ'):
2592
            ordet = 'J' + ordet[2:]
2593
        elif ordet[0:1] == 'G' and ordet[1:2] in _mjuka_vokaler:
2594
            ordet = 'J' + ordet[1:]
2595
        elif ordet[0:1] == 'Q':
2596
            ordet = 'K' + ordet[1:]
2597
        elif (ordet[0:2] == 'CH' and
2598
              ordet[2:3] in frozenset(_mjuka_vokaler | _harde_vokaler)):
2599
            ordet = '#' + ordet[2:]
2600
        elif ordet[0:1] == 'C' and ordet[1:2] in _harde_vokaler:
2601
            ordet = 'K' + ordet[1:]
2602
        elif ordet[0:1] == 'C' and ordet[1:2] in _konsonanter:
2603
            ordet = 'K' + ordet[1:]
2604
        elif ordet[0:1] == 'X':
2605
            ordet = 'S' + ordet[1:]
2606
        elif ordet[0:1] == 'C' and ordet[1:2] in _mjuka_vokaler:
2607
            ordet = 'S' + ordet[1:]
2608
        elif ordet[0:3] in ('SKJ', 'STJ', 'SCH'):
2609
            ordet = '#' + ordet[3:]
2610
        elif ordet[0:2] in ('SH', 'KJ', 'TJ', 'SJ'):
2611
            ordet = '#' + ordet[2:]
2612
        elif ordet[0:2] == 'SK' and ordet[2:3] in _mjuka_vokaler:
2613
            ordet = '#' + ordet[2:]
2614
        elif ordet[0:1] == 'K' and ordet[1:2] in _mjuka_vokaler:
2615
            ordet = '#' + ordet[1:]
2616
        return ordet
2617
2618
    # Steg 1, Versaler
2619
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2620
    word = word.replace('ß', 'SS')
2621
    word = word.replace('-', ' ')
2622
2623
    # Steg 2, Ta bort adelsprefix
2624
    for adelstitel in adelstitler:
2625
        while adelstitel in word:
2626
            word = word.replace(adelstitel, ' ')
2627
        if word.startswith(adelstitel[1:]):
2628
            word = word[len(adelstitel)-1:]
2629
2630
    # Split word into tokens
2631
    ordlista = word.split()
2632
2633
    # Steg 3, Ta bort dubbelteckning i början på namnet
2634
    ordlista = [_delete_consecutive_repeats(ordet) for ordet in ordlista]
2635
    if not ordlista:
2636
        return ('',)
2637
2638
    # Steg 4, Försvenskning
2639
    ordlista = [_foersvensker(ordet) for ordet in ordlista]
2640
2641
    # Steg 5, Ta bort alla tecken som inte är A-Ö (65-90,196,197,214)
2642
    ordlista = [''.join(c for c in ordet if c in _alfabet)
2643
                for ordet in ordlista]
2644
2645
    # Steg 6, Koda första ljudet
2646
    ordlista = [_koda_foersta_ljudet(ordet) for ordet in ordlista]
2647
2648
    # Steg 7, Dela upp namnet i två delar
2649
    rest = [ordet[1:] for ordet in ordlista]
2650
2651
    # Steg 8, Utför fonetisk transformation i resten
2652
    rest = [ordet.replace('DT', 'T') for ordet in rest]
2653
    rest = [ordet.replace('X', 'KS') for ordet in rest]
2654
2655
    # Steg 9, Koda resten till en sifferkod
2656
    for vokal in _mjuka_vokaler:
2657
        rest = [ordet.replace('C'+vokal, '8'+vokal) for ordet in rest]
2658
    rest = [ordet.translate(_sfinxbis_translation) for ordet in rest]
2659
2660
    # Steg 10, Ta bort intilliggande dubbletter
2661
    rest = [_delete_consecutive_repeats(ordet) for ordet in rest]
2662
2663
    # Steg 11, Ta bort alla "9"
2664
    rest = [ordet.replace('9', '') for ordet in rest]
2665
2666
    # Steg 12, Sätt ihop delarna igen
2667
    ordlista = [''.join(ordet) for ordet in
2668
                zip((_[0:1] for _ in ordlista), rest)]
2669
2670
    # truncate, if maxlength is set
2671
    if maxlength and maxlength < _INFINITY:
2672
        ordlista = [ordet[:maxlength] for ordet in ordlista]
2673
2674
    return tuple(ordlista)
2675
2676
2677
def phonet(word, mode=1, lang='de', trace=False):
2678
    """Return the phonet code for a word.
2679
2680
    phonet ("Hannoveraner Phonetik") was developed by Jörg Michael and
2681
    documented in c't magazine vol. 25/1999, p. 252. It is a phonetic
2682
    algorithm designed primarily for German.
2683
    Cf. http://www.heise.de/ct/ftp/99/25/252/
2684
2685
    This is a port of Jesper Zedlitz's code, which is licensed LGPL:
2686
    https://code.google.com/p/phonet4java/source/browse/trunk/src/main/java/com/googlecode/phonet4java/Phonet.java
2687
2688
    That is, in turn, based on Michael's C code, which is also licensed LGPL:
2689
    ftp://ftp.heise.de/pub/ct/listings/phonet.zip
2690
2691
    :param str word: the word to transform
2692
    :param int mode: the ponet variant to employ (1 or 2)
2693
    :param str lang: 'de' (default) for German
2694
            'none' for no language
2695
    :param bool trace: prints debugging info if True
2696
    :returns: the phonet value
2697
    :rtype: str
2698
2699
    >>> phonet('Christopher')
2700
    'KRISTOFA'
2701
    >>> phonet('Niall')
2702
    'NIAL'
2703
    >>> phonet('Smith')
2704
    'SMIT'
2705
    >>> phonet('Schmidt')
2706
    'SHMIT'
2707
2708
    >>> phonet('Christopher', mode=2)
2709
    'KRIZTUFA'
2710
    >>> phonet('Niall', mode=2)
2711
    'NIAL'
2712
    >>> phonet('Smith', mode=2)
2713
    'ZNIT'
2714
    >>> phonet('Schmidt', mode=2)
2715
    'ZNIT'
2716
2717
    >>> phonet('Christopher', lang='none')
2718
    'CHRISTOPHER'
2719
    >>> phonet('Niall', lang='none')
2720
    'NIAL'
2721
    >>> phonet('Smith', lang='none')
2722
    'SMITH'
2723
    >>> phonet('Schmidt', lang='none')
2724
    'SCHMIDT'
2725
    """
2726
    # pylint: disable=too-many-branches
2727
2728
    _phonet_rules_no_lang = (  # separator chars
2729
        '´', ' ', ' ',
2730
        '"', ' ', ' ',
2731
        '`$', '', '',
2732
        '\'', ' ', ' ',
2733
        ',', ',', ',',
2734
        ';', ',', ',',
2735
        '-', ' ', ' ',
2736
        ' ', ' ', ' ',
2737
        '.', '.', '.',
2738
        ':', '.', '.',
2739
        # German umlauts
2740
        'Ä', 'AE', 'AE',
2741
        'Ö', 'OE', 'OE',
2742
        'Ü', 'UE', 'UE',
2743
        'ß', 'S', 'S',
2744
        # international umlauts
2745
        'À', 'A', 'A',
2746
        'Á', 'A', 'A',
2747
        'Â', 'A', 'A',
2748
        'Ã', 'A', 'A',
2749
        'Å', 'A', 'A',
2750
        'Æ', 'AE', 'AE',
2751
        'Ç', 'C', 'C',
2752
        'Ð', 'DJ', 'DJ',
2753
        'È', 'E', 'E',
2754
        'É', 'E', 'E',
2755
        'Ê', 'E', 'E',
2756
        'Ë', 'E', 'E',
2757
        'Ì', 'I', 'I',
2758
        'Í', 'I', 'I',
2759
        'Î', 'I', 'I',
2760
        'Ï', 'I', 'I',
2761
        'Ñ', 'NH', 'NH',
2762
        'Ò', 'O', 'O',
2763
        'Ó', 'O', 'O',
2764
        'Ô', 'O', 'O',
2765
        'Õ', 'O', 'O',
2766
        'Œ', 'OE', 'OE',
2767
        'Ø', 'OE', 'OE',
2768
        'Š', 'SH', 'SH',
2769
        'Þ', 'TH', 'TH',
2770
        'Ù', 'U', 'U',
2771
        'Ú', 'U', 'U',
2772
        'Û', 'U', 'U',
2773
        'Ý', 'Y', 'Y',
2774
        'Ÿ', 'Y', 'Y',
2775
        # 'normal' letters (A-Z)
2776
        'MC^', 'MAC', 'MAC',
2777
        'MC^', 'MAC', 'MAC',
2778
        'M´^', 'MAC', 'MAC',
2779
        'M\'^', 'MAC', 'MAC',
2780
        'O´^', 'O', 'O',
2781
        'O\'^', 'O', 'O',
2782
        'VAN DEN ^', 'VANDEN', 'VANDEN',
2783
        None, None, None)
2784
2785
    _phonet_rules_german = (  # separator chars
2786
        '´', ' ', ' ',
2787
        '"', ' ', ' ',
2788
        '`$', '', '',
2789
        '\'', ' ', ' ',
2790
        ',', ' ', ' ',
2791
        ';', ' ', ' ',
2792
        '-', ' ', ' ',
2793
        ' ', ' ', ' ',
2794
        '.', '.', '.',
2795
        ':', '.', '.',
2796
        # German umlauts
2797
        'ÄE', 'E', 'E',
2798
        'ÄU<', 'EU', 'EU',
2799
        'ÄV(AEOU)-<', 'EW', None,
2800
        'Ä$', 'Ä', None,
2801
        'Ä<', None, 'E',
2802
        'Ä', 'E', None,
2803
        'ÖE', 'Ö', 'Ö',
2804
        'ÖU', 'Ö', 'Ö',
2805
        'ÖVER--<', 'ÖW', None,
2806
        'ÖV(AOU)-', 'ÖW', None,
2807
        'ÜBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
2808
        'ÜBER^^', 'ÜBA', 'IBA',
2809
        'ÜE', 'Ü', 'I',
2810
        'ÜVER--<', 'ÜW', None,
2811
        'ÜV(AOU)-', 'ÜW', None,
2812
        'Ü', None, 'I',
2813
        'ßCH<', None, 'Z',
2814
        'ß<', 'S', 'Z',
2815
        # international umlauts
2816
        'À<', 'A', 'A',
2817
        'Á<', 'A', 'A',
2818
        'Â<', 'A', 'A',
2819
        'Ã<', 'A', 'A',
2820
        'Å<', 'A', 'A',
2821
        'ÆER-', 'E', 'E',
2822
        'ÆU<', 'EU', 'EU',
2823
        'ÆV(AEOU)-<', 'EW', None,
2824
        'Æ$', 'Ä', None,
2825
        'Æ<', None, 'E',
2826
        'Æ', 'E', None,
2827
        'Ç', 'Z', 'Z',
2828
        'ÐÐ-', '', '',
2829
        'Ð', 'DI', 'TI',
2830
        'È<', 'E', 'E',
2831
        'É<', 'E', 'E',
2832
        'Ê<', 'E', 'E',
2833
        'Ë', 'E', 'E',
2834
        'Ì<', 'I', 'I',
2835
        'Í<', 'I', 'I',
2836
        'Î<', 'I', 'I',
2837
        'Ï', 'I', 'I',
2838
        'ÑÑ-', '', '',
2839
        'Ñ', 'NI', 'NI',
2840
        'Ò<', 'O', 'U',
2841
        'Ó<', 'O', 'U',
2842
        'Ô<', 'O', 'U',
2843
        'Õ<', 'O', 'U',
2844
        'Œ<', 'Ö', 'Ö',
2845
        'Ø(IJY)-<', 'E', 'E',
2846
        'Ø<', 'Ö', 'Ö',
2847
        'Š', 'SH', 'Z',
2848
        'Þ', 'T', 'T',
2849
        'Ù<', 'U', 'U',
2850
        'Ú<', 'U', 'U',
2851
        'Û<', 'U', 'U',
2852
        'Ý<', 'I', 'I',
2853
        'Ÿ<', 'I', 'I',
2854
        # 'normal' letters (A-Z)
2855
        'ABELLE$', 'ABL', 'ABL',
2856
        'ABELL$', 'ABL', 'ABL',
2857
        'ABIENNE$', 'ABIN', 'ABIN',
2858
        'ACHME---^', 'ACH', 'AK',
2859
        'ACEY$', 'AZI', 'AZI',
2860
        'ADV', 'ATW', None,
2861
        'AEGL-', 'EK', None,
2862
        'AEU<', 'EU', 'EU',
2863
        'AE2', 'E', 'E',
2864
        'AFTRAUBEN------', 'AFT ', 'AFT ',
2865
        'AGL-1', 'AK', None,
2866
        'AGNI-^', 'AKN', 'AKN',
2867
        'AGNIE-', 'ANI', 'ANI',
2868
        'AGN(AEOU)-$', 'ANI', 'ANI',
2869
        'AH(AIOÖUÜY)-', 'AH', None,
2870
        'AIA2', 'AIA', 'AIA',
2871
        'AIE$', 'E', 'E',
2872
        'AILL(EOU)-', 'ALI', 'ALI',
2873
        'AINE$', 'EN', 'EN',
2874
        'AIRE$', 'ER', 'ER',
2875
        'AIR-', 'E', 'E',
2876
        'AISE$', 'ES', 'EZ',
2877
        'AISSANCE$', 'ESANS', 'EZANZ',
2878
        'AISSE$', 'ES', 'EZ',
2879
        'AIX$', 'EX', 'EX',
2880
        'AJ(AÄEÈÉÊIOÖUÜ)--', 'A', 'A',
2881
        'AKTIE', 'AXIE', 'AXIE',
2882
        'AKTUEL', 'AKTUEL', None,
2883
        'ALOI^', 'ALOI', 'ALUI',  # Don't merge these rules
2884
        'ALOY^', 'ALOI', 'ALUI',  # needed by 'check_rules'
2885
        'AMATEU(RS)-', 'AMATÖ', 'ANATÖ',
2886
        'ANCH(OEI)-', 'ANSH', 'ANZ',
2887
        'ANDERGEGANG----', 'ANDA GE', 'ANTA KE',
2888
        'ANDERGEHE----', 'ANDA ', 'ANTA ',
2889
        'ANDERGESETZ----', 'ANDA GE', 'ANTA KE',
2890
        'ANDERGING----', 'ANDA ', 'ANTA ',
2891
        'ANDERSETZ(ET)-----', 'ANDA ', 'ANTA ',
2892
        'ANDERZUGEHE----', 'ANDA ZU ', 'ANTA ZU ',
2893
        'ANDERZUSETZE-----', 'ANDA ZU ', 'ANTA ZU ',
2894
        'ANER(BKO)---^^', 'AN', None,
2895
        'ANHAND---^$', 'AN H', 'AN ',
2896
        'ANH(AÄEIOÖUÜY)--^^', 'AN', None,
2897
        'ANIELLE$', 'ANIEL', 'ANIL',
2898
        'ANIEL', 'ANIEL', None,
2899
        'ANSTELLE----^$', 'AN ST', 'AN ZT',
2900
        'ANTI^^', 'ANTI', 'ANTI',
2901
        'ANVER^^', 'ANFA', 'ANFA',
2902
        'ATIA$', 'ATIA', 'ATIA',
2903
        'ATIA(NS)--', 'ATI', 'ATI',
2904
        'ATI(AÄOÖUÜ)-', 'AZI', 'AZI',
2905
        'AUAU--', '', '',
2906
        'AUERE$', 'AUERE', None,
2907
        'AUERE(NS)-$', 'AUERE', None,
2908
        'AUERE(AIOUY)--', 'AUER', None,
2909
        'AUER(AÄIOÖUÜY)-', 'AUER', None,
2910
        'AUER<', 'AUA', 'AUA',
2911
        'AUF^^', 'AUF', 'AUF',
2912
        'AULT$', 'O', 'U',
2913
        'AUR(BCDFGKLMNQSTVWZ)-', 'AUA', 'AUA',
2914
        'AUR$', 'AUA', 'AUA',
2915
        'AUSSE$', 'OS', 'UZ',
2916
        'AUS(ST)-^', 'AUS', 'AUS',
2917
        'AUS^^', 'AUS', 'AUS',
2918
        'AUTOFAHR----', 'AUTO ', 'AUTU ',
2919
        'AUTO^^', 'AUTO', 'AUTU',
2920
        'AUX(IY)-', 'AUX', 'AUX',
2921
        'AUX', 'O', 'U',
2922
        'AU', 'AU', 'AU',
2923
        'AVER--<', 'AW', None,
2924
        'AVIER$', 'AWIE', 'AFIE',
2925
        'AV(EÈÉÊI)-^', 'AW', None,
2926
        'AV(AOU)-', 'AW', None,
2927
        'AYRE$', 'EIRE', 'EIRE',
2928
        'AYRE(NS)-$', 'EIRE', 'EIRE',
2929
        'AYRE(AIOUY)--', 'EIR', 'EIR',
2930
        'AYR(AÄIOÖUÜY)-', 'EIR', 'EIR',
2931
        'AYR<', 'EIA', 'EIA',
2932
        'AYER--<', 'EI', 'EI',
2933
        'AY(AÄEIOÖUÜY)--', 'A', 'A',
2934
        'AË', 'E', 'E',
2935
        'A(IJY)<', 'EI', 'EI',
2936
        'BABY^$', 'BEBI', 'BEBI',
2937
        'BAB(IY)^', 'BEBI', 'BEBI',
2938
        'BEAU^$', 'BO', None,
2939
        'BEA(BCMNRU)-^', 'BEA', 'BEA',
2940
        'BEAT(AEIMORU)-^', 'BEAT', 'BEAT',
2941
        'BEE$', 'BI', 'BI',
2942
        'BEIGE^$', 'BESH', 'BEZ',
2943
        'BENOIT--', 'BENO', 'BENU',
2944
        'BER(DT)-', 'BER', None,
2945
        'BERN(DT)-', 'BERN', None,
2946
        'BE(LMNRST)-^', 'BE', 'BE',
2947
        'BETTE$', 'BET', 'BET',
2948
        'BEVOR^$', 'BEFOR', None,
2949
        'BIC$', 'BIZ', 'BIZ',
2950
        'BOWL(EI)-', 'BOL', 'BUL',
2951
        'BP(AÄEÈÉÊIÌÍÎOÖRUÜY)-', 'B', 'B',
2952
        'BRINGEND-----^', 'BRI', 'BRI',
2953
        'BRINGEND-----', ' BRI', ' BRI',
2954
        'BROW(NS)-', 'BRAU', 'BRAU',
2955
        'BUDGET7', 'BÜGE', 'BIKE',
2956
        'BUFFET7', 'BÜFE', 'BIFE',
2957
        'BYLLE$', 'BILE', 'BILE',
2958
        'BYLL$', 'BIL', 'BIL',
2959
        'BYPA--^', 'BEI', 'BEI',
2960
        'BYTE<', 'BEIT', 'BEIT',
2961
        'BY9^', 'BÜ', None,
2962
        'B(SßZ)$', 'BS', None,
2963
        'CACH(EI)-^', 'KESH', 'KEZ',
2964
        'CAE--', 'Z', 'Z',
2965
        'CA(IY)$', 'ZEI', 'ZEI',
2966
        'CE(EIJUY)--', 'Z', 'Z',
2967
        'CENT<', 'ZENT', 'ZENT',
2968
        'CERST(EI)----^', 'KE', 'KE',
2969
        'CER$', 'ZA', 'ZA',
2970
        'CE3', 'ZE', 'ZE',
2971
        'CH\'S$', 'X', 'X',
2972
        'CH´S$', 'X', 'X',
2973
        'CHAO(ST)-', 'KAO', 'KAU',
2974
        'CHAMPIO-^', 'SHEMPI', 'ZENBI',
2975
        'CHAR(AI)-^', 'KAR', 'KAR',
2976
        'CHAU(CDFSVWXZ)-', 'SHO', 'ZU',
2977
        'CHÄ(CF)-', 'SHE', 'ZE',
2978
        'CHE(CF)-', 'SHE', 'ZE',
2979
        'CHEM-^', 'KE', 'KE',  # or: 'CHE', 'KE'
2980
        'CHEQUE<', 'SHEK', 'ZEK',
2981
        'CHI(CFGPVW)-', 'SHI', 'ZI',
2982
        'CH(AEUY)-<^', 'SH', 'Z',
2983
        'CHK-', '', '',
2984
        'CHO(CKPS)-^', 'SHO', 'ZU',
2985
        'CHRIS-', 'KRI', None,
2986
        'CHRO-', 'KR', None,
2987
        'CH(LOR)-<^', 'K', 'K',
2988
        'CHST-', 'X', 'X',
2989
        'CH(SßXZ)3', 'X', 'X',
2990
        'CHTNI-3', 'CHN', 'KN',
2991
        'CH^', 'K', 'K',  # or: 'CH', 'K'
2992
        'CH', 'CH', 'K',
2993
        'CIC$', 'ZIZ', 'ZIZ',
2994
        'CIENCEFICT----', 'EIENS ', 'EIENZ ',
2995
        'CIENCE$', 'EIENS', 'EIENZ',
2996
        'CIER$', 'ZIE', 'ZIE',
2997
        'CYB-^', 'ZEI', 'ZEI',
2998
        'CY9^', 'ZÜ', 'ZI',
2999
        'C(IJY)-<3', 'Z', 'Z',
3000
        'CLOWN-', 'KLAU', 'KLAU',
3001
        'CCH', 'Z', 'Z',
3002
        'CCE-', 'X', 'X',
3003
        'C(CK)-', '', '',
3004
        'CLAUDET---', 'KLO', 'KLU',
3005
        'CLAUDINE^$', 'KLODIN', 'KLUTIN',
3006
        'COACH', 'KOSH', 'KUZ',
3007
        'COLE$', 'KOL', 'KUL',
3008
        'COUCH', 'KAUSH', 'KAUZ',
3009
        'COW', 'KAU', 'KAU',
3010
        'CQUES$', 'K', 'K',
3011
        'CQUE', 'K', 'K',
3012
        'CRASH--9', 'KRE', 'KRE',
3013
        'CREAT-^', 'KREA', 'KREA',
3014
        'CST', 'XT', 'XT',
3015
        'CS<^', 'Z', 'Z',
3016
        'C(SßX)', 'X', 'X',
3017
        'CT\'S$', 'X', 'X',
3018
        'CT(SßXZ)', 'X', 'X',
3019
        'CZ<', 'Z', 'Z',
3020
        'C(ÈÉÊÌÍÎÝ)3', 'Z', 'Z',
3021
        'C.^', 'C.', 'C.',
3022
        'CÄ-', 'Z', 'Z',
3023
        'CÜ$', 'ZÜ', 'ZI',
3024
        'C\'S$', 'X', 'X',
3025
        'C<', 'K', 'K',
3026
        'DAHER^$', 'DAHER', None,
3027
        'DARAUFFOLGE-----', 'DARAUF ', 'TARAUF ',
3028
        'DAVO(NR)-^$', 'DAFO', 'TAFU',
3029
        'DD(SZ)--<', '', '',
3030
        'DD9', 'D', None,
3031
        'DEPOT7', 'DEPO', 'TEBU',
3032
        'DESIGN', 'DISEIN', 'TIZEIN',
3033
        'DE(LMNRST)-3^', 'DE', 'TE',
3034
        'DETTE$', 'DET', 'TET',
3035
        'DH$', 'T', None,
3036
        'DIC$', 'DIZ', 'TIZ',
3037
        'DIDR-^', 'DIT', None,
3038
        'DIEDR-^', 'DIT', None,
3039
        'DJ(AEIOU)-^', 'I', 'I',
3040
        'DMITR-^', 'DIMIT', 'TINIT',
3041
        'DRY9^', 'DRÜ', None,
3042
        'DT-', '', '',
3043
        'DUIS-^', 'DÜ', 'TI',
3044
        'DURCH^^', 'DURCH', 'TURK',
3045
        'DVA$', 'TWA', None,
3046
        'DY9^', 'DÜ', None,
3047
        'DYS$', 'DIS', None,
3048
        'DS(CH)--<', 'T', 'T',
3049
        'DST', 'ZT', 'ZT',
3050
        'DZS(CH)--', 'T', 'T',
3051
        'D(SßZ)', 'Z', 'Z',
3052
        'D(AÄEIOÖRUÜY)-', 'D', None,
3053
        'D(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'D', None,
3054
        'D\'H^', 'D', 'T',
3055
        'D´H^', 'D', 'T',
3056
        'D`H^', 'D', 'T',
3057
        'D\'S3$', 'Z', 'Z',
3058
        'D´S3$', 'Z', 'Z',
3059
        'D^', 'D', None,
3060
        'D', 'T', 'T',
3061
        'EAULT$', 'O', 'U',
3062
        'EAUX$', 'O', 'U',
3063
        'EAU', 'O', 'U',
3064
        'EAV', 'IW', 'IF',
3065
        'EAS3$', 'EAS', None,
3066
        'EA(AÄEIOÖÜY)-3', 'EA', 'EA',
3067
        'EA3$', 'EA', 'EA',
3068
        'EA3', 'I', 'I',
3069
        'EBENSO^$', 'EBNSO', 'EBNZU',
3070
        'EBENSO^^', 'EBNSO ', 'EBNZU ',
3071
        'EBEN^^', 'EBN', 'EBN',
3072
        'EE9', 'E', 'E',
3073
        'EGL-1', 'EK', None,
3074
        'EHE(IUY)--1', 'EH', None,
3075
        'EHUNG---1', 'E', None,
3076
        'EH(AÄIOÖUÜY)-1', 'EH', None,
3077
        'EIEI--', '', '',
3078
        'EIERE^$', 'EIERE', None,
3079
        'EIERE$', 'EIERE', None,
3080
        'EIERE(NS)-$', 'EIERE', None,
3081
        'EIERE(AIOUY)--', 'EIER', None,
3082
        'EIER(AÄIOÖUÜY)-', 'EIER', None,
3083
        'EIER<', 'EIA', None,
3084
        'EIGL-1', 'EIK', None,
3085
        'EIGH$', 'EI', 'EI',
3086
        'EIH--', 'E', 'E',
3087
        'EILLE$', 'EI', 'EI',
3088
        'EIR(BCDFGKLMNQSTVWZ)-', 'EIA', 'EIA',
3089
        'EIR$', 'EIA', 'EIA',
3090
        'EITRAUBEN------', 'EIT ', 'EIT ',
3091
        'EI', 'EI', 'EI',
3092
        'EJ$', 'EI', 'EI',
3093
        'ELIZ^', 'ELIS', None,
3094
        'ELZ^', 'ELS', None,
3095
        'EL-^', 'E', 'E',
3096
        'ELANG----1', 'E', 'E',
3097
        'EL(DKL)--1', 'E', 'E',
3098
        'EL(MNT)--1$', 'E', 'E',
3099
        'ELYNE$', 'ELINE', 'ELINE',
3100
        'ELYN$', 'ELIN', 'ELIN',
3101
        'EL(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'EL', 'EL',
3102
        'EL-1', 'L', 'L',
3103
        'EM-^', None, 'E',
3104
        'EM(DFKMPQT)--1', None, 'E',
3105
        'EM(AÄEÈÉÊIÌÍÎOÖUÜY)--1', None, 'E',
3106
        'EM-1', None, 'N',
3107
        'ENGAG-^', 'ANGA', 'ANKA',
3108
        'EN-^', 'E', 'E',
3109
        'ENTUEL', 'ENTUEL', None,
3110
        'EN(CDGKQSTZ)--1', 'E', 'E',
3111
        'EN(AÄEÈÉÊIÌÍÎNOÖUÜY)-1', 'EN', 'EN',
3112
        'EN-1', '', '',
3113
        'ERH(AÄEIOÖUÜ)-^', 'ERH', 'ER',
3114
        'ER-^', 'E', 'E',
3115
        'ERREGEND-----', ' ER', ' ER',
3116
        'ERT1$', 'AT', None,
3117
        'ER(DGLKMNRQTZß)-1', 'ER', None,
3118
        'ER(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'ER', 'A',
3119
        'ER1$', 'A', 'A',
3120
        'ER<1', 'A', 'A',
3121
        'ETAT7', 'ETA', 'ETA',
3122
        'ETI(AÄOÖÜU)-', 'EZI', 'EZI',
3123
        'EUERE$', 'EUERE', None,
3124
        'EUERE(NS)-$', 'EUERE', None,
3125
        'EUERE(AIOUY)--', 'EUER', None,
3126
        'EUER(AÄIOÖUÜY)-', 'EUER', None,
3127
        'EUER<', 'EUA', None,
3128
        'EUEU--', '', '',
3129
        'EUILLE$', 'Ö', 'Ö',
3130
        'EUR$', 'ÖR', 'ÖR',
3131
        'EUX', 'Ö', 'Ö',
3132
        'EUSZ$', 'EUS', None,
3133
        'EUTZ$', 'EUS', None,
3134
        'EUYS$', 'EUS', 'EUZ',
3135
        'EUZ$', 'EUS', None,
3136
        'EU', 'EU', 'EU',
3137
        'EVER--<1', 'EW', None,
3138
        'EV(ÄOÖUÜ)-1', 'EW', None,
3139
        'EYER<', 'EIA', 'EIA',
3140
        'EY<', 'EI', 'EI',
3141
        'FACETTE', 'FASET', 'FAZET',
3142
        'FANS--^$', 'FE', 'FE',
3143
        'FAN-^$', 'FE', 'FE',
3144
        'FAULT-', 'FOL', 'FUL',
3145
        'FEE(DL)-', 'FI', 'FI',
3146
        'FEHLER', 'FELA', 'FELA',
3147
        'FE(LMNRST)-3^', 'FE', 'FE',
3148
        'FOERDERN---^', 'FÖRD', 'FÖRT',
3149
        'FOERDERN---', ' FÖRD', ' FÖRT',
3150
        'FOND7', 'FON', 'FUN',
3151
        'FRAIN$', 'FRA', 'FRA',
3152
        'FRISEU(RS)-', 'FRISÖ', 'FRIZÖ',
3153
        'FY9^', 'FÜ', None,
3154
        'FÖRDERN---^', 'FÖRD', 'FÖRT',
3155
        'FÖRDERN---', ' FÖRD', ' FÖRT',
3156
        'GAGS^$', 'GEX', 'KEX',
3157
        'GAG^$', 'GEK', 'KEK',
3158
        'GD', 'KT', 'KT',
3159
        'GEGEN^^', 'GEGN', 'KEKN',
3160
        'GEGENGEKOM-----', 'GEGN ', 'KEKN ',
3161
        'GEGENGESET-----', 'GEGN ', 'KEKN ',
3162
        'GEGENKOMME-----', 'GEGN ', 'KEKN ',
3163
        'GEGENZUKOM---', 'GEGN ZU ', 'KEKN ZU ',
3164
        'GENDETWAS-----$', 'GENT ', 'KENT ',
3165
        'GENRE', 'IORE', 'IURE',
3166
        'GE(LMNRST)-3^', 'GE', 'KE',
3167
        'GER(DKT)-', 'GER', None,
3168
        'GETTE$', 'GET', 'KET',
3169
        'GGF.', 'GF.', None,
3170
        'GG-', '', '',
3171
        'GH', 'G', None,
3172
        'GI(AOU)-^', 'I', 'I',
3173
        'GION-3', 'KIO', 'KIU',
3174
        'G(CK)-', '', '',
3175
        'GJ(AEIOU)-^', 'I', 'I',
3176
        'GMBH^$', 'GMBH', 'GMBH',
3177
        'GNAC$', 'NIAK', 'NIAK',
3178
        'GNON$', 'NION', 'NIUN',
3179
        'GN$', 'N', 'N',
3180
        'GONCAL-^', 'GONZA', 'KUNZA',
3181
        'GRY9^', 'GRÜ', None,
3182
        'G(SßXZ)-<', 'K', 'K',
3183
        'GUCK-', 'KU', 'KU',
3184
        'GUISEP-^', 'IUSE', 'IUZE',
3185
        'GUI-^', 'G', 'K',
3186
        'GUTAUSSEH------^', 'GUT ', 'KUT ',
3187
        'GUTGEHEND------^', 'GUT ', 'KUT ',
3188
        'GY9^', 'GÜ', None,
3189
        'G(AÄEILOÖRUÜY)-', 'G', None,
3190
        'G(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'G', None,
3191
        'G\'S$', 'X', 'X',
3192
        'G´S$', 'X', 'X',
3193
        'G^', 'G', None,
3194
        'G', 'K', 'K',
3195
        'HA(HIUY)--1', 'H', None,
3196
        'HANDVOL---^', 'HANT ', 'ANT ',
3197
        'HANNOVE-^', 'HANOF', None,
3198
        'HAVEN7$', 'HAFN', None,
3199
        'HEAD-', 'HE', 'E',
3200
        'HELIEGEN------', 'E ', 'E ',
3201
        'HESTEHEN------', 'E ', 'E ',
3202
        'HE(LMNRST)-3^', 'HE', 'E',
3203
        'HE(LMN)-1', 'E', 'E',
3204
        'HEUR1$', 'ÖR', 'ÖR',
3205
        'HE(HIUY)--1', 'H', None,
3206
        'HIH(AÄEIOÖUÜY)-1', 'IH', None,
3207
        'HLH(AÄEIOÖUÜY)-1', 'LH', None,
3208
        'HMH(AÄEIOÖUÜY)-1', 'MH', None,
3209
        'HNH(AÄEIOÖUÜY)-1', 'NH', None,
3210
        'HOBBY9^', 'HOBI', None,
3211
        'HOCHBEGAB-----^', 'HOCH ', 'UK ',
3212
        'HOCHTALEN-----^', 'HOCH ', 'UK ',
3213
        'HOCHZUFRI-----^', 'HOCH ', 'UK ',
3214
        'HO(HIY)--1', 'H', None,
3215
        'HRH(AÄEIOÖUÜY)-1', 'RH', None,
3216
        'HUH(AÄEIOÖUÜY)-1', 'UH', None,
3217
        'HUIS^^', 'HÜS', 'IZ',
3218
        'HUIS$', 'ÜS', 'IZ',
3219
        'HUI--1', 'H', None,
3220
        'HYGIEN^', 'HÜKIEN', None,
3221
        'HY9^', 'HÜ', None,
3222
        'HY(BDGMNPST)-', 'Ü', None,
3223
        'H.^', None, 'H.',
3224
        'HÄU--1', 'H', None,
3225
        'H^', 'H', '',
3226
        'H', '', '',
3227
        'ICHELL---', 'ISH', 'IZ',
3228
        'ICHI$', 'ISHI', 'IZI',
3229
        'IEC$', 'IZ', 'IZ',
3230
        'IEDENSTELLE------', 'IDN ', 'ITN ',
3231
        'IEI-3', '', '',
3232
        'IELL3', 'IEL', 'IEL',
3233
        'IENNE$', 'IN', 'IN',
3234
        'IERRE$', 'IER', 'IER',
3235
        'IERZULAN---', 'IR ZU ', 'IR ZU ',
3236
        'IETTE$', 'IT', 'IT',
3237
        'IEU', 'IÖ', 'IÖ',
3238
        'IE<4', 'I', 'I',
3239
        'IGL-1', 'IK', None,
3240
        'IGHT3$', 'EIT', 'EIT',
3241
        'IGNI(EO)-', 'INI', 'INI',
3242
        'IGN(AEOU)-$', 'INI', 'INI',
3243
        'IHER(DGLKRT)--1', 'IHE', None,
3244
        'IHE(IUY)--', 'IH', None,
3245
        'IH(AIOÖUÜY)-', 'IH', None,
3246
        'IJ(AOU)-', 'I', 'I',
3247
        'IJ$', 'I', 'I',
3248
        'IJ<', 'EI', 'EI',
3249
        'IKOLE$', 'IKOL', 'IKUL',
3250
        'ILLAN(STZ)--4', 'ILIA', 'ILIA',
3251
        'ILLAR(DT)--4', 'ILIA', 'ILIA',
3252
        'IMSTAN----^', 'IM ', 'IN ',
3253
        'INDELERREGE------', 'INDL ', 'INTL ',
3254
        'INFRAGE-----^$', 'IN ', 'IN ',
3255
        'INTERN(AOU)-^', 'INTAN', 'INTAN',
3256
        'INVER-', 'INWE', 'INFE',
3257
        'ITI(AÄIOÖUÜ)-', 'IZI', 'IZI',
3258
        'IUSZ$', 'IUS', None,
3259
        'IUTZ$', 'IUS', None,
3260
        'IUZ$', 'IUS', None,
3261
        'IVER--<', 'IW', None,
3262
        'IVIER$', 'IWIE', 'IFIE',
3263
        'IV(ÄOÖUÜ)-', 'IW', None,
3264
        'IV<3', 'IW', None,
3265
        'IY2', 'I', None,
3266
        'I(ÈÉÊ)<4', 'I', 'I',
3267
        'JAVIE---<^', 'ZA', 'ZA',
3268
        'JEANS^$', 'JINS', 'INZ',
3269
        'JEANNE^$', 'IAN', 'IAN',
3270
        'JEAN-^', 'IA', 'IA',
3271
        'JER-^', 'IE', 'IE',
3272
        'JE(LMNST)-', 'IE', 'IE',
3273
        'JI^', 'JI', None,
3274
        'JOR(GK)^$', 'IÖRK', 'IÖRK',
3275
        'J', 'I', 'I',
3276
        'KC(ÄEIJ)-', 'X', 'X',
3277
        'KD', 'KT', None,
3278
        'KE(LMNRST)-3^', 'KE', 'KE',
3279
        'KG(AÄEILOÖRUÜY)-', 'K', None,
3280
        'KH<^', 'K', 'K',
3281
        'KIC$', 'KIZ', 'KIZ',
3282
        'KLE(LMNRST)-3^', 'KLE', 'KLE',
3283
        'KOTELE-^', 'KOTL', 'KUTL',
3284
        'KREAT-^', 'KREA', 'KREA',
3285
        'KRÜS(TZ)--^', 'KRI', None,
3286
        'KRYS(TZ)--^', 'KRI', None,
3287
        'KRY9^', 'KRÜ', None,
3288
        'KSCH---', 'K', 'K',
3289
        'KSH--', 'K', 'K',
3290
        'K(SßXZ)7', 'X', 'X',  # implies 'KST' -> 'XT'
3291
        'KT\'S$', 'X', 'X',
3292
        'KTI(AIOU)-3', 'XI', 'XI',
3293
        'KT(SßXZ)', 'X', 'X',
3294
        'KY9^', 'KÜ', None,
3295
        'K\'S$', 'X', 'X',
3296
        'K´S$', 'X', 'X',
3297
        'LANGES$', ' LANGES', ' LANKEZ',
3298
        'LANGE$', ' LANGE', ' LANKE',
3299
        'LANG$', ' LANK', ' LANK',
3300
        'LARVE-', 'LARF', 'LARF',
3301
        'LD(SßZ)$', 'LS', 'LZ',
3302
        'LD\'S$', 'LS', 'LZ',
3303
        'LD´S$', 'LS', 'LZ',
3304
        'LEAND-^', 'LEAN', 'LEAN',
3305
        'LEERSTEHE-----^', 'LER ', 'LER ',
3306
        'LEICHBLEIB-----', 'LEICH ', 'LEIK ',
3307
        'LEICHLAUTE-----', 'LEICH ', 'LEIK ',
3308
        'LEIDERREGE------', 'LEIT ', 'LEIT ',
3309
        'LEIDGEPR----^', 'LEIT ', 'LEIT ',
3310
        'LEINSTEHE-----', 'LEIN ', 'LEIN ',
3311
        'LEL-', 'LE', 'LE',
3312
        'LE(MNRST)-3^', 'LE', 'LE',
3313
        'LETTE$', 'LET', 'LET',
3314
        'LFGNAG-', 'LFGAN', 'LFKAN',
3315
        'LICHERWEIS----', 'LICHA ', 'LIKA ',
3316
        'LIC$', 'LIZ', 'LIZ',
3317
        'LIVE^$', 'LEIF', 'LEIF',
3318
        'LT(SßZ)$', 'LS', 'LZ',
3319
        'LT\'S$', 'LS', 'LZ',
3320
        'LT´S$', 'LS', 'LZ',
3321
        'LUI(GS)--', 'LU', 'LU',
3322
        'LV(AIO)-', 'LW', None,
3323
        'LY9^', 'LÜ', None,
3324
        'LSTS$', 'LS', 'LZ',
3325
        'LZ(BDFGKLMNPQRSTVWX)-', 'LS', None,
3326
        'L(SßZ)$', 'LS', None,
3327
        'MAIR-<', 'MEI', 'NEI',
3328
        'MANAG-', 'MENE', 'NENE',
3329
        'MANUEL', 'MANUEL', None,
3330
        'MASSEU(RS)-', 'MASÖ', 'NAZÖ',
3331
        'MATCH', 'MESH', 'NEZ',
3332
        'MAURICE', 'MORIS', 'NURIZ',
3333
        'MBH^$', 'MBH', 'MBH',
3334
        'MB(ßZ)$', 'MS', None,
3335
        'MB(SßTZ)-', 'M', 'N',
3336
        'MCG9^', 'MAK', 'NAK',
3337
        'MC9^', 'MAK', 'NAK',
3338
        'MEMOIR-^', 'MEMOA', 'NENUA',
3339
        'MERHAVEN$', 'MAHAFN', None,
3340
        'ME(LMNRST)-3^', 'ME', 'NE',
3341
        'MEN(STZ)--3', 'ME', None,
3342
        'MEN$', 'MEN', None,
3343
        'MIGUEL-', 'MIGE', 'NIKE',
3344
        'MIKE^$', 'MEIK', 'NEIK',
3345
        'MITHILFE----^$', 'MIT H', 'NIT ',
3346
        'MN$', 'M', None,
3347
        'MN', 'N', 'N',
3348
        'MPJUTE-', 'MPUT', 'NBUT',
3349
        'MP(ßZ)$', 'MS', None,
3350
        'MP(SßTZ)-', 'M', 'N',
3351
        'MP(BDJLMNPQVW)-', 'MB', 'NB',
3352
        'MY9^', 'MÜ', None,
3353
        'M(ßZ)$', 'MS', None,
3354
        'M´G7^', 'MAK', 'NAK',
3355
        'M\'G7^', 'MAK', 'NAK',
3356
        'M´^', 'MAK', 'NAK',
3357
        'M\'^', 'MAK', 'NAK',
3358
        'M', None, 'N',
3359
        'NACH^^', 'NACH', 'NAK',
3360
        'NADINE', 'NADIN', 'NATIN',
3361
        'NAIV--', 'NA', 'NA',
3362
        'NAISE$', 'NESE', 'NEZE',
3363
        'NAUGENOMM------', 'NAU ', 'NAU ',
3364
        'NAUSOGUT$', 'NAUSO GUT', 'NAUZU KUT',
3365
        'NCH$', 'NSH', 'NZ',
3366
        'NCOISE$', 'SOA', 'ZUA',
3367
        'NCOIS$', 'SOA', 'ZUA',
3368
        'NDAR$', 'NDA', 'NTA',
3369
        'NDERINGEN------', 'NDE ', 'NTE ',
3370
        'NDRO(CDKTZ)-', 'NTRO', None,
3371
        'ND(BFGJLMNPQVW)-', 'NT', None,
3372
        'ND(SßZ)$', 'NS', 'NZ',
3373
        'ND\'S$', 'NS', 'NZ',
3374
        'ND´S$', 'NS', 'NZ',
3375
        'NEBEN^^', 'NEBN', 'NEBN',
3376
        'NENGELERN------', 'NEN ', 'NEN ',
3377
        'NENLERN(ET)---', 'NEN LE', 'NEN LE',
3378
        'NENZULERNE---', 'NEN ZU LE', 'NEN ZU LE',
3379
        'NE(LMNRST)-3^', 'NE', 'NE',
3380
        'NEN-3', 'NE', 'NE',
3381
        'NETTE$', 'NET', 'NET',
3382
        'NGU^^', 'NU', 'NU',
3383
        'NG(BDFJLMNPQRTVW)-', 'NK', 'NK',
3384
        'NH(AUO)-$', 'NI', 'NI',
3385
        'NICHTSAHNEN-----', 'NIX ', 'NIX ',
3386
        'NICHTSSAGE----', 'NIX ', 'NIX ',
3387
        'NICHTS^^', 'NIX', 'NIX',
3388
        'NICHT^^', 'NICHT', 'NIKT',
3389
        'NINE$', 'NIN', 'NIN',
3390
        'NON^^', 'NON', 'NUN',
3391
        'NOTLEIDE-----^', 'NOT ', 'NUT ',
3392
        'NOT^^', 'NOT', 'NUT',
3393
        'NTI(AIOU)-3', 'NZI', 'NZI',
3394
        'NTIEL--3', 'NZI', 'NZI',
3395
        'NT(SßZ)$', 'NS', 'NZ',
3396
        'NT\'S$', 'NS', 'NZ',
3397
        'NT´S$', 'NS', 'NZ',
3398
        'NYLON', 'NEILON', 'NEILUN',
3399
        'NY9^', 'NÜ', None,
3400
        'NSTZUNEH---', 'NST ZU ', 'NZT ZU ',
3401
        'NSZ-', 'NS', None,
3402
        'NSTS$', 'NS', 'NZ',
3403
        'NZ(BDFGKLMNPQRSTVWX)-', 'NS', None,
3404
        'N(SßZ)$', 'NS', None,
3405
        'OBERE-', 'OBER', None,
3406
        'OBER^^', 'OBA', 'UBA',
3407
        'OEU2', 'Ö', 'Ö',
3408
        'OE<2', 'Ö', 'Ö',
3409
        'OGL-', 'OK', None,
3410
        'OGNIE-', 'ONI', 'UNI',
3411
        'OGN(AEOU)-$', 'ONI', 'UNI',
3412
        'OH(AIOÖUÜY)-', 'OH', None,
3413
        'OIE$', 'Ö', 'Ö',
3414
        'OIRE$', 'OA', 'UA',
3415
        'OIR$', 'OA', 'UA',
3416
        'OIX', 'OA', 'UA',
3417
        'OI<3', 'EU', 'EU',
3418
        'OKAY^$', 'OKE', 'UKE',
3419
        'OLYN$', 'OLIN', 'ULIN',
3420
        'OO(DLMZ)-', 'U', None,
3421
        'OO$', 'U', None,
3422
        'OO-', '', '',
3423
        'ORGINAL-----', 'ORI', 'URI',
3424
        'OTI(AÄOÖUÜ)-', 'OZI', 'UZI',
3425
        'OUI^', 'WI', 'FI',
3426
        'OUILLE$', 'ULIE', 'ULIE',
3427
        'OU(DT)-^', 'AU', 'AU',
3428
        'OUSE$', 'AUS', 'AUZ',
3429
        'OUT-', 'AU', 'AU',
3430
        'OU', 'U', 'U',
3431
        'O(FV)$', 'AU', 'AU',  # due to 'OW$' -> 'AU'
3432
        'OVER--<', 'OW', None,
3433
        'OV(AOU)-', 'OW', None,
3434
        'OW$', 'AU', 'AU',
3435
        'OWS$', 'OS', 'UZ',
3436
        'OJ(AÄEIOÖUÜ)--', 'O', 'U',
3437
        'OYER', 'OIA', None,
3438
        'OY(AÄEIOÖUÜ)--', 'O', 'U',
3439
        'O(JY)<', 'EU', 'EU',
3440
        'OZ$', 'OS', None,
3441
        'O´^', 'O', 'U',
3442
        'O\'^', 'O', 'U',
3443
        'O', None, 'U',
3444
        'PATIEN--^', 'PAZI', 'PAZI',
3445
        'PENSIO-^', 'PANSI', 'PANZI',
3446
        'PE(LMNRST)-3^', 'PE', 'PE',
3447
        'PFER-^', 'FE', 'FE',
3448
        'P(FH)<', 'F', 'F',
3449
        'PIC^$', 'PIK', 'PIK',
3450
        'PIC$', 'PIZ', 'PIZ',
3451
        'PIPELINE', 'PEIBLEIN', 'PEIBLEIN',
3452
        'POLYP-', 'POLÜ', None,
3453
        'POLY^^', 'POLI', 'PULI',
3454
        'PORTRAIT7', 'PORTRE', 'PURTRE',
3455
        'POWER7', 'PAUA', 'PAUA',
3456
        'PP(FH)--<', 'B', 'B',
3457
        'PP-', '', '',
3458
        'PRODUZ-^', 'PRODU', 'BRUTU',
3459
        'PRODUZI--', ' PRODU', ' BRUTU',
3460
        'PRIX^$', 'PRI', 'PRI',
3461
        'PS-^^', 'P', None,
3462
        'P(SßZ)^', None, 'Z',
3463
        'P(SßZ)$', 'BS', None,
3464
        'PT-^', '', '',
3465
        'PTI(AÄOÖUÜ)-3', 'BZI', 'BZI',
3466
        'PY9^', 'PÜ', None,
3467
        'P(AÄEIOÖRUÜY)-', 'P', 'P',
3468
        'P(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'P', None,
3469
        'P.^', None, 'P.',
3470
        'P^', 'P', None,
3471
        'P', 'B', 'B',
3472
        'QI-', 'Z', 'Z',
3473
        'QUARANT--', 'KARA', 'KARA',
3474
        'QUE(LMNRST)-3', 'KWE', 'KFE',
3475
        'QUE$', 'K', 'K',
3476
        'QUI(NS)$', 'KI', 'KI',
3477
        'QUIZ7', 'KWIS', None,
3478
        'Q(UV)7', 'KW', 'KF',
3479
        'Q<', 'K', 'K',
3480
        'RADFAHR----', 'RAT ', 'RAT ',
3481
        'RAEFTEZEHRE-----', 'REFTE ', 'REFTE ',
3482
        'RCH', 'RCH', 'RK',
3483
        'REA(DU)---3^', 'R', None,
3484
        'REBSERZEUG------', 'REBS ', 'REBZ ',
3485
        'RECHERCH^', 'RESHASH', 'REZAZ',
3486
        'RECYCL--', 'RIZEI', 'RIZEI',
3487
        'RE(ALST)-3^', 'RE', None,
3488
        'REE$', 'RI', 'RI',
3489
        'RER$', 'RA', 'RA',
3490
        'RE(MNR)-4', 'RE', 'RE',
3491
        'RETTE$', 'RET', 'RET',
3492
        'REUZ$', 'REUZ', None,
3493
        'REW$', 'RU', 'RU',
3494
        'RH<^', 'R', 'R',
3495
        'RJA(MN)--', 'RI', 'RI',
3496
        'ROWD-^', 'RAU', 'RAU',
3497
        'RTEMONNAIE-', 'RTMON', 'RTNUN',
3498
        'RTI(AÄOÖUÜ)-3', 'RZI', 'RZI',
3499
        'RTIEL--3', 'RZI', 'RZI',
3500
        'RV(AEOU)-3', 'RW', None,
3501
        'RY(KN)-$', 'RI', 'RI',
3502
        'RY9^', 'RÜ', None,
3503
        'RÄFTEZEHRE-----', 'REFTE ', 'REFTE ',
3504
        'SAISO-^', 'SES', 'ZEZ',
3505
        'SAFE^$', 'SEIF', 'ZEIF',
3506
        'SAUCE-^', 'SOS', 'ZUZ',
3507
        'SCHLAGGEBEN-----<', 'SHLAK ', 'ZLAK ',
3508
        'SCHSCH---7', '', '',
3509
        'SCHTSCH', 'SH', 'Z',
3510
        'SC(HZ)<', 'SH', 'Z',
3511
        'SC', 'SK', 'ZK',
3512
        'SELBSTST--7^^', 'SELB', 'ZELB',
3513
        'SELBST7^^', 'SELBST', 'ZELBZT',
3514
        'SERVICE7^', 'SÖRWIS', 'ZÖRFIZ',
3515
        'SERVI-^', 'SERW', None,
3516
        'SE(LMNRST)-3^', 'SE', 'ZE',
3517
        'SETTE$', 'SET', 'ZET',
3518
        'SHP-^', 'S', 'Z',
3519
        'SHST', 'SHT', 'ZT',
3520
        'SHTSH', 'SH', 'Z',
3521
        'SHT', 'ST', 'Z',
3522
        'SHY9^', 'SHÜ', None,
3523
        'SH^^', 'SH', None,
3524
        'SH3', 'SH', 'Z',
3525
        'SICHERGEGAN-----^', 'SICHA ', 'ZIKA ',
3526
        'SICHERGEHE----^', 'SICHA ', 'ZIKA ',
3527
        'SICHERGESTEL------^', 'SICHA ', 'ZIKA ',
3528
        'SICHERSTELL-----^', 'SICHA ', 'ZIKA ',
3529
        'SICHERZU(GS)--^', 'SICHA ZU ', 'ZIKA ZU ',
3530
        'SIEGLI-^', 'SIKL', 'ZIKL',
3531
        'SIGLI-^', 'SIKL', 'ZIKL',
3532
        'SIGHT', 'SEIT', 'ZEIT',
3533
        'SIGN', 'SEIN', 'ZEIN',
3534
        'SKI(NPZ)-', 'SKI', 'ZKI',
3535
        'SKI<^', 'SHI', 'ZI',
3536
        'SODASS^$', 'SO DAS', 'ZU TAZ',
3537
        'SODAß^$', 'SO DAS', 'ZU TAZ',
3538
        'SOGENAN--^', 'SO GEN', 'ZU KEN',
3539
        'SOUND-', 'SAUN', 'ZAUN',
3540
        'STAATS^^', 'STAZ', 'ZTAZ',
3541
        'STADT^^', 'STAT', 'ZTAT',
3542
        'STANDE$', ' STANDE', ' ZTANTE',
3543
        'START^^', 'START', 'ZTART',
3544
        'STAURANT7', 'STORAN', 'ZTURAN',
3545
        'STEAK-', 'STE', 'ZTE',
3546
        'STEPHEN-^$', 'STEW', None,
3547
        'STERN', 'STERN', None,
3548
        'STRAF^^', 'STRAF', 'ZTRAF',
3549
        'ST\'S$', 'Z', 'Z',
3550
        'ST´S$', 'Z', 'Z',
3551
        'STST--', '', '',
3552
        'STS(ACEÈÉÊHIÌÍÎOUÄÜÖ)--', 'ST', 'ZT',
3553
        'ST(SZ)', 'Z', 'Z',
3554
        'SPAREN---^', 'SPA', 'ZPA',
3555
        'SPAREND----', ' SPA', ' ZPA',
3556
        'S(PTW)-^^', 'S', None,
3557
        'SP', 'SP', None,
3558
        'STYN(AE)-$', 'STIN', 'ZTIN',
3559
        'ST', 'ST', 'ZT',
3560
        'SUITE<', 'SIUT', 'ZIUT',
3561
        'SUKE--$', 'S', 'Z',
3562
        'SURF(EI)-', 'SÖRF', 'ZÖRF',
3563
        'SV(AEÈÉÊIÌÍÎOU)-<^', 'SW', None,
3564
        'SYB(IY)--^', 'SIB', None,
3565
        'SYL(KVW)--^', 'SI', None,
3566
        'SY9^', 'SÜ', None,
3567
        'SZE(NPT)-^', 'ZE', 'ZE',
3568
        'SZI(ELN)-^', 'ZI', 'ZI',
3569
        'SZCZ<', 'SH', 'Z',
3570
        'SZT<', 'ST', 'ZT',
3571
        'SZ<3', 'SH', 'Z',
3572
        'SÜL(KVW)--^', 'SI', None,
3573
        'S', None, 'Z',
3574
        'TCH', 'SH', 'Z',
3575
        'TD(AÄEIOÖRUÜY)-', 'T', None,
3576
        'TD(ÀÁÂÃÅÈÉÊËÌÍÎÏÒÓÔÕØÙÚÛÝŸ)-', 'T', None,
3577
        'TEAT-^', 'TEA', 'TEA',
3578
        'TERRAI7^', 'TERA', 'TERA',
3579
        'TE(LMNRST)-3^', 'TE', 'TE',
3580
        'TH<', 'T', 'T',
3581
        'TICHT-', 'TIK', 'TIK',
3582
        'TICH$', 'TIK', 'TIK',
3583
        'TIC$', 'TIZ', 'TIZ',
3584
        'TIGGESTELL-------', 'TIK ', 'TIK ',
3585
        'TIGSTELL-----', 'TIK ', 'TIK ',
3586
        'TOAS-^', 'TO', 'TU',
3587
        'TOILET-', 'TOLE', 'TULE',
3588
        'TOIN-', 'TOA', 'TUA',
3589
        'TRAECHTI-^', 'TRECHT', 'TREKT',
3590
        'TRAECHTIG--', ' TRECHT', ' TREKT',
3591
        'TRAINI-', 'TREN', 'TREN',
3592
        'TRÄCHTI-^', 'TRECHT', 'TREKT',
3593
        'TRÄCHTIG--', ' TRECHT', ' TREKT',
3594
        'TSCH', 'SH', 'Z',
3595
        'TSH', 'SH', 'Z',
3596
        'TST', 'ZT', 'ZT',
3597
        'T(Sß)', 'Z', 'Z',
3598
        'TT(SZ)--<', '', '',
3599
        'TT9', 'T', 'T',
3600
        'TV^$', 'TV', 'TV',
3601
        'TX(AEIOU)-3', 'SH', 'Z',
3602
        'TY9^', 'TÜ', None,
3603
        'TZ-', '', '',
3604
        'T\'S3$', 'Z', 'Z',
3605
        'T´S3$', 'Z', 'Z',
3606
        'UEBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
3607
        'UEBER^^', 'ÜBA', 'IBA',
3608
        'UE2', 'Ü', 'I',
3609
        'UGL-', 'UK', None,
3610
        'UH(AOÖUÜY)-', 'UH', None,
3611
        'UIE$', 'Ü', 'I',
3612
        'UM^^', 'UM', 'UN',
3613
        'UNTERE--3', 'UNTE', 'UNTE',
3614
        'UNTER^^', 'UNTA', 'UNTA',
3615
        'UNVER^^', 'UNFA', 'UNFA',
3616
        'UN^^', 'UN', 'UN',
3617
        'UTI(AÄOÖUÜ)-', 'UZI', 'UZI',
3618
        'UVE-4', 'UW', None,
3619
        'UY2', 'UI', None,
3620
        'UZZ', 'AS', 'AZ',
3621
        'VACL-^', 'WAZ', 'FAZ',
3622
        'VAC$', 'WAZ', 'FAZ',
3623
        'VAN DEN ^', 'FANDN', 'FANTN',
3624
        'VANES-^', 'WANE', None,
3625
        'VATRO-', 'WATR', None,
3626
        'VA(DHJNT)--^', 'F', None,
3627
        'VEDD-^', 'FE', 'FE',
3628
        'VE(BEHIU)--^', 'F', None,
3629
        'VEL(BDLMNT)-^', 'FEL', None,
3630
        'VENTZ-^', 'FEN', None,
3631
        'VEN(NRSZ)-^', 'FEN', None,
3632
        'VER(AB)-^$', 'WER', None,
3633
        'VERBAL^$', 'WERBAL', None,
3634
        'VERBAL(EINS)-^', 'WERBAL', None,
3635
        'VERTEBR--', 'WERTE', None,
3636
        'VEREIN-----', 'F', None,
3637
        'VEREN(AEIOU)-^', 'WEREN', None,
3638
        'VERIFI', 'WERIFI', None,
3639
        'VERON(AEIOU)-^', 'WERON', None,
3640
        'VERSEN^', 'FERSN', 'FAZN',
3641
        'VERSIERT--^', 'WERSI', None,
3642
        'VERSIO--^', 'WERS', None,
3643
        'VERSUS', 'WERSUS', None,
3644
        'VERTI(GK)-', 'WERTI', None,
3645
        'VER^^', 'FER', 'FA',
3646
        'VERSPRECHE-------', ' FER', ' FA',
3647
        'VER$', 'WA', None,
3648
        'VER', 'FA', 'FA',
3649
        'VET(HT)-^', 'FET', 'FET',
3650
        'VETTE$', 'WET', 'FET',
3651
        'VE^', 'WE', None,
3652
        'VIC$', 'WIZ', 'FIZ',
3653
        'VIELSAGE----', 'FIL ', 'FIL ',
3654
        'VIEL', 'FIL', 'FIL',
3655
        'VIEW', 'WIU', 'FIU',
3656
        'VILL(AE)-', 'WIL', None,
3657
        'VIS(ACEIKUVWZ)-<^', 'WIS', None,
3658
        'VI(ELS)--^', 'F', None,
3659
        'VILLON--', 'WILI', 'FILI',
3660
        'VIZE^^', 'FIZE', 'FIZE',
3661
        'VLIE--^', 'FL', None,
3662
        'VL(AEIOU)--', 'W', None,
3663
        'VOKA-^', 'WOK', None,
3664
        'VOL(ATUVW)--^', 'WO', None,
3665
        'VOR^^', 'FOR', 'FUR',
3666
        'VR(AEIOU)--', 'W', None,
3667
        'VV9', 'W', None,
3668
        'VY9^', 'WÜ', 'FI',
3669
        'V(ÜY)-', 'W', None,
3670
        'V(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'W', None,
3671
        'V(AEIJLRU)-<', 'W', None,
3672
        'V.^', 'V.', None,
3673
        'V<', 'F', 'F',
3674
        'WEITERENTWI-----^', 'WEITA ', 'FEITA ',
3675
        'WEITREICH-----^', 'WEIT ', 'FEIT ',
3676
        'WEITVER^', 'WEIT FER', 'FEIT FA',
3677
        'WE(LMNRST)-3^', 'WE', 'FE',
3678
        'WER(DST)-', 'WER', None,
3679
        'WIC$', 'WIZ', 'FIZ',
3680
        'WIEDERU--', 'WIDE', 'FITE',
3681
        'WIEDER^$', 'WIDA', 'FITA',
3682
        'WIEDER^^', 'WIDA ', 'FITA ',
3683
        'WIEVIEL', 'WI FIL', 'FI FIL',
3684
        'WISUEL', 'WISUEL', None,
3685
        'WR-^', 'W', None,
3686
        'WY9^', 'WÜ', 'FI',
3687
        'W(BDFGJKLMNPQRSTZ)-', 'F', None,
3688
        'W$', 'F', None,
3689
        'W', None, 'F',
3690
        'X<^', 'Z', 'Z',
3691
        'XHAVEN$', 'XAFN', None,
3692
        'X(CSZ)', 'X', 'X',
3693
        'XTS(CH)--', 'XT', 'XT',
3694
        'XT(SZ)', 'Z', 'Z',
3695
        'YE(LMNRST)-3^', 'IE', 'IE',
3696
        'YE-3', 'I', 'I',
3697
        'YOR(GK)^$', 'IÖRK', 'IÖRK',
3698
        'Y(AOU)-<7', 'I', 'I',
3699
        'Y(BKLMNPRSTX)-1', 'Ü', None,
3700
        'YVES^$', 'IF', 'IF',
3701
        'YVONNE^$', 'IWON', 'IFUN',
3702
        'Y.^', 'Y.', None,
3703
        'Y', 'I', 'I',
3704
        'ZC(AOU)-', 'SK', 'ZK',
3705
        'ZE(LMNRST)-3^', 'ZE', 'ZE',
3706
        'ZIEJ$', 'ZI', 'ZI',
3707
        'ZIGERJA(HR)-3', 'ZIGA IA', 'ZIKA IA',
3708
        'ZL(AEIOU)-', 'SL', None,
3709
        'ZS(CHT)--', '', '',
3710
        'ZS', 'SH', 'Z',
3711
        'ZUERST', 'ZUERST', 'ZUERST',
3712
        'ZUGRUNDE^$', 'ZU GRUNDE', 'ZU KRUNTE',
3713
        'ZUGRUNDE', 'ZU GRUNDE ', 'ZU KRUNTE ',
3714
        'ZUGUNSTEN', 'ZU GUNSTN', 'ZU KUNZTN',
3715
        'ZUHAUSE-', 'ZU HAUS', 'ZU AUZ',
3716
        'ZULASTEN^$', 'ZU LASTN', 'ZU LAZTN',
3717
        'ZURUECK^^', 'ZURÜK', 'ZURIK',
3718
        'ZURZEIT', 'ZUR ZEIT', 'ZUR ZEIT',
3719
        'ZURÜCK^^', 'ZURÜK', 'ZURIK',
3720
        'ZUSTANDE', 'ZU STANDE', 'ZU ZTANTE',
3721
        'ZUTAGE', 'ZU TAGE', 'ZU TAKE',
3722
        'ZUVER^^', 'ZUFA', 'ZUFA',
3723
        'ZUVIEL', 'ZU FIL', 'ZU FIL',
3724
        'ZUWENIG', 'ZU WENIK', 'ZU FENIK',
3725
        'ZY9^', 'ZÜ', None,
3726
        'ZYK3$', 'ZIK', None,
3727
        'Z(VW)7^', 'SW', None,
3728
        None, None, None)
3729
3730
    phonet_hash = Counter()
3731
    alpha_pos = Counter()
3732
3733
    phonet_hash_1 = Counter()
3734
    phonet_hash_2 = Counter()
3735
3736
    _phonet_upper_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
3737
                                          'abcdefghijklmnopqrstuvwxyzàáâãåäæ' +
3738
                                          'çðèéêëìíîïñòóôõöøœšßþùúûüýÿ'),
3739
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÅÄÆ' +
3740
                                         'ÇÐÈÉÊËÌÍÎÏÑÒÓÔÕÖØŒŠßÞÙÚÛÜÝŸ'))
3741
3742
    def _trinfo(text, rule, err_text, lang):
3743
        """Output debug information."""
3744
        if lang == 'none':
3745
            _phonet_rules = _phonet_rules_no_lang
3746
        else:
3747
            _phonet_rules = _phonet_rules_german
3748
3749
        from_rule = ('(NULL)' if _phonet_rules[rule] is None else
3750
                     _phonet_rules[rule])
3751
        to_rule1 = ('(NULL)' if (_phonet_rules[rule + 1] is None) else
3752
                    _phonet_rules[rule + 1])
3753
        to_rule2 = ('(NULL)' if (_phonet_rules[rule + 2] is None) else
3754
                    _phonet_rules[rule + 2])
3755
        print('"{} {}:  "{}"{}"{}" {}'.format(text, ((rule / 3) + 1),
3756
                                              from_rule, to_rule1, to_rule2,
3757
                                              err_text))
3758
3759
    def _initialize_phonet(lang):
3760
        """Initialize phonet variables."""
3761
        if lang == 'none':
3762
            _phonet_rules = _phonet_rules_no_lang
3763
        else:
3764
            _phonet_rules = _phonet_rules_german
3765
3766
        phonet_hash[''] = -1
3767
3768
        # German and international umlauts
3769
        for j in {'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë',
3770
                  'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø',
3771
                  'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'Œ', 'Š', 'Ÿ'}:
3772
            alpha_pos[j] = 1
3773
            phonet_hash[j] = -1
3774
3775
        # "normal" letters ('A'-'Z')
3776
        for i, j in enumerate('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
3777
            alpha_pos[j] = i + 2
3778
            phonet_hash[j] = -1
3779
3780
        for i in range(26):
3781
            for j in range(28):
3782
                phonet_hash_1[i, j] = -1
3783
                phonet_hash_2[i, j] = -1
3784
3785
        # for each phonetc rule
3786
        for i in range(len(_phonet_rules)):
3787
            rule = _phonet_rules[i]
3788
3789
            if rule and i % 3 == 0:
3790
                # calculate first hash value
3791
                k = _phonet_rules[i][0]
3792
3793
                if phonet_hash[k] < 0 and (_phonet_rules[i+1] or
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash does not seem to be defined.
Loading history...
3794
                                           _phonet_rules[i+2]):
3795
                    phonet_hash[k] = i
3796
3797
                # calculate second hash values
3798
                if k and alpha_pos[k] >= 2:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable alpha_pos does not seem to be defined.
Loading history...
3799
                    k = alpha_pos[k]
3800
3801
                    j = k-2
3802
                    rule = rule[1:]
3803
3804
                    if not rule:
3805
                        rule = ' '
3806
                    elif rule[0] == '(':
3807
                        rule = rule[1:]
3808
                    else:
3809
                        rule = rule[0]
3810
3811
                    while rule and (rule[0] != ')'):
3812
                        k = alpha_pos[rule[0]]
3813
3814
                        if k > 0:
3815
                            # add hash value for this letter
3816
                            if phonet_hash_1[j, k] < 0:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_1 does not seem to be defined.
Loading history...
3817
                                phonet_hash_1[j, k] = i
3818
                                phonet_hash_2[j, k] = i
3819
3820
                            if phonet_hash_2[j, k] >= (i-30):
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_2 does not seem to be defined.
Loading history...
3821
                                phonet_hash_2[j, k] = i
3822
                            else:
3823
                                k = -1
3824
3825
                        if k <= 0:
3826
                            # add hash value for all letters
3827
                            if phonet_hash_1[j, 0] < 0:
3828
                                phonet_hash_1[j, 0] = i
3829
3830
                            phonet_hash_2[j, 0] = i
3831
3832
                        rule = rule[1:]
3833
3834
    def _phonet(term, mode, lang, trace):
3835
        """Return the phonet coded form of a term."""
3836
        if lang == 'none':
3837
            _phonet_rules = _phonet_rules_no_lang
3838
        else:
3839
            _phonet_rules = _phonet_rules_german
3840
3841
        char0 = ''
3842
        dest = term
3843
3844
        if not term:
3845
            return ''
3846
3847
        term_length = len(term)
3848
3849
        # convert input string to upper-case
3850
        src = term.translate(_phonet_upper_translation)
3851
3852
        # check "src"
3853
        i = 0
3854
        j = 0
3855
        zeta = 0
3856
3857
        while i < len(src):
3858
            char = src[i]
3859
3860
            if trace:
3861
                print('\ncheck position {}:  src = "{}",  dest = "{}"'.format
3862
                      (j, src[i:], dest[:j]))
3863
3864
            pos = alpha_pos[char]
3865
3866
            if pos >= 2:
3867
                xpos = pos-2
3868
3869
                if i+1 == len(src):
3870
                    pos = alpha_pos['']
3871
                else:
3872
                    pos = alpha_pos[src[i+1]]
3873
3874
                start1 = phonet_hash_1[xpos, pos]
3875
                start2 = phonet_hash_1[xpos, 0]
3876
                end1 = phonet_hash_2[xpos, pos]
3877
                end2 = phonet_hash_2[xpos, 0]
3878
3879
                # preserve rule priorities
3880
                if (start2 >= 0) and ((start1 < 0) or (start2 < start1)):
3881
                    pos = start1
3882
                    start1 = start2
3883
                    start2 = pos
3884
                    pos = end1
3885
                    end1 = end2
3886
                    end2 = pos
3887
3888
                if (end1 >= start2) and (start2 >= 0):
3889
                    if end2 > end1:
3890
                        end1 = end2
3891
3892
                    start2 = -1
3893
                    end2 = -1
3894
            else:
3895
                pos = phonet_hash[char]
3896
                start1 = pos
3897
                end1 = 10000
3898
                start2 = -1
3899
                end2 = -1
3900
3901
            pos = start1
3902
            zeta0 = 0
3903
3904
            if pos >= 0:
3905
                # check rules for this char
3906
                while ((_phonet_rules[pos] is None) or
3907
                       (_phonet_rules[pos][0] == char)):
3908
                    if pos > end1:
3909
                        if start2 > 0:
3910
                            pos = start2
3911
                            start1 = start2
3912
                            start2 = -1
3913
                            end1 = end2
3914
                            end2 = -1
3915
                            continue
3916
3917
                        break
3918
3919
                    if (((_phonet_rules[pos] is None) or
3920
                         (_phonet_rules[pos + mode] is None))):
3921
                        # no conversion rule available
3922
                        pos += 3
3923
                        continue
3924
3925
                    if trace:
3926
                        _trinfo('> rule no.', pos, 'is being checked', lang)
3927
3928
                    # check whole string
3929
                    matches = 1  # number of matching letters
3930
                    priority = 5  # default priority
3931
                    rule = _phonet_rules[pos]
3932
                    rule = rule[1:]
3933
3934
                    while (rule and
3935
                           (len(src) > (i + matches)) and
3936
                           (src[i + matches] == rule[0]) and
3937
                           not rule[0].isdigit() and
3938
                           (rule not in '(-<^$')):
3939
                        matches += 1
3940
                        rule = rule[1:]
3941
3942
                    if rule and (rule[0] == '('):
3943
                        # check an array of letters
3944
                        if (((len(src) > (i + matches)) and
3945
                             src[i + matches].isalpha() and
3946
                             (src[i + matches] in rule[1:]))):
3947
                            matches += 1
3948
3949
                            while rule and rule[0] != ')':
3950
                                rule = rule[1:]
3951
3952
                            # if rule[0] == ')':
3953
                            rule = rule[1:]
3954
3955
                    if rule:
3956
                        priority0 = ord(rule[0])
3957
                    else:
3958
                        priority0 = 0
3959
3960
                    matches0 = matches
3961
3962
                    while rule and rule[0] == '-' and matches > 1:
3963
                        matches -= 1
3964
                        rule = rule[1:]
3965
3966
                    if rule and rule[0] == '<':
3967
                        rule = rule[1:]
3968
3969
                    if rule and rule[0].isdigit():
3970
                        # read priority
3971
                        priority = int(rule[0])
3972
                        rule = rule[1:]
3973
3974
                    if rule and rule[0:2] == '^^':
3975
                        rule = rule[1:]
3976
3977
                    if (not rule or
3978
                            ((rule[0] == '^') and
3979
                             ((i == 0) or not src[i-1].isalpha()) and
3980
                             ((rule[1:2] != '$') or
3981
                              (not (src[i+matches0:i+matches0+1].isalpha()) and
3982
                               (src[i+matches0:i+matches0+1] != '.')))) or
3983
                            ((rule[0] == '$') and (i > 0) and
3984
                             src[i-1].isalpha() and
3985
                             ((not src[i+matches0:i+matches0+1].isalpha()) and
3986
                              (src[i+matches0:i+matches0+1] != '.')))):
3987
                        # look for continuation, if:
3988
                        # matches > 1 und NO '-' in first string */
3989
                        pos0 = -1
3990
3991
                        start3 = 0
3992
                        start4 = 0
3993
                        end3 = 0
3994
                        end4 = 0
3995
3996
                        if (((matches > 1) and
3997
                             src[i+matches:i+matches+1] and
3998
                             (priority0 != ord('-')))):
3999
                            char0 = src[i+matches-1]
4000
                            pos0 = alpha_pos[char0]
4001
4002
                            if pos0 >= 2 and src[i+matches]:
4003
                                xpos = pos0 - 2
4004
                                pos0 = alpha_pos[src[i+matches]]
4005
                                start3 = phonet_hash_1[xpos, pos0]
4006
                                start4 = phonet_hash_1[xpos, 0]
4007
                                end3 = phonet_hash_2[xpos, pos0]
4008
                                end4 = phonet_hash_2[xpos, 0]
4009
4010
                                # preserve rule priorities
4011
                                if (((start4 >= 0) and
4012
                                     ((start3 < 0) or (start4 < start3)))):
4013
                                    pos0 = start3
4014
                                    start3 = start4
4015
                                    start4 = pos0
4016
                                    pos0 = end3
4017
                                    end3 = end4
4018
                                    end4 = pos0
4019
4020
                                if (end3 >= start4) and (start4 >= 0):
4021
                                    if end4 > end3:
4022
                                        end3 = end4
4023
4024
                                    start4 = -1
4025
                                    end4 = -1
4026
                            else:
4027
                                pos0 = phonet_hash[char0]
4028
                                start3 = pos0
4029
                                end3 = 10000
4030
                                start4 = -1
4031
                                end4 = -1
4032
4033
                            pos0 = start3
4034
4035
                        # check continuation rules for src[i+matches]
4036
                        if pos0 >= 0:
4037
                            while ((_phonet_rules[pos0] is None) or
4038
                                   (_phonet_rules[pos0][0] == char0)):
4039
                                if pos0 > end3:
4040
                                    if start4 > 0:
4041
                                        pos0 = start4
4042
                                        start3 = start4
4043
                                        start4 = -1
4044
                                        end3 = end4
4045
                                        end4 = -1
4046
                                        continue
4047
4048
                                    priority0 = -1
4049
4050
                                    # important
4051
                                    break
4052
4053
                                if (((_phonet_rules[pos0] is None) or
4054
                                     (_phonet_rules[pos0 + mode] is None))):
4055
                                    # no conversion rule available
4056
                                    pos0 += 3
4057
                                    continue
4058
4059
                                if trace:
4060
                                    _trinfo('> > continuation rule no.', pos0,
4061
                                            'is being checked', lang)
4062
4063
                                # check whole string
4064
                                matches0 = matches
4065
                                priority0 = 5
4066
                                rule = _phonet_rules[pos0]
4067
                                rule = rule[1:]
4068
4069
                                while (rule and
4070
                                       (src[i+matches0:i+matches0+1] ==
4071
                                        rule[0]) and
4072
                                       (not rule[0].isdigit() or
4073
                                        (rule in '(-<^$'))):
4074
                                    matches0 += 1
4075
                                    rule = rule[1:]
4076
4077
                                if rule and rule[0] == '(':
4078
                                    # check an array of letters
4079
                                    if ((src[i+matches0:i+matches0+1]
4080
                                         .isalpha() and
4081
                                         (src[i+matches0] in rule[1:]))):
4082
                                        matches0 += 1
4083
4084
                                        while rule and rule[0] != ')':
4085
                                            rule = rule[1:]
4086
4087
                                        # if rule[0] == ')':
4088
                                        rule = rule[1:]
4089
4090
                                while rule and rule[0] == '-':
4091
                                    # "matches0" is NOT decremented
4092
                                    # because of  "if (matches0 == matches)"
4093
                                    rule = rule[1:]
4094
4095
                                if rule and rule[0] == '<':
4096
                                    rule = rule[1:]
4097
4098
                                if rule and rule[0].isdigit():
4099
                                    priority0 = int(rule[0])
4100
                                    rule = rule[1:]
4101
4102
                                if (not rule or
4103
                                        # rule == '^' is not possible here
4104
                                        ((rule[0] == '$') and not
4105
                                         src[i+matches0:i+matches0+1]
4106
                                         .isalpha() and
4107
                                         (src[i+matches0:i+matches0+1]
4108
                                          != '.'))):
4109
                                    if matches0 == matches:
4110
                                        # this is only a partial string
4111
                                        if trace:
4112
                                            _trinfo('> > continuation ' +
4113
                                                    'rule no.',
4114
                                                    pos0,
4115
                                                    'not used (too short)',
4116
                                                    lang)
4117
4118
                                        pos0 += 3
4119
                                        continue
4120
4121
                                    if priority0 < priority:
4122
                                        # priority is too low
4123
                                        if trace:
4124
                                            _trinfo('> > continuation ' +
4125
                                                    'rule no.',
4126
                                                    pos0,
4127
                                                    'not used (priority)',
4128
                                                    lang)
4129
4130
                                        pos0 += 3
4131
                                        continue
4132
4133
                                    # continuation rule found
4134
                                    break
4135
4136
                                if trace:
4137
                                    _trinfo('> > continuation rule no.', pos0,
4138
                                            'not used', lang)
4139
4140
                                pos0 += 3
4141
4142
                            # end of "while"
4143
                            if ((priority0 >= priority) and
4144
                                    ((_phonet_rules[pos0] is not None) and
4145
                                     (_phonet_rules[pos0][0] == char0))):
4146
4147
                                if trace:
4148
                                    _trinfo('> rule no.', pos, '', lang)
4149
                                    _trinfo('> not used because of ' +
4150
                                            'continuation', pos0, '', lang)
4151
4152
                                pos += 3
4153
                                continue
4154
4155
                        # replace string
4156
                        if trace:
4157
                            _trinfo('Rule no.', pos, 'is applied', lang)
4158
4159
                        if ((_phonet_rules[pos] and
4160
                             ('<' in _phonet_rules[pos][1:]))):
4161
                            priority0 = 1
4162
                        else:
4163
                            priority0 = 0
4164
4165
                        rule = _phonet_rules[pos + mode]
4166
4167
                        if (priority0 == 1) and (zeta == 0):
4168
                            # rule with '<' is applied
4169
                            if ((j > 0) and rule and
4170
                                    ((dest[j-1] == char) or
4171
                                     (dest[j-1] == rule[0]))):
4172
                                j -= 1
4173
4174
                            zeta0 = 1
4175
                            zeta += 1
4176
                            matches0 = 0
4177
4178
                            while rule and src[i+matches0]:
4179
                                src = (src[0:i+matches0] + rule[0] +
4180
                                       src[i+matches0+1:])
4181
                                matches0 += 1
4182
                                rule = rule[1:]
4183
4184
                            if matches0 < matches:
4185
                                src = (src[0:i+matches0] +
4186
                                       src[i+matches:])
4187
4188
                            char = src[i]
4189
                        else:
4190
                            i = i + matches - 1
4191
                            zeta = 0
4192
4193
                            while len(rule) > 1:
4194
                                if (j == 0) or (dest[j - 1] != rule[0]):
4195
                                    dest = (dest[0:j] + rule[0] +
4196
                                            dest[min(len(dest), j+1):])
4197
                                    j += 1
4198
4199
                                rule = rule[1:]
4200
4201
                            # new "current char"
4202
                            if not rule:
4203
                                rule = ''
4204
                                char = ''
4205
                            else:
4206
                                char = rule[0]
4207
4208
                            if ((_phonet_rules[pos] and
4209
                                 '^^' in _phonet_rules[pos][1:])):
4210
                                if char:  # pragma: no branch
4211
                                    dest = (dest[0:j] + char +
4212
                                            dest[min(len(dest), j + 1):])
4213
                                    j += 1
4214
4215
                                src = src[i + 1:]
4216
                                i = 0
4217
                                zeta0 = 1
4218
4219
                        break
4220
4221
                    pos += 3
4222
4223
                    if pos > end1 and start2 > 0:
4224
                        pos = start2
4225
                        start1 = start2
4226
                        end1 = end2
4227
                        start2 = -1
4228
                        end2 = -1
4229
4230
            if zeta0 == 0:
4231
                if char and ((j == 0) or (dest[j-1] != char)):
4232
                    # delete multiple letters only
4233
                    dest = dest[0:j] + char + dest[min(j+1, term_length):]
4234
                    j += 1
4235
4236
                i += 1
4237
                zeta = 0
4238
4239
        dest = dest[0:j]
4240
4241
        return dest
4242
4243
    _initialize_phonet(lang)
4244
4245
    word = unicodedata.normalize('NFKC', text_type(word))
4246
    return _phonet(word, mode, lang, trace)
4247
4248
4249
def spfc(word):
4250
    """Return the Standardized Phonetic Frequency Code (SPFC) of a word.
4251
4252
    Standardized Phonetic Frequency Code is roughly Soundex-like.
4253
    This implementation is based on page 19-21 of
4254
    https://archive.org/stream/accessingindivid00moor#page/19/mode/1up
4255
4256
    :param str word: the word to transform
4257
    :returns: the SPFC value
4258
    :rtype: str
4259
4260
    >>> spfc('Christopher Smith')
4261
    '01160'
4262
    >>> spfc('Christopher Schmidt')
4263
    '01160'
4264
    >>> spfc('Niall Smith')
4265
    '01660'
4266
    >>> spfc('Niall Schmidt')
4267
4268
    >>> spfc('L.Smith')
4269
    '01960'
4270
    >>> spfc('R.Miller')
4271
    '65490'
4272
4273
    >>> spfc(('L', 'Smith'))
4274
    '01960'
4275
    >>> spfc(('R', 'Miller'))
4276
    '65490'
4277
    """
4278
    _pf1 = dict(zip((ord(_) for _ in 'SZCKQVFPUWABLORDHIEMNXGJT'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4279
                    '0011112222334445556666777'))
4280
    _pf2 = dict(zip((ord(_) for _ in
4281
                     'SZCKQFPXABORDHIMNGJTUVWEL'),
4282
                    '0011122233445556677788899'))
4283
    _pf3 = dict(zip((ord(_) for _ in
4284
                     'BCKQVDTFLPGJXMNRSZAEHIOUWY'),
4285
                    '00000112223334456677777777'))
4286
4287
    _substitutions = (('DK', 'K'), ('DT', 'T'), ('SC', 'S'), ('KN', 'N'),
4288
                      ('MN', 'N'))
4289
4290
    def _raise_word_ex():
4291
        """Raise an AttributeError."""
4292
        raise AttributeError('word attribute must be a string with a space ' +
4293
                             'or period dividing the first and last names ' +
4294
                             'or a tuple/list consisting of the first and ' +
4295
                             'last names')
4296
4297
    if not word:
4298
        return ''
4299
4300
    if isinstance(word, (str, text_type)):
4301
        names = word.split('.', 1)
4302
        if len(names) != 2:
4303
            names = word.split(' ', 1)
4304
            if len(names) != 2:
4305
                _raise_word_ex()
4306
    elif hasattr(word, '__iter__'):
4307
        if len(word) != 2:
4308
            _raise_word_ex()
4309
        names = word
4310
    else:
4311
        _raise_word_ex()
4312
4313
    names = [unicodedata.normalize('NFKD', text_type(_.strip()
4314
                                                     .replace('ß', 'SS')
4315
                                                     .upper()))
4316
             for _ in names]
0 ignored issues
show
introduced by
The variable names does not seem to be defined for all execution paths.
Loading history...
4317
    code = ''
4318
4319
    def steps_one_to_three(name):
4320
        """Perform the first three steps of SPFC."""
4321
        # filter out non A-Z
4322
        name = ''.join(_ for _ in name if _ in
4323
                       {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
4324
                        'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
4325
                        'W', 'X', 'Y', 'Z'})
4326
4327
        # 1. In the field, convert DK to K, DT to T, SC to S, KN to N,
4328
        # and MN to N
4329
        for subst in _substitutions:
4330
            name = name.replace(subst[0], subst[1])
4331
4332
        # 2. In the name field, replace multiple letters with a single letter
4333
        name = _delete_consecutive_repeats(name)
4334
4335
        # 3. Remove vowels, W, H, and Y, but keep the first letter in the name
4336
        # field.
4337
        if name:
4338
            name = name[0] + ''.join(_ for _ in name[1:] if _ not in
4339
                                     {'A', 'E', 'H', 'I', 'O', 'U', 'W', 'Y'})
4340
        return name
4341
4342
    names = [steps_one_to_three(_) for _ in names]
4343
4344
    # 4. The first digit of the code is obtained using PF1 and the first letter
4345
    # of the name field. Remove this letter after coding.
4346
    if names[1]:
4347
        code += names[1][0].translate(_pf1)
4348
        names[1] = names[1][1:]
4349
4350
    # 5. Using the last letters of the name, use Table PF3 to obtain the
4351
    # second digit of the code. Use as many letters as possible and remove
4352
    # after coding.
4353
    if names[1]:
4354
        if names[1][-3:] == 'STN' or names[1][-3:] == 'PRS':
4355
            code += '8'
4356
            names[1] = names[1][:-3]
4357
        elif names[1][-2:] == 'SN':
4358
            code += '8'
4359
            names[1] = names[1][:-2]
4360
        elif names[1][-3:] == 'STR':
4361
            code += '9'
4362
            names[1] = names[1][:-3]
4363
        elif names[1][-2:] in {'SR', 'TN', 'TD'}:
4364
            code += '9'
4365
            names[1] = names[1][:-2]
4366
        elif names[1][-3:] == 'DRS':
4367
            code += '7'
4368
            names[1] = names[1][:-3]
4369
        elif names[1][-2:] in {'TR', 'MN'}:
4370
            code += '7'
4371
            names[1] = names[1][:-2]
4372
        else:
4373
            code += names[1][-1].translate(_pf3)
4374
            names[1] = names[1][:-1]
4375
4376
    # 6. The third digit is found using Table PF2 and the first character of
4377
    # the first name. Remove after coding.
4378
    if names[0]:
4379
        code += names[0][0].translate(_pf2)
4380
        names[0] = names[0][1:]
4381
4382
    # 7. The fourth digit is found using Table PF2 and the first character of
4383
    # the name field. If no letters remain use zero. After coding remove the
4384
    # letter.
4385
    # 8. The fifth digit is found in the same manner as the fourth using the
4386
    # remaining characters of the name field if any.
4387
    for _ in range(2):
4388
        if names[1]:
4389
            code += names[1][0].translate(_pf2)
4390
            names[1] = names[1][1:]
4391
        else:
4392
            code += '0'
4393
4394
    return code
4395
4396
4397
def statistics_canada(word, maxlength=4):
4398
    """Return the Statistics Canada code for a word.
4399
4400
    The original description of this algorithm could not be located, and
4401
    may only have been specified in an unpublished TR. The coding does not
4402
    appear to be in use by Statistics Canada any longer. In its place, this is
4403
    an implementation of the "Census modified Statistics Canada name coding
4404
    procedure".
4405
4406
    The modified version of this algorithm is described in Appendix B of
4407
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
4408
    Procedure for the SRS Record Linkage System.` Statistical Reporting
4409
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
4410
    https://naldc.nal.usda.gov/download/27833/PDF
4411
4412
    :param str word: the word to transform
4413
    :param int maxlength: the maximum length (default 6) of the code to return
4414
    :param bool modified: indicates whether to use USDA modified algorithm
4415
    :returns: the Statistics Canada name code value
4416
    :rtype: str
4417
4418
    >>> statistics_canada('Christopher')
4419
    'CHRS'
4420
    >>> statistics_canada('Niall')
4421
    'NL'
4422
    >>> statistics_canada('Smith')
4423
    'SMTH'
4424
    >>> statistics_canada('Schmidt')
4425
    'SCHM'
4426
    """
4427
    # uppercase, normalize, decompose, and filter non-A-Z out
4428
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4429
    word = word.replace('ß', 'SS')
4430
    word = ''.join(c for c in word if c in
4431
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4432
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4433
                    'Y', 'Z'})
4434
    if not word:
4435
        return ''
4436
4437
    code = word[1:]
4438
    for vowel in {'A', 'E', 'I', 'O', 'U', 'Y'}:
4439
        code = code.replace(vowel, '')
4440
    code = word[0]+code
4441
    code = _delete_consecutive_repeats(code)
4442
    code = code.replace(' ', '')
4443
4444
    return code[:maxlength]
4445
4446
4447
def lein(word, maxlength=4, zero_pad=True):
4448
    """Return the Lein code for a word.
4449
4450
    This is Lein name coding, based on
4451
    https://naldc.nal.usda.gov/download/27833/PDF
4452
4453
    :param str word: the word to transform
4454
    :param int maxlength: the maximum length (default 4) of the code to return
4455
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4456
        maxlength string
4457
    :returns: the Lein code
4458
    :rtype: str
4459
4460
    >>> lein('Christopher')
4461
    'C351'
4462
    >>> lein('Niall')
4463
    'N300'
4464
    >>> lein('Smith')
4465
    'S210'
4466
    >>> lein('Schmidt')
4467
    'S521'
4468
    """
4469
    _lein_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4470
                                  'BCDFGJKLMNPQRSTVXZ'),
4471
                                 '451455532245351455'))
4472
4473
    # uppercase, normalize, decompose, and filter non-A-Z out
4474
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4475
    word = word.replace('ß', 'SS')
4476
    word = ''.join(c for c in word if c in
4477
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4478
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4479
                    'Y', 'Z'})
4480
4481
    if not word:
4482
        return ''
4483
4484
    code = word[0]  # Rule 1
4485
    word = word[1:].translate({32: None, 65: None, 69: None, 72: None,
4486
                               73: None, 79: None, 85: None, 87: None,
4487
                               89: None})  # Rule 2
4488
    word = _delete_consecutive_repeats(word)  # Rule 3
4489
    code += word.translate(_lein_translation)  # Rule 4
4490
4491
    if zero_pad:
4492
        code += ('0'*maxlength)  # Rule 4
4493
4494
    return code[:maxlength]
4495
4496
4497
def roger_root(word, maxlength=5, zero_pad=True):
4498
    """Return the Roger Root code for a word.
4499
4500
    This is Roger Root name coding, based on
4501
    https://naldc.nal.usda.gov/download/27833/PDF
4502
4503
    :param str word: the word to transform
4504
    :param int maxlength: the maximum length (default 5) of the code to return
4505
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4506
        maxlength string
4507
    :returns: the Roger Root code
4508
    :rtype: str
4509
4510
    >>> roger_root('Christopher')
4511
    '06401'
4512
    >>> roger_root('Niall')
4513
    '02500'
4514
    >>> roger_root('Smith')
4515
    '00310'
4516
    >>> roger_root('Schmidt')
4517
    '06310'
4518
    """
4519
    # uppercase, normalize, decompose, and filter non-A-Z out
4520
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4521
    word = word.replace('ß', 'SS')
4522
    word = ''.join(c for c in word if c in
4523
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4524
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4525
                    'Y', 'Z'})
4526
4527
    if not word:
4528
        return ''
4529
4530
    # '*' is used to prevent combining by _delete_consecutive_repeats()
4531
    _init_patterns = {4: {'TSCH': '06'},
4532
                      3: {'TSH': '06', 'SCH': '06'},
4533
                      2: {'CE': '0*0', 'CH': '06', 'CI': '0*0', 'CY': '0*0',
4534
                          'DG': '07', 'GF': '08', 'GM': '03', 'GN': '02',
4535
                          'KN': '02', 'PF': '08', 'PH': '08', 'PN': '02',
4536
                          'SH': '06', 'TS': '0*0', 'WR': '04'},
4537
                      1: {'A': '1', 'B': '09', 'C': '07', 'D': '01', 'E': '1',
4538
                          'F': '08', 'G': '07', 'H': '2', 'I': '1', 'J': '3',
4539
                          'K': '07', 'L': '05', 'M': '03', 'N': '02', 'O': '1',
4540
                          'P': '09', 'Q': '07', 'R': '04', 'S': '0*0',
4541
                          'T': '01', 'U': '1', 'V': '08', 'W': '4', 'X': '07',
4542
                          'Y': '5', 'Z': '0*0'}}
4543
4544
    _med_patterns = {4: {'TSCH': '6'},
4545
                     3: {'TSH': '6', 'SCH': '6'},
4546
                     2: {'CE': '0', 'CH': '6', 'CI': '0', 'CY': '0', 'DG': '7',
4547
                         'PH': '8', 'SH': '6', 'TS': '0'},
4548
                     1: {'B': '9', 'C': '7', 'D': '1', 'F': '8', 'G': '7',
4549
                         'J': '6', 'K': '7', 'L': '5', 'M': '3', 'N': '2',
4550
                         'P': '9', 'Q': '7', 'R': '4', 'S': '0', 'T': '1',
4551
                         'V': '8', 'X': '7', 'Z': '0',
4552
                         'A': '*', 'E': '*', 'H': '*', 'I': '*', 'O': '*',
4553
                         'U': '*', 'W': '*', 'Y': '*'}}
4554
4555
    code = ''
4556
    pos = 0
4557
4558
    # Do first digit(s) first
4559
    for num in range(4, 0, -1):
4560
        if word[:num] in _init_patterns[num]:
4561
            code = _init_patterns[num][word[:num]]
4562
            pos += num
4563
            break
4564
    else:
4565
        pos += 1  # Advance if nothing is recognized
4566
4567
    # Then code subsequent digits
4568
    while pos < len(word):
4569
        for num in range(4, 0, -1):
4570
            if word[pos:pos+num] in _med_patterns[num]:
4571
                code += _med_patterns[num][word[pos:pos+num]]
4572
                pos += num
4573
                break
4574
        else:
4575
            pos += 1  # Advance if nothing is recognized
4576
4577
    code = _delete_consecutive_repeats(code)
4578
    code = code.replace('*', '')
4579
4580
    if zero_pad:
4581
        code += '0'*maxlength
4582
4583
    return code[:maxlength]
4584
4585
4586
def onca(word, maxlength=4, zero_pad=True):
4587
    """Return the Oxford Name Compression Algorithm (ONCA) code for a word.
4588
4589
    This is the Oxford Name Compression Algorithm, based on:
4590
    Gill, Leicester E. 1997. "OX-LINK: The Oxford Medical Record Linkage
4591
    System." In ``Record Linkage Techniques -- 1997``. Arlington, VA. March
4592
    20--21, 1997.
4593
    https://nces.ed.gov/FCSM/pdf/RLT97.pdf
4594
4595
    I can find no complete description of the "anglicised version of the NYSIIS
4596
    method" identified as the first step in this algorithm, so this is likely
4597
    not a correct implementation, in that it employs the standard NYSIIS
4598
    algorithm.
4599
4600
    :param str word: the word to transform
4601
    :param int maxlength: the maximum length (default 5) of the code to return
4602
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4603
        maxlength string
4604
    :returns: the ONCA code
4605
    :rtype: str
4606
4607
    >>> onca('Christopher')
4608
    'C623'
4609
    >>> onca('Niall')
4610
    'N400'
4611
    >>> onca('Smith')
4612
    'S530'
4613
    >>> onca('Schmidt')
4614
    'S530'
4615
    """
4616
    # In the most extreme case, 3 characters of NYSIIS input can be compressed
4617
    # to one character of output, so give it triple the maxlength.
4618
    return soundex(nysiis(word, maxlength=maxlength*3), maxlength,
4619
                   zero_pad=zero_pad)
4620
4621
4622
def eudex(word, maxlength=8):
4623
    """Return the eudex phonetic hash of a word.
4624
4625
    This implementation of eudex phonetic hashing is based on the specification
4626
    (not the reference implementation) at:
4627
    Ticki. 2017. "Eudex: A blazingly fast phonetic reduction/hashing
4628
    algorithm." https://docs.rs/crate/eudex
4629
4630
    Further details can be found at
4631
    http://ticki.github.io/blog/the-eudex-algorithm/
4632
4633
    :param str word: the word to transform
4634
    :param int maxlength: the length of the code returned (defaults to 8)
4635
    :returns: the eudex hash
4636
    :rtype: str
4637
    """
4638
    _trailing_phones = {
4639
        'a': 0,  # a
4640
        'b': 0b01001000,  # b
4641
        'c': 0b00001100,  # c
4642
        'd': 0b00011000,  # d
4643
        'e': 0,  # e
4644
        'f': 0b01000100,  # f
4645
        'g': 0b00001000,  # g
4646
        'h': 0b00000100,  # h
4647
        'i': 1,  # i
4648
        'j': 0b00000101,  # j
4649
        'k': 0b00001001,  # k
4650
        'l': 0b10100000,  # l
4651
        'm': 0b00000010,  # m
4652
        'n': 0b00010010,  # n
4653
        'o': 0,  # o
4654
        'p': 0b01001001,  # p
4655
        'q': 0b10101000,  # q
4656
        'r': 0b10100001,  # r
4657
        's': 0b00010100,  # s
4658
        't': 0b00011101,  # t
4659
        'u': 1,  # u
4660
        'v': 0b01000101,  # v
4661
        'w': 0b00000000,  # w
4662
        'x': 0b10000100,  # x
4663
        'y': 1,  # y
4664
        'z': 0b10010100,  # z
4665
4666
        'ß': 0b00010101,  # ß
4667
        'à': 0,  # à
4668
        'á': 0,  # á
4669
        'â': 0,  # â
4670
        'ã': 0,  # ã
4671
        'ä': 0,  # ä[æ]
4672
        'å': 1,  # å[oː]
4673
        'æ': 0,  # æ[æ]
4674
        'ç': 0b10010101,  # ç[t͡ʃ]
4675
        'è': 1,  # è
4676
        'é': 1,  # é
4677
        'ê': 1,  # ê
4678
        'ë': 1,  # ë
4679
        'ì': 1,  # ì
4680
        'í': 1,  # í
4681
        'î': 1,  # î
4682
        'ï': 1,  # ï
4683
        'ð': 0b00010101,  # ð[ð̠](represented as a non-plosive T)
4684
        'ñ': 0b00010111,  # ñ[nj](represented as a combination of n and j)
4685
        'ò': 0,  # ò
4686
        'ó': 0,  # ó
4687
        'ô': 0,  # ô
4688
        'õ': 0,  # õ
4689
        'ö': 1,  # ö[ø]
4690
        '÷': 0b11111111,  # ÷
4691
        'ø': 1,  # ø[ø]
4692
        'ù': 1,  # ù
4693
        'ú': 1,  # ú
4694
        'û': 1,  # û
4695
        'ü': 1,  # ü
4696
        'ý': 1,  # ý
4697
        'þ': 0b00010101,  # þ[ð̠](represented as a non-plosive T)
4698
        'ÿ': 1,  # ÿ
4699
    }
4700
4701
    _initial_phones = {
4702
        'a': 0b10000100,  # a*
4703
        'b': 0b00100100,  # b
4704
        'c': 0b00000110,  # c
4705
        'd': 0b00001100,  # d
4706
        'e': 0b11011000,  # e*
4707
        'f': 0b00100010,  # f
4708
        'g': 0b00000100,  # g
4709
        'h': 0b00000010,  # h
4710
        'i': 0b11111000,  # i*
4711
        'j': 0b00000011,  # j
4712
        'k': 0b00000101,  # k
4713
        'l': 0b01010000,  # l
4714
        'm': 0b00000001,  # m
4715
        'n': 0b00001001,  # n
4716
        'o': 0b10010100,  # o*
4717
        'p': 0b00100101,  # p
4718
        'q': 0b01010100,  # q
4719
        'r': 0b01010001,  # r
4720
        's': 0b00001010,  # s
4721
        't': 0b00001110,  # t
4722
        'u': 0b11100000,  # u*
4723
        'v': 0b00100011,  # v
4724
        'w': 0b00000000,  # w
4725
        'x': 0b01000010,  # x
4726
        'y': 0b11100100,  # y*
4727
        'z': 0b01001010,  # z
4728
4729
        'ß': 0b00001011,  # ß
4730
        'à': 0b10000101,  # à
4731
        'á': 0b10000101,  # á
4732
        'â': 0b10000000,  # â
4733
        'ã': 0b10000110,  # ã
4734
        'ä': 0b10100110,  # ä [æ]
4735
        'å': 0b11000010,  # å [oː]
4736
        'æ': 0b10100111,  # æ [æ]
4737
        'ç': 0b01010100,  # ç [t͡ʃ]
4738
        'è': 0b11011001,  # è
4739
        'é': 0b11011001,  # é
4740
        'ê': 0b11011001,  # ê
4741
        'ë': 0b11000110,  # ë [ə] or [œ]
4742
        'ì': 0b11111001,  # ì
4743
        'í': 0b11111001,  # í
4744
        'î': 0b11111001,  # î
4745
        'ï': 0b11111001,  # ï
4746
        'ð': 0b00001011,  # ð [ð̠] (represented as a non-plosive T)
4747
        'ñ': 0b00001011,  # ñ [nj] (represented as a combination of n and j)
4748
        'ò': 0b10010101,  # ò
4749
        'ó': 0b10010101,  # ó
4750
        'ô': 0b10010101,  # ô
4751
        'õ': 0b10010101,  # õ
4752
        'ö': 0b11011100,  # ö [œ] or [ø]
4753
        '÷': 0b11111111,  # ÷
4754
        'ø': 0b11011101,  # ø [œ] or [ø]
4755
        'ù': 0b11100001,  # ù
4756
        'ú': 0b11100001,  # ú
4757
        'û': 0b11100001,  # û
4758
        'ü': 0b11100101,  # ü
4759
        'ý': 0b11100101,  # ý
4760
        'þ': 0b00001011,  # þ [ð̠] (represented as a non-plosive T)
4761
        'ÿ': 0b11100101,  # ÿ
4762
    }
4763
    # Lowercase input & filter unknown characters
4764
    word = ''.join(char for char in word.lower() if char in _initial_phones)
4765
4766
    # Perform initial eudex coding of each character
4767
    values = [_initial_phones[word[0]]]
4768
    values += [_trailing_phones[char] for char in word[1:]]
4769
4770
    # Right-shift by one to determine if second instance should be skipped
4771
    shifted_values = [_ >> 1 for _ in values]
4772
    condensed_values = [values[0]]
4773
    for n in range(1, len(shifted_values)):
4774
        if shifted_values[n] != shifted_values[n-1]:
4775
            condensed_values.append(values[n])
4776
4777
    # Add padding after first character & trim beyond maxlength
4778
    values = ([condensed_values[0]] +
4779
              [0]*max(0, maxlength - len(condensed_values)) +
4780
              condensed_values[1:maxlength])
4781
4782
    # Combine individual character values into eudex hash
4783
    hash_value = 0
4784
    for val in values:
4785
        hash_value = (hash_value << 8) | val
4786
4787
    return hash_value
4788
4789
4790
def haase_phonetik(word):
4791
    """Return the Haase Phonetik (numeric output) code for a word.
4792
4793
    Based on the algorithm described at
4794
    https://github.com/elastic/elasticsearch/blob/master/plugins/analysis-phonetic/src/main/java/org/elasticsearch/index/analysis/phonetic/HaasePhonetik.java
4795
4796
    While the output code is numeric, it is still a str.
4797
4798
    :param str word: the word to transform
4799
    :returns: the Haase Phonetik value as a numeric string
4800
    :rtype: str
4801
    """
4802
    def _after(word, i, letters):
4803
        """Return True if word[i] follows one of the supplied letters."""
4804
        if i > 0 and word[i-1] in letters:
4805
            return True
4806
        return False
4807
4808
    def _before(word, i, letters):
4809
        """Return True if word[i] precedes one of the supplied letters."""
4810
        if i+1 < len(word) and word[i+1] in letters:
4811
            return True
4812
        return False
4813
4814
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
4815
4816
    sdx = ''
4817
4818
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4819
    word = word.replace('ß', 'SS')
4820
4821
    word = word.replace('Ä', 'AE')
4822
    word = word.replace('Ö', 'OE')
4823
    word = word.replace('Ü', 'UE')
4824
    word = ''.join(c for c in word if c in
4825
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4826
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4827
                    'Y', 'Z'})
4828
4829
    # Nothing to convert, return base case
4830
    if not word:
4831
        return sdx
4832
4833
    word = word.replace('AUN', 'OWN')
4834
    word = word.replace('RB', 'RW')
4835
    word = word.replace('WSK', 'RSK')
4836
    if word[-1] == 'A':
4837
        word = word[:-1]+'AR'
4838
    if word[-1] == 'O':
4839
        word = word[:-1]+'OW'
4840
    word = word.replace('SCH', 'CH')
4841
    word = word.replace('GLI', 'LI')
4842
    if word[-3:] == 'EAU':
4843
        word = word[:-3]+'O'
4844
    if word[:2] == 'CH':
4845
        word = 'SCH'+word[2:]
4846
    word = word.replace('AUX', 'O')
4847
    word = word.replace('EUX', 'O')
4848
    word = word.replace('ILLE', 'I')
4849
4850
    for i in range(len(word)):
4851 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
4852
            sdx += '9'
4853
        elif word[i] == 'B':
4854
            sdx += '1'
4855
        elif word[i] == 'P':
4856
            if _before(word, i, {'H'}):
4857
                sdx += '3'
4858
            else:
4859
                sdx += '1'
4860
        elif word[i] in {'D', 'T'}:
4861
            if _before(word, i, {'C', 'S', 'Z'}):
4862
                sdx += '8'
4863
            else:
4864
                sdx += '2'
4865
        elif word[i] in {'F', 'V', 'W'}:
4866
            sdx += '3'
4867
        elif word[i] in {'G', 'K', 'Q'}:
4868
            sdx += '4'
4869
        elif word[i] == 'C':
4870
            if _after(word, i, {'S', 'Z'}):
4871
                sdx += '8'
4872
            elif i == 0:
4873
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
4874
                                     'X'}):
4875
                    sdx += '4'
4876
                else:
4877
                    sdx += '8'
4878
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
4879
                sdx += '4'
4880
            else:
4881
                sdx += '8'
4882
        elif word[i] == 'X':
4883
            if _after(word, i, {'C', 'K', 'Q'}):
4884
                sdx += '8'
4885
            else:
4886
                sdx += '48'
4887
        elif word[i] == 'L':
4888
            sdx += '5'
4889
        elif word[i] in {'M', 'N'}:
4890
            sdx += '6'
4891
        elif word[i] == 'R':
4892
            sdx += '7'
4893
        elif word[i] in {'S', 'Z'}:
4894
            sdx += '8'
4895
4896
    sdx = _delete_consecutive_repeats(sdx)
4897
4898
    if sdx:
4899
        sdx = sdx[0] + sdx[1:].replace('9', '')
4900
4901
    return sdx
4902
4903
4904
def reth_schek_phonetik(word):
4905
    """Return Reth-Schek Phonetik code for a word.
4906
4907
    This algorithm is proposed in:
4908
    von Reth, Hans-Peter and Schek, Hans-Jörg. 1977. "Eine Zugriffsmethode für
4909
    die phonetische Ähnlichkeitssuche." Heidelberg Scientific Center technical
4910
    reports 77.03.002. IBM Deutschland GmbH.
4911
4912
    Since I couldn't secure a copy of that document (maybe I'll look for it
4913
    next time I'm in Germany), this implementation is based on what I could
4914
    glean from the implementations published by German Record Linkage
4915
    Center (www.record-linkage.de):
4916
    - Privacy-preserving Record Linkage (PPRL) (in R)
4917
    - Merge ToolBox (in Java)
4918
4919
    Rules that are unclear:
4920
    - Should 'C' become 'G' or 'Z'? (PPRL has both, 'Z' rule blocked)
4921
    - Should 'CC' become 'G'? (PPRL has blocked 'CK' that may be typo)
4922
    - Should 'TUI' -> 'ZUI' rule exist? (PPRL has rule, but I can't
4923
        think of a German word with '-tui-' in it.)
4924
    - Should we really change 'SCH' -> 'CH' and then 'CH' -> 'SCH'?
4925
4926
    :param word:
4927
    :return:
4928
    """
4929
    replacements = {3: {'AEH': 'E', 'IEH': 'I', 'OEH': 'OE', 'UEH': 'UE',
4930
                        'SCH': 'CH', 'ZIO': 'TIO', 'TIU': 'TIO', 'ZIU': 'TIO',
4931
                        'CHS': 'X', 'CKS': 'X', 'AEU': 'OI'},
4932
                    2: {'LL': 'L', 'AA': 'A', 'AH': 'A', 'BB': 'B', 'PP': 'B',
4933
                        'BP': 'B', 'PB': 'B', 'DD': 'D', 'DT': 'D', 'TT': 'D',
4934
                        'TH': 'D', 'EE': 'E', 'EH': 'E', 'AE': 'E', 'FF': 'F',
4935
                        'PH': 'F', 'KK': 'K', 'GG': 'G', 'GK': 'G', 'KG': 'G',
4936
                        'CK': 'G', 'CC': 'C', 'IE': 'I', 'IH': 'I', 'MM': 'M',
4937
                        'NN': 'N', 'OO': 'O', 'OH': 'O', 'SZ': 'S', 'UH': 'U',
4938
                        'GS': 'X', 'KS': 'X', 'TZ': 'Z', 'AY': 'AI',
4939
                        'EI': 'AI', 'EY': 'AI', 'EU': 'OI', 'RR': 'R',
4940
                        'SS': 'S', 'KW': 'QU'},
4941
                    1: {'P': 'B', 'T': 'D', 'V': 'F', 'W': 'F', 'C': 'G',
4942
                        'K': 'G', 'Y': 'I'}}
4943
4944
    # Uppercase
4945
    word = word.upper()
4946
4947
    # Replace umlauts/eszett
4948
    word = word.replace('Ä', 'AE')
4949
    word = word.replace('Ö', 'OE')
4950
    word = word.replace('Ü', 'UE')
4951
    word = word.replace('ß', 'SS')
4952
4953
    # Main loop, using above replacements table
4954
    pos = 0
4955
    while pos < len(word):
4956
        for num in range(3, 0, -1):
4957
            if word[pos:pos+num] in replacements[num]:
4958
                word = (word[:pos] + replacements[num][word[pos:pos+num]]
4959
                        + word[pos+num:])
4960
                pos += 1
4961
                break
4962
        else:
4963
            pos += 1  # Advance if nothing is recognized
4964
4965
    # Change 'CH' back(?) to 'SCH'
4966
    word = word.replace('CH', 'SCH')
4967
4968
    # Replace final sequences
4969
    if word[-2:] == 'ER':
4970
        word = word[:-2]+'R'
4971
    elif word[-2:] == 'EL':
4972
        word = word[:-2]+'L'
4973
    elif word[-1] == 'H':
4974
        word = word[:-1]
4975
4976
    return word
4977
4978
4979
def fonem(word):
4980
    """Return the FONEM code of a word.
4981
4982
    FONEM is a phonetic algorithm designed for French (particularly surnames in
4983
    Saguenay, Canada), defined in:
4984
    Bouchard, Gérard, Patrick Brard, and Yolande Lavoie. 1981. "FONEM: Un code
4985
    de transcription phonétique pour la reconstitution automatique des
4986
    familles saguenayennes." Population. 36(6). 1085--1103.
4987
    https://doi.org/10.2307/1532326
4988
    http://www.persee.fr/doc/pop_0032-4663_1981_num_36_6_17248
4989
4990
    Guillaume Plique's Javascript implementation at
4991
    https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/french/fonem.js
4992
    was also consulted for this implementation.
4993
4994
    :param str word: the word to transform
4995
    :returns: the FONEM code
4996
    :rtype: str
4997
    """
4998
    # I don't see a sane way of doing this without regexps :(
4999
    rule_table = {
5000
        # Vowels & groups of vowels
5001
        'V-1':     (re.compile('E?AU'), 'O'),
5002
        'V-2,5':   (re.compile('(E?AU|O)L[TX]$'), 'O'),
5003
        'V-3,4':   (re.compile('E?AU[TX]$'), 'O'),
5004
        'V-6':     (re.compile('E?AUL?D$'), 'O'),
5005
        'V-7':     (re.compile(r'(?<!G)AY$'), 'E'),
5006
        'V-8':     (re.compile('EUX$'), 'EU'),
5007
        'V-9':     (re.compile('EY(?=$|[BCDFGHJKLMNPQRSTVWXZ])'), 'E'),
5008
        'V-10':    ('Y', 'I'),
5009
        'V-11':    (re.compile('(?<=[AEIOUY])I(?=[AEIOUY])'), 'Y'),
5010
        'V-12':    (re.compile('(?<=[AEIOUY])ILL'), 'Y'),
5011
        'V-13':    (re.compile('OU(?=[AEOU]|I(?!LL))'), 'W'),
5012
        'V-14':    (re.compile(r'([AEIOUY])(?=\1)'), ''),
5013
        # Nasal vowels
5014
        'V-15':    (re.compile('[AE]M(?=[BCDFGHJKLMPQRSTVWXZ])(?!$)'), 'EN'),
5015
        'V-16':    (re.compile('OM(?=[BCDFGHJKLMPQRSTVWXZ])'), 'ON'),
5016
        'V-17':    (re.compile('AN(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'EN'),
5017
        'V-18':    (re.compile('(AI[MN]|EIN)(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'),
5018
                    'IN'),
5019
        'V-19':    (re.compile('B(O|U|OU)RNE?$'), 'BURN'),
5020
        'V-20':    (re.compile('(^IM|(?<=[BCDFGHJKLMNPQRSTVWXZ])IM(?=[BCDFGHJKLMPQRSTVWXZ]))'),
5021
                    'IN'),
5022
        # Consonants and groups of consonants
5023
        'C-1':     ('BV', 'V'),
5024
        'C-2':     (re.compile('(?<=[AEIOUY])C(?=[EIY])'), 'SS'),
5025
        'C-3':     (re.compile('(?<=[BDFGHJKLMNPQRSTVWZ])C(?=[EIY])'), 'S'),
5026
        'C-4':     (re.compile('^C(?=[EIY])'), 'S'),
5027
        'C-5':     (re.compile('^C(?=[OUA])'), 'K'),
5028
        'C-6':     (re.compile('(?<=[AEIOUY])C$'), 'K'),
5029
        'C-7':     (re.compile('C(?=[BDFGJKLMNPQRSTVWXZ])'), 'K'),
5030
        'C-8':     (re.compile('CC(?=[AOU])'), 'K'),
5031
        'C-9':     (re.compile('CC(?=[EIY])'), 'X'),
5032
        'C-10':    (re.compile('G(?=[EIY])'), 'J'),
5033
        'C-11':    (re.compile('GA(?=I?[MN])'), 'G#'),
5034
        'C-12':    (re.compile('GE(O|AU)'), 'JO'),
5035
        'C-13':    (re.compile('GNI(?=[AEIOUY])'), 'GN'),
5036
        'C-14':    (re.compile('(?<![PCS])H'), ''),
5037
        'C-15':    ('JEA', 'JA'),
5038
        'C-16':    (re.compile('^MAC(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'MA#'),
5039
        'C-17':    (re.compile('^MC'), 'MA#'),
5040
        'C-18':    ('PH', 'F'),
5041
        'C-19':    ('QU', 'K'),
5042
        'C-20':    (re.compile('^SC(?=[EIY])'), 'S'),
5043
        'C-21':    (re.compile('(?<=.)SC(?=[EIY])'), 'SS'),
5044
        'C-22':    (re.compile('(?<=.)SC(?=[AOU])'), 'SK'),
5045
        'C-23':    ('SH', 'CH'),
5046
        'C-24':    (re.compile('TIA$'), 'SSIA'),
5047
        'C-25':    (re.compile('(?<=[AIOUY])W'), ''),
5048
        'C-26':    (re.compile('X[CSZ]'), 'X'),
5049
        'C-27':    (re.compile('(?<=[AEIOUY])Z|(?<=[BCDFGHJKLMNPQRSTVWXZ])Z(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'S'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (110/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
5050
        'C-28':    (re.compile(r'([BDFGHJKMNPQRTVWXZ])\1'), r'\1'),
5051
        'C-28a':   (re.compile('CC(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'C'),
5052
        'C-28b':   (re.compile('((?<=[BCDFGHJKLMNPQRSTVWXZ])|^)SS'), 'S'),
5053
        'C-28bb':  (re.compile('SS(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'S'),
5054
        'C-28c':   (re.compile('((?<=[^I])|^)LL'), 'L'),
5055
        'C-28d':   (re.compile('ILE$'), 'ILLE'),
5056
        'C-29':    (re.compile('(ILS|[CS]H|[MN]P|R[CFKLNSX])$|([BCDFGHJKLMNPQRSTVWXZ])[BCDFGHJKLMNPQRSTVWXZ]$'), r'\1\2'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (122/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
5057
        'C-30,32': (re.compile('^(SA?INT?|SEI[NM]|CINQ?|ST)(?!E)-?'), 'ST-'),
5058
        'C-31,33': (re.compile('^(SAINTE|STE)-?'), 'STE-'),
5059
        # Rules to undo rule bleeding prevention in C-11, C-16, C-17
5060
        'C-34':    ('G#', 'GA'),
5061
        'C-35':    ('MA#', 'MAC')
5062
    }
5063
    rule_order = [
5064
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5065
        'C-12',
5066
        'C-8', 'C-9', 'C-10',
5067
        'C-16', 'C-17', 'C-2', 'C-3', 'C-7',
5068
        'V-2,5', 'V-3,4', 'V-6',
5069
        'V-1', 'C-14',
5070
        'C-31,33', 'C-30,32',
5071
        'C-11', 'V-15', 'V-17', 'V-18',
5072
        'V-7', 'V-8', 'V-9', 'V-10', 'V-11', 'V-12', 'V-13', 'V-16',
5073
        'V-19', 'V-20',
5074
        'C-1', 'C-4', 'C-5', 'C-6', 'C-13', 'C-15',
5075
        'C-18', 'C-19', 'C-20', 'C-21', 'C-22', 'C-23', 'C-24',
5076
        'C-25', 'C-26', 'C-27',
5077
        'C-29',
5078
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5079
        'C-34', 'C-35'
5080
    ]
5081
5082
    # normalize, upper-case, and filter non-French letters
5083
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
5084
    word = word.translate({198: 'AE', 338: 'OE'})
5085
    word = ''.join(c for c in word if c in
5086
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5087
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5088
                    'Y', 'Z', '-'})
5089
5090
    for rule in rule_order:
5091
        regex, repl = rule_table[rule]
5092
        if isinstance(regex, text_type):
5093
            word = word.replace(regex, repl)
5094
        else:
5095
            word = regex.sub(repl, word)
5096
        # print(rule, word)
5097
5098
    return word
5099
5100
5101
def parmar_kumbharana(word):
5102
    """Return the Parmar-Kumbharana encoding of a word.
5103
5104
    This is based on the phonetic algorithm proposed in
5105
    Parmar, Vimal P. and CK Kumbharana. 2014. "Study Existing Various Phonetic
5106
    Algorithms and Designing and Development of a working model for the New
5107
    Developed Algorithm and Comparison by implementing ti with Existing
5108
    Algorithm(s)." International Journal of Computer Applications. 98(19).
5109
    https://doi.org/10.5120/17295-7795
5110
5111
    :param word:
5112
    :return:
5113
    """
5114
    rule_table = {4: {'OUGH': 'F'},
5115
                  3: {'DGE': 'J',
5116
                      'OUL': 'U',
5117
                      'GHT': 'T'},
5118
                  2: {'CE': 'S', 'CI': 'S', 'CY': 'S',
5119
                      'GE': 'J', 'GI': 'J', 'GY': 'J',
5120
                      'WR': 'R',
5121
                      'GN': 'N', 'KN': 'N', 'PN': 'N',
5122
                      'CK': 'K',
5123
                      'SH': 'S'}}
5124
    vowel_trans = {65: '', 69: '', 73: '', 79: '', 85: '', 89: ''}
5125
5126
    word = word.upper()  # Rule 3
5127
    word = _delete_consecutive_repeats(word)  # Rule 4
5128
5129
    # Rule 5
5130
    i = 0
5131
    while i < len(word):
5132
        for match_len in range(4, 1, -1):
5133
            if word[i:i+match_len] in rule_table[match_len]:
5134
                repl = rule_table[match_len][word[i:i+match_len]]
5135
                word = (word[:i] + repl + word[i+match_len:])
5136
                i += len(repl)
5137
        else:
5138
            i += 1
5139
5140
    word = word[0]+word[1:].translate(vowel_trans)  # Rule 6
5141
    return word
5142
5143
5144
def bmpm(word, language_arg=0, name_mode='gen', match_mode='approx',
5145
         concat=False, filter_langs=False):
5146
    """Return the Beider-Morse Phonetic Matching algorithm code for a word.
5147
5148
    The Beider-Morse Phonetic Matching algorithm is described at:
5149
    http://stevemorse.org/phonetics/bmpm.htm
5150
    The reference implementation is licensed under GPLv3 and available at:
5151
    http://stevemorse.org/phoneticinfo.htm
5152
5153
    :param str word: the word to transform
5154
    :param str language_arg: the language of the term; supported values
5155
        include:
5156
5157
            - 'any'
5158
            - 'arabic'
5159
            - 'cyrillic'
5160
            - 'czech'
5161
            - 'dutch'
5162
            - 'english'
5163
            - 'french'
5164
            - 'german'
5165
            - 'greek'
5166
            - 'greeklatin'
5167
            - 'hebrew'
5168
            - 'hungarian'
5169
            - 'italian'
5170
            - 'polish'
5171
            - 'portuguese'
5172
            - 'romanian'
5173
            - 'russian'
5174
            - 'spanish'
5175
            - 'turkish'
5176
            - 'germandjsg'
5177
            - 'polishdjskp'
5178
            - 'russiandjsre'
5179
5180
    :param str name_mode: the name mode of the algorithm:
5181
5182
            - 'gen' -- general (default)
5183
            - 'ash' -- Ashkenazi
5184
            - 'sep' -- Sephardic
5185
5186
    :param str match_mode: matching mode: 'approx' or 'exact'
5187
    :param bool concat: concatenation mode
5188
    :param bool filter_langs: filter out incompatible languages
5189
    :returns: the BMPM value(s)
5190
    :rtype: tuple
5191
5192
    >>> bmpm('Christopher')
5193
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5194
    xristYfir xristopi xritopir xritopi xristofi xritofir xritofi tzristopir
5195
    tzristofir zristopir zristopi zritopir zritopi zristofir zristofi zritofir
5196
    zritofi'
5197
    >>> bmpm('Niall')
5198
    'nial niol'
5199
    >>> bmpm('Smith')
5200
    'zmit'
5201
    >>> bmpm('Schmidt')
5202
    'zmit stzmit'
5203
5204
    >>> bmpm('Christopher', language_arg='German')
5205
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5206
    xristYfir'
5207
    >>> bmpm('Christopher', language_arg='English')
5208
    'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir xristafir
5209
    xrQstafir'
5210
    >>> bmpm('Christopher', language_arg='German', name_mode='ash')
5211
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5212
    xristYfir'
5213
5214
    >>> bmpm('Christopher', language_arg='German', match_mode='exact')
5215
    'xriStopher xriStofer xristopher xristofer'
5216
    """
5217
    return _bmpm(word, language_arg, name_mode, match_mode,
5218
                 concat, filter_langs)
5219
5220
5221
if __name__ == '__main__':
5222
    import doctest
5223
    doctest.testmod()
5224