Completed
Push — master ( b86da4...472d2c )
by Chris
11:30
created

abydos.phonetic.phonix()   F

Complexity

Conditions 24

Size

Total Lines 209
Code Lines 160

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 24
eloc 160
nop 3
dl 0
loc 209
rs 0
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic.phonix() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (5537/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.phonetic.
20
21
The phonetic module implements phonetic algorithms including:
22
23
    - Robert C. Russell's Index
24
    - American Soundex
25
    - Refined Soundex
26
    - Daitch-Mokotoff Soundex
27
    - Kölner Phonetik
28
    - NYSIIS
29
    - Match Rating Algorithm
30
    - Metaphone
31
    - Double Metaphone
32
    - Caverphone
33
    - Alpha Search Inquiry System
34
    - Fuzzy Soundex
35
    - Phonex
36
    - Phonem
37
    - Phonix
38
    - SfinxBis
39
    - phonet
40
    - Standardized Phonetic Frequency Code
41
    - Statistics Canada
42
    - Lein
43
    - Roger Root
44
    - Oxford Name Compression Algorithm (ONCA)
45
    - Eudex phonetic hash
46
    - Haase Phonetik
47
    - Reth-Schek Phonetik
48
    - FONEM
49
    - Parmar-Kumbharana
50
    - Davidson's Consonant Code
51
    - SoundD
52
    - Beider-Morse Phonetic Matching
53
"""
54
55
from __future__ import division, unicode_literals
56
57
import re
58
import unicodedata
59
from collections import Counter
60
from itertools import groupby, product
61
62
from six import text_type
63
from six.moves import range
64
65
from ._bm import _bmpm
66
67
_INFINITY = float('inf')
68
69
70
def _delete_consecutive_repeats(word):
71
    """Delete consecutive repeated characters in a word.
72
73
    :param str word: the word to transform
74
    :returns: word with consecutive repeating characters collapsed to
75
        a single instance
76
    :rtype: str
77
    """
78
    return ''.join(char for char, _ in groupby(word))
79
80
81
def russell_index(word):
82
    """Return the Russell Index (integer output) of a word.
83
84
    This follows Robert C. Russell's Index algorithm, as described in
85
    US Patent 1,261,167 (1917)
86
87
    :param str word: the word to transform
88
    :returns: the Russell Index value
89
    :rtype: int
90
91
    >>> russell_index('Christopher')
92
    3813428
93
    >>> russell_index('Niall')
94
    715
95
    >>> russell_index('Smith')
96
    3614
97
    >>> russell_index('Schmidt')
98
    3614
99
    """
100
    _russell_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
101
                                     'ABCDEFGIKLMNOPQRSTUVXYZ'),
102
                                    '12341231356712383412313'))
103
104
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
105
    word = word.replace('ß', 'SS')
106
    word = word.replace('GH', '')  # discard gh (rule 3)
107
    word = word.rstrip('SZ')  # discard /[sz]$/ (rule 3)
108
109
    # translate according to Russell's mapping
110
    word = ''.join(c for c in word if c in
111
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'K', 'L', 'M', 'N',
112
                    'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'})
113
    sdx = word.translate(_russell_translation)
114
115
    # remove any 1s after the first occurrence
116
    one = sdx.find('1')+1
117
    if one:
118
        sdx = sdx[:one] + ''.join(c for c in sdx[one:] if c != '1')
119
120
    # remove repeating characters
121
    sdx = _delete_consecutive_repeats(sdx)
122
123
    # return as an int
124
    return int(sdx) if sdx else float('NaN')
125
126
127
def russell_index_num_to_alpha(num):
128
    """Convert the Russell Index integer to an alphabetic string.
129
130
    This follows Robert C. Russell's Index algorithm, as described in
131
    US Patent 1,261,167 (1917)
132
133
    :param int num: a Russell Index integer value
134
    :returns: the Russell Index as an alphabetic string
135
    :rtype: str
136
137
    >>> russell_index_num_to_alpha(3813428)
138
    'CRACDBR'
139
    >>> russell_index_num_to_alpha(715)
140
    'NAL'
141
    >>> russell_index_num_to_alpha(3614)
142
    'CMAD'
143
    """
144
    _russell_num_translation = dict(zip((ord(_) for _ in '12345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
145
                                        'ABCDLMNR'))
146
    num = ''.join(c for c in text_type(num) if c in {'1', '2', '3', '4', '5',
147
                                                     '6', '7', '8'})
148
    if num:
149
        return num.translate(_russell_num_translation)
150
    return ''
151
152
153
def russell_index_alpha(word):
154
    """Return the Russell Index (alphabetic output) for the word.
155
156
    This follows Robert C. Russell's Index algorithm, as described in
157
    US Patent 1,261,167 (1917)
158
159
    :param str word: the word to transform
160
    :returns: the Russell Index value as an alphabetic string
161
    :rtype: str
162
163
    >>> russell_index_alpha('Christopher')
164
    'CRACDBR'
165
    >>> russell_index_alpha('Niall')
166
    'NAL'
167
    >>> russell_index_alpha('Smith')
168
    'CMAD'
169
    >>> russell_index_alpha('Schmidt')
170
    'CMAD'
171
    """
172
    if word:
173
        return russell_index_num_to_alpha(russell_index(word))
174
    return ''
175
176
177
def soundex(word, maxlength=4, var='American', reverse=False, zero_pad=True):
178
    """Return the Soundex code for a word.
179
180
    :param str word: the word to transform
181
    :param int maxlength: the length of the code returned (defaults to 4)
182
    :param str var: the variant of the algorithm to employ (defaults to
183
        'American'):
184
185
        - 'American' follows the American Soundex algorithm, as described at
186
          http://www.archives.gov/publications/general-info-leaflets/55-census.html
187
          and in Knuth(1998:394); this is also called Miracode
188
        - 'special' follows the rules from the 1880-1910 US Census
189
          retrospective re-analysis, in which h & w are not treated as blocking
190
          consonants but as vowels.
191
          Cf. http://creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
192
        - 'Census' follows the rules laid out in GIL 55 by the US Census,
193
          including coding prefixed and unprefixed versions of some names
194
195
    :param bool reverse: reverse the word before computing the selected Soundex
196
        (defaults to False); This results in "Reverse Soundex"
197
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
198
        maxlength string
199
    :returns: the Soundex value
200
    :rtype: str
201
202
    >>> soundex("Christopher")
203
    'C623'
204
    >>> soundex("Niall")
205
    'N400'
206
    >>> soundex('Smith')
207
    'S530'
208
    >>> soundex('Schmidt')
209
    'S530'
210
211
212
    >>> soundex('Christopher', maxlength=_INFINITY)
213
    'C623160000000000000000000000000000000000000000000000000000000000'
214
    >>> soundex('Christopher', maxlength=_INFINITY, zero_pad=False)
215
    'C62316'
216
217
    >>> soundex('Christopher', reverse=True)
218
    'R132'
219
220
    >>> soundex('Ashcroft')
221
    'A261'
222
    >>> soundex('Asicroft')
223
    'A226'
224
    >>> soundex('Ashcroft', var='special')
225
    'A226'
226
    >>> soundex('Asicroft', var='special')
227
    'A226'
228
    """
229
    _soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
230
                                     'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
231
                                    '01230129022455012623019202'))
232
233
    # Require a maxlength of at least 4 and not more than 64
234
    if maxlength is not None:
235
        maxlength = min(max(4, maxlength), 64)
236
    else:
237
        maxlength = 64
238
239
    # uppercase, normalize, decompose, and filter non-A-Z out
240
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
241
    word = word.replace('ß', 'SS')
242
243
    if var == 'Census':
244
        # Should these prefixes be supplemented? (VANDE, DELA, VON)
245
        if word[:3] in {'VAN', 'CON'} and len(word) > 4:
246
            return (soundex(word, maxlength, 'American', reverse, zero_pad),
247
                    soundex(word[3:], maxlength, 'American', reverse,
248
                            zero_pad))
249
        if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
250
            return (soundex(word, maxlength, 'American', reverse, zero_pad),
251
                    soundex(word[2:], maxlength, 'American', reverse,
252
                            zero_pad))
253
        # Otherwise, proceed as usual (var='American' mode, ostensibly)
254
255
    word = ''.join(c for c in word if c in
256
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
257
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
258
                    'Y', 'Z'})
259
260
    # Nothing to convert, return base case
261
    if not word:
262
        if zero_pad:
263
            return '0'*maxlength
264
        return '0'
265
266
    # Reverse word if computing Reverse Soundex
267
    if reverse:
268
        word = word[::-1]
269
270
    # apply the Soundex algorithm
271
    sdx = word.translate(_soundex_translation)
272
273
    if var == 'special':
274
        sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
275
    else:
276
        sdx = sdx.replace('9', '')  # rule 1
277
    sdx = _delete_consecutive_repeats(sdx)  # rule 3
278
279
    if word[0] in 'HW':
280
        sdx = word[0] + sdx
281
    else:
282
        sdx = word[0] + sdx[1:]
283
    sdx = sdx.replace('0', '')  # rule 1
284
285
    if zero_pad:
286
        sdx += ('0'*maxlength)  # rule 4
287
288
    return sdx[:maxlength]
289
290
291
def refined_soundex(word, maxlength=_INFINITY, reverse=False, zero_pad=False,
292
                    retain_vowels=False):
293
    """Return the Refined Soundex code for a word.
294
295
    This is Soundex, but with more character classes. It was defined by
296
    Carolyn B. Boyce:
297
    https://web.archive.org/web/20010513121003/http://www.bluepoof.com:80/Soundex/info2.html
298
299
    :param word: the word to transform
300
    :param maxlength: the length of the code returned (defaults to unlimited)
301
    :param reverse: reverse the word before computing the selected Soundex
302
        (defaults to False); This results in "Reverse Soundex"
303
    :param zero_pad: pad the end of the return value with 0s to achieve a
304
        maxlength string
305
    :param retain_vowels: retain vowels (as 0) in the resulting code
306
    :returns: the Refined Soundex value
307
    :rtype: str
308
309
    >>> refined_soundex('Christopher')
310
    'C3090360109'
311
    >>> refined_soundex('Niall')
312
    'N807'
313
    >>> refined_soundex('Smith')
314
    'S38060'
315
    >>> refined_soundex('Schmidt')
316
    'S30806'
317
    """
318
    _ref_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
319
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
320
                                        '01360240043788015936020505'))
321
322
    # uppercase, normalize, decompose, and filter non-A-Z out
323
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
324
    word = word.replace('ß', 'SS')
325
    word = ''.join(c for c in word if c in
326
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
327
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
328
                    'Y', 'Z'})
329
330
    # Reverse word if computing Reverse Soundex
331
    if reverse:
332
        word = word[::-1]
333
334
    # apply the Soundex algorithm
335
    sdx = word[0] + word.translate(_ref_soundex_translation)
336
    sdx = _delete_consecutive_repeats(sdx)
337
    if not retain_vowels:
338
        sdx = sdx.replace('0', '')  # Delete vowels, H, W, Y
339
340
    if maxlength < _INFINITY:
341
        if zero_pad:
342
            sdx += ('0' * maxlength)
343
        if maxlength:
344
            sdx = sdx[:maxlength]
345
346
    return sdx
347
348
349
def dm_soundex(word, maxlength=6, reverse=False, zero_pad=True):
350
    """Return the Daitch-Mokotoff Soundex code for a word.
351
352
    Returns values of a word as a set. A collection is necessary since there
353
    can be multiple values for a single word.
354
355
    :param word: the word to transform
356
    :param maxlength: the length of the code returned (defaults to 6)
357
    :param reverse: reverse the word before computing the selected Soundex
358
        (defaults to False); This results in "Reverse Soundex"
359
    :param zero_pad: pad the end of the return value with 0s to achieve a
360
        maxlength string
361
    :returns: the Daitch-Mokotoff Soundex value
362
    :rtype: str
363
364
    >>> dm_soundex('Christopher')
365
    {'494379', '594379'}
366
    >>> dm_soundex('Niall')
367
    {'680000'}
368
    >>> dm_soundex('Smith')
369
    {'463000'}
370
    >>> dm_soundex('Schmidt')
371
    {'463000'}
372
373
    >>> dm_soundex('The quick brown fox', maxlength=20, zero_pad=False)
374
    {'35457976754', '3557976754'}
375
    """
376
    _dms_table = {'STCH': (2, 4, 4), 'DRZ': (4, 4, 4), 'ZH': (4, 4, 4),
377
                  'ZHDZH': (2, 4, 4), 'DZH': (4, 4, 4), 'DRS': (4, 4, 4),
378
                  'DZS': (4, 4, 4), 'SCHTCH': (2, 4, 4), 'SHTSH': (2, 4, 4),
379
                  'SZCZ': (2, 4, 4), 'TZS': (4, 4, 4), 'SZCS': (2, 4, 4),
380
                  'STSH': (2, 4, 4), 'SHCH': (2, 4, 4), 'D': (3, 3, 3),
381
                  'H': (5, 5, '_'), 'TTSCH': (4, 4, 4), 'THS': (4, 4, 4),
382
                  'L': (8, 8, 8), 'P': (7, 7, 7), 'CHS': (5, 54, 54),
383
                  'T': (3, 3, 3), 'X': (5, 54, 54), 'OJ': (0, 1, '_'),
384
                  'OI': (0, 1, '_'), 'SCHTSH': (2, 4, 4), 'OY': (0, 1, '_'),
385
                  'Y': (1, '_', '_'), 'TSH': (4, 4, 4), 'ZDZ': (2, 4, 4),
386
                  'TSZ': (4, 4, 4), 'SHT': (2, 43, 43), 'SCHTSCH': (2, 4, 4),
387
                  'TTSZ': (4, 4, 4), 'TTZ': (4, 4, 4), 'SCH': (4, 4, 4),
388
                  'TTS': (4, 4, 4), 'SZD': (2, 43, 43), 'AI': (0, 1, '_'),
389
                  'PF': (7, 7, 7), 'TCH': (4, 4, 4), 'PH': (7, 7, 7),
390
                  'TTCH': (4, 4, 4), 'SZT': (2, 43, 43), 'ZDZH': (2, 4, 4),
391
                  'EI': (0, 1, '_'), 'G': (5, 5, 5), 'EJ': (0, 1, '_'),
392
                  'ZD': (2, 43, 43), 'IU': (1, '_', '_'), 'K': (5, 5, 5),
393
                  'O': (0, '_', '_'), 'SHTCH': (2, 4, 4), 'S': (4, 4, 4),
394
                  'TRZ': (4, 4, 4), 'SHD': (2, 43, 43), 'DSH': (4, 4, 4),
395
                  'CSZ': (4, 4, 4), 'EU': (1, 1, '_'), 'TRS': (4, 4, 4),
396
                  'ZS': (4, 4, 4), 'STRZ': (2, 4, 4), 'UY': (0, 1, '_'),
397
                  'STRS': (2, 4, 4), 'CZS': (4, 4, 4),
398
                  'MN': ('6_6', '6_6', '6_6'), 'UI': (0, 1, '_'),
399
                  'UJ': (0, 1, '_'), 'UE': (0, '_', '_'), 'EY': (0, 1, '_'),
400
                  'W': (7, 7, 7), 'IA': (1, '_', '_'), 'FB': (7, 7, 7),
401
                  'STSCH': (2, 4, 4), 'SCHT': (2, 43, 43),
402
                  'NM': ('6_6', '6_6', '6_6'), 'SCHD': (2, 43, 43),
403
                  'B': (7, 7, 7), 'DSZ': (4, 4, 4), 'F': (7, 7, 7),
404
                  'N': (6, 6, 6), 'CZ': (4, 4, 4), 'R': (9, 9, 9),
405
                  'U': (0, '_', '_'), 'V': (7, 7, 7), 'CS': (4, 4, 4),
406
                  'Z': (4, 4, 4), 'SZ': (4, 4, 4), 'TSCH': (4, 4, 4),
407
                  'KH': (5, 5, 5), 'ST': (2, 43, 43), 'KS': (5, 54, 54),
408
                  'SH': (4, 4, 4), 'SC': (2, 4, 4), 'SD': (2, 43, 43),
409
                  'DZ': (4, 4, 4), 'ZHD': (2, 43, 43), 'DT': (3, 3, 3),
410
                  'ZSH': (4, 4, 4), 'DS': (4, 4, 4), 'TZ': (4, 4, 4),
411
                  'TS': (4, 4, 4), 'TH': (3, 3, 3), 'TC': (4, 4, 4),
412
                  'A': (0, '_', '_'), 'E': (0, '_', '_'), 'I': (0, '_', '_'),
413
                  'AJ': (0, 1, '_'), 'M': (6, 6, 6), 'Q': (5, 5, 5),
414
                  'AU': (0, 7, '_'), 'IO': (1, '_', '_'), 'AY': (0, 1, '_'),
415
                  'IE': (1, '_', '_'), 'ZSCH': (4, 4, 4),
416
                  'CH': ((5, 4), (5, 4), (5, 4)),
417
                  'CK': ((5, 45), (5, 45), (5, 45)),
418
                  'C': ((5, 4), (5, 4), (5, 4)),
419
                  'J': ((1, 4), ('_', 4), ('_', 4)),
420
                  'RZ': ((94, 4), (94, 4), (94, 4)),
421
                  'RS': ((94, 4), (94, 4), (94, 4))}
422
423
    _dms_order = {'A': ('AI', 'AJ', 'AU', 'AY', 'A'),
424
                  'B': ('B'),
425
                  'C': ('CHS', 'CSZ', 'CZS', 'CH', 'CK', 'CS', 'CZ', 'C'),
426
                  'D': ('DRS', 'DRZ', 'DSH', 'DSZ', 'DZH', 'DZS', 'DS', 'DT',
427
                        'DZ', 'D'),
428
                  'E': ('EI', 'EJ', 'EU', 'EY', 'E'),
429
                  'F': ('FB', 'F'),
430
                  'G': ('G'),
431
                  'H': ('H'),
432
                  'I': ('IA', 'IE', 'IO', 'IU', 'I'),
433
                  'J': ('J'),
434
                  'K': ('KH', 'KS', 'K'),
435
                  'L': ('L'),
436
                  'M': ('MN', 'M'),
437
                  'N': ('NM', 'N'),
438
                  'O': ('OI', 'OJ', 'OY', 'O'),
439
                  'P': ('PF', 'PH', 'P'),
440
                  'Q': ('Q'),
441
                  'R': ('RS', 'RZ', 'R'),
442
                  'S': ('SCHTSCH', 'SCHTCH', 'SCHTSH', 'SHTCH', 'SHTSH',
443
                        'STSCH', 'SCHD', 'SCHT', 'SHCH', 'STCH', 'STRS',
444
                        'STRZ', 'STSH', 'SZCS', 'SZCZ', 'SCH', 'SHD', 'SHT',
445
                        'SZD', 'SZT', 'SC', 'SD', 'SH', 'ST', 'SZ', 'S'),
446
                  'T': ('TTSCH', 'TSCH', 'TTCH', 'TTSZ', 'TCH', 'THS', 'TRS',
447
                        'TRZ', 'TSH', 'TSZ', 'TTS', 'TTZ', 'TZS', 'TC', 'TH',
448
                        'TS', 'TZ', 'T'),
449
                  'U': ('UE', 'UI', 'UJ', 'UY', 'U'),
450
                  'V': ('V'),
451
                  'W': ('W'),
452
                  'X': ('X'),
453
                  'Y': ('Y'),
454
                  'Z': ('ZHDZH', 'ZDZH', 'ZSCH', 'ZDZ', 'ZHD', 'ZSH', 'ZD',
455
                        'ZH', 'ZS', 'Z')}
456
457
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
458
    dms = ['']  # initialize empty code list
459
460
    # Require a maxlength of at least 6 and not more than 64
461
    if maxlength is not None:
462
        maxlength = min(max(6, maxlength), 64)
463
    else:
464
        maxlength = 64
465
466
    # uppercase, normalize, decompose, and filter non-A-Z
467
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
468
    word = word.replace('ß', 'SS')
469
    word = ''.join(c for c in word if c in
470
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
471
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
472
                    'Y', 'Z'})
473
474
    # Nothing to convert, return base case
475
    if not word:
476
        if zero_pad:
477
            return {'0'*maxlength}
478
        return {'0'}
479
480
    # Reverse word if computing Reverse Soundex
481
    if reverse:
482
        word = word[::-1]
483
484
    pos = 0
485
    while pos < len(word):
486
        # Iterate through _dms_order, which specifies the possible substrings
487
        # for which codes exist in the Daitch-Mokotoff coding
488
        for sstr in _dms_order[word[pos]]:  # pragma: no branch
489
            if word[pos:].startswith(sstr):
490
                # Having determined a valid substring start, retrieve the code
491
                dm_val = _dms_table[sstr]
492
493
                # Having retried the code (triple), determine the correct
494
                # positional variant (first, pre-vocalic, elsewhere)
495
                if pos == 0:
496
                    dm_val = dm_val[0]
497
                elif (pos+len(sstr) < len(word) and
498
                      word[pos+len(sstr)] in _vowels):
499
                    dm_val = dm_val[1]
500
                else:
501
                    dm_val = dm_val[2]
502
503
                # Build the code strings
504
                if isinstance(dm_val, tuple):
505
                    dms = [_ + text_type(dm_val[0]) for _ in dms] \
506
                            + [_ + text_type(dm_val[1]) for _ in dms]
507
                else:
508
                    dms = [_ + text_type(dm_val) for _ in dms]
509
                pos += len(sstr)
510
                break
511
512
    # Filter out double letters and _ placeholders
513
    dms = (''.join(c for c in _delete_consecutive_repeats(_) if c != '_')
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
514
           for _ in dms)
515
516
    # Trim codes and return set
517
    if zero_pad:
518
        dms = ((_ + ('0'*maxlength))[:maxlength] for _ in dms)
519
    else:
520
        dms = (_[:maxlength] for _ in dms)
521
    return set(dms)
522
523
524
def koelner_phonetik(word):
525
    """Return the Kölner Phonetik (numeric output) code for a word.
526
527
    Based on the algorithm described at
528
    https://de.wikipedia.org/wiki/Kölner_Phonetik
529
530
    While the output code is numeric, it is still a str because 0s can lead
531
    the code.
532
533
    :param str word: the word to transform
534
    :returns: the Kölner Phonetik value as a numeric string
535
    :rtype: str
536
537
    >>> koelner_phonetik('Christopher')
538
    '478237'
539
    >>> koelner_phonetik('Niall')
540
    '65'
541
    >>> koelner_phonetik('Smith')
542
    '862'
543
    >>> koelner_phonetik('Schmidt')
544
    '862'
545
    >>> koelner_phonetik('Müller')
546
    '657'
547
    >>> koelner_phonetik('Zimmermann')
548
    '86766'
549
    """
550
    # pylint: disable=too-many-branches
551
    def _after(word, i, letters):
552
        """Return True if word[i] follows one of the supplied letters."""
553
        if i > 0 and word[i-1] in letters:
554
            return True
555
        return False
556
557
    def _before(word, i, letters):
558
        """Return True if word[i] precedes one of the supplied letters."""
559
        if i+1 < len(word) and word[i+1] in letters:
560
            return True
561
        return False
562
563
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
564
565
    sdx = ''
566
567
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
568
    word = word.replace('ß', 'SS')
569
570
    word = word.replace('Ä', 'AE')
571
    word = word.replace('Ö', 'OE')
572
    word = word.replace('Ü', 'UE')
573
    word = ''.join(c for c in word if c in
574
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
575
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
576
                    'Y', 'Z'})
577
578
    # Nothing to convert, return base case
579
    if not word:
580
        return sdx
581
582
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
583 View Code Duplication
        if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
584
            sdx += '0'
585
        elif word[i] == 'B':
586
            sdx += '1'
587
        elif word[i] == 'P':
588
            if _before(word, i, {'H'}):
589
                sdx += '3'
590
            else:
591
                sdx += '1'
592
        elif word[i] in {'D', 'T'}:
593
            if _before(word, i, {'C', 'S', 'Z'}):
594
                sdx += '8'
595
            else:
596
                sdx += '2'
597
        elif word[i] in {'F', 'V', 'W'}:
598
            sdx += '3'
599
        elif word[i] in {'G', 'K', 'Q'}:
600
            sdx += '4'
601
        elif word[i] == 'C':
602
            if _after(word, i, {'S', 'Z'}):
603
                sdx += '8'
604
            elif i == 0:
605
                if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R', 'U',
606
                                     'X'}):
607
                    sdx += '4'
608
                else:
609
                    sdx += '8'
610
            elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
611
                sdx += '4'
612
            else:
613
                sdx += '8'
614
        elif word[i] == 'X':
615
            if _after(word, i, {'C', 'K', 'Q'}):
616
                sdx += '8'
617
            else:
618
                sdx += '48'
619
        elif word[i] == 'L':
620
            sdx += '5'
621
        elif word[i] in {'M', 'N'}:
622
            sdx += '6'
623
        elif word[i] == 'R':
624
            sdx += '7'
625
        elif word[i] in {'S', 'Z'}:
626
            sdx += '8'
627
628
    sdx = _delete_consecutive_repeats(sdx)
629
630
    if sdx:
631
        sdx = sdx[0] + sdx[1:].replace('0', '')
632
633
    return sdx
634
635
636
def koelner_phonetik_num_to_alpha(num):
637
    """Convert a Kölner Phonetik code from numeric to alphabetic.
638
639
    :param str num: a numeric Kölner Phonetik representation
640
    :returns: an alphabetic representation of the same word
641
    :rtype: str
642
643
    >>> koelner_phonetik_num_to_alpha(862)
644
    'SNT'
645
    >>> koelner_phonetik_num_to_alpha(657)
646
    'NLR'
647
    >>> koelner_phonetik_num_to_alpha(86766)
648
    'SNRNN'
649
    """
650
    _koelner_num_translation = dict(zip((ord(_) for _ in '012345678'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
651
                                        'APTFKLNRS'))
652
    num = ''.join(c for c in text_type(num) if c in {'0', '1', '2', '3', '4',
653
                                                     '5', '6', '7', '8'})
654
    return num.translate(_koelner_num_translation)
655
656
657
def koelner_phonetik_alpha(word):
658
    """Return the Kölner Phonetik (alphabetic output) code for a word.
659
660
    :param str word: the word to transform
661
    :returns: the Kölner Phonetik value as an alphabetic string
662
    :rtype: str
663
664
    >>> koelner_phonetik_alpha('Smith')
665
    'SNT'
666
    >>> koelner_phonetik_alpha('Schmidt')
667
    'SNT'
668
    >>> koelner_phonetik_alpha('Müller')
669
    'NLR'
670
    >>> koelner_phonetik_alpha('Zimmermann')
671
    'SNRNN'
672
    """
673
    return koelner_phonetik_num_to_alpha(koelner_phonetik(word))
674
675
676
def nysiis(word, maxlength=6, modified=False):
677
    """Return the NYSIIS code for a word.
678
679
    A description of the New York State Identification and Intelligence System
680
    algorithm can be found at
681
    https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
682
683
    The modified version of this algorithm is described in Appendix B of
684
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
685
    Procedure for the SRS Record Linkage System.` Statistical Reporting
686
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
687
    https://naldc.nal.usda.gov/download/27833/PDF
688
689
    :param str word: the word to transform
690
    :param int maxlength: the maximum length (default 6) of the code to return
691
    :param bool modified: indicates whether to use USDA modified NYSIIS
692
    :returns: the NYSIIS value
693
    :rtype: str
694
695
    >>> nysiis('Christopher')
696
    'CRASTA'
697
    >>> nysiis('Niall')
698
    'NAL'
699
    >>> nysiis('Smith')
700
    'SNAT'
701
    >>> nysiis('Schmidt')
702
    'SNAD'
703
704
    >>> nysiis('Christopher', maxlength=_INFINITY)
705
    'CRASTAFAR'
706
707
    >>> nysiis('Christopher', maxlength=8, modified=True)
708
    'CRASTAFA'
709
    >>> nysiis('Niall', maxlength=8, modified=True)
710
    'NAL'
711
    >>> nysiis('Smith', maxlength=8, modified=True)
712
    'SNAT'
713
    >>> nysiis('Schmidt', maxlength=8, modified=True)
714
    'SNAD'
715
    """
716
    # Require a maxlength of at least 6
717
    if maxlength:
718
        maxlength = max(6, maxlength)
719
720
    _vowels = {'A', 'E', 'I', 'O', 'U'}
721
722
    word = ''.join(c for c in word.upper() if c.isalpha())
723
    word = word.replace('ß', 'SS')
724
725
    # exit early if there are no alphas
726
    if not word:
727
        return ''
728
729
    if modified:
730
        original_first_char = word[0]
731
732
    if word[:3] == 'MAC':
733
        word = 'MCC'+word[3:]
734
    elif word[:2] == 'KN':
735
        word = 'NN'+word[2:]
736
    elif word[:1] == 'K':
737
        word = 'C'+word[1:]
738
    elif word[:2] in {'PH', 'PF'}:
739
        word = 'FF'+word[2:]
740
    elif word[:3] == 'SCH':
741
        word = 'SSS'+word[3:]
742
    elif modified:
743
        if word[:2] == 'WR':
744
            word = 'RR'+word[2:]
745
        elif word[:2] == 'RH':
746
            word = 'RR'+word[2:]
747
        elif word[:2] == 'DG':
748
            word = 'GG'+word[2:]
749
        elif word[:1] in _vowels:
750
            word = 'A'+word[1:]
751
752
    if modified and word[-1] in {'S', 'Z'}:
753
        word = word[:-1]
754
755
    if word[-2:] == 'EE' or word[-2:] == 'IE' or (modified and
756
                                                  word[-2:] == 'YE'):
757
        word = word[:-2]+'Y'
758
    elif word[-2:] in {'DT', 'RT', 'RD'}:
759
        word = word[:-2]+'D'
760
    elif word[-2:] in {'NT', 'ND'}:
761
        word = word[:-2]+('N' if modified else 'D')
762
    elif modified:
763
        if word[-2:] == 'IX':
764
            word = word[:-2]+'ICK'
765
        elif word[-2:] == 'EX':
766
            word = word[:-2]+'ECK'
767
        elif word[-2:] in {'JR', 'SR'}:
768
            return 'ERROR'  # TODO: decide how best to return an error
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
769
770
    key = word[0]
771
772
    skip = 0
773
    for i in range(1, len(word)):
774
        if i >= len(word):
775
            continue
776
        elif skip:
777
            skip -= 1
778
            continue
779
        elif word[i:i+2] == 'EV':
780
            word = word[:i] + 'AF' + word[i+2:]
781
            skip = 1
782
        elif word[i] in _vowels:
783
            word = word[:i] + 'A' + word[i+1:]
784
        elif modified and i != len(word)-1 and word[i] == 'Y':
785
            word = word[:i] + 'A' + word[i+1:]
786
        elif word[i] == 'Q':
787
            word = word[:i] + 'G' + word[i+1:]
788
        elif word[i] == 'Z':
789
            word = word[:i] + 'S' + word[i+1:]
790
        elif word[i] == 'M':
791
            word = word[:i] + 'N' + word[i+1:]
792
        elif word[i:i+2] == 'KN':
793
            word = word[:i] + 'N' + word[i+2:]
794
        elif word[i] == 'K':
795
            word = word[:i] + 'C' + word[i+1:]
796
        elif modified and i == len(word)-3 and word[i:i+3] == 'SCH':
797
            word = word[:i] + 'SSA'
798
            skip = 2
799
        elif word[i:i+3] == 'SCH':
800
            word = word[:i] + 'SSS' + word[i+3:]
801
            skip = 2
802
        elif modified and i == len(word)-2 and word[i:i+2] == 'SH':
803
            word = word[:i] + 'SA'
804
            skip = 1
805
        elif word[i:i+2] == 'SH':
806
            word = word[:i] + 'SS' + word[i+2:]
807
            skip = 1
808
        elif word[i:i+2] == 'PH':
809
            word = word[:i] + 'FF' + word[i+2:]
810
            skip = 1
811
        elif modified and word[i:i+3] == 'GHT':
812
            word = word[:i] + 'TTT' + word[i+3:]
813
            skip = 2
814
        elif modified and word[i:i+2] == 'DG':
815
            word = word[:i] + 'GG' + word[i+2:]
816
            skip = 1
817
        elif modified and word[i:i+2] == 'WR':
818
            word = word[:i] + 'RR' + word[i+2:]
819
            skip = 1
820
        elif word[i] == 'H' and (word[i-1] not in _vowels or
821
                                 word[i+1:i+2] not in _vowels):
822
            word = word[:i] + word[i-1] + word[i+1:]
823
        elif word[i] == 'W' and word[i-1] in _vowels:
824
            word = word[:i] + word[i-1] + word[i+1:]
825
826
        if word[i:i+skip+1] != key[-1:]:
827
            key += word[i:i+skip+1]
828
829
    key = _delete_consecutive_repeats(key)
830
831
    if key[-1] == 'S':
832
        key = key[:-1]
833
    if key[-2:] == 'AY':
834
        key = key[:-2] + 'Y'
835
    if key[-1:] == 'A':
836
        key = key[:-1]
837
    if modified and key[0] == 'A':
838
        key = original_first_char + key[1:]
0 ignored issues
show
introduced by
The variable original_first_char does not seem to be defined in case modified on line 729 is False. Are you sure this can never be the case?
Loading history...
839
840
    if maxlength and maxlength < _INFINITY:
841
        key = key[:maxlength]
842
843
    return key
844
845
846
def mra(word):
847
    """Return the MRA personal numeric identifier (PNI) for a word.
848
849
    A description of the Western Airlines Surname Match Rating Algorithm can
850
    be found on page 18 of
851
    https://archive.org/details/accessingindivid00moor
852
853
    :param str word: the word to transform
854
    :returns: the MRA PNI
855
    :rtype: str
856
857
    >>> mra('Christopher')
858
    'CHRPHR'
859
    >>> mra('Niall')
860
    'NL'
861
    >>> mra('Smith')
862
    'SMTH'
863
    >>> mra('Schmidt')
864
    'SCHMDT'
865
    """
866
    if not word:
867
        return word
868
    word = word.upper()
869
    word = word.replace('ß', 'SS')
870
    word = word[0]+''.join(c for c in word[1:] if
871
                           c not in {'A', 'E', 'I', 'O', 'U'})
872
    word = _delete_consecutive_repeats(word)
873
    if len(word) > 6:
874
        word = word[:3]+word[-3:]
875
    return word
876
877
878
def metaphone(word, maxlength=_INFINITY):
879
    """Return the Metaphone code for a word.
880
881
    Based on Lawrence Philips' Pick BASIC code from 1990:
882
    http://aspell.net/metaphone/metaphone.basic
883
    This incorporates some corrections to the above code, particularly
884
    some of those suggested by Michael Kuhn in:
885
    http://aspell.net/metaphone/metaphone-kuhn.txt
886
887
    :param str word: the word to transform
888
    :param int maxlength: the maximum length of the returned Metaphone code
889
        (defaults to unlimited, but in Philips' original implementation
890
        this was 4)
891
    :returns: the Metaphone value
892
    :rtype: str
893
894
895
    >>> metaphone('Christopher')
896
    'KRSTFR'
897
    >>> metaphone('Niall')
898
    'NL'
899
    >>> metaphone('Smith')
900
    'SM0'
901
    >>> metaphone('Schmidt')
902
    'SKMTT'
903
    """
904
    # pylint: disable=too-many-branches
905
    _vowels = {'A', 'E', 'I', 'O', 'U'}
906
    _frontv = {'E', 'I', 'Y'}
907
    _varson = {'C', 'G', 'P', 'S', 'T'}
908
909
    # Require a maxlength of at least 4
910
    if maxlength is not None:
911
        maxlength = max(4, maxlength)
912
    else:
913
        maxlength = 64
914
915
    # As in variable sound--those modified by adding an "h"
916
    ename = ''.join(c for c in word.upper() if c.isalnum())
917
    ename = ename.replace('ß', 'SS')
918
919
    # Delete nonalphanumeric characters and make all caps
920
    if not ename:
921
        return ''
922
    if ename[0:2] in {'PN', 'AE', 'KN', 'GN', 'WR'}:
923
        ename = ename[1:]
924
    elif ename[0] == 'X':
925
        ename = 'S' + ename[1:]
926
    elif ename[0:2] == 'WH':
927
        ename = 'W' + ename[2:]
928
929
    # Convert to metaph
930
    elen = len(ename)-1
931
    metaph = ''
932
    for i in range(len(ename)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
933
        if len(metaph) >= maxlength:
934
            break
935
        if ((ename[i] not in {'G', 'T'} and
936
             i > 0 and ename[i-1] == ename[i])):
937
            continue
938
939
        if ename[i] in _vowels and i == 0:
940
            metaph = ename[i]
941
942
        elif ename[i] == 'B':
943
            if i != elen or ename[i-1] != 'M':
944
                metaph += ename[i]
945
946
        elif ename[i] == 'C':
947
            if not (i > 0 and ename[i-1] == 'S' and ename[i+1:i+2] in _frontv):
948
                if ename[i+1:i+3] == 'IA':
949
                    metaph += 'X'
950
                elif ename[i+1:i+2] in _frontv:
951
                    metaph += 'S'
952
                elif i > 0 and ename[i-1:i+2] == 'SCH':
953
                    metaph += 'K'
954
                elif ename[i+1:i+2] == 'H':
955
                    if i == 0 and i+1 < elen and ename[i+2:i+3] not in _vowels:
956
                        metaph += 'K'
957
                    else:
958
                        metaph += 'X'
959
                else:
960
                    metaph += 'K'
961
962
        elif ename[i] == 'D':
963
            if ename[i+1:i+2] == 'G' and ename[i+2:i+3] in _frontv:
964
                metaph += 'J'
965
            else:
966
                metaph += 'T'
967
968
        elif ename[i] == 'G':
969
            if ename[i+1:i+2] == 'H' and not (i+1 == elen or
970
                                              ename[i+2:i+3] not in _vowels):
971
                continue
972
            elif i > 0 and ((i+1 == elen and ename[i+1] == 'N') or
973
                            (i+3 == elen and ename[i+1:i+4] == 'NED')):
974
                continue
975
            elif (i-1 > 0 and i+1 <= elen and ename[i-1] == 'D' and
976
                  ename[i+1] in _frontv):
977
                continue
978
            elif ename[i+1:i+2] == 'G':
979
                continue
980
            elif ename[i+1:i+2] in _frontv:
981
                if i == 0 or ename[i-1] != 'G':
982
                    metaph += 'J'
983
                else:
984
                    metaph += 'K'
985
            else:
986
                metaph += 'K'
987
988
        elif ename[i] == 'H':
989
            if ((i > 0 and ename[i-1] in _vowels and
990
                 ename[i+1:i+2] not in _vowels)):
991
                continue
992
            elif i > 0 and ename[i-1] in _varson:
993
                continue
994
            else:
995
                metaph += 'H'
996
997
        elif ename[i] in {'F', 'J', 'L', 'M', 'N', 'R'}:
998
            metaph += ename[i]
999
1000
        elif ename[i] == 'K':
1001
            if i > 0 and ename[i-1] == 'C':
1002
                continue
1003
            else:
1004
                metaph += 'K'
1005
1006
        elif ename[i] == 'P':
1007
            if ename[i+1:i+2] == 'H':
1008
                metaph += 'F'
1009
            else:
1010
                metaph += 'P'
1011
1012
        elif ename[i] == 'Q':
1013
            metaph += 'K'
1014
1015
        elif ename[i] == 'S':
1016
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1017
                 ename[i+2] in 'OA')):
1018
                metaph += 'X'
1019
            elif ename[i+1:i+2] == 'H':
1020
                metaph += 'X'
1021
            else:
1022
                metaph += 'S'
1023
1024
        elif ename[i] == 'T':
1025
            if ((i > 0 and i+2 <= elen and ename[i+1] == 'I' and
1026
                 ename[i+2] in {'A', 'O'})):
1027
                metaph += 'X'
1028
            elif ename[i+1:i+2] == 'H':
1029
                metaph += '0'
1030
            elif ename[i+1:i+3] != 'CH':
1031
                if ename[i-1:i] != 'T':
1032
                    metaph += 'T'
1033
1034
        elif ename[i] == 'V':
1035
            metaph += 'F'
1036
1037
        elif ename[i] in 'WY':
1038
            if ename[i+1:i+2] in _vowels:
1039
                metaph += ename[i]
1040
1041
        elif ename[i] == 'X':
1042
            metaph += 'KS'
1043
1044
        elif ename[i] == 'Z':
1045
            metaph += 'S'
1046
1047
    return metaph
1048
1049
1050
def double_metaphone(word, maxlength=_INFINITY):
1051
    """Return the Double Metaphone code for a word.
1052
1053
    Based on Lawrence Philips' (Visual) C++ code from 1999:
1054
    http://aspell.net/metaphone/dmetaph.cpp
1055
1056
    :param word: the word to transform
1057
    :param maxlength: the maximum length of the returned Double Metaphone codes
1058
        (defaults to unlimited, but in Philips' original implementation this
1059
        was 4)
1060
    :returns: the Double Metaphone value(s)
1061
    :rtype: tuple
1062
1063
    >>> double_metaphone('Christopher')
1064
    ('KRSTFR', '')
1065
    >>> double_metaphone('Niall')
1066
    ('NL', '')
1067
    >>> double_metaphone('Smith')
1068
    ('SM0', 'XMT')
1069
    >>> double_metaphone('Schmidt')
1070
    ('XMT', 'SMT')
1071
    """
1072
    # pylint: disable=too-many-branches
1073
    # Require a maxlength of at least 4
1074
    if maxlength is not None:
1075
        maxlength = max(4, maxlength)
1076
    else:
1077
        maxlength = 64
1078
1079
    primary = ''
1080
    secondary = ''
1081
1082
    def _slavo_germanic():
1083
        """Return True if the word appears to be Slavic or Germanic."""
1084
        if 'W' in word or 'K' in word or 'CZ' in word:
1085
            return True
1086
        return False
1087
1088
    def _metaph_add(pri, sec=''):
1089
        """Return a new metaphone tuple with the supplied elements."""
1090
        newpri = primary
1091
        newsec = secondary
1092
        if pri:
1093
            newpri += pri
1094
        if sec:
1095
            if sec != ' ':
1096
                newsec += sec
1097
        else:
1098
            newsec += pri
1099
        return (newpri, newsec)
1100
1101
    def _is_vowel(pos):
1102
        """Return True if the character at word[pos] is a vowel."""
1103
        if pos >= 0 and word[pos] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1104
            return True
1105
        return False
1106
1107
    def _get_at(pos):
1108
        """Return the character at word[pos]."""
1109
        return word[pos]
1110
1111
    def _string_at(pos, slen, substrings):
1112
        """Return True if word[pos:pos+slen] is in substrings."""
1113
        if pos < 0:
1114
            return False
1115
        return word[pos:pos+slen] in substrings
1116
1117
    current = 0
1118
    length = len(word)
1119
    if length < 1:
1120
        return ('', '')
1121
    last = length - 1
1122
1123
    word = word.upper()
1124
    word = word.replace('ß', 'SS')
1125
1126
    # Pad the original string so that we can index beyond the edge of the world
1127
    word += '     '
1128
1129
    # Skip these when at start of word
1130
    if word[0:2] in {'GN', 'KN', 'PN', 'WR', 'PS'}:
1131
        current += 1
1132
1133
    # Initial 'X' is pronounced 'Z' e.g. 'Xavier'
1134
    if _get_at(0) == 'X':
1135
        (primary, secondary) = _metaph_add('S')  # 'Z' maps to 'S'
1136
        current += 1
1137
1138
    # Main loop
1139
    while True:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
1140
        if current >= length:
1141
            break
1142
1143
        if _get_at(current) in {'A', 'E', 'I', 'O', 'U', 'Y'}:
1144
            if current == 0:
1145
                # All init vowels now map to 'A'
1146
                (primary, secondary) = _metaph_add('A')
1147
            current += 1
1148
            continue
1149
1150
        elif _get_at(current) == 'B':
1151
            # "-mb", e.g", "dumb", already skipped over...
1152
            (primary, secondary) = _metaph_add('P')
1153
            if _get_at(current + 1) == 'B':
1154
                current += 2
1155
            else:
1156
                current += 1
1157
            continue
1158
1159
        elif _get_at(current) == 'Ç':
1160
            (primary, secondary) = _metaph_add('S')
1161
            current += 1
1162
            continue
1163
1164
        elif _get_at(current) == 'C':
1165
            # Various Germanic
1166
            if (current > 1 and not _is_vowel(current - 2) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1167
                    _string_at((current - 1), 3, {'ACH'}) and
1168
                    ((_get_at(current + 2) != 'I') and
1169
                     ((_get_at(current + 2) != 'E') or
1170
                      _string_at((current - 2), 6,
1171
                                 {'BACHER', 'MACHER'})))):
1172
                (primary, secondary) = _metaph_add('K')
1173
                current += 2
1174
                continue
1175
1176
            # Special case 'caesar'
1177
            elif current == 0 and _string_at(current, 6, {'CAESAR'}):
1178
                (primary, secondary) = _metaph_add('S')
1179
                current += 2
1180
                continue
1181
1182
            # Italian 'chianti'
1183
            elif _string_at(current, 4, {'CHIA'}):
1184
                (primary, secondary) = _metaph_add('K')
1185
                current += 2
1186
                continue
1187
1188
            elif _string_at(current, 2, {'CH'}):
1189
                # Find 'Michael'
1190
                if current > 0 and _string_at(current, 4, {'CHAE'}):
1191
                    (primary, secondary) = _metaph_add('K', 'X')
1192
                    current += 2
1193
                    continue
1194
1195
                # Greek roots e.g. 'chemistry', 'chorus'
1196
                elif (current == 0 and
1197
                      (_string_at((current + 1), 5,
1198
                                  {'HARAC', 'HARIS'}) or
1199
                       _string_at((current + 1), 3,
1200
                                  {'HOR', 'HYM', 'HIA', 'HEM'})) and
1201
                      not _string_at(0, 5, {'CHORE'})):
1202
                    (primary, secondary) = _metaph_add('K')
1203
                    current += 2
1204
                    continue
1205
1206
                # Germanic, Greek, or otherwise 'ch' for 'kh' sound
1207
                elif ((_string_at(0, 4, {'VAN ', 'VON '}) or
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (7/5)
Loading history...
1208
                       _string_at(0, 3, {'SCH'})) or
1209
                      # 'architect but not 'arch', 'orchestra', 'orchid'
1210
                      _string_at((current - 2), 6,
1211
                                 {'ORCHES', 'ARCHIT', 'ORCHID'}) or
1212
                      _string_at((current + 2), 1, {'T', 'S'}) or
1213
                      ((_string_at((current - 1), 1,
1214
                                   {'A', 'O', 'U', 'E'}) or
1215
                        (current == 0)) and
1216
                       # e.g., 'wachtler', 'wechsler', but not 'tichner'
1217
                       _string_at((current + 2), 1,
1218
                                  {'L', 'R', 'N', 'M', 'B', 'H', 'F', 'V', 'W',
1219
                                   ' '}))):
1220
                    (primary, secondary) = _metaph_add('K')
1221
1222
                else:
1223
                    if current > 0:
1224
                        if _string_at(0, 2, {'MC'}):
1225
                            # e.g., "McHugh"
1226
                            (primary, secondary) = _metaph_add('K')
1227
                        else:
1228
                            (primary, secondary) = _metaph_add('X', 'K')
1229
                    else:
1230
                        (primary, secondary) = _metaph_add('X')
1231
1232
                current += 2
1233
                continue
1234
1235
            # e.g, 'czerny'
1236
            elif (_string_at(current, 2, {'CZ'}) and
1237
                  not _string_at((current - 2), 4, {'WICZ'})):
1238
                (primary, secondary) = _metaph_add('S', 'X')
1239
                current += 2
1240
                continue
1241
1242
            # e.g., 'focaccia'
1243
            elif _string_at((current + 1), 3, {'CIA'}):
1244
                (primary, secondary) = _metaph_add('X')
1245
                current += 3
1246
1247
            # double 'C', but not if e.g. 'McClellan'
1248
            elif (_string_at(current, 2, {'CC'}) and
1249
                  not ((current == 1) and (_get_at(0) == 'M'))):
1250
                # 'bellocchio' but not 'bacchus'
1251
                if ((_string_at((current + 2), 1,
1252
                                {'I', 'E', 'H'}) and
1253
                     not _string_at((current + 2), 2, ['HU']))):
1254
                    # 'accident', 'accede' 'succeed'
1255
                    if ((((current == 1) and _get_at(current - 1) == 'A') or
1256
                         _string_at((current - 1), 5,
1257
                                    {'UCCEE', 'UCCES'}))):
1258
                        (primary, secondary) = _metaph_add('KS')
1259
                    # 'bacci', 'bertucci', other italian
1260
                    else:
1261
                        (primary, secondary) = _metaph_add('X')
1262
                    current += 3
1263
                    continue
1264
                else:  # Pierce's rule
1265
                    (primary, secondary) = _metaph_add('K')
1266
                    current += 2
1267
                    continue
1268
1269
            elif _string_at(current, 2, {'CK', 'CG', 'CQ'}):
1270
                (primary, secondary) = _metaph_add('K')
1271
                current += 2
1272
                continue
1273
1274
            elif _string_at(current, 2, {'CI', 'CE', 'CY'}):
1275
                # Italian vs. English
1276
                if _string_at(current, 3, {'CIO', 'CIE', 'CIA'}):
1277
                    (primary, secondary) = _metaph_add('S', 'X')
1278
                else:
1279
                    (primary, secondary) = _metaph_add('S')
1280
                current += 2
1281
                continue
1282
1283
            # else
1284
            else:
1285
                (primary, secondary) = _metaph_add('K')
1286
1287
                # name sent in 'mac caffrey', 'mac gregor
1288
                if _string_at((current + 1), 2, {' C', ' Q', ' G'}):
1289
                    current += 3
1290
                elif (_string_at((current + 1), 1,
1291
                                 {'C', 'K', 'Q'}) and
1292
                      not _string_at((current + 1), 2, {'CE', 'CI'})):
1293
                    current += 2
1294
                else:
1295
                    current += 1
1296
                continue
1297
1298
        elif _get_at(current) == 'D':
1299
            if _string_at(current, 2, {'DG'}):
1300
                if _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1301
                    # e.g. 'edge'
1302
                    (primary, secondary) = _metaph_add('J')
1303
                    current += 3
1304
                    continue
1305
                else:
1306
                    # e.g. 'edgar'
1307
                    (primary, secondary) = _metaph_add('TK')
1308
                    current += 2
1309
                    continue
1310
1311
            elif _string_at(current, 2, {'DT', 'DD'}):
1312
                (primary, secondary) = _metaph_add('T')
1313
                current += 2
1314
                continue
1315
1316
            # else
1317
            else:
1318
                (primary, secondary) = _metaph_add('T')
1319
                current += 1
1320
                continue
1321
1322
        elif _get_at(current) == 'F':
1323
            if _get_at(current + 1) == 'F':
1324
                current += 2
1325
            else:
1326
                current += 1
1327
            (primary, secondary) = _metaph_add('F')
1328
            continue
1329
1330
        elif _get_at(current) == 'G':
1331
            if _get_at(current + 1) == 'H':
1332
                if (current > 0) and not _is_vowel(current - 1):
1333
                    (primary, secondary) = _metaph_add('K')
1334
                    current += 2
1335
                    continue
1336
1337
                # 'ghislane', ghiradelli
1338
                elif current == 0:
1339
                    if _get_at(current + 2) == 'I':
1340
                        (primary, secondary) = _metaph_add('J')
1341
                    else:
1342
                        (primary, secondary) = _metaph_add('K')
1343
                    current += 2
1344
                    continue
1345
1346
                # Parker's rule (with some further refinements) - e.g., 'hugh'
1347
                elif (((current > 1) and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1348
                       _string_at((current - 2), 1, {'B', 'H', 'D'})) or
1349
                      # e.g., 'bough'
1350
                      ((current > 2) and
1351
                       _string_at((current - 3), 1, {'B', 'H', 'D'})) or
1352
                      # e.g., 'broughton'
1353
                      ((current > 3) and
1354
                       _string_at((current - 4), 1, {'B', 'H'}))):
1355
                    current += 2
1356
                    continue
1357
                else:
1358
                    # e.g. 'laugh', 'McLaughlin', 'cough',
1359
                    #      'gough', 'rough', 'tough'
1360
                    if ((current > 2) and
1361
                            (_get_at(current - 1) == 'U') and
1362
                            (_string_at((current - 3), 1,
1363
                                        {'C', 'G', 'L', 'R', 'T'}))):
1364
                        (primary, secondary) = _metaph_add('F')
1365
                    elif (current > 0) and _get_at(current - 1) != 'I':
1366
                        (primary, secondary) = _metaph_add('K')
1367
                    current += 2
1368
                    continue
1369
1370
            elif _get_at(current + 1) == 'N':
1371
                if (current == 1) and _is_vowel(0) and not _slavo_germanic():
1372
                    (primary, secondary) = _metaph_add('KN', 'N')
1373
                # not e.g. 'cagney'
1374
                elif (not _string_at((current + 2), 2, {'EY'}) and
1375
                      (_get_at(current + 1) != 'Y') and
1376
                      not _slavo_germanic()):
1377
                    (primary, secondary) = _metaph_add('N', 'KN')
1378
                else:
1379
                    (primary, secondary) = _metaph_add('KN')
1380
                current += 2
1381
                continue
1382
1383
            # 'tagliaro'
1384
            elif (_string_at((current + 1), 2, {'LI'}) and
1385
                  not _slavo_germanic()):
1386
                (primary, secondary) = _metaph_add('KL', 'L')
1387
                current += 2
1388
                continue
1389
1390
            # -ges-, -gep-, -gel-, -gie- at beginning
1391
            elif ((current == 0) and
1392
                  ((_get_at(current + 1) == 'Y') or
1393
                   _string_at((current + 1), 2, {'ES', 'EP', 'EB', 'EL', 'EY',
1394
                                                 'IB', 'IL', 'IN', 'IE', 'EI',
1395
                                                 'ER'}))):
1396
                (primary, secondary) = _metaph_add('K', 'J')
1397
                current += 2
1398
                continue
1399
1400
            #  -ger-,  -gy-
1401
            elif ((_string_at((current + 1), 2, {'ER'}) or
1402
                   (_get_at(current + 1) == 'Y')) and not
1403
                  _string_at(0, 6, {'DANGER', 'RANGER', 'MANGER'}) and not
1404
                  _string_at((current - 1), 1, {'E', 'I'}) and not
1405
                  _string_at((current - 1), 3, {'RGY', 'OGY'})):
1406
                (primary, secondary) = _metaph_add('K', 'J')
1407
                current += 2
1408
                continue
1409
1410
            #  italian e.g, 'biaggi'
1411
            elif (_string_at((current + 1), 1, {'E', 'I', 'Y'}) or
1412
                  _string_at((current - 1), 4, {'AGGI', 'OGGI'})):
1413
                # obvious germanic
1414
                if (((_string_at(0, 4, {'VAN ', 'VON '}) or
1415
                      _string_at(0, 3, {'SCH'})) or
1416
                     _string_at((current + 1), 2, {'ET'}))):
1417
                    (primary, secondary) = _metaph_add('K')
1418
                elif _string_at((current + 1), 4, {'IER '}):
1419
                    (primary, secondary) = _metaph_add('J')
1420
                else:
1421
                    (primary, secondary) = _metaph_add('J', 'K')
1422
                current += 2
1423
                continue
1424
1425
            else:
1426
                if _get_at(current + 1) == 'G':
1427
                    current += 2
1428
                else:
1429
                    current += 1
1430
                (primary, secondary) = _metaph_add('K')
1431
                continue
1432
1433
        elif _get_at(current) == 'H':
1434
            # only keep if first & before vowel or btw. 2 vowels
1435
            if ((((current == 0) or _is_vowel(current - 1)) and
1436
                 _is_vowel(current + 1))):
1437
                (primary, secondary) = _metaph_add('H')
1438
                current += 2
1439
            else:  # also takes care of 'HH'
1440
                current += 1
1441
            continue
1442
1443
        elif _get_at(current) == 'J':
1444
            # obvious spanish, 'jose', 'san jacinto'
1445
            if _string_at(current, 4, ['JOSE']) or _string_at(0, 4, {'SAN '}):
1446
                if ((((current == 0) and (_get_at(current + 4) == ' ')) or
1447
                     _string_at(0, 4, ['SAN ']))):
1448
                    (primary, secondary) = _metaph_add('H')
1449
                else:
1450
                    (primary, secondary) = _metaph_add('J', 'H')
1451
                current += 1
1452
                continue
1453
1454
            elif (current == 0) and not _string_at(current, 4, {'JOSE'}):
1455
                # Yankelovich/Jankelowicz
1456
                (primary, secondary) = _metaph_add('J', 'A')
1457
            # Spanish pron. of e.g. 'bajador'
1458
            elif (_is_vowel(current - 1) and
1459
                  not _slavo_germanic() and
1460
                  ((_get_at(current + 1) == 'A') or
1461
                   (_get_at(current + 1) == 'O'))):
1462
                (primary, secondary) = _metaph_add('J', 'H')
1463
            elif current == last:
1464
                (primary, secondary) = _metaph_add('J', ' ')
1465
            elif (not _string_at((current + 1), 1,
1466
                                 {'L', 'T', 'K', 'S', 'N', 'M', 'B', 'Z'}) and
1467
                  not _string_at((current - 1), 1, {'S', 'K', 'L'})):
1468
                (primary, secondary) = _metaph_add('J')
1469
1470
            if _get_at(current + 1) == 'J':  # it could happen!
1471
                current += 2
1472
            else:
1473
                current += 1
1474
            continue
1475
1476
        elif _get_at(current) == 'K':
1477
            if _get_at(current + 1) == 'K':
1478
                current += 2
1479
            else:
1480
                current += 1
1481
            (primary, secondary) = _metaph_add('K')
1482
            continue
1483
1484
        elif _get_at(current) == 'L':
1485
            if _get_at(current + 1) == 'L':
1486
                # Spanish e.g. 'cabrillo', 'gallegos'
1487
                if (((current == (length - 3)) and
1488
                     _string_at((current - 1), 4, {'ILLO', 'ILLA', 'ALLE'})) or
1489
                        ((_string_at((last - 1), 2, {'AS', 'OS'}) or
1490
                          _string_at(last, 1, {'A', 'O'})) and
1491
                         _string_at((current - 1), 4, {'ALLE'}))):
1492
                    (primary, secondary) = _metaph_add('L', ' ')
1493
                    current += 2
1494
                    continue
1495
                current += 2
1496
            else:
1497
                current += 1
1498
            (primary, secondary) = _metaph_add('L')
1499
            continue
1500
1501
        elif _get_at(current) == 'M':
1502
            if (((_string_at((current - 1), 3, {'UMB'}) and
1503
                  (((current + 1) == last) or
1504
                   _string_at((current + 2), 2, {'ER'}))) or
1505
                 # 'dumb', 'thumb'
1506
                 (_get_at(current + 1) == 'M'))):
1507
                current += 2
1508
            else:
1509
                current += 1
1510
            (primary, secondary) = _metaph_add('M')
1511
            continue
1512
1513
        elif _get_at(current) == 'N':
1514
            if _get_at(current + 1) == 'N':
1515
                current += 2
1516
            else:
1517
                current += 1
1518
            (primary, secondary) = _metaph_add('N')
1519
            continue
1520
1521
        elif _get_at(current) == 'Ñ':
1522
            current += 1
1523
            (primary, secondary) = _metaph_add('N')
1524
            continue
1525
1526
        elif _get_at(current) == 'P':
1527
            if _get_at(current + 1) == 'H':
1528
                (primary, secondary) = _metaph_add('F')
1529
                current += 2
1530
                continue
1531
1532
            # also account for "campbell", "raspberry"
1533
            elif _string_at((current + 1), 1, {'P', 'B'}):
1534
                current += 2
1535
            else:
1536
                current += 1
1537
            (primary, secondary) = _metaph_add('P')
1538
            continue
1539
1540
        elif _get_at(current) == 'Q':
1541
            if _get_at(current + 1) == 'Q':
1542
                current += 2
1543
            else:
1544
                current += 1
1545
            (primary, secondary) = _metaph_add('K')
1546
            continue
1547
1548
        elif _get_at(current) == 'R':
1549
            # french e.g. 'rogier', but exclude 'hochmeier'
1550
            if (((current == last) and
1551
                 not _slavo_germanic() and
1552
                 _string_at((current - 2), 2, {'IE'}) and
1553
                 not _string_at((current - 4), 2, {'ME', 'MA'}))):
1554
                (primary, secondary) = _metaph_add('', 'R')
1555
            else:
1556
                (primary, secondary) = _metaph_add('R')
1557
1558
            if _get_at(current + 1) == 'R':
1559
                current += 2
1560
            else:
1561
                current += 1
1562
            continue
1563
1564
        elif _get_at(current) == 'S':
1565
            # special cases 'island', 'isle', 'carlisle', 'carlysle'
1566
            if _string_at((current - 1), 3, {'ISL', 'YSL'}):
1567
                current += 1
1568
                continue
1569
1570
            # special case 'sugar-'
1571
            elif (current == 0) and _string_at(current, 5, {'SUGAR'}):
1572
                (primary, secondary) = _metaph_add('X', 'S')
1573
                current += 1
1574
                continue
1575
1576
            elif _string_at(current, 2, {'SH'}):
1577
                # Germanic
1578
                if _string_at((current + 1), 4,
1579
                              {'HEIM', 'HOEK', 'HOLM', 'HOLZ'}):
1580
                    (primary, secondary) = _metaph_add('S')
1581
                else:
1582
                    (primary, secondary) = _metaph_add('X')
1583
                current += 2
1584
                continue
1585
1586
            # Italian & Armenian
1587
            elif (_string_at(current, 3, {'SIO', 'SIA'}) or
1588
                  _string_at(current, 4, {'SIAN'})):
1589
                if not _slavo_germanic():
1590
                    (primary, secondary) = _metaph_add('S', 'X')
1591
                else:
1592
                    (primary, secondary) = _metaph_add('S')
1593
                current += 3
1594
                continue
1595
1596
            # German & anglicisations, e.g. 'smith' match 'schmidt',
1597
            #                               'snider' match 'schneider'
1598
            # also, -sz- in Slavic language although in Hungarian it is
1599
            #       pronounced 's'
1600
            elif (((current == 0) and
1601
                   _string_at((current + 1), 1, {'M', 'N', 'L', 'W'})) or
1602
                  _string_at((current + 1), 1, {'Z'})):
1603
                (primary, secondary) = _metaph_add('S', 'X')
1604
                if _string_at((current + 1), 1, {'Z'}):
1605
                    current += 2
1606
                else:
1607
                    current += 1
1608
                continue
1609
1610
            elif _string_at(current, 2, {'SC'}):
1611
                # Schlesinger's rule
1612
                if _get_at(current + 2) == 'H':
1613
                    # dutch origin, e.g. 'school', 'schooner'
1614
                    if _string_at((current + 3), 2,
1615
                                  {'OO', 'ER', 'EN', 'UY', 'ED', 'EM'}):
1616
                        # 'schermerhorn', 'schenker'
1617
                        if _string_at((current + 3), 2, {'ER', 'EN'}):
1618
                            (primary, secondary) = _metaph_add('X', 'SK')
1619
                        else:
1620
                            (primary, secondary) = _metaph_add('SK')
1621
                        current += 3
1622
                        continue
1623
                    else:
1624
                        if (((current == 0) and not _is_vowel(3) and
1625
                             (_get_at(3) != 'W'))):
1626
                            (primary, secondary) = _metaph_add('X', 'S')
1627
                        else:
1628
                            (primary, secondary) = _metaph_add('X')
1629
                        current += 3
1630
                        continue
1631
1632
                elif _string_at((current + 2), 1, {'I', 'E', 'Y'}):
1633
                    (primary, secondary) = _metaph_add('S')
1634
                    current += 3
1635
                    continue
1636
1637
                # else
1638
                else:
1639
                    (primary, secondary) = _metaph_add('SK')
1640
                    current += 3
1641
                    continue
1642
1643
            else:
1644
                # french e.g. 'resnais', 'artois'
1645
                if (current == last) and _string_at((current - 2), 2,
1646
                                                    {'AI', 'OI'}):
1647
                    (primary, secondary) = _metaph_add('', 'S')
1648
                else:
1649
                    (primary, secondary) = _metaph_add('S')
1650
1651
                if _string_at((current + 1), 1, {'S', 'Z'}):
1652
                    current += 2
1653
                else:
1654
                    current += 1
1655
                continue
1656
1657
        elif _get_at(current) == 'T':
1658
            if _string_at(current, 4, {'TION'}):
1659
                (primary, secondary) = _metaph_add('X')
1660
                current += 3
1661
                continue
1662
1663
            elif _string_at(current, 3, {'TIA', 'TCH'}):
1664
                (primary, secondary) = _metaph_add('X')
1665
                current += 3
1666
                continue
1667
1668
            elif (_string_at(current, 2, {'TH'}) or
1669
                  _string_at(current, 3, {'TTH'})):
1670
                # special case 'thomas', 'thames' or germanic
1671
                if ((_string_at((current + 2), 2, {'OM', 'AM'}) or
1672
                     _string_at(0, 4, {'VAN ', 'VON '}) or
1673
                     _string_at(0, 3, {'SCH'}))):
1674
                    (primary, secondary) = _metaph_add('T')
1675
                else:
1676
                    (primary, secondary) = _metaph_add('0', 'T')
1677
                current += 2
1678
                continue
1679
1680
            elif _string_at((current + 1), 1, {'T', 'D'}):
1681
                current += 2
1682
            else:
1683
                current += 1
1684
            (primary, secondary) = _metaph_add('T')
1685
            continue
1686
1687
        elif _get_at(current) == 'V':
1688
            if _get_at(current + 1) == 'V':
1689
                current += 2
1690
            else:
1691
                current += 1
1692
            (primary, secondary) = _metaph_add('F')
1693
            continue
1694
1695
        elif _get_at(current) == 'W':
1696
            # can also be in middle of word
1697
            if _string_at(current, 2, {'WR'}):
1698
                (primary, secondary) = _metaph_add('R')
1699
                current += 2
1700
                continue
1701
            elif ((current == 0) and
1702
                  (_is_vowel(current + 1) or _string_at(current, 2, {'WH'}))):
1703
                # Wasserman should match Vasserman
1704
                if _is_vowel(current + 1):
1705
                    (primary, secondary) = _metaph_add('A', 'F')
1706
                else:
1707
                    # need Uomo to match Womo
1708
                    (primary, secondary) = _metaph_add('A')
1709
1710
            # Arnow should match Arnoff
1711
            if ((((current == last) and _is_vowel(current - 1)) or
1712
                 _string_at((current - 1), 5,
1713
                            {'EWSKI', 'EWSKY', 'OWSKI', 'OWSKY'}) or
1714
                 _string_at(0, 3, ['SCH']))):
1715
                (primary, secondary) = _metaph_add('', 'F')
1716
                current += 1
1717
                continue
1718
            # Polish e.g. 'filipowicz'
1719
            elif _string_at(current, 4, {'WICZ', 'WITZ'}):
1720
                (primary, secondary) = _metaph_add('TS', 'FX')
1721
                current += 4
1722
                continue
1723
            # else skip it
1724
            else:
1725
                current += 1
1726
                continue
1727
1728
        elif _get_at(current) == 'X':
1729
            # French e.g. breaux
1730
            if (not ((current == last) and
1731
                     (_string_at((current - 3), 3, {'IAU', 'EAU'}) or
1732
                      _string_at((current - 2), 2, {'AU', 'OU'})))):
1733
                (primary, secondary) = _metaph_add('KS')
1734
1735
            if _string_at((current + 1), 1, {'C', 'X'}):
1736
                current += 2
1737
            else:
1738
                current += 1
1739
            continue
1740
1741
        elif _get_at(current) == 'Z':
1742
            # Chinese Pinyin e.g. 'zhao'
1743
            if _get_at(current + 1) == 'H':
1744
                (primary, secondary) = _metaph_add('J')
1745
                current += 2
1746
                continue
1747
            elif (_string_at((current + 1), 2, {'ZO', 'ZI', 'ZA'}) or
1748
                  (_slavo_germanic() and ((current > 0) and
1749
                                          _get_at(current - 1) != 'T'))):
1750
                (primary, secondary) = _metaph_add('S', 'TS')
1751
            else:
1752
                (primary, secondary) = _metaph_add('S')
1753
1754
            if _get_at(current + 1) == 'Z':
1755
                current += 2
1756
            else:
1757
                current += 1
1758
            continue
1759
1760
        else:
1761
            current += 1
1762
1763
    if maxlength and maxlength < _INFINITY:
1764
        primary = primary[:maxlength]
1765
        secondary = secondary[:maxlength]
1766
    if primary == secondary:
1767
        secondary = ''
1768
1769
    return (primary, secondary)
1770
1771
1772
def caverphone(word, version=2):
1773
    """Return the Caverphone code for a word.
1774
1775
    A description of version 1 of the algorithm can be found at:
1776
    http://caversham.otago.ac.nz/files/working/ctp060902.pdf
1777
1778
    A description of version 2 of the algorithm can be found at:
1779
    http://caversham.otago.ac.nz/files/working/ctp150804.pdf
1780
1781
    :param str word: the word to transform
1782
    :param int version: the version of Caverphone to employ for encoding
1783
        (defaults to 2)
1784
    :returns: the Caverphone value
1785
    :rtype: str
1786
1787
    >>> caverphone('Christopher')
1788
    'KRSTFA1111'
1789
    >>> caverphone('Niall')
1790
    'NA11111111'
1791
    >>> caverphone('Smith')
1792
    'SMT1111111'
1793
    >>> caverphone('Schmidt')
1794
    'SKMT111111'
1795
1796
    >>> caverphone('Christopher', 1)
1797
    'KRSTF1'
1798
    >>> caverphone('Niall', 1)
1799
    'N11111'
1800
    >>> caverphone('Smith', 1)
1801
    'SMT111'
1802
    >>> caverphone('Schmidt', 1)
1803
    'SKMT11'
1804
    """
1805
    _vowels = {'a', 'e', 'i', 'o', 'u'}
1806
1807
    word = word.lower()
1808
    word = ''.join(c for c in word if c in
1809
                   {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
1810
                    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
1811
                    'y', 'z'})
1812
1813
    def _squeeze_replace(word, char, new_char):
1814
        """Convert strings of char in word to one instance of new_char."""
1815
        while char * 2 in word:
1816
            word = word.replace(char * 2, char)
1817
        return word.replace(char, new_char)
1818
1819
    # the main replacemet algorithm
1820
    if version != 1 and word[-1:] == 'e':
1821
        word = word[:-1]
1822
    if word:
1823
        if word[:5] == 'cough':
1824
            word = 'cou2f'+word[5:]
1825
        if word[:5] == 'rough':
1826
            word = 'rou2f'+word[5:]
1827
        if word[:5] == 'tough':
1828
            word = 'tou2f'+word[5:]
1829
        if word[:6] == 'enough':
1830
            word = 'enou2f'+word[6:]
1831
        if version != 1 and word[:6] == 'trough':
1832
            word = 'trou2f'+word[6:]
1833
        if word[:2] == 'gn':
1834
            word = '2n'+word[2:]
1835
        if word[-2:] == 'mb':
1836
            word = word[:-1]+'2'
1837
        word = word.replace('cq', '2q')
1838
        word = word.replace('ci', 'si')
1839
        word = word.replace('ce', 'se')
1840
        word = word.replace('cy', 'sy')
1841
        word = word.replace('tch', '2ch')
1842
        word = word.replace('c', 'k')
1843
        word = word.replace('q', 'k')
1844
        word = word.replace('x', 'k')
1845
        word = word.replace('v', 'f')
1846
        word = word.replace('dg', '2g')
1847
        word = word.replace('tio', 'sio')
1848
        word = word.replace('tia', 'sia')
1849
        word = word.replace('d', 't')
1850
        word = word.replace('ph', 'fh')
1851
        word = word.replace('b', 'p')
1852
        word = word.replace('sh', 's2')
1853
        word = word.replace('z', 's')
1854
        if word[0] in _vowels:
1855
            word = 'A'+word[1:]
1856
        word = word.replace('a', '3')
1857
        word = word.replace('e', '3')
1858
        word = word.replace('i', '3')
1859
        word = word.replace('o', '3')
1860
        word = word.replace('u', '3')
1861
        if version != 1:
1862
            word = word.replace('j', 'y')
1863
            if word[:2] == 'y3':
1864
                word = 'Y3'+word[2:]
1865
            if word[:1] == 'y':
1866
                word = 'A'+word[1:]
1867
            word = word.replace('y', '3')
1868
        word = word.replace('3gh3', '3kh3')
1869
        word = word.replace('gh', '22')
1870
        word = word.replace('g', 'k')
1871
1872
        word = _squeeze_replace(word, 's', 'S')
1873
        word = _squeeze_replace(word, 't', 'T')
1874
        word = _squeeze_replace(word, 'p', 'P')
1875
        word = _squeeze_replace(word, 'k', 'K')
1876
        word = _squeeze_replace(word, 'f', 'F')
1877
        word = _squeeze_replace(word, 'm', 'M')
1878
        word = _squeeze_replace(word, 'n', 'N')
1879
1880
        word = word.replace('w3', 'W3')
1881
        if version == 1:
1882
            word = word.replace('wy', 'Wy')
1883
        word = word.replace('wh3', 'Wh3')
1884
        if version == 1:
1885
            word = word.replace('why', 'Why')
1886
        if version != 1 and word[-1:] == 'w':
1887
            word = word[:-1]+'3'
1888
        word = word.replace('w', '2')
1889
        if word[:1] == 'h':
1890
            word = 'A'+word[1:]
1891
        word = word.replace('h', '2')
1892
        word = word.replace('r3', 'R3')
1893
        if version == 1:
1894
            word = word.replace('ry', 'Ry')
1895
        if version != 1 and word[-1:] == 'r':
1896
            word = word[:-1]+'3'
1897
        word = word.replace('r', '2')
1898
        word = word.replace('l3', 'L3')
1899
        if version == 1:
1900
            word = word.replace('ly', 'Ly')
1901
        if version != 1 and word[-1:] == 'l':
1902
            word = word[:-1]+'3'
1903
        word = word.replace('l', '2')
1904
        if version == 1:
1905
            word = word.replace('j', 'y')
1906
            word = word.replace('y3', 'Y3')
1907
            word = word.replace('y', '2')
1908
        word = word.replace('2', '')
1909
        if version != 1 and word[-1:] == '3':
1910
            word = word[:-1]+'A'
1911
        word = word.replace('3', '')
1912
1913
    # pad with 1s, then extract the necessary length of code
1914
    word = word+'1'*10
1915
    if version != 1:
1916
        word = word[:10]
1917
    else:
1918
        word = word[:6]
1919
1920
    return word
1921
1922
1923
def alpha_sis(word, maxlength=14):
1924
    """Return the IBM Alpha Search Inquiry System code for a word.
1925
1926
    Based on the algorithm described in "Accessing individual records from
1927
    personal data files using non-unique identifiers" / Gwendolyn B. Moore,
1928
    et al.; prepared for the Institute for Computer Sciences and Technology,
1929
    National Bureau of Standards, Washington, D.C (1977):
1930
    https://archive.org/stream/accessingindivid00moor#page/15/mode/1up
1931
1932
    A collection is necessary since there can be multiple values for a
1933
    single word. But the collection must be ordered since the first value
1934
    is the primary coding.
1935
1936
    :param str word: the word to transform
1937
    :param int maxlength: the length of the code returned (defaults to 14)
1938
    :returns: the Alpha SIS value
1939
    :rtype: tuple
1940
1941
    >>> alpha_sis('Christopher')
1942
    ('06401840000000', '07040184000000', '04018400000000')
1943
    >>> alpha_sis('Niall')
1944
    ('02500000000000',)
1945
    >>> alpha_sis('Smith')
1946
    ('03100000000000',)
1947
    >>> alpha_sis('Schmidt')
1948
    ('06310000000000',)
1949
    """
1950
    _alpha_sis_initials = {'GF': '08', 'GM': '03', 'GN': '02', 'KN': '02',
1951
                           'PF': '08', 'PN': '02', 'PS': '00', 'WR': '04',
1952
                           'A': '1', 'E': '1', 'H': '2', 'I': '1', 'J': '3',
1953
                           'O': '1', 'U': '1', 'W': '4', 'Y': '5'}
1954
    _alpha_sis_initials_order = ('GF', 'GM', 'GN', 'KN', 'PF', 'PN', 'PS',
1955
                                 'WR', 'A', 'E', 'H', 'I', 'J', 'O', 'U', 'W',
1956
                                 'Y')
1957
    _alpha_sis_basic = {'SCH': '6', 'CZ': ('70', '6', '0'),
1958
                        'CH': ('6', '70', '0'), 'CK': ('7', '6'),
1959
                        'DS': ('0', '10'), 'DZ': ('0', '10'),
1960
                        'TS': ('0', '10'), 'TZ': ('0', '10'), 'CI': '0',
1961
                        'CY': '0', 'CE': '0', 'SH': '6', 'DG': '7', 'PH': '8',
1962
                        'C': ('7', '6'), 'K': ('7', '6'), 'Z': '0', 'S': '0',
1963
                        'D': '1', 'T': '1', 'N': '2', 'M': '3', 'R': '4',
1964
                        'L': '5', 'J': '6', 'G': '7', 'Q': '7', 'X': '7',
1965
                        'F': '8', 'V': '8', 'B': '9', 'P': '9'}
1966
    _alpha_sis_basic_order = ('SCH', 'CZ', 'CH', 'CK', 'DS', 'DZ', 'TS', 'TZ',
1967
                              'CI', 'CY', 'CE', 'SH', 'DG', 'PH', 'C', 'K',
1968
                              'Z', 'S', 'D', 'T', 'N', 'M', 'R', 'L', 'J', 'C',
1969
                              'G', 'K', 'Q', 'X', 'F', 'V', 'B', 'P')
1970
1971
    alpha = ['']
1972
    pos = 0
1973
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
1974
    word = word.replace('ß', 'SS')
1975
    word = ''.join(c for c in word if c in
1976
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
1977
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
1978
                    'Y', 'Z'})
1979
1980
    # Clamp maxlength to [4, 64]
1981
    if maxlength is not None:
1982
        maxlength = min(max(4, maxlength), 64)
1983
    else:
1984
        maxlength = 64
1985
1986
    # Do special processing for initial substrings
1987
    for k in _alpha_sis_initials_order:
1988
        if word.startswith(k):
1989
            alpha[0] += _alpha_sis_initials[k]
1990
            pos += len(k)
1991
            break
1992
1993
    # Add a '0' if alpha is still empty
1994
    if not alpha[0]:
1995
        alpha[0] += '0'
1996
1997
    # Whether or not any special initial codes were encoded, iterate
1998
    # through the length of the word in the main encoding loop
1999
    while pos < len(word):
2000
        origpos = pos
2001
        for k in _alpha_sis_basic_order:
2002
            if word[pos:].startswith(k):
2003
                if isinstance(_alpha_sis_basic[k], tuple):
2004
                    newalpha = []
2005
                    for i in range(len(_alpha_sis_basic[k])):
2006
                        newalpha += [_ + _alpha_sis_basic[k][i] for _ in alpha]
2007
                    alpha = newalpha
2008
                else:
2009
                    alpha = [_ + _alpha_sis_basic[k] for _ in alpha]
2010
                pos += len(k)
2011
                break
2012
        if pos == origpos:
2013
            alpha = [_ + '_' for _ in alpha]
2014
            pos += 1
2015
2016
    # Trim doublets and placeholders
2017
    for i in range(len(alpha)):
2018
        pos = 1
2019
        while pos < len(alpha[i]):
2020
            if alpha[i][pos] == alpha[i][pos-1]:
2021
                alpha[i] = alpha[i][:pos]+alpha[i][pos+1:]
2022
            pos += 1
2023
    alpha = (_.replace('_', '') for _ in alpha)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2024
2025
    # Trim codes and return tuple
2026
    alpha = ((_ + ('0'*maxlength))[:maxlength] for _ in alpha)
2027
    return tuple(alpha)
2028
2029
2030
def fuzzy_soundex(word, maxlength=5, zero_pad=True):
2031
    """Return the Fuzzy Soundex code for a word.
2032
2033
    Fuzzy Soundex is an algorithm derived from Soundex, defined in:
2034
    Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for
2035
    Soundex Retrieval."
2036
    http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf
2037
2038
    :param str word: the word to transform
2039
    :param int maxlength: the length of the code returned (defaults to 4)
2040
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2041
        a maxlength string
2042
    :returns: the Fuzzy Soundex value
2043
    :rtype: str
2044
2045
    >>> fuzzy_soundex('Christopher')
2046
    'K6931'
2047
    >>> fuzzy_soundex('Niall')
2048
    'N4000'
2049
    >>> fuzzy_soundex('Smith')
2050
    'S5300'
2051
    >>> fuzzy_soundex('Smith')
2052
    'S5300'
2053
    """
2054
    _fuzzy_soundex_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2055
                                           'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2056
                                          '0193017-07745501769301-7-9'))
2057
2058
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2059
    word = word.replace('ß', 'SS')
2060
2061
    # Clamp maxlength to [4, 64]
2062
    if maxlength is not None:
2063
        maxlength = min(max(4, maxlength), 64)
2064
    else:
2065
        maxlength = 64
2066
2067
    if not word:
2068
        if zero_pad:
2069
            return '0' * maxlength
2070
        return '0'
2071
2072
    if word[:2] in {'CS', 'CZ', 'TS', 'TZ'}:
2073
        word = 'SS' + word[2:]
2074
    elif word[:2] == 'GN':
2075
        word = 'NN' + word[2:]
2076
    elif word[:2] in {'HR', 'WR'}:
2077
        word = 'RR' + word[2:]
2078
    elif word[:2] == 'HW':
2079
        word = 'WW' + word[2:]
2080
    elif word[:2] in {'KN', 'NG'}:
2081
        word = 'NN' + word[2:]
2082
2083
    if word[-2:] == 'CH':
2084
        word = word[:-2] + 'KK'
2085
    elif word[-2:] == 'NT':
2086
        word = word[:-2] + 'TT'
2087
    elif word[-2:] == 'RT':
2088
        word = word[:-2] + 'RR'
2089
    elif word[-3:] == 'RDT':
2090
        word = word[:-3] + 'RR'
2091
2092
    word = word.replace('CA', 'KA')
2093
    word = word.replace('CC', 'KK')
2094
    word = word.replace('CK', 'KK')
2095
    word = word.replace('CE', 'SE')
2096
    word = word.replace('CHL', 'KL')
2097
    word = word.replace('CL', 'KL')
2098
    word = word.replace('CHR', 'KR')
2099
    word = word.replace('CR', 'KR')
2100
    word = word.replace('CI', 'SI')
2101
    word = word.replace('CO', 'KO')
2102
    word = word.replace('CU', 'KU')
2103
    word = word.replace('CY', 'SY')
2104
    word = word.replace('DG', 'GG')
2105
    word = word.replace('GH', 'HH')
2106
    word = word.replace('MAC', 'MK')
2107
    word = word.replace('MC', 'MK')
2108
    word = word.replace('NST', 'NSS')
2109
    word = word.replace('PF', 'FF')
2110
    word = word.replace('PH', 'FF')
2111
    word = word.replace('SCH', 'SSS')
2112
    word = word.replace('TIO', 'SIO')
2113
    word = word.replace('TIA', 'SIO')
2114
    word = word.replace('TCH', 'CHH')
2115
2116
    sdx = word.translate(_fuzzy_soundex_translation)
2117
    sdx = sdx.replace('-', '')
2118
2119
    # remove repeating characters
2120
    sdx = _delete_consecutive_repeats(sdx)
2121
2122
    if word[0] in {'H', 'W', 'Y'}:
2123
        sdx = word[0] + sdx
2124
    else:
2125
        sdx = word[0] + sdx[1:]
2126
2127
    sdx = sdx.replace('0', '')
2128
2129
    if zero_pad:
2130
        sdx += ('0'*maxlength)
2131
2132
    return sdx[:maxlength]
2133
2134
2135
def phonex(word, maxlength=4, zero_pad=True):
2136
    """Return the Phonex code for a word.
2137
2138
    Phonex is an algorithm derived from Soundex, defined in:
2139
    Lait, A. J. and B. Randell. "An Assessment of Name Matching Algorithms".
2140
    http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
2141
2142
    :param str word: the word to transform
2143
    :param int maxlength: the length of the code returned (defaults to 4)
2144
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2145
        a maxlength string
2146
    :returns: the Phonex value
2147
    :rtype: str
2148
2149
    >>> phonex('Christopher')
2150
    'C623'
2151
    >>> phonex('Niall')
2152
    'N400'
2153
    >>> phonex('Schmidt')
2154
    'S253'
2155
    >>> phonex('Smith')
2156
    'S530'
2157
    """
2158
    name = unicodedata.normalize('NFKD', text_type(word.upper()))
2159
    name = name.replace('ß', 'SS')
2160
2161
    # Clamp maxlength to [4, 64]
2162
    if maxlength is not None:
2163
        maxlength = min(max(4, maxlength), 64)
2164
    else:
2165
        maxlength = 64
2166
2167
    name_code = last = ''
2168
2169
    # Deletions effected by replacing with next letter which
2170
    # will be ignored due to duplicate handling of Soundex code.
2171
    # This is faster than 'moving' all subsequent letters.
2172
2173
    # Remove any trailing Ss
2174
    while name[-1:] == 'S':
2175
        name = name[:-1]
2176
2177
    # Phonetic equivalents of first 2 characters
2178
    # Works since duplicate letters are ignored
2179
    if name[:2] == 'KN':
2180
        name = 'N' + name[2:]  # KN.. == N..
2181
    elif name[:2] == 'PH':
2182
        name = 'F' + name[2:]  # PH.. == F.. (H ignored anyway)
2183
    elif name[:2] == 'WR':
2184
        name = 'R' + name[2:]  # WR.. == R..
2185
2186
    if name:
2187
        # Special case, ignore H first letter (subsequent Hs ignored anyway)
2188
        # Works since duplicate letters are ignored
2189
        if name[0] == 'H':
2190
            name = name[1:]
2191
2192
    if name:
2193
        # Phonetic equivalents of first character
2194
        if name[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2195
            name = 'A' + name[1:]
2196
        elif name[0] in {'B', 'P'}:
2197
            name = 'B' + name[1:]
2198
        elif name[0] in {'V', 'F'}:
2199
            name = 'F' + name[1:]
2200
        elif name[0] in {'C', 'K', 'Q'}:
2201
            name = 'C' + name[1:]
2202
        elif name[0] in {'G', 'J'}:
2203
            name = 'G' + name[1:]
2204
        elif name[0] in {'S', 'Z'}:
2205
            name = 'S' + name[1:]
2206
2207
        name_code = last = name[0]
2208
2209
    # MODIFIED SOUNDEX CODE
2210
    for i in range(1, len(name)):
2211
        code = '0'
2212
        if name[i] in {'B', 'F', 'P', 'V'}:
2213
            code = '1'
2214
        elif name[i] in {'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'}:
2215
            code = '2'
2216
        elif name[i] in {'D', 'T'}:
2217
            if name[i+1:i+2] != 'C':
2218
                code = '3'
2219
        elif name[i] == 'L':
2220
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2221
                    i+1 == len(name)):
2222
                code = '4'
2223
        elif name[i] in {'M', 'N'}:
2224
            if name[i+1:i+2] in {'D', 'G'}:
2225
                name = name[:i+1] + name[i] + name[i+2:]
2226
            code = '5'
2227
        elif name[i] == 'R':
2228
            if (name[i+1:i+2] in {'A', 'E', 'I', 'O', 'U', 'Y'} or
2229
                    i+1 == len(name)):
2230
                code = '6'
2231
2232
        if code != last and code != '0' and i != 0:
2233
            name_code += code
2234
2235
        last = name_code[-1]
2236
2237
    if zero_pad:
2238
        name_code += '0' * maxlength
2239
    if not name_code:
2240
        name_code = '0'
2241
    return name_code[:maxlength]
2242
2243
2244
def phonem(word):
2245
    """Return the Phonem code for a word.
2246
2247
    Phonem is defined in:
2248
    Wilde, Georg and Carsten Meyer. 1988. "Nicht wörtlich genommen,
2249
    'Schreibweisentolerante' Suchroutine in dBASE implementiert." c't Magazin
2250
    für Computer Technik. Oct. 1988. 126--131.
2251
2252
    This version is based on the Perl implementation documented at:
2253
    http://ifl.phil-fak.uni-koeln.de/sites/linguistik/Phonetik/import/Phonetik_Files/Allgemeine_Dateien/Martin_Wilz.pdf
2254
    It includes some enhancements presented in the Java port at:
2255
    https://github.com/dcm4che/dcm4che/blob/master/dcm4che-soundex/src/main/java/org/dcm4che3/soundex/Phonem.java
2256
2257
    Phonem is intended chiefly for German names/words.
2258
2259
    :param str word: the word to transform
2260
    :returns: the Phonem value
2261
    :rtype: str
2262
2263
    >>> phonem('Christopher')
2264
    'CRYSDOVR'
2265
    >>> phonem('Niall')
2266
    'NYAL'
2267
    >>> phonem('Smith')
2268
    'SMYD'
2269
    >>> phonem('Schmidt')
2270
    'CMYD'
2271
    """
2272
    _phonem_substitutions = (('SC', 'C'), ('SZ', 'C'), ('CZ', 'C'),
2273
                             ('TZ', 'C'), ('TS', 'C'), ('KS', 'X'),
2274
                             ('PF', 'V'), ('QU', 'KW'), ('PH', 'V'),
2275
                             ('UE', 'Y'), ('AE', 'E'), ('OE', 'Ö'),
2276
                             ('EI', 'AY'), ('EY', 'AY'), ('EU', 'OY'),
2277
                             ('AU', 'A§'), ('OU', '§'))
2278
    _phonem_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2279
                                    'ZKGQÇÑßFWPTÁÀÂÃÅÄÆÉÈÊËIJÌÍÎÏÜݧÚÙÛÔÒÓÕØ'),
2280
                                   'CCCCCNSVVBDAAAAAEEEEEEYYYYYYYYUUUUOOOOÖ'))
2281
2282
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2283
    for i, j in _phonem_substitutions:
2284
        word = word.replace(i, j)
2285
    word = word.translate(_phonem_translation)
2286
2287
    return ''.join(c for c in _delete_consecutive_repeats(word)
2288
                   if c in {'A', 'B', 'C', 'D', 'L', 'M', 'N', 'O', 'R', 'S',
2289
                            'U', 'V', 'W', 'X', 'Y', 'Ö'})
2290
2291
2292
def phonix(word, maxlength=4, zero_pad=True):
2293
    """Return the Phonix code for a word.
2294
2295
    Phonix is a Soundex-like algorithm defined in:
2296
    T.N. Gadd: PHONIX --- The Algorithm, Program 24/4, 1990, p.363-366.
2297
2298
    This implementation is based on
2299
    http://cpansearch.perl.org/src/ULPFR/WAIT-1.800/soundex.c
2300
    http://cs.anu.edu.au/people/Peter.Christen/Febrl/febrl-0.4.01/encode.py
2301
    and
2302
    https://metacpan.org/pod/Text::Phonetic::Phonix
2303
2304
    :param str word: the word to transform
2305
    :param int maxlength: the length of the code returned (defaults to 4)
2306
    :param bool zero_pad: pad the end of the return value with 0s to achieve
2307
        a maxlength string
2308
    :returns: the Phonix value
2309
    :rtype: str
2310
2311
    >>> phonix('Christopher')
2312
    'K683'
2313
    >>> phonix('Niall')
2314
    'N400'
2315
    >>> phonix('Smith')
2316
    'S530'
2317
    >>> phonix('Schmidt')
2318
    'S530'
2319
    """
2320
    # pylint: disable=too-many-branches
2321
    def _start_repl(word, src, tar, post=None):
2322
        r"""Replace src with tar at the start of word."""
2323
        if post:
2324
            for i in post:
2325
                if word.startswith(src+i):
2326
                    return tar + word[len(src):]
2327
        elif word.startswith(src):
2328
            return tar + word[len(src):]
2329
        return word
2330
2331
    def _end_repl(word, src, tar, pre=None):
2332
        r"""Replace src with tar at the end of word."""
2333
        if pre:
2334
            for i in pre:
2335
                if word.endswith(i+src):
2336
                    return word[:-len(src)] + tar
2337
        elif word.endswith(src):
2338
            return word[:-len(src)] + tar
2339
        return word
2340
2341
    def _mid_repl(word, src, tar, pre=None, post=None):
2342
        r"""Replace src with tar in the middle of word."""
2343
        if pre or post:
2344
            if not pre:
2345
                return word[0] + _all_repl(word[1:], src, tar, pre, post)
2346
            elif not post:
2347
                return _all_repl(word[:-1], src, tar, pre, post) + word[-1]
2348
            return _all_repl(word, src, tar, pre, post)
2349
        return (word[0] + _all_repl(word[1:-1], src, tar, pre, post) +
2350
                word[-1])
2351
2352
    def _all_repl(word, src, tar, pre=None, post=None):
2353
        r"""Replace src with tar anywhere in word."""
2354
        if pre or post:
2355
            if post:
2356
                post = post
2357
            else:
2358
                post = frozenset(('',))
2359
            if pre:
2360
                pre = pre
2361
            else:
2362
                pre = frozenset(('',))
2363
2364
            for i, j in ((i, j) for i in pre for j in post):
2365
                word = word.replace(i+src+j, i+tar+j)
2366
            return word
2367
        else:
2368
            return word.replace(src, tar)
2369
2370
    _vow = {'A', 'E', 'I', 'O', 'U'}
2371
    _con = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q',
2372
            'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z'}
2373
2374
    _phonix_substitutions = ((_all_repl, 'DG', 'G'),
2375
                             (_all_repl, 'CO', 'KO'),
2376
                             (_all_repl, 'CA', 'KA'),
2377
                             (_all_repl, 'CU', 'KU'),
2378
                             (_all_repl, 'CY', 'SI'),
2379
                             (_all_repl, 'CI', 'SI'),
2380
                             (_all_repl, 'CE', 'SE'),
2381
                             (_start_repl, 'CL', 'KL', _vow),
2382
                             (_all_repl, 'CK', 'K'),
2383
                             (_end_repl, 'GC', 'K'),
2384
                             (_end_repl, 'JC', 'K'),
2385
                             (_start_repl, 'CHR', 'KR', _vow),
2386
                             (_start_repl, 'CR', 'KR', _vow),
2387
                             (_start_repl, 'WR', 'R'),
2388
                             (_all_repl, 'NC', 'NK'),
2389
                             (_all_repl, 'CT', 'KT'),
2390
                             (_all_repl, 'PH', 'F'),
2391
                             (_all_repl, 'AA', 'AR'),
2392
                             (_all_repl, 'SCH', 'SH'),
2393
                             (_all_repl, 'BTL', 'TL'),
2394
                             (_all_repl, 'GHT', 'T'),
2395
                             (_all_repl, 'AUGH', 'ARF'),
2396
                             (_mid_repl, 'LJ', 'LD', _vow, _vow),
2397
                             (_all_repl, 'LOUGH', 'LOW'),
2398
                             (_start_repl, 'Q', 'KW'),
2399
                             (_start_repl, 'KN', 'N'),
2400
                             (_end_repl, 'GN', 'N'),
2401
                             (_all_repl, 'GHN', 'N'),
2402
                             (_end_repl, 'GNE', 'N'),
2403
                             (_all_repl, 'GHNE', 'NE'),
2404
                             (_end_repl, 'GNES', 'NS'),
2405
                             (_start_repl, 'GN', 'N'),
2406
                             (_mid_repl, 'GN', 'N', None, _con),
2407
                             (_end_repl, 'GN', 'N'),
2408
                             (_start_repl, 'PS', 'S'),
2409
                             (_start_repl, 'PT', 'T'),
2410
                             (_start_repl, 'CZ', 'C'),
2411
                             (_mid_repl, 'WZ', 'Z', _vow),
2412
                             (_mid_repl, 'CZ', 'CH'),
2413
                             (_all_repl, 'LZ', 'LSH'),
2414
                             (_all_repl, 'RZ', 'RSH'),
2415
                             (_mid_repl, 'Z', 'S', None, _vow),
2416
                             (_all_repl, 'ZZ', 'TS'),
2417
                             (_mid_repl, 'Z', 'TS', _con),
2418
                             (_all_repl, 'HROUG', 'REW'),
2419
                             (_all_repl, 'OUGH', 'OF'),
2420
                             (_mid_repl, 'Q', 'KW', _vow, _vow),
2421
                             (_mid_repl, 'J', 'Y', _vow, _vow),
2422
                             (_start_repl, 'YJ', 'Y', _vow),
2423
                             (_start_repl, 'GH', 'G'),
2424
                             (_end_repl, 'GH', 'E', _vow),
2425
                             (_start_repl, 'CY', 'S'),
2426
                             (_all_repl, 'NX', 'NKS'),
2427
                             (_start_repl, 'PF', 'F'),
2428
                             (_end_repl, 'DT', 'T'),
2429
                             (_end_repl, 'TL', 'TIL'),
2430
                             (_end_repl, 'DL', 'DIL'),
2431
                             (_all_repl, 'YTH', 'ITH'),
2432
                             (_start_repl, 'TJ', 'CH', _vow),
2433
                             (_start_repl, 'TSJ', 'CH', _vow),
2434
                             (_start_repl, 'TS', 'T', _vow),
2435
                             (_all_repl, 'TCH', 'CH'),
2436
                             (_mid_repl, 'WSK', 'VSKIE', _vow),
2437
                             (_end_repl, 'WSK', 'VSKIE', _vow),
2438
                             (_start_repl, 'MN', 'N', _vow),
2439
                             (_start_repl, 'PN', 'N', _vow),
2440
                             (_mid_repl, 'STL', 'SL', _vow),
2441
                             (_end_repl, 'STL', 'SL', _vow),
2442
                             (_end_repl, 'TNT', 'ENT'),
2443
                             (_end_repl, 'EAUX', 'OH'),
2444
                             (_all_repl, 'EXCI', 'ECS'),
2445
                             (_all_repl, 'X', 'ECS'),
2446
                             (_end_repl, 'NED', 'ND'),
2447
                             (_all_repl, 'JR', 'DR'),
2448
                             (_end_repl, 'EE', 'EA'),
2449
                             (_all_repl, 'ZS', 'S'),
2450
                             (_mid_repl, 'R', 'AH', _vow, _con),
2451
                             (_end_repl, 'R', 'AH', _vow),
2452
                             (_mid_repl, 'HR', 'AH', _vow, _con),
2453
                             (_end_repl, 'HR', 'AH', _vow),
2454
                             (_end_repl, 'HR', 'AH', _vow),
2455
                             (_end_repl, 'RE', 'AR'),
2456
                             (_end_repl, 'R', 'AH', _vow),
2457
                             (_all_repl, 'LLE', 'LE'),
2458
                             (_end_repl, 'LE', 'ILE', _con),
2459
                             (_end_repl, 'LES', 'ILES', _con),
2460
                             (_end_repl, 'E', ''),
2461
                             (_end_repl, 'ES', 'S'),
2462
                             (_end_repl, 'SS', 'AS', _vow),
2463
                             (_end_repl, 'MB', 'M', _vow),
2464
                             (_all_repl, 'MPTS', 'MPS'),
2465
                             (_all_repl, 'MPS', 'MS'),
2466
                             (_all_repl, 'MPT', 'MT'))
2467
2468
    _phonix_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2469
                                    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
2470
                                   '01230720022455012683070808'))
2471
2472
    sdx = ''
2473
2474
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
2475
    word = word.replace('ß', 'SS')
2476
    word = ''.join(c for c in word if c in
2477
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2478
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2479
                    'Y', 'Z'})
2480
    if word:
2481
        for trans in _phonix_substitutions:
2482
            word = trans[0](word, *trans[1:])
2483
        if word[0] in {'A', 'E', 'I', 'O', 'U', 'Y'}:
2484
            sdx = 'v' + word[1:].translate(_phonix_translation)
2485
        else:
2486
            sdx = word[0] + word[1:].translate(_phonix_translation)
2487
        sdx = _delete_consecutive_repeats(sdx)
2488
        sdx = sdx.replace('0', '')
2489
2490
    # Clamp maxlength to [4, 64]
2491
    if maxlength is not None:
2492
        maxlength = min(max(4, maxlength), 64)
2493
    else:
2494
        maxlength = 64
2495
2496
    if zero_pad:
2497
        sdx += '0' * maxlength
2498
    if not sdx:
2499
        sdx = '0'
2500
    return sdx[:maxlength]
2501
2502
2503
def sfinxbis(word, maxlength=None):
2504
    """Return the SfinxBis code for a word.
2505
2506
    SfinxBis is a Soundex-like algorithm defined in:
2507
    http://www.swami.se/download/18.248ad5af12aa8136533800091/SfinxBis.pdf
2508
2509
    This implementation follows the reference implementation:
2510
    http://www.swami.se/download/18.248ad5af12aa8136533800093/swamiSfinxBis.java.txt
2511
2512
    SfinxBis is intended chiefly for Swedish names.
2513
2514
    :param str word: the word to transform
2515
    :param int maxlength: the length of the code returned (defaults to
2516
        unlimited)
2517
    :returns: the SfinxBis value
2518
    :rtype: tuple
2519
2520
    >>> sfinxbis('Christopher')
2521
    ('K68376',)
2522
    >>> sfinxbis('Niall')
2523
    ('N4',)
2524
    >>> sfinxbis('Smith')
2525
    ('S53',)
2526
    >>> sfinxbis('Schmidt')
2527
    ('S53',)
2528
2529
    >>> sfinxbis('Johansson')
2530
    ('J585',)
2531
    >>> sfinxbis('Sjöberg')
2532
    ('#162',)
2533
    """
2534
    adelstitler = (' DE LA ', ' DE LAS ', ' DE LOS ', ' VAN DE ', ' VAN DEN ',
2535
                   ' VAN DER ', ' VON DEM ', ' VON DER ',
2536
                   ' AF ', ' AV ', ' DA ', ' DE ', ' DEL ', ' DEN ', ' DES ',
2537
                   ' DI ', ' DO ', ' DON ', ' DOS ', ' DU ', ' E ', ' IN ',
2538
                   ' LA ', ' LE ', ' MAC ', ' MC ', ' VAN ', ' VON ', ' Y ',
2539
                   ' S:T ')
2540
2541
    _harde_vokaler = {'A', 'O', 'U', 'Å'}
2542
    _mjuka_vokaler = {'E', 'I', 'Y', 'Ä', 'Ö'}
2543
    _konsonanter = {'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P',
2544
                    'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Z'}
2545
    _alfabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
2546
                'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
2547
                'Y', 'Z', 'Ä', 'Å', 'Ö'}
2548
2549
    _sfinxbis_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
2550
                                      'BCDFGHJKLMNPQRSTVZAOUÅEIYÄÖ'),
2551
                                     '123729224551268378999999999'))
2552
2553
    _sfinxbis_substitutions = dict(zip((ord(_) for _ in
2554
                                        'WZÀÁÂÃÆÇÈÉÊËÌÍÎÏÑÒÓÔÕØÙÚÛÜÝ'),
2555
                                       'VSAAAAÄCEEEEIIIINOOOOÖUUUYY'))
2556
2557
    def _foersvensker(ordet):
2558
        """Return the Swedish-ized form of the word."""
2559
        ordet = ordet.replace('STIERN', 'STJÄRN')
2560
        ordet = ordet.replace('HIE', 'HJ')
2561
        ordet = ordet.replace('SIÖ', 'SJÖ')
2562
        ordet = ordet.replace('SCH', 'SH')
2563
        ordet = ordet.replace('QU', 'KV')
2564
        ordet = ordet.replace('IO', 'JO')
2565
        ordet = ordet.replace('PH', 'F')
2566
2567
        for i in _harde_vokaler:
2568
            ordet = ordet.replace(i+'Ü', i+'J')
2569
            ordet = ordet.replace(i+'Y', i+'J')
2570
            ordet = ordet.replace(i+'I', i+'J')
2571
        for i in _mjuka_vokaler:
2572
            ordet = ordet.replace(i+'Ü', i+'J')
2573
            ordet = ordet.replace(i+'Y', i+'J')
2574
            ordet = ordet.replace(i+'I', i+'J')
2575
2576
        if 'H' in ordet:
2577
            for i in _konsonanter:
2578
                ordet = ordet.replace('H'+i, i)
2579
2580
        ordet = ordet.translate(_sfinxbis_substitutions)
2581
2582
        ordet = ordet.replace('Ð', 'ETH')
2583
        ordet = ordet.replace('Þ', 'TH')
2584
        ordet = ordet.replace('ß', 'SS')
2585
2586
        return ordet
2587
2588
    def _koda_foersta_ljudet(ordet):
2589
        """Return the word with the first sound coded."""
2590
        if ordet[0:1] in _mjuka_vokaler or ordet[0:1] in _harde_vokaler:
2591
            ordet = '$' + ordet[1:]
2592
        elif ordet[0:2] in ('DJ', 'GJ', 'HJ', 'LJ'):
2593
            ordet = 'J' + ordet[2:]
2594
        elif ordet[0:1] == 'G' and ordet[1:2] in _mjuka_vokaler:
2595
            ordet = 'J' + ordet[1:]
2596
        elif ordet[0:1] == 'Q':
2597
            ordet = 'K' + ordet[1:]
2598
        elif (ordet[0:2] == 'CH' and
2599
              ordet[2:3] in frozenset(_mjuka_vokaler | _harde_vokaler)):
2600
            ordet = '#' + ordet[2:]
2601
        elif ordet[0:1] == 'C' and ordet[1:2] in _harde_vokaler:
2602
            ordet = 'K' + ordet[1:]
2603
        elif ordet[0:1] == 'C' and ordet[1:2] in _konsonanter:
2604
            ordet = 'K' + ordet[1:]
2605
        elif ordet[0:1] == 'X':
2606
            ordet = 'S' + ordet[1:]
2607
        elif ordet[0:1] == 'C' and ordet[1:2] in _mjuka_vokaler:
2608
            ordet = 'S' + ordet[1:]
2609
        elif ordet[0:3] in ('SKJ', 'STJ', 'SCH'):
2610
            ordet = '#' + ordet[3:]
2611
        elif ordet[0:2] in ('SH', 'KJ', 'TJ', 'SJ'):
2612
            ordet = '#' + ordet[2:]
2613
        elif ordet[0:2] == 'SK' and ordet[2:3] in _mjuka_vokaler:
2614
            ordet = '#' + ordet[2:]
2615
        elif ordet[0:1] == 'K' and ordet[1:2] in _mjuka_vokaler:
2616
            ordet = '#' + ordet[1:]
2617
        return ordet
2618
2619
    # Steg 1, Versaler
2620
    word = unicodedata.normalize('NFC', text_type(word.upper()))
2621
    word = word.replace('ß', 'SS')
2622
    word = word.replace('-', ' ')
2623
2624
    # Steg 2, Ta bort adelsprefix
2625
    for adelstitel in adelstitler:
2626
        while adelstitel in word:
2627
            word = word.replace(adelstitel, ' ')
2628
        if word.startswith(adelstitel[1:]):
2629
            word = word[len(adelstitel)-1:]
2630
2631
    # Split word into tokens
2632
    ordlista = word.split()
2633
2634
    # Steg 3, Ta bort dubbelteckning i början på namnet
2635
    ordlista = [_delete_consecutive_repeats(ordet) for ordet in ordlista]
2636
    if not ordlista:
2637
        return ('',)
2638
2639
    # Steg 4, Försvenskning
2640
    ordlista = [_foersvensker(ordet) for ordet in ordlista]
2641
2642
    # Steg 5, Ta bort alla tecken som inte är A-Ö (65-90,196,197,214)
2643
    ordlista = [''.join(c for c in ordet if c in _alfabet)
2644
                for ordet in ordlista]
2645
2646
    # Steg 6, Koda första ljudet
2647
    ordlista = [_koda_foersta_ljudet(ordet) for ordet in ordlista]
2648
2649
    # Steg 7, Dela upp namnet i två delar
2650
    rest = [ordet[1:] for ordet in ordlista]
2651
2652
    # Steg 8, Utför fonetisk transformation i resten
2653
    rest = [ordet.replace('DT', 'T') for ordet in rest]
2654
    rest = [ordet.replace('X', 'KS') for ordet in rest]
2655
2656
    # Steg 9, Koda resten till en sifferkod
2657
    for vokal in _mjuka_vokaler:
2658
        rest = [ordet.replace('C'+vokal, '8'+vokal) for ordet in rest]
2659
    rest = [ordet.translate(_sfinxbis_translation) for ordet in rest]
2660
2661
    # Steg 10, Ta bort intilliggande dubbletter
2662
    rest = [_delete_consecutive_repeats(ordet) for ordet in rest]
2663
2664
    # Steg 11, Ta bort alla "9"
2665
    rest = [ordet.replace('9', '') for ordet in rest]
2666
2667
    # Steg 12, Sätt ihop delarna igen
2668
    ordlista = [''.join(ordet) for ordet in
2669
                zip((_[0:1] for _ in ordlista), rest)]
2670
2671
    # truncate, if maxlength is set
2672
    if maxlength and maxlength < _INFINITY:
2673
        ordlista = [ordet[:maxlength] for ordet in ordlista]
2674
2675
    return tuple(ordlista)
2676
2677
2678
def phonet(word, mode=1, lang='de', trace=False):
2679
    """Return the phonet code for a word.
2680
2681
    phonet ("Hannoveraner Phonetik") was developed by Jörg Michael and
2682
    documented in c't magazine vol. 25/1999, p. 252. It is a phonetic
2683
    algorithm designed primarily for German.
2684
    Cf. http://www.heise.de/ct/ftp/99/25/252/
2685
2686
    This is a port of Jesper Zedlitz's code, which is licensed LGPL:
2687
    https://github.com/jze/phonet4java/blob/master/src/main/java/de/zedlitz/phonet4java/Phonet.java
2688
2689
    That is, in turn, based on Michael's C code, which is also licensed LGPL:
2690
    ftp://ftp.heise.de/pub/ct/listings/phonet.zip
2691
2692
    :param str word: the word to transform
2693
    :param int mode: the ponet variant to employ (1 or 2)
2694
    :param str lang: 'de' (default) for German
2695
            'none' for no language
2696
    :param bool trace: prints debugging info if True
2697
    :returns: the phonet value
2698
    :rtype: str
2699
2700
    >>> phonet('Christopher')
2701
    'KRISTOFA'
2702
    >>> phonet('Niall')
2703
    'NIAL'
2704
    >>> phonet('Smith')
2705
    'SMIT'
2706
    >>> phonet('Schmidt')
2707
    'SHMIT'
2708
2709
    >>> phonet('Christopher', mode=2)
2710
    'KRIZTUFA'
2711
    >>> phonet('Niall', mode=2)
2712
    'NIAL'
2713
    >>> phonet('Smith', mode=2)
2714
    'ZNIT'
2715
    >>> phonet('Schmidt', mode=2)
2716
    'ZNIT'
2717
2718
    >>> phonet('Christopher', lang='none')
2719
    'CHRISTOPHER'
2720
    >>> phonet('Niall', lang='none')
2721
    'NIAL'
2722
    >>> phonet('Smith', lang='none')
2723
    'SMITH'
2724
    >>> phonet('Schmidt', lang='none')
2725
    'SCHMIDT'
2726
    """
2727
    # pylint: disable=too-many-branches
2728
2729
    _phonet_rules_no_lang = (  # separator chars
2730
        '´', ' ', ' ',
2731
        '"', ' ', ' ',
2732
        '`$', '', '',
2733
        '\'', ' ', ' ',
2734
        ',', ',', ',',
2735
        ';', ',', ',',
2736
        '-', ' ', ' ',
2737
        ' ', ' ', ' ',
2738
        '.', '.', '.',
2739
        ':', '.', '.',
2740
        # German umlauts
2741
        'Ä', 'AE', 'AE',
2742
        'Ö', 'OE', 'OE',
2743
        'Ü', 'UE', 'UE',
2744
        'ß', 'S', 'S',
2745
        # international umlauts
2746
        'À', 'A', 'A',
2747
        'Á', 'A', 'A',
2748
        'Â', 'A', 'A',
2749
        'Ã', 'A', 'A',
2750
        'Å', 'A', 'A',
2751
        'Æ', 'AE', 'AE',
2752
        'Ç', 'C', 'C',
2753
        'Ð', 'DJ', 'DJ',
2754
        'È', 'E', 'E',
2755
        'É', 'E', 'E',
2756
        'Ê', 'E', 'E',
2757
        'Ë', 'E', 'E',
2758
        'Ì', 'I', 'I',
2759
        'Í', 'I', 'I',
2760
        'Î', 'I', 'I',
2761
        'Ï', 'I', 'I',
2762
        'Ñ', 'NH', 'NH',
2763
        'Ò', 'O', 'O',
2764
        'Ó', 'O', 'O',
2765
        'Ô', 'O', 'O',
2766
        'Õ', 'O', 'O',
2767
        'Œ', 'OE', 'OE',
2768
        'Ø', 'OE', 'OE',
2769
        'Š', 'SH', 'SH',
2770
        'Þ', 'TH', 'TH',
2771
        'Ù', 'U', 'U',
2772
        'Ú', 'U', 'U',
2773
        'Û', 'U', 'U',
2774
        'Ý', 'Y', 'Y',
2775
        'Ÿ', 'Y', 'Y',
2776
        # 'normal' letters (A-Z)
2777
        'MC^', 'MAC', 'MAC',
2778
        'MC^', 'MAC', 'MAC',
2779
        'M´^', 'MAC', 'MAC',
2780
        'M\'^', 'MAC', 'MAC',
2781
        'O´^', 'O', 'O',
2782
        'O\'^', 'O', 'O',
2783
        'VAN DEN ^', 'VANDEN', 'VANDEN',
2784
        None, None, None)
2785
2786
    _phonet_rules_german = (  # separator chars
2787
        '´', ' ', ' ',
2788
        '"', ' ', ' ',
2789
        '`$', '', '',
2790
        '\'', ' ', ' ',
2791
        ',', ' ', ' ',
2792
        ';', ' ', ' ',
2793
        '-', ' ', ' ',
2794
        ' ', ' ', ' ',
2795
        '.', '.', '.',
2796
        ':', '.', '.',
2797
        # German umlauts
2798
        'ÄE', 'E', 'E',
2799
        'ÄU<', 'EU', 'EU',
2800
        'ÄV(AEOU)-<', 'EW', None,
2801
        'Ä$', 'Ä', None,
2802
        'Ä<', None, 'E',
2803
        'Ä', 'E', None,
2804
        'ÖE', 'Ö', 'Ö',
2805
        'ÖU', 'Ö', 'Ö',
2806
        'ÖVER--<', 'ÖW', None,
2807
        'ÖV(AOU)-', 'ÖW', None,
2808
        'ÜBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
2809
        'ÜBER^^', 'ÜBA', 'IBA',
2810
        'ÜE', 'Ü', 'I',
2811
        'ÜVER--<', 'ÜW', None,
2812
        'ÜV(AOU)-', 'ÜW', None,
2813
        'Ü', None, 'I',
2814
        'ßCH<', None, 'Z',
2815
        'ß<', 'S', 'Z',
2816
        # international umlauts
2817
        'À<', 'A', 'A',
2818
        'Á<', 'A', 'A',
2819
        'Â<', 'A', 'A',
2820
        'Ã<', 'A', 'A',
2821
        'Å<', 'A', 'A',
2822
        'ÆER-', 'E', 'E',
2823
        'ÆU<', 'EU', 'EU',
2824
        'ÆV(AEOU)-<', 'EW', None,
2825
        'Æ$', 'Ä', None,
2826
        'Æ<', None, 'E',
2827
        'Æ', 'E', None,
2828
        'Ç', 'Z', 'Z',
2829
        'ÐÐ-', '', '',
2830
        'Ð', 'DI', 'TI',
2831
        'È<', 'E', 'E',
2832
        'É<', 'E', 'E',
2833
        'Ê<', 'E', 'E',
2834
        'Ë', 'E', 'E',
2835
        'Ì<', 'I', 'I',
2836
        'Í<', 'I', 'I',
2837
        'Î<', 'I', 'I',
2838
        'Ï', 'I', 'I',
2839
        'ÑÑ-', '', '',
2840
        'Ñ', 'NI', 'NI',
2841
        'Ò<', 'O', 'U',
2842
        'Ó<', 'O', 'U',
2843
        'Ô<', 'O', 'U',
2844
        'Õ<', 'O', 'U',
2845
        'Œ<', 'Ö', 'Ö',
2846
        'Ø(IJY)-<', 'E', 'E',
2847
        'Ø<', 'Ö', 'Ö',
2848
        'Š', 'SH', 'Z',
2849
        'Þ', 'T', 'T',
2850
        'Ù<', 'U', 'U',
2851
        'Ú<', 'U', 'U',
2852
        'Û<', 'U', 'U',
2853
        'Ý<', 'I', 'I',
2854
        'Ÿ<', 'I', 'I',
2855
        # 'normal' letters (A-Z)
2856
        'ABELLE$', 'ABL', 'ABL',
2857
        'ABELL$', 'ABL', 'ABL',
2858
        'ABIENNE$', 'ABIN', 'ABIN',
2859
        'ACHME---^', 'ACH', 'AK',
2860
        'ACEY$', 'AZI', 'AZI',
2861
        'ADV', 'ATW', None,
2862
        'AEGL-', 'EK', None,
2863
        'AEU<', 'EU', 'EU',
2864
        'AE2', 'E', 'E',
2865
        'AFTRAUBEN------', 'AFT ', 'AFT ',
2866
        'AGL-1', 'AK', None,
2867
        'AGNI-^', 'AKN', 'AKN',
2868
        'AGNIE-', 'ANI', 'ANI',
2869
        'AGN(AEOU)-$', 'ANI', 'ANI',
2870
        'AH(AIOÖUÜY)-', 'AH', None,
2871
        'AIA2', 'AIA', 'AIA',
2872
        'AIE$', 'E', 'E',
2873
        'AILL(EOU)-', 'ALI', 'ALI',
2874
        'AINE$', 'EN', 'EN',
2875
        'AIRE$', 'ER', 'ER',
2876
        'AIR-', 'E', 'E',
2877
        'AISE$', 'ES', 'EZ',
2878
        'AISSANCE$', 'ESANS', 'EZANZ',
2879
        'AISSE$', 'ES', 'EZ',
2880
        'AIX$', 'EX', 'EX',
2881
        'AJ(AÄEÈÉÊIOÖUÜ)--', 'A', 'A',
2882
        'AKTIE', 'AXIE', 'AXIE',
2883
        'AKTUEL', 'AKTUEL', None,
2884
        'ALOI^', 'ALOI', 'ALUI',  # Don't merge these rules
2885
        'ALOY^', 'ALOI', 'ALUI',  # needed by 'check_rules'
2886
        'AMATEU(RS)-', 'AMATÖ', 'ANATÖ',
2887
        'ANCH(OEI)-', 'ANSH', 'ANZ',
2888
        'ANDERGEGANG----', 'ANDA GE', 'ANTA KE',
2889
        'ANDERGEHE----', 'ANDA ', 'ANTA ',
2890
        'ANDERGESETZ----', 'ANDA GE', 'ANTA KE',
2891
        'ANDERGING----', 'ANDA ', 'ANTA ',
2892
        'ANDERSETZ(ET)-----', 'ANDA ', 'ANTA ',
2893
        'ANDERZUGEHE----', 'ANDA ZU ', 'ANTA ZU ',
2894
        'ANDERZUSETZE-----', 'ANDA ZU ', 'ANTA ZU ',
2895
        'ANER(BKO)---^^', 'AN', None,
2896
        'ANHAND---^$', 'AN H', 'AN ',
2897
        'ANH(AÄEIOÖUÜY)--^^', 'AN', None,
2898
        'ANIELLE$', 'ANIEL', 'ANIL',
2899
        'ANIEL', 'ANIEL', None,
2900
        'ANSTELLE----^$', 'AN ST', 'AN ZT',
2901
        'ANTI^^', 'ANTI', 'ANTI',
2902
        'ANVER^^', 'ANFA', 'ANFA',
2903
        'ATIA$', 'ATIA', 'ATIA',
2904
        'ATIA(NS)--', 'ATI', 'ATI',
2905
        'ATI(AÄOÖUÜ)-', 'AZI', 'AZI',
2906
        'AUAU--', '', '',
2907
        'AUERE$', 'AUERE', None,
2908
        'AUERE(NS)-$', 'AUERE', None,
2909
        'AUERE(AIOUY)--', 'AUER', None,
2910
        'AUER(AÄIOÖUÜY)-', 'AUER', None,
2911
        'AUER<', 'AUA', 'AUA',
2912
        'AUF^^', 'AUF', 'AUF',
2913
        'AULT$', 'O', 'U',
2914
        'AUR(BCDFGKLMNQSTVWZ)-', 'AUA', 'AUA',
2915
        'AUR$', 'AUA', 'AUA',
2916
        'AUSSE$', 'OS', 'UZ',
2917
        'AUS(ST)-^', 'AUS', 'AUS',
2918
        'AUS^^', 'AUS', 'AUS',
2919
        'AUTOFAHR----', 'AUTO ', 'AUTU ',
2920
        'AUTO^^', 'AUTO', 'AUTU',
2921
        'AUX(IY)-', 'AUX', 'AUX',
2922
        'AUX', 'O', 'U',
2923
        'AU', 'AU', 'AU',
2924
        'AVER--<', 'AW', None,
2925
        'AVIER$', 'AWIE', 'AFIE',
2926
        'AV(EÈÉÊI)-^', 'AW', None,
2927
        'AV(AOU)-', 'AW', None,
2928
        'AYRE$', 'EIRE', 'EIRE',
2929
        'AYRE(NS)-$', 'EIRE', 'EIRE',
2930
        'AYRE(AIOUY)--', 'EIR', 'EIR',
2931
        'AYR(AÄIOÖUÜY)-', 'EIR', 'EIR',
2932
        'AYR<', 'EIA', 'EIA',
2933
        'AYER--<', 'EI', 'EI',
2934
        'AY(AÄEIOÖUÜY)--', 'A', 'A',
2935
        'AË', 'E', 'E',
2936
        'A(IJY)<', 'EI', 'EI',
2937
        'BABY^$', 'BEBI', 'BEBI',
2938
        'BAB(IY)^', 'BEBI', 'BEBI',
2939
        'BEAU^$', 'BO', None,
2940
        'BEA(BCMNRU)-^', 'BEA', 'BEA',
2941
        'BEAT(AEIMORU)-^', 'BEAT', 'BEAT',
2942
        'BEE$', 'BI', 'BI',
2943
        'BEIGE^$', 'BESH', 'BEZ',
2944
        'BENOIT--', 'BENO', 'BENU',
2945
        'BER(DT)-', 'BER', None,
2946
        'BERN(DT)-', 'BERN', None,
2947
        'BE(LMNRST)-^', 'BE', 'BE',
2948
        'BETTE$', 'BET', 'BET',
2949
        'BEVOR^$', 'BEFOR', None,
2950
        'BIC$', 'BIZ', 'BIZ',
2951
        'BOWL(EI)-', 'BOL', 'BUL',
2952
        'BP(AÄEÈÉÊIÌÍÎOÖRUÜY)-', 'B', 'B',
2953
        'BRINGEND-----^', 'BRI', 'BRI',
2954
        'BRINGEND-----', ' BRI', ' BRI',
2955
        'BROW(NS)-', 'BRAU', 'BRAU',
2956
        'BUDGET7', 'BÜGE', 'BIKE',
2957
        'BUFFET7', 'BÜFE', 'BIFE',
2958
        'BYLLE$', 'BILE', 'BILE',
2959
        'BYLL$', 'BIL', 'BIL',
2960
        'BYPA--^', 'BEI', 'BEI',
2961
        'BYTE<', 'BEIT', 'BEIT',
2962
        'BY9^', 'BÜ', None,
2963
        'B(SßZ)$', 'BS', None,
2964
        'CACH(EI)-^', 'KESH', 'KEZ',
2965
        'CAE--', 'Z', 'Z',
2966
        'CA(IY)$', 'ZEI', 'ZEI',
2967
        'CE(EIJUY)--', 'Z', 'Z',
2968
        'CENT<', 'ZENT', 'ZENT',
2969
        'CERST(EI)----^', 'KE', 'KE',
2970
        'CER$', 'ZA', 'ZA',
2971
        'CE3', 'ZE', 'ZE',
2972
        'CH\'S$', 'X', 'X',
2973
        'CH´S$', 'X', 'X',
2974
        'CHAO(ST)-', 'KAO', 'KAU',
2975
        'CHAMPIO-^', 'SHEMPI', 'ZENBI',
2976
        'CHAR(AI)-^', 'KAR', 'KAR',
2977
        'CHAU(CDFSVWXZ)-', 'SHO', 'ZU',
2978
        'CHÄ(CF)-', 'SHE', 'ZE',
2979
        'CHE(CF)-', 'SHE', 'ZE',
2980
        'CHEM-^', 'KE', 'KE',  # or: 'CHE', 'KE'
2981
        'CHEQUE<', 'SHEK', 'ZEK',
2982
        'CHI(CFGPVW)-', 'SHI', 'ZI',
2983
        'CH(AEUY)-<^', 'SH', 'Z',
2984
        'CHK-', '', '',
2985
        'CHO(CKPS)-^', 'SHO', 'ZU',
2986
        'CHRIS-', 'KRI', None,
2987
        'CHRO-', 'KR', None,
2988
        'CH(LOR)-<^', 'K', 'K',
2989
        'CHST-', 'X', 'X',
2990
        'CH(SßXZ)3', 'X', 'X',
2991
        'CHTNI-3', 'CHN', 'KN',
2992
        'CH^', 'K', 'K',  # or: 'CH', 'K'
2993
        'CH', 'CH', 'K',
2994
        'CIC$', 'ZIZ', 'ZIZ',
2995
        'CIENCEFICT----', 'EIENS ', 'EIENZ ',
2996
        'CIENCE$', 'EIENS', 'EIENZ',
2997
        'CIER$', 'ZIE', 'ZIE',
2998
        'CYB-^', 'ZEI', 'ZEI',
2999
        'CY9^', 'ZÜ', 'ZI',
3000
        'C(IJY)-<3', 'Z', 'Z',
3001
        'CLOWN-', 'KLAU', 'KLAU',
3002
        'CCH', 'Z', 'Z',
3003
        'CCE-', 'X', 'X',
3004
        'C(CK)-', '', '',
3005
        'CLAUDET---', 'KLO', 'KLU',
3006
        'CLAUDINE^$', 'KLODIN', 'KLUTIN',
3007
        'COACH', 'KOSH', 'KUZ',
3008
        'COLE$', 'KOL', 'KUL',
3009
        'COUCH', 'KAUSH', 'KAUZ',
3010
        'COW', 'KAU', 'KAU',
3011
        'CQUES$', 'K', 'K',
3012
        'CQUE', 'K', 'K',
3013
        'CRASH--9', 'KRE', 'KRE',
3014
        'CREAT-^', 'KREA', 'KREA',
3015
        'CST', 'XT', 'XT',
3016
        'CS<^', 'Z', 'Z',
3017
        'C(SßX)', 'X', 'X',
3018
        'CT\'S$', 'X', 'X',
3019
        'CT(SßXZ)', 'X', 'X',
3020
        'CZ<', 'Z', 'Z',
3021
        'C(ÈÉÊÌÍÎÝ)3', 'Z', 'Z',
3022
        'C.^', 'C.', 'C.',
3023
        'CÄ-', 'Z', 'Z',
3024
        'CÜ$', 'ZÜ', 'ZI',
3025
        'C\'S$', 'X', 'X',
3026
        'C<', 'K', 'K',
3027
        'DAHER^$', 'DAHER', None,
3028
        'DARAUFFOLGE-----', 'DARAUF ', 'TARAUF ',
3029
        'DAVO(NR)-^$', 'DAFO', 'TAFU',
3030
        'DD(SZ)--<', '', '',
3031
        'DD9', 'D', None,
3032
        'DEPOT7', 'DEPO', 'TEBU',
3033
        'DESIGN', 'DISEIN', 'TIZEIN',
3034
        'DE(LMNRST)-3^', 'DE', 'TE',
3035
        'DETTE$', 'DET', 'TET',
3036
        'DH$', 'T', None,
3037
        'DIC$', 'DIZ', 'TIZ',
3038
        'DIDR-^', 'DIT', None,
3039
        'DIEDR-^', 'DIT', None,
3040
        'DJ(AEIOU)-^', 'I', 'I',
3041
        'DMITR-^', 'DIMIT', 'TINIT',
3042
        'DRY9^', 'DRÜ', None,
3043
        'DT-', '', '',
3044
        'DUIS-^', 'DÜ', 'TI',
3045
        'DURCH^^', 'DURCH', 'TURK',
3046
        'DVA$', 'TWA', None,
3047
        'DY9^', 'DÜ', None,
3048
        'DYS$', 'DIS', None,
3049
        'DS(CH)--<', 'T', 'T',
3050
        'DST', 'ZT', 'ZT',
3051
        'DZS(CH)--', 'T', 'T',
3052
        'D(SßZ)', 'Z', 'Z',
3053
        'D(AÄEIOÖRUÜY)-', 'D', None,
3054
        'D(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'D', None,
3055
        'D\'H^', 'D', 'T',
3056
        'D´H^', 'D', 'T',
3057
        'D`H^', 'D', 'T',
3058
        'D\'S3$', 'Z', 'Z',
3059
        'D´S3$', 'Z', 'Z',
3060
        'D^', 'D', None,
3061
        'D', 'T', 'T',
3062
        'EAULT$', 'O', 'U',
3063
        'EAUX$', 'O', 'U',
3064
        'EAU', 'O', 'U',
3065
        'EAV', 'IW', 'IF',
3066
        'EAS3$', 'EAS', None,
3067
        'EA(AÄEIOÖÜY)-3', 'EA', 'EA',
3068
        'EA3$', 'EA', 'EA',
3069
        'EA3', 'I', 'I',
3070
        'EBENSO^$', 'EBNSO', 'EBNZU',
3071
        'EBENSO^^', 'EBNSO ', 'EBNZU ',
3072
        'EBEN^^', 'EBN', 'EBN',
3073
        'EE9', 'E', 'E',
3074
        'EGL-1', 'EK', None,
3075
        'EHE(IUY)--1', 'EH', None,
3076
        'EHUNG---1', 'E', None,
3077
        'EH(AÄIOÖUÜY)-1', 'EH', None,
3078
        'EIEI--', '', '',
3079
        'EIERE^$', 'EIERE', None,
3080
        'EIERE$', 'EIERE', None,
3081
        'EIERE(NS)-$', 'EIERE', None,
3082
        'EIERE(AIOUY)--', 'EIER', None,
3083
        'EIER(AÄIOÖUÜY)-', 'EIER', None,
3084
        'EIER<', 'EIA', None,
3085
        'EIGL-1', 'EIK', None,
3086
        'EIGH$', 'EI', 'EI',
3087
        'EIH--', 'E', 'E',
3088
        'EILLE$', 'EI', 'EI',
3089
        'EIR(BCDFGKLMNQSTVWZ)-', 'EIA', 'EIA',
3090
        'EIR$', 'EIA', 'EIA',
3091
        'EITRAUBEN------', 'EIT ', 'EIT ',
3092
        'EI', 'EI', 'EI',
3093
        'EJ$', 'EI', 'EI',
3094
        'ELIZ^', 'ELIS', None,
3095
        'ELZ^', 'ELS', None,
3096
        'EL-^', 'E', 'E',
3097
        'ELANG----1', 'E', 'E',
3098
        'EL(DKL)--1', 'E', 'E',
3099
        'EL(MNT)--1$', 'E', 'E',
3100
        'ELYNE$', 'ELINE', 'ELINE',
3101
        'ELYN$', 'ELIN', 'ELIN',
3102
        'EL(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'EL', 'EL',
3103
        'EL-1', 'L', 'L',
3104
        'EM-^', None, 'E',
3105
        'EM(DFKMPQT)--1', None, 'E',
3106
        'EM(AÄEÈÉÊIÌÍÎOÖUÜY)--1', None, 'E',
3107
        'EM-1', None, 'N',
3108
        'ENGAG-^', 'ANGA', 'ANKA',
3109
        'EN-^', 'E', 'E',
3110
        'ENTUEL', 'ENTUEL', None,
3111
        'EN(CDGKQSTZ)--1', 'E', 'E',
3112
        'EN(AÄEÈÉÊIÌÍÎNOÖUÜY)-1', 'EN', 'EN',
3113
        'EN-1', '', '',
3114
        'ERH(AÄEIOÖUÜ)-^', 'ERH', 'ER',
3115
        'ER-^', 'E', 'E',
3116
        'ERREGEND-----', ' ER', ' ER',
3117
        'ERT1$', 'AT', None,
3118
        'ER(DGLKMNRQTZß)-1', 'ER', None,
3119
        'ER(AÄEÈÉÊIÌÍÎOÖUÜY)-1', 'ER', 'A',
3120
        'ER1$', 'A', 'A',
3121
        'ER<1', 'A', 'A',
3122
        'ETAT7', 'ETA', 'ETA',
3123
        'ETI(AÄOÖÜU)-', 'EZI', 'EZI',
3124
        'EUERE$', 'EUERE', None,
3125
        'EUERE(NS)-$', 'EUERE', None,
3126
        'EUERE(AIOUY)--', 'EUER', None,
3127
        'EUER(AÄIOÖUÜY)-', 'EUER', None,
3128
        'EUER<', 'EUA', None,
3129
        'EUEU--', '', '',
3130
        'EUILLE$', 'Ö', 'Ö',
3131
        'EUR$', 'ÖR', 'ÖR',
3132
        'EUX', 'Ö', 'Ö',
3133
        'EUSZ$', 'EUS', None,
3134
        'EUTZ$', 'EUS', None,
3135
        'EUYS$', 'EUS', 'EUZ',
3136
        'EUZ$', 'EUS', None,
3137
        'EU', 'EU', 'EU',
3138
        'EVER--<1', 'EW', None,
3139
        'EV(ÄOÖUÜ)-1', 'EW', None,
3140
        'EYER<', 'EIA', 'EIA',
3141
        'EY<', 'EI', 'EI',
3142
        'FACETTE', 'FASET', 'FAZET',
3143
        'FANS--^$', 'FE', 'FE',
3144
        'FAN-^$', 'FE', 'FE',
3145
        'FAULT-', 'FOL', 'FUL',
3146
        'FEE(DL)-', 'FI', 'FI',
3147
        'FEHLER', 'FELA', 'FELA',
3148
        'FE(LMNRST)-3^', 'FE', 'FE',
3149
        'FOERDERN---^', 'FÖRD', 'FÖRT',
3150
        'FOERDERN---', ' FÖRD', ' FÖRT',
3151
        'FOND7', 'FON', 'FUN',
3152
        'FRAIN$', 'FRA', 'FRA',
3153
        'FRISEU(RS)-', 'FRISÖ', 'FRIZÖ',
3154
        'FY9^', 'FÜ', None,
3155
        'FÖRDERN---^', 'FÖRD', 'FÖRT',
3156
        'FÖRDERN---', ' FÖRD', ' FÖRT',
3157
        'GAGS^$', 'GEX', 'KEX',
3158
        'GAG^$', 'GEK', 'KEK',
3159
        'GD', 'KT', 'KT',
3160
        'GEGEN^^', 'GEGN', 'KEKN',
3161
        'GEGENGEKOM-----', 'GEGN ', 'KEKN ',
3162
        'GEGENGESET-----', 'GEGN ', 'KEKN ',
3163
        'GEGENKOMME-----', 'GEGN ', 'KEKN ',
3164
        'GEGENZUKOM---', 'GEGN ZU ', 'KEKN ZU ',
3165
        'GENDETWAS-----$', 'GENT ', 'KENT ',
3166
        'GENRE', 'IORE', 'IURE',
3167
        'GE(LMNRST)-3^', 'GE', 'KE',
3168
        'GER(DKT)-', 'GER', None,
3169
        'GETTE$', 'GET', 'KET',
3170
        'GGF.', 'GF.', None,
3171
        'GG-', '', '',
3172
        'GH', 'G', None,
3173
        'GI(AOU)-^', 'I', 'I',
3174
        'GION-3', 'KIO', 'KIU',
3175
        'G(CK)-', '', '',
3176
        'GJ(AEIOU)-^', 'I', 'I',
3177
        'GMBH^$', 'GMBH', 'GMBH',
3178
        'GNAC$', 'NIAK', 'NIAK',
3179
        'GNON$', 'NION', 'NIUN',
3180
        'GN$', 'N', 'N',
3181
        'GONCAL-^', 'GONZA', 'KUNZA',
3182
        'GRY9^', 'GRÜ', None,
3183
        'G(SßXZ)-<', 'K', 'K',
3184
        'GUCK-', 'KU', 'KU',
3185
        'GUISEP-^', 'IUSE', 'IUZE',
3186
        'GUI-^', 'G', 'K',
3187
        'GUTAUSSEH------^', 'GUT ', 'KUT ',
3188
        'GUTGEHEND------^', 'GUT ', 'KUT ',
3189
        'GY9^', 'GÜ', None,
3190
        'G(AÄEILOÖRUÜY)-', 'G', None,
3191
        'G(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'G', None,
3192
        'G\'S$', 'X', 'X',
3193
        'G´S$', 'X', 'X',
3194
        'G^', 'G', None,
3195
        'G', 'K', 'K',
3196
        'HA(HIUY)--1', 'H', None,
3197
        'HANDVOL---^', 'HANT ', 'ANT ',
3198
        'HANNOVE-^', 'HANOF', None,
3199
        'HAVEN7$', 'HAFN', None,
3200
        'HEAD-', 'HE', 'E',
3201
        'HELIEGEN------', 'E ', 'E ',
3202
        'HESTEHEN------', 'E ', 'E ',
3203
        'HE(LMNRST)-3^', 'HE', 'E',
3204
        'HE(LMN)-1', 'E', 'E',
3205
        'HEUR1$', 'ÖR', 'ÖR',
3206
        'HE(HIUY)--1', 'H', None,
3207
        'HIH(AÄEIOÖUÜY)-1', 'IH', None,
3208
        'HLH(AÄEIOÖUÜY)-1', 'LH', None,
3209
        'HMH(AÄEIOÖUÜY)-1', 'MH', None,
3210
        'HNH(AÄEIOÖUÜY)-1', 'NH', None,
3211
        'HOBBY9^', 'HOBI', None,
3212
        'HOCHBEGAB-----^', 'HOCH ', 'UK ',
3213
        'HOCHTALEN-----^', 'HOCH ', 'UK ',
3214
        'HOCHZUFRI-----^', 'HOCH ', 'UK ',
3215
        'HO(HIY)--1', 'H', None,
3216
        'HRH(AÄEIOÖUÜY)-1', 'RH', None,
3217
        'HUH(AÄEIOÖUÜY)-1', 'UH', None,
3218
        'HUIS^^', 'HÜS', 'IZ',
3219
        'HUIS$', 'ÜS', 'IZ',
3220
        'HUI--1', 'H', None,
3221
        'HYGIEN^', 'HÜKIEN', None,
3222
        'HY9^', 'HÜ', None,
3223
        'HY(BDGMNPST)-', 'Ü', None,
3224
        'H.^', None, 'H.',
3225
        'HÄU--1', 'H', None,
3226
        'H^', 'H', '',
3227
        'H', '', '',
3228
        'ICHELL---', 'ISH', 'IZ',
3229
        'ICHI$', 'ISHI', 'IZI',
3230
        'IEC$', 'IZ', 'IZ',
3231
        'IEDENSTELLE------', 'IDN ', 'ITN ',
3232
        'IEI-3', '', '',
3233
        'IELL3', 'IEL', 'IEL',
3234
        'IENNE$', 'IN', 'IN',
3235
        'IERRE$', 'IER', 'IER',
3236
        'IERZULAN---', 'IR ZU ', 'IR ZU ',
3237
        'IETTE$', 'IT', 'IT',
3238
        'IEU', 'IÖ', 'IÖ',
3239
        'IE<4', 'I', 'I',
3240
        'IGL-1', 'IK', None,
3241
        'IGHT3$', 'EIT', 'EIT',
3242
        'IGNI(EO)-', 'INI', 'INI',
3243
        'IGN(AEOU)-$', 'INI', 'INI',
3244
        'IHER(DGLKRT)--1', 'IHE', None,
3245
        'IHE(IUY)--', 'IH', None,
3246
        'IH(AIOÖUÜY)-', 'IH', None,
3247
        'IJ(AOU)-', 'I', 'I',
3248
        'IJ$', 'I', 'I',
3249
        'IJ<', 'EI', 'EI',
3250
        'IKOLE$', 'IKOL', 'IKUL',
3251
        'ILLAN(STZ)--4', 'ILIA', 'ILIA',
3252
        'ILLAR(DT)--4', 'ILIA', 'ILIA',
3253
        'IMSTAN----^', 'IM ', 'IN ',
3254
        'INDELERREGE------', 'INDL ', 'INTL ',
3255
        'INFRAGE-----^$', 'IN ', 'IN ',
3256
        'INTERN(AOU)-^', 'INTAN', 'INTAN',
3257
        'INVER-', 'INWE', 'INFE',
3258
        'ITI(AÄIOÖUÜ)-', 'IZI', 'IZI',
3259
        'IUSZ$', 'IUS', None,
3260
        'IUTZ$', 'IUS', None,
3261
        'IUZ$', 'IUS', None,
3262
        'IVER--<', 'IW', None,
3263
        'IVIER$', 'IWIE', 'IFIE',
3264
        'IV(ÄOÖUÜ)-', 'IW', None,
3265
        'IV<3', 'IW', None,
3266
        'IY2', 'I', None,
3267
        'I(ÈÉÊ)<4', 'I', 'I',
3268
        'JAVIE---<^', 'ZA', 'ZA',
3269
        'JEANS^$', 'JINS', 'INZ',
3270
        'JEANNE^$', 'IAN', 'IAN',
3271
        'JEAN-^', 'IA', 'IA',
3272
        'JER-^', 'IE', 'IE',
3273
        'JE(LMNST)-', 'IE', 'IE',
3274
        'JI^', 'JI', None,
3275
        'JOR(GK)^$', 'IÖRK', 'IÖRK',
3276
        'J', 'I', 'I',
3277
        'KC(ÄEIJ)-', 'X', 'X',
3278
        'KD', 'KT', None,
3279
        'KE(LMNRST)-3^', 'KE', 'KE',
3280
        'KG(AÄEILOÖRUÜY)-', 'K', None,
3281
        'KH<^', 'K', 'K',
3282
        'KIC$', 'KIZ', 'KIZ',
3283
        'KLE(LMNRST)-3^', 'KLE', 'KLE',
3284
        'KOTELE-^', 'KOTL', 'KUTL',
3285
        'KREAT-^', 'KREA', 'KREA',
3286
        'KRÜS(TZ)--^', 'KRI', None,
3287
        'KRYS(TZ)--^', 'KRI', None,
3288
        'KRY9^', 'KRÜ', None,
3289
        'KSCH---', 'K', 'K',
3290
        'KSH--', 'K', 'K',
3291
        'K(SßXZ)7', 'X', 'X',  # implies 'KST' -> 'XT'
3292
        'KT\'S$', 'X', 'X',
3293
        'KTI(AIOU)-3', 'XI', 'XI',
3294
        'KT(SßXZ)', 'X', 'X',
3295
        'KY9^', 'KÜ', None,
3296
        'K\'S$', 'X', 'X',
3297
        'K´S$', 'X', 'X',
3298
        'LANGES$', ' LANGES', ' LANKEZ',
3299
        'LANGE$', ' LANGE', ' LANKE',
3300
        'LANG$', ' LANK', ' LANK',
3301
        'LARVE-', 'LARF', 'LARF',
3302
        'LD(SßZ)$', 'LS', 'LZ',
3303
        'LD\'S$', 'LS', 'LZ',
3304
        'LD´S$', 'LS', 'LZ',
3305
        'LEAND-^', 'LEAN', 'LEAN',
3306
        'LEERSTEHE-----^', 'LER ', 'LER ',
3307
        'LEICHBLEIB-----', 'LEICH ', 'LEIK ',
3308
        'LEICHLAUTE-----', 'LEICH ', 'LEIK ',
3309
        'LEIDERREGE------', 'LEIT ', 'LEIT ',
3310
        'LEIDGEPR----^', 'LEIT ', 'LEIT ',
3311
        'LEINSTEHE-----', 'LEIN ', 'LEIN ',
3312
        'LEL-', 'LE', 'LE',
3313
        'LE(MNRST)-3^', 'LE', 'LE',
3314
        'LETTE$', 'LET', 'LET',
3315
        'LFGNAG-', 'LFGAN', 'LFKAN',
3316
        'LICHERWEIS----', 'LICHA ', 'LIKA ',
3317
        'LIC$', 'LIZ', 'LIZ',
3318
        'LIVE^$', 'LEIF', 'LEIF',
3319
        'LT(SßZ)$', 'LS', 'LZ',
3320
        'LT\'S$', 'LS', 'LZ',
3321
        'LT´S$', 'LS', 'LZ',
3322
        'LUI(GS)--', 'LU', 'LU',
3323
        'LV(AIO)-', 'LW', None,
3324
        'LY9^', 'LÜ', None,
3325
        'LSTS$', 'LS', 'LZ',
3326
        'LZ(BDFGKLMNPQRSTVWX)-', 'LS', None,
3327
        'L(SßZ)$', 'LS', None,
3328
        'MAIR-<', 'MEI', 'NEI',
3329
        'MANAG-', 'MENE', 'NENE',
3330
        'MANUEL', 'MANUEL', None,
3331
        'MASSEU(RS)-', 'MASÖ', 'NAZÖ',
3332
        'MATCH', 'MESH', 'NEZ',
3333
        'MAURICE', 'MORIS', 'NURIZ',
3334
        'MBH^$', 'MBH', 'MBH',
3335
        'MB(ßZ)$', 'MS', None,
3336
        'MB(SßTZ)-', 'M', 'N',
3337
        'MCG9^', 'MAK', 'NAK',
3338
        'MC9^', 'MAK', 'NAK',
3339
        'MEMOIR-^', 'MEMOA', 'NENUA',
3340
        'MERHAVEN$', 'MAHAFN', None,
3341
        'ME(LMNRST)-3^', 'ME', 'NE',
3342
        'MEN(STZ)--3', 'ME', None,
3343
        'MEN$', 'MEN', None,
3344
        'MIGUEL-', 'MIGE', 'NIKE',
3345
        'MIKE^$', 'MEIK', 'NEIK',
3346
        'MITHILFE----^$', 'MIT H', 'NIT ',
3347
        'MN$', 'M', None,
3348
        'MN', 'N', 'N',
3349
        'MPJUTE-', 'MPUT', 'NBUT',
3350
        'MP(ßZ)$', 'MS', None,
3351
        'MP(SßTZ)-', 'M', 'N',
3352
        'MP(BDJLMNPQVW)-', 'MB', 'NB',
3353
        'MY9^', 'MÜ', None,
3354
        'M(ßZ)$', 'MS', None,
3355
        'M´G7^', 'MAK', 'NAK',
3356
        'M\'G7^', 'MAK', 'NAK',
3357
        'M´^', 'MAK', 'NAK',
3358
        'M\'^', 'MAK', 'NAK',
3359
        'M', None, 'N',
3360
        'NACH^^', 'NACH', 'NAK',
3361
        'NADINE', 'NADIN', 'NATIN',
3362
        'NAIV--', 'NA', 'NA',
3363
        'NAISE$', 'NESE', 'NEZE',
3364
        'NAUGENOMM------', 'NAU ', 'NAU ',
3365
        'NAUSOGUT$', 'NAUSO GUT', 'NAUZU KUT',
3366
        'NCH$', 'NSH', 'NZ',
3367
        'NCOISE$', 'SOA', 'ZUA',
3368
        'NCOIS$', 'SOA', 'ZUA',
3369
        'NDAR$', 'NDA', 'NTA',
3370
        'NDERINGEN------', 'NDE ', 'NTE ',
3371
        'NDRO(CDKTZ)-', 'NTRO', None,
3372
        'ND(BFGJLMNPQVW)-', 'NT', None,
3373
        'ND(SßZ)$', 'NS', 'NZ',
3374
        'ND\'S$', 'NS', 'NZ',
3375
        'ND´S$', 'NS', 'NZ',
3376
        'NEBEN^^', 'NEBN', 'NEBN',
3377
        'NENGELERN------', 'NEN ', 'NEN ',
3378
        'NENLERN(ET)---', 'NEN LE', 'NEN LE',
3379
        'NENZULERNE---', 'NEN ZU LE', 'NEN ZU LE',
3380
        'NE(LMNRST)-3^', 'NE', 'NE',
3381
        'NEN-3', 'NE', 'NE',
3382
        'NETTE$', 'NET', 'NET',
3383
        'NGU^^', 'NU', 'NU',
3384
        'NG(BDFJLMNPQRTVW)-', 'NK', 'NK',
3385
        'NH(AUO)-$', 'NI', 'NI',
3386
        'NICHTSAHNEN-----', 'NIX ', 'NIX ',
3387
        'NICHTSSAGE----', 'NIX ', 'NIX ',
3388
        'NICHTS^^', 'NIX', 'NIX',
3389
        'NICHT^^', 'NICHT', 'NIKT',
3390
        'NINE$', 'NIN', 'NIN',
3391
        'NON^^', 'NON', 'NUN',
3392
        'NOTLEIDE-----^', 'NOT ', 'NUT ',
3393
        'NOT^^', 'NOT', 'NUT',
3394
        'NTI(AIOU)-3', 'NZI', 'NZI',
3395
        'NTIEL--3', 'NZI', 'NZI',
3396
        'NT(SßZ)$', 'NS', 'NZ',
3397
        'NT\'S$', 'NS', 'NZ',
3398
        'NT´S$', 'NS', 'NZ',
3399
        'NYLON', 'NEILON', 'NEILUN',
3400
        'NY9^', 'NÜ', None,
3401
        'NSTZUNEH---', 'NST ZU ', 'NZT ZU ',
3402
        'NSZ-', 'NS', None,
3403
        'NSTS$', 'NS', 'NZ',
3404
        'NZ(BDFGKLMNPQRSTVWX)-', 'NS', None,
3405
        'N(SßZ)$', 'NS', None,
3406
        'OBERE-', 'OBER', None,
3407
        'OBER^^', 'OBA', 'UBA',
3408
        'OEU2', 'Ö', 'Ö',
3409
        'OE<2', 'Ö', 'Ö',
3410
        'OGL-', 'OK', None,
3411
        'OGNIE-', 'ONI', 'UNI',
3412
        'OGN(AEOU)-$', 'ONI', 'UNI',
3413
        'OH(AIOÖUÜY)-', 'OH', None,
3414
        'OIE$', 'Ö', 'Ö',
3415
        'OIRE$', 'OA', 'UA',
3416
        'OIR$', 'OA', 'UA',
3417
        'OIX', 'OA', 'UA',
3418
        'OI<3', 'EU', 'EU',
3419
        'OKAY^$', 'OKE', 'UKE',
3420
        'OLYN$', 'OLIN', 'ULIN',
3421
        'OO(DLMZ)-', 'U', None,
3422
        'OO$', 'U', None,
3423
        'OO-', '', '',
3424
        'ORGINAL-----', 'ORI', 'URI',
3425
        'OTI(AÄOÖUÜ)-', 'OZI', 'UZI',
3426
        'OUI^', 'WI', 'FI',
3427
        'OUILLE$', 'ULIE', 'ULIE',
3428
        'OU(DT)-^', 'AU', 'AU',
3429
        'OUSE$', 'AUS', 'AUZ',
3430
        'OUT-', 'AU', 'AU',
3431
        'OU', 'U', 'U',
3432
        'O(FV)$', 'AU', 'AU',  # due to 'OW$' -> 'AU'
3433
        'OVER--<', 'OW', None,
3434
        'OV(AOU)-', 'OW', None,
3435
        'OW$', 'AU', 'AU',
3436
        'OWS$', 'OS', 'UZ',
3437
        'OJ(AÄEIOÖUÜ)--', 'O', 'U',
3438
        'OYER', 'OIA', None,
3439
        'OY(AÄEIOÖUÜ)--', 'O', 'U',
3440
        'O(JY)<', 'EU', 'EU',
3441
        'OZ$', 'OS', None,
3442
        'O´^', 'O', 'U',
3443
        'O\'^', 'O', 'U',
3444
        'O', None, 'U',
3445
        'PATIEN--^', 'PAZI', 'PAZI',
3446
        'PENSIO-^', 'PANSI', 'PANZI',
3447
        'PE(LMNRST)-3^', 'PE', 'PE',
3448
        'PFER-^', 'FE', 'FE',
3449
        'P(FH)<', 'F', 'F',
3450
        'PIC^$', 'PIK', 'PIK',
3451
        'PIC$', 'PIZ', 'PIZ',
3452
        'PIPELINE', 'PEIBLEIN', 'PEIBLEIN',
3453
        'POLYP-', 'POLÜ', None,
3454
        'POLY^^', 'POLI', 'PULI',
3455
        'PORTRAIT7', 'PORTRE', 'PURTRE',
3456
        'POWER7', 'PAUA', 'PAUA',
3457
        'PP(FH)--<', 'B', 'B',
3458
        'PP-', '', '',
3459
        'PRODUZ-^', 'PRODU', 'BRUTU',
3460
        'PRODUZI--', ' PRODU', ' BRUTU',
3461
        'PRIX^$', 'PRI', 'PRI',
3462
        'PS-^^', 'P', None,
3463
        'P(SßZ)^', None, 'Z',
3464
        'P(SßZ)$', 'BS', None,
3465
        'PT-^', '', '',
3466
        'PTI(AÄOÖUÜ)-3', 'BZI', 'BZI',
3467
        'PY9^', 'PÜ', None,
3468
        'P(AÄEIOÖRUÜY)-', 'P', 'P',
3469
        'P(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'P', None,
3470
        'P.^', None, 'P.',
3471
        'P^', 'P', None,
3472
        'P', 'B', 'B',
3473
        'QI-', 'Z', 'Z',
3474
        'QUARANT--', 'KARA', 'KARA',
3475
        'QUE(LMNRST)-3', 'KWE', 'KFE',
3476
        'QUE$', 'K', 'K',
3477
        'QUI(NS)$', 'KI', 'KI',
3478
        'QUIZ7', 'KWIS', None,
3479
        'Q(UV)7', 'KW', 'KF',
3480
        'Q<', 'K', 'K',
3481
        'RADFAHR----', 'RAT ', 'RAT ',
3482
        'RAEFTEZEHRE-----', 'REFTE ', 'REFTE ',
3483
        'RCH', 'RCH', 'RK',
3484
        'REA(DU)---3^', 'R', None,
3485
        'REBSERZEUG------', 'REBS ', 'REBZ ',
3486
        'RECHERCH^', 'RESHASH', 'REZAZ',
3487
        'RECYCL--', 'RIZEI', 'RIZEI',
3488
        'RE(ALST)-3^', 'RE', None,
3489
        'REE$', 'RI', 'RI',
3490
        'RER$', 'RA', 'RA',
3491
        'RE(MNR)-4', 'RE', 'RE',
3492
        'RETTE$', 'RET', 'RET',
3493
        'REUZ$', 'REUZ', None,
3494
        'REW$', 'RU', 'RU',
3495
        'RH<^', 'R', 'R',
3496
        'RJA(MN)--', 'RI', 'RI',
3497
        'ROWD-^', 'RAU', 'RAU',
3498
        'RTEMONNAIE-', 'RTMON', 'RTNUN',
3499
        'RTI(AÄOÖUÜ)-3', 'RZI', 'RZI',
3500
        'RTIEL--3', 'RZI', 'RZI',
3501
        'RV(AEOU)-3', 'RW', None,
3502
        'RY(KN)-$', 'RI', 'RI',
3503
        'RY9^', 'RÜ', None,
3504
        'RÄFTEZEHRE-----', 'REFTE ', 'REFTE ',
3505
        'SAISO-^', 'SES', 'ZEZ',
3506
        'SAFE^$', 'SEIF', 'ZEIF',
3507
        'SAUCE-^', 'SOS', 'ZUZ',
3508
        'SCHLAGGEBEN-----<', 'SHLAK ', 'ZLAK ',
3509
        'SCHSCH---7', '', '',
3510
        'SCHTSCH', 'SH', 'Z',
3511
        'SC(HZ)<', 'SH', 'Z',
3512
        'SC', 'SK', 'ZK',
3513
        'SELBSTST--7^^', 'SELB', 'ZELB',
3514
        'SELBST7^^', 'SELBST', 'ZELBZT',
3515
        'SERVICE7^', 'SÖRWIS', 'ZÖRFIZ',
3516
        'SERVI-^', 'SERW', None,
3517
        'SE(LMNRST)-3^', 'SE', 'ZE',
3518
        'SETTE$', 'SET', 'ZET',
3519
        'SHP-^', 'S', 'Z',
3520
        'SHST', 'SHT', 'ZT',
3521
        'SHTSH', 'SH', 'Z',
3522
        'SHT', 'ST', 'Z',
3523
        'SHY9^', 'SHÜ', None,
3524
        'SH^^', 'SH', None,
3525
        'SH3', 'SH', 'Z',
3526
        'SICHERGEGAN-----^', 'SICHA ', 'ZIKA ',
3527
        'SICHERGEHE----^', 'SICHA ', 'ZIKA ',
3528
        'SICHERGESTEL------^', 'SICHA ', 'ZIKA ',
3529
        'SICHERSTELL-----^', 'SICHA ', 'ZIKA ',
3530
        'SICHERZU(GS)--^', 'SICHA ZU ', 'ZIKA ZU ',
3531
        'SIEGLI-^', 'SIKL', 'ZIKL',
3532
        'SIGLI-^', 'SIKL', 'ZIKL',
3533
        'SIGHT', 'SEIT', 'ZEIT',
3534
        'SIGN', 'SEIN', 'ZEIN',
3535
        'SKI(NPZ)-', 'SKI', 'ZKI',
3536
        'SKI<^', 'SHI', 'ZI',
3537
        'SODASS^$', 'SO DAS', 'ZU TAZ',
3538
        'SODAß^$', 'SO DAS', 'ZU TAZ',
3539
        'SOGENAN--^', 'SO GEN', 'ZU KEN',
3540
        'SOUND-', 'SAUN', 'ZAUN',
3541
        'STAATS^^', 'STAZ', 'ZTAZ',
3542
        'STADT^^', 'STAT', 'ZTAT',
3543
        'STANDE$', ' STANDE', ' ZTANTE',
3544
        'START^^', 'START', 'ZTART',
3545
        'STAURANT7', 'STORAN', 'ZTURAN',
3546
        'STEAK-', 'STE', 'ZTE',
3547
        'STEPHEN-^$', 'STEW', None,
3548
        'STERN', 'STERN', None,
3549
        'STRAF^^', 'STRAF', 'ZTRAF',
3550
        'ST\'S$', 'Z', 'Z',
3551
        'ST´S$', 'Z', 'Z',
3552
        'STST--', '', '',
3553
        'STS(ACEÈÉÊHIÌÍÎOUÄÜÖ)--', 'ST', 'ZT',
3554
        'ST(SZ)', 'Z', 'Z',
3555
        'SPAREN---^', 'SPA', 'ZPA',
3556
        'SPAREND----', ' SPA', ' ZPA',
3557
        'S(PTW)-^^', 'S', None,
3558
        'SP', 'SP', None,
3559
        'STYN(AE)-$', 'STIN', 'ZTIN',
3560
        'ST', 'ST', 'ZT',
3561
        'SUITE<', 'SIUT', 'ZIUT',
3562
        'SUKE--$', 'S', 'Z',
3563
        'SURF(EI)-', 'SÖRF', 'ZÖRF',
3564
        'SV(AEÈÉÊIÌÍÎOU)-<^', 'SW', None,
3565
        'SYB(IY)--^', 'SIB', None,
3566
        'SYL(KVW)--^', 'SI', None,
3567
        'SY9^', 'SÜ', None,
3568
        'SZE(NPT)-^', 'ZE', 'ZE',
3569
        'SZI(ELN)-^', 'ZI', 'ZI',
3570
        'SZCZ<', 'SH', 'Z',
3571
        'SZT<', 'ST', 'ZT',
3572
        'SZ<3', 'SH', 'Z',
3573
        'SÜL(KVW)--^', 'SI', None,
3574
        'S', None, 'Z',
3575
        'TCH', 'SH', 'Z',
3576
        'TD(AÄEIOÖRUÜY)-', 'T', None,
3577
        'TD(ÀÁÂÃÅÈÉÊËÌÍÎÏÒÓÔÕØÙÚÛÝŸ)-', 'T', None,
3578
        'TEAT-^', 'TEA', 'TEA',
3579
        'TERRAI7^', 'TERA', 'TERA',
3580
        'TE(LMNRST)-3^', 'TE', 'TE',
3581
        'TH<', 'T', 'T',
3582
        'TICHT-', 'TIK', 'TIK',
3583
        'TICH$', 'TIK', 'TIK',
3584
        'TIC$', 'TIZ', 'TIZ',
3585
        'TIGGESTELL-------', 'TIK ', 'TIK ',
3586
        'TIGSTELL-----', 'TIK ', 'TIK ',
3587
        'TOAS-^', 'TO', 'TU',
3588
        'TOILET-', 'TOLE', 'TULE',
3589
        'TOIN-', 'TOA', 'TUA',
3590
        'TRAECHTI-^', 'TRECHT', 'TREKT',
3591
        'TRAECHTIG--', ' TRECHT', ' TREKT',
3592
        'TRAINI-', 'TREN', 'TREN',
3593
        'TRÄCHTI-^', 'TRECHT', 'TREKT',
3594
        'TRÄCHTIG--', ' TRECHT', ' TREKT',
3595
        'TSCH', 'SH', 'Z',
3596
        'TSH', 'SH', 'Z',
3597
        'TST', 'ZT', 'ZT',
3598
        'T(Sß)', 'Z', 'Z',
3599
        'TT(SZ)--<', '', '',
3600
        'TT9', 'T', 'T',
3601
        'TV^$', 'TV', 'TV',
3602
        'TX(AEIOU)-3', 'SH', 'Z',
3603
        'TY9^', 'TÜ', None,
3604
        'TZ-', '', '',
3605
        'T\'S3$', 'Z', 'Z',
3606
        'T´S3$', 'Z', 'Z',
3607
        'UEBEL(GNRW)-^^', 'ÜBL ', 'IBL ',
3608
        'UEBER^^', 'ÜBA', 'IBA',
3609
        'UE2', 'Ü', 'I',
3610
        'UGL-', 'UK', None,
3611
        'UH(AOÖUÜY)-', 'UH', None,
3612
        'UIE$', 'Ü', 'I',
3613
        'UM^^', 'UM', 'UN',
3614
        'UNTERE--3', 'UNTE', 'UNTE',
3615
        'UNTER^^', 'UNTA', 'UNTA',
3616
        'UNVER^^', 'UNFA', 'UNFA',
3617
        'UN^^', 'UN', 'UN',
3618
        'UTI(AÄOÖUÜ)-', 'UZI', 'UZI',
3619
        'UVE-4', 'UW', None,
3620
        'UY2', 'UI', None,
3621
        'UZZ', 'AS', 'AZ',
3622
        'VACL-^', 'WAZ', 'FAZ',
3623
        'VAC$', 'WAZ', 'FAZ',
3624
        'VAN DEN ^', 'FANDN', 'FANTN',
3625
        'VANES-^', 'WANE', None,
3626
        'VATRO-', 'WATR', None,
3627
        'VA(DHJNT)--^', 'F', None,
3628
        'VEDD-^', 'FE', 'FE',
3629
        'VE(BEHIU)--^', 'F', None,
3630
        'VEL(BDLMNT)-^', 'FEL', None,
3631
        'VENTZ-^', 'FEN', None,
3632
        'VEN(NRSZ)-^', 'FEN', None,
3633
        'VER(AB)-^$', 'WER', None,
3634
        'VERBAL^$', 'WERBAL', None,
3635
        'VERBAL(EINS)-^', 'WERBAL', None,
3636
        'VERTEBR--', 'WERTE', None,
3637
        'VEREIN-----', 'F', None,
3638
        'VEREN(AEIOU)-^', 'WEREN', None,
3639
        'VERIFI', 'WERIFI', None,
3640
        'VERON(AEIOU)-^', 'WERON', None,
3641
        'VERSEN^', 'FERSN', 'FAZN',
3642
        'VERSIERT--^', 'WERSI', None,
3643
        'VERSIO--^', 'WERS', None,
3644
        'VERSUS', 'WERSUS', None,
3645
        'VERTI(GK)-', 'WERTI', None,
3646
        'VER^^', 'FER', 'FA',
3647
        'VERSPRECHE-------', ' FER', ' FA',
3648
        'VER$', 'WA', None,
3649
        'VER', 'FA', 'FA',
3650
        'VET(HT)-^', 'FET', 'FET',
3651
        'VETTE$', 'WET', 'FET',
3652
        'VE^', 'WE', None,
3653
        'VIC$', 'WIZ', 'FIZ',
3654
        'VIELSAGE----', 'FIL ', 'FIL ',
3655
        'VIEL', 'FIL', 'FIL',
3656
        'VIEW', 'WIU', 'FIU',
3657
        'VILL(AE)-', 'WIL', None,
3658
        'VIS(ACEIKUVWZ)-<^', 'WIS', None,
3659
        'VI(ELS)--^', 'F', None,
3660
        'VILLON--', 'WILI', 'FILI',
3661
        'VIZE^^', 'FIZE', 'FIZE',
3662
        'VLIE--^', 'FL', None,
3663
        'VL(AEIOU)--', 'W', None,
3664
        'VOKA-^', 'WOK', None,
3665
        'VOL(ATUVW)--^', 'WO', None,
3666
        'VOR^^', 'FOR', 'FUR',
3667
        'VR(AEIOU)--', 'W', None,
3668
        'VV9', 'W', None,
3669
        'VY9^', 'WÜ', 'FI',
3670
        'V(ÜY)-', 'W', None,
3671
        'V(ÀÁÂÃÅÈÉÊÌÍÎÙÚÛ)-', 'W', None,
3672
        'V(AEIJLRU)-<', 'W', None,
3673
        'V.^', 'V.', None,
3674
        'V<', 'F', 'F',
3675
        'WEITERENTWI-----^', 'WEITA ', 'FEITA ',
3676
        'WEITREICH-----^', 'WEIT ', 'FEIT ',
3677
        'WEITVER^', 'WEIT FER', 'FEIT FA',
3678
        'WE(LMNRST)-3^', 'WE', 'FE',
3679
        'WER(DST)-', 'WER', None,
3680
        'WIC$', 'WIZ', 'FIZ',
3681
        'WIEDERU--', 'WIDE', 'FITE',
3682
        'WIEDER^$', 'WIDA', 'FITA',
3683
        'WIEDER^^', 'WIDA ', 'FITA ',
3684
        'WIEVIEL', 'WI FIL', 'FI FIL',
3685
        'WISUEL', 'WISUEL', None,
3686
        'WR-^', 'W', None,
3687
        'WY9^', 'WÜ', 'FI',
3688
        'W(BDFGJKLMNPQRSTZ)-', 'F', None,
3689
        'W$', 'F', None,
3690
        'W', None, 'F',
3691
        'X<^', 'Z', 'Z',
3692
        'XHAVEN$', 'XAFN', None,
3693
        'X(CSZ)', 'X', 'X',
3694
        'XTS(CH)--', 'XT', 'XT',
3695
        'XT(SZ)', 'Z', 'Z',
3696
        'YE(LMNRST)-3^', 'IE', 'IE',
3697
        'YE-3', 'I', 'I',
3698
        'YOR(GK)^$', 'IÖRK', 'IÖRK',
3699
        'Y(AOU)-<7', 'I', 'I',
3700
        'Y(BKLMNPRSTX)-1', 'Ü', None,
3701
        'YVES^$', 'IF', 'IF',
3702
        'YVONNE^$', 'IWON', 'IFUN',
3703
        'Y.^', 'Y.', None,
3704
        'Y', 'I', 'I',
3705
        'ZC(AOU)-', 'SK', 'ZK',
3706
        'ZE(LMNRST)-3^', 'ZE', 'ZE',
3707
        'ZIEJ$', 'ZI', 'ZI',
3708
        'ZIGERJA(HR)-3', 'ZIGA IA', 'ZIKA IA',
3709
        'ZL(AEIOU)-', 'SL', None,
3710
        'ZS(CHT)--', '', '',
3711
        'ZS', 'SH', 'Z',
3712
        'ZUERST', 'ZUERST', 'ZUERST',
3713
        'ZUGRUNDE^$', 'ZU GRUNDE', 'ZU KRUNTE',
3714
        'ZUGRUNDE', 'ZU GRUNDE ', 'ZU KRUNTE ',
3715
        'ZUGUNSTEN', 'ZU GUNSTN', 'ZU KUNZTN',
3716
        'ZUHAUSE-', 'ZU HAUS', 'ZU AUZ',
3717
        'ZULASTEN^$', 'ZU LASTN', 'ZU LAZTN',
3718
        'ZURUECK^^', 'ZURÜK', 'ZURIK',
3719
        'ZURZEIT', 'ZUR ZEIT', 'ZUR ZEIT',
3720
        'ZURÜCK^^', 'ZURÜK', 'ZURIK',
3721
        'ZUSTANDE', 'ZU STANDE', 'ZU ZTANTE',
3722
        'ZUTAGE', 'ZU TAGE', 'ZU TAKE',
3723
        'ZUVER^^', 'ZUFA', 'ZUFA',
3724
        'ZUVIEL', 'ZU FIL', 'ZU FIL',
3725
        'ZUWENIG', 'ZU WENIK', 'ZU FENIK',
3726
        'ZY9^', 'ZÜ', None,
3727
        'ZYK3$', 'ZIK', None,
3728
        'Z(VW)7^', 'SW', None,
3729
        None, None, None)
3730
3731
    phonet_hash = Counter()
3732
    alpha_pos = Counter()
3733
3734
    phonet_hash_1 = Counter()
3735
    phonet_hash_2 = Counter()
3736
3737
    _phonet_upper_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
3738
                                          'abcdefghijklmnopqrstuvwxyzàáâãåäæ' +
3739
                                          'çðèéêëìíîïñòóôõöøœšßþùúûüýÿ'),
3740
                                         'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÅÄÆ' +
3741
                                         'ÇÐÈÉÊËÌÍÎÏÑÒÓÔÕÖØŒŠßÞÙÚÛÜÝŸ'))
3742
3743
    def _trinfo(text, rule, err_text, lang):
3744
        """Output debug information."""
3745
        if lang == 'none':
3746
            _phonet_rules = _phonet_rules_no_lang
3747
        else:
3748
            _phonet_rules = _phonet_rules_german
3749
3750
        from_rule = ('(NULL)' if _phonet_rules[rule] is None else
3751
                     _phonet_rules[rule])
3752
        to_rule1 = ('(NULL)' if (_phonet_rules[rule + 1] is None) else
3753
                    _phonet_rules[rule + 1])
3754
        to_rule2 = ('(NULL)' if (_phonet_rules[rule + 2] is None) else
3755
                    _phonet_rules[rule + 2])
3756
        print('"{} {}:  "{}"{}"{}" {}'.format(text, ((rule / 3) + 1),
3757
                                              from_rule, to_rule1, to_rule2,
3758
                                              err_text))
3759
3760
    def _initialize_phonet(lang):
3761
        """Initialize phonet variables."""
3762
        if lang == 'none':
3763
            _phonet_rules = _phonet_rules_no_lang
3764
        else:
3765
            _phonet_rules = _phonet_rules_german
3766
3767
        phonet_hash[''] = -1
3768
3769
        # German and international umlauts
3770
        for j in {'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë',
3771
                  'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø',
3772
                  'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'Œ', 'Š', 'Ÿ'}:
3773
            alpha_pos[j] = 1
3774
            phonet_hash[j] = -1
3775
3776
        # "normal" letters ('A'-'Z')
3777
        for i, j in enumerate('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
3778
            alpha_pos[j] = i + 2
3779
            phonet_hash[j] = -1
3780
3781
        for i in range(26):
3782
            for j in range(28):
3783
                phonet_hash_1[i, j] = -1
3784
                phonet_hash_2[i, j] = -1
3785
3786
        # for each phonetc rule
3787
        for i in range(len(_phonet_rules)):
3788
            rule = _phonet_rules[i]
3789
3790
            if rule and i % 3 == 0:
3791
                # calculate first hash value
3792
                k = _phonet_rules[i][0]
3793
3794
                if phonet_hash[k] < 0 and (_phonet_rules[i+1] or
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash does not seem to be defined.
Loading history...
3795
                                           _phonet_rules[i+2]):
3796
                    phonet_hash[k] = i
3797
3798
                # calculate second hash values
3799
                if k and alpha_pos[k] >= 2:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable alpha_pos does not seem to be defined.
Loading history...
3800
                    k = alpha_pos[k]
3801
3802
                    j = k-2
3803
                    rule = rule[1:]
3804
3805
                    if not rule:
3806
                        rule = ' '
3807
                    elif rule[0] == '(':
3808
                        rule = rule[1:]
3809
                    else:
3810
                        rule = rule[0]
3811
3812
                    while rule and (rule[0] != ')'):
3813
                        k = alpha_pos[rule[0]]
3814
3815
                        if k > 0:
3816
                            # add hash value for this letter
3817
                            if phonet_hash_1[j, k] < 0:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_1 does not seem to be defined.
Loading history...
3818
                                phonet_hash_1[j, k] = i
3819
                                phonet_hash_2[j, k] = i
3820
3821
                            if phonet_hash_2[j, k] >= (i-30):
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable phonet_hash_2 does not seem to be defined.
Loading history...
3822
                                phonet_hash_2[j, k] = i
3823
                            else:
3824
                                k = -1
3825
3826
                        if k <= 0:
3827
                            # add hash value for all letters
3828
                            if phonet_hash_1[j, 0] < 0:
3829
                                phonet_hash_1[j, 0] = i
3830
3831
                            phonet_hash_2[j, 0] = i
3832
3833
                        rule = rule[1:]
3834
3835
    def _phonet(term, mode, lang, trace):
3836
        """Return the phonet coded form of a term."""
3837
        if lang == 'none':
3838
            _phonet_rules = _phonet_rules_no_lang
3839
        else:
3840
            _phonet_rules = _phonet_rules_german
3841
3842
        char0 = ''
3843
        dest = term
3844
3845
        if not term:
3846
            return ''
3847
3848
        term_length = len(term)
3849
3850
        # convert input string to upper-case
3851
        src = term.translate(_phonet_upper_translation)
3852
3853
        # check "src"
3854
        i = 0
3855
        j = 0
3856
        zeta = 0
3857
3858
        while i < len(src):
3859
            char = src[i]
3860
3861
            if trace:
3862
                print('\ncheck position {}:  src = "{}",  dest = "{}"'.format
3863
                      (j, src[i:], dest[:j]))
3864
3865
            pos = alpha_pos[char]
3866
3867
            if pos >= 2:
3868
                xpos = pos-2
3869
3870
                if i+1 == len(src):
3871
                    pos = alpha_pos['']
3872
                else:
3873
                    pos = alpha_pos[src[i+1]]
3874
3875
                start1 = phonet_hash_1[xpos, pos]
3876
                start2 = phonet_hash_1[xpos, 0]
3877
                end1 = phonet_hash_2[xpos, pos]
3878
                end2 = phonet_hash_2[xpos, 0]
3879
3880
                # preserve rule priorities
3881
                if (start2 >= 0) and ((start1 < 0) or (start2 < start1)):
3882
                    pos = start1
3883
                    start1 = start2
3884
                    start2 = pos
3885
                    pos = end1
3886
                    end1 = end2
3887
                    end2 = pos
3888
3889
                if (end1 >= start2) and (start2 >= 0):
3890
                    if end2 > end1:
3891
                        end1 = end2
3892
3893
                    start2 = -1
3894
                    end2 = -1
3895
            else:
3896
                pos = phonet_hash[char]
3897
                start1 = pos
3898
                end1 = 10000
3899
                start2 = -1
3900
                end2 = -1
3901
3902
            pos = start1
3903
            zeta0 = 0
3904
3905
            if pos >= 0:
3906
                # check rules for this char
3907
                while ((_phonet_rules[pos] is None) or
3908
                       (_phonet_rules[pos][0] == char)):
3909
                    if pos > end1:
3910
                        if start2 > 0:
3911
                            pos = start2
3912
                            start1 = start2
3913
                            start2 = -1
3914
                            end1 = end2
3915
                            end2 = -1
3916
                            continue
3917
3918
                        break
3919
3920
                    if (((_phonet_rules[pos] is None) or
3921
                         (_phonet_rules[pos + mode] is None))):
3922
                        # no conversion rule available
3923
                        pos += 3
3924
                        continue
3925
3926
                    if trace:
3927
                        _trinfo('> rule no.', pos, 'is being checked', lang)
3928
3929
                    # check whole string
3930
                    matches = 1  # number of matching letters
3931
                    priority = 5  # default priority
3932
                    rule = _phonet_rules[pos]
3933
                    rule = rule[1:]
3934
3935
                    while (rule and
3936
                           (len(src) > (i + matches)) and
3937
                           (src[i + matches] == rule[0]) and
3938
                           not rule[0].isdigit() and
3939
                           (rule not in '(-<^$')):
3940
                        matches += 1
3941
                        rule = rule[1:]
3942
3943
                    if rule and (rule[0] == '('):
3944
                        # check an array of letters
3945
                        if (((len(src) > (i + matches)) and
3946
                             src[i + matches].isalpha() and
3947
                             (src[i + matches] in rule[1:]))):
3948
                            matches += 1
3949
3950
                            while rule and rule[0] != ')':
3951
                                rule = rule[1:]
3952
3953
                            # if rule[0] == ')':
3954
                            rule = rule[1:]
3955
3956
                    if rule:
3957
                        priority0 = ord(rule[0])
3958
                    else:
3959
                        priority0 = 0
3960
3961
                    matches0 = matches
3962
3963
                    while rule and rule[0] == '-' and matches > 1:
3964
                        matches -= 1
3965
                        rule = rule[1:]
3966
3967
                    if rule and rule[0] == '<':
3968
                        rule = rule[1:]
3969
3970
                    if rule and rule[0].isdigit():
3971
                        # read priority
3972
                        priority = int(rule[0])
3973
                        rule = rule[1:]
3974
3975
                    if rule and rule[0:2] == '^^':
3976
                        rule = rule[1:]
3977
3978
                    if (not rule or
3979
                            ((rule[0] == '^') and
3980
                             ((i == 0) or not src[i-1].isalpha()) and
3981
                             ((rule[1:2] != '$') or
3982
                              (not (src[i+matches0:i+matches0+1].isalpha()) and
3983
                               (src[i+matches0:i+matches0+1] != '.')))) or
3984
                            ((rule[0] == '$') and (i > 0) and
3985
                             src[i-1].isalpha() and
3986
                             ((not src[i+matches0:i+matches0+1].isalpha()) and
3987
                              (src[i+matches0:i+matches0+1] != '.')))):
3988
                        # look for continuation, if:
3989
                        # matches > 1 und NO '-' in first string */
3990
                        pos0 = -1
3991
3992
                        start3 = 0
3993
                        start4 = 0
3994
                        end3 = 0
3995
                        end4 = 0
3996
3997
                        if (((matches > 1) and
3998
                             src[i+matches:i+matches+1] and
3999
                             (priority0 != ord('-')))):
4000
                            char0 = src[i+matches-1]
4001
                            pos0 = alpha_pos[char0]
4002
4003
                            if pos0 >= 2 and src[i+matches]:
4004
                                xpos = pos0 - 2
4005
                                pos0 = alpha_pos[src[i+matches]]
4006
                                start3 = phonet_hash_1[xpos, pos0]
4007
                                start4 = phonet_hash_1[xpos, 0]
4008
                                end3 = phonet_hash_2[xpos, pos0]
4009
                                end4 = phonet_hash_2[xpos, 0]
4010
4011
                                # preserve rule priorities
4012
                                if (((start4 >= 0) and
4013
                                     ((start3 < 0) or (start4 < start3)))):
4014
                                    pos0 = start3
4015
                                    start3 = start4
4016
                                    start4 = pos0
4017
                                    pos0 = end3
4018
                                    end3 = end4
4019
                                    end4 = pos0
4020
4021
                                if (end3 >= start4) and (start4 >= 0):
4022
                                    if end4 > end3:
4023
                                        end3 = end4
4024
4025
                                    start4 = -1
4026
                                    end4 = -1
4027
                            else:
4028
                                pos0 = phonet_hash[char0]
4029
                                start3 = pos0
4030
                                end3 = 10000
4031
                                start4 = -1
4032
                                end4 = -1
4033
4034
                            pos0 = start3
4035
4036
                        # check continuation rules for src[i+matches]
4037
                        if pos0 >= 0:
4038
                            while ((_phonet_rules[pos0] is None) or
4039
                                   (_phonet_rules[pos0][0] == char0)):
4040
                                if pos0 > end3:
4041
                                    if start4 > 0:
4042
                                        pos0 = start4
4043
                                        start3 = start4
4044
                                        start4 = -1
4045
                                        end3 = end4
4046
                                        end4 = -1
4047
                                        continue
4048
4049
                                    priority0 = -1
4050
4051
                                    # important
4052
                                    break
4053
4054
                                if (((_phonet_rules[pos0] is None) or
4055
                                     (_phonet_rules[pos0 + mode] is None))):
4056
                                    # no conversion rule available
4057
                                    pos0 += 3
4058
                                    continue
4059
4060
                                if trace:
4061
                                    _trinfo('> > continuation rule no.', pos0,
4062
                                            'is being checked', lang)
4063
4064
                                # check whole string
4065
                                matches0 = matches
4066
                                priority0 = 5
4067
                                rule = _phonet_rules[pos0]
4068
                                rule = rule[1:]
4069
4070
                                while (rule and
4071
                                       (src[i+matches0:i+matches0+1] ==
4072
                                        rule[0]) and
4073
                                       (not rule[0].isdigit() or
4074
                                        (rule in '(-<^$'))):
4075
                                    matches0 += 1
4076
                                    rule = rule[1:]
4077
4078
                                if rule and rule[0] == '(':
4079
                                    # check an array of letters
4080
                                    if ((src[i+matches0:i+matches0+1]
4081
                                         .isalpha() and
4082
                                         (src[i+matches0] in rule[1:]))):
4083
                                        matches0 += 1
4084
4085
                                        while rule and rule[0] != ')':
4086
                                            rule = rule[1:]
4087
4088
                                        # if rule[0] == ')':
4089
                                        rule = rule[1:]
4090
4091
                                while rule and rule[0] == '-':
4092
                                    # "matches0" is NOT decremented
4093
                                    # because of  "if (matches0 == matches)"
4094
                                    rule = rule[1:]
4095
4096
                                if rule and rule[0] == '<':
4097
                                    rule = rule[1:]
4098
4099
                                if rule and rule[0].isdigit():
4100
                                    priority0 = int(rule[0])
4101
                                    rule = rule[1:]
4102
4103
                                if (not rule or
4104
                                        # rule == '^' is not possible here
4105
                                        ((rule[0] == '$') and not
4106
                                         src[i+matches0:i+matches0+1]
4107
                                         .isalpha() and
4108
                                         (src[i+matches0:i+matches0+1]
4109
                                          != '.'))):
4110
                                    if matches0 == matches:
4111
                                        # this is only a partial string
4112
                                        if trace:
4113
                                            _trinfo('> > continuation ' +
4114
                                                    'rule no.',
4115
                                                    pos0,
4116
                                                    'not used (too short)',
4117
                                                    lang)
4118
4119
                                        pos0 += 3
4120
                                        continue
4121
4122
                                    if priority0 < priority:
4123
                                        # priority is too low
4124
                                        if trace:
4125
                                            _trinfo('> > continuation ' +
4126
                                                    'rule no.',
4127
                                                    pos0,
4128
                                                    'not used (priority)',
4129
                                                    lang)
4130
4131
                                        pos0 += 3
4132
                                        continue
4133
4134
                                    # continuation rule found
4135
                                    break
4136
4137
                                if trace:
4138
                                    _trinfo('> > continuation rule no.', pos0,
4139
                                            'not used', lang)
4140
4141
                                pos0 += 3
4142
4143
                            # end of "while"
4144
                            if ((priority0 >= priority) and
4145
                                    ((_phonet_rules[pos0] is not None) and
4146
                                     (_phonet_rules[pos0][0] == char0))):
4147
4148
                                if trace:
4149
                                    _trinfo('> rule no.', pos, '', lang)
4150
                                    _trinfo('> not used because of ' +
4151
                                            'continuation', pos0, '', lang)
4152
4153
                                pos += 3
4154
                                continue
4155
4156
                        # replace string
4157
                        if trace:
4158
                            _trinfo('Rule no.', pos, 'is applied', lang)
4159
4160
                        if ((_phonet_rules[pos] and
4161
                             ('<' in _phonet_rules[pos][1:]))):
4162
                            priority0 = 1
4163
                        else:
4164
                            priority0 = 0
4165
4166
                        rule = _phonet_rules[pos + mode]
4167
4168
                        if (priority0 == 1) and (zeta == 0):
4169
                            # rule with '<' is applied
4170
                            if ((j > 0) and rule and
4171
                                    ((dest[j-1] == char) or
4172
                                     (dest[j-1] == rule[0]))):
4173
                                j -= 1
4174
4175
                            zeta0 = 1
4176
                            zeta += 1
4177
                            matches0 = 0
4178
4179
                            while rule and src[i+matches0]:
4180
                                src = (src[0:i+matches0] + rule[0] +
4181
                                       src[i+matches0+1:])
4182
                                matches0 += 1
4183
                                rule = rule[1:]
4184
4185
                            if matches0 < matches:
4186
                                src = (src[0:i+matches0] +
4187
                                       src[i+matches:])
4188
4189
                            char = src[i]
4190
                        else:
4191
                            i = i + matches - 1
4192
                            zeta = 0
4193
4194
                            while len(rule) > 1:
4195
                                if (j == 0) or (dest[j - 1] != rule[0]):
4196
                                    dest = (dest[0:j] + rule[0] +
4197
                                            dest[min(len(dest), j+1):])
4198
                                    j += 1
4199
4200
                                rule = rule[1:]
4201
4202
                            # new "current char"
4203
                            if not rule:
4204
                                rule = ''
4205
                                char = ''
4206
                            else:
4207
                                char = rule[0]
4208
4209
                            if ((_phonet_rules[pos] and
4210
                                 '^^' in _phonet_rules[pos][1:])):
4211
                                if char:  # pragma: no branch
4212
                                    dest = (dest[0:j] + char +
4213
                                            dest[min(len(dest), j + 1):])
4214
                                    j += 1
4215
4216
                                src = src[i + 1:]
4217
                                i = 0
4218
                                zeta0 = 1
4219
4220
                        break
4221
4222
                    pos += 3
4223
4224
                    if pos > end1 and start2 > 0:
4225
                        pos = start2
4226
                        start1 = start2
4227
                        end1 = end2
4228
                        start2 = -1
4229
                        end2 = -1
4230
4231
            if zeta0 == 0:
4232
                if char and ((j == 0) or (dest[j-1] != char)):
4233
                    # delete multiple letters only
4234
                    dest = dest[0:j] + char + dest[min(j+1, term_length):]
4235
                    j += 1
4236
4237
                i += 1
4238
                zeta = 0
4239
4240
        dest = dest[0:j]
4241
4242
        return dest
4243
4244
    _initialize_phonet(lang)
4245
4246
    word = unicodedata.normalize('NFKC', text_type(word))
4247
    return _phonet(word, mode, lang, trace)
4248
4249
4250
def spfc(word):
4251
    """Return the Standardized Phonetic Frequency Code (SPFC) of a word.
4252
4253
    Standardized Phonetic Frequency Code is roughly Soundex-like.
4254
    This implementation is based on page 19-21 of
4255
    https://archive.org/stream/accessingindivid00moor#page/19/mode/1up
4256
4257
    :param str word: the word to transform
4258
    :returns: the SPFC value
4259
    :rtype: str
4260
4261
    >>> spfc('Christopher Smith')
4262
    '01160'
4263
    >>> spfc('Christopher Schmidt')
4264
    '01160'
4265
    >>> spfc('Niall Smith')
4266
    '01660'
4267
    >>> spfc('Niall Schmidt')
4268
4269
    >>> spfc('L.Smith')
4270
    '01960'
4271
    >>> spfc('R.Miller')
4272
    '65490'
4273
4274
    >>> spfc(('L', 'Smith'))
4275
    '01960'
4276
    >>> spfc(('R', 'Miller'))
4277
    '65490'
4278
    """
4279
    _pf1 = dict(zip((ord(_) for _ in 'SZCKQVFPUWABLORDHIEMNXGJT'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4280
                    '0011112222334445556666777'))
4281
    _pf2 = dict(zip((ord(_) for _ in
4282
                     'SZCKQFPXABORDHIMNGJTUVWEL'),
4283
                    '0011122233445556677788899'))
4284
    _pf3 = dict(zip((ord(_) for _ in
4285
                     'BCKQVDTFLPGJXMNRSZAEHIOUWY'),
4286
                    '00000112223334456677777777'))
4287
4288
    _substitutions = (('DK', 'K'), ('DT', 'T'), ('SC', 'S'), ('KN', 'N'),
4289
                      ('MN', 'N'))
4290
4291
    def _raise_word_ex():
4292
        """Raise an AttributeError."""
4293
        raise AttributeError('word attribute must be a string with a space ' +
4294
                             'or period dividing the first and last names ' +
4295
                             'or a tuple/list consisting of the first and ' +
4296
                             'last names')
4297
4298
    if not word:
4299
        return ''
4300
4301
    if isinstance(word, (str, text_type)):
4302
        names = word.split('.', 1)
4303
        if len(names) != 2:
4304
            names = word.split(' ', 1)
4305
            if len(names) != 2:
4306
                _raise_word_ex()
4307
    elif hasattr(word, '__iter__'):
4308
        if len(word) != 2:
4309
            _raise_word_ex()
4310
        names = word
4311
    else:
4312
        _raise_word_ex()
4313
4314
    names = [unicodedata.normalize('NFKD', text_type(_.strip()
4315
                                                     .replace('ß', 'SS')
4316
                                                     .upper()))
4317
             for _ in names]
0 ignored issues
show
introduced by
The variable names does not seem to be defined for all execution paths.
Loading history...
4318
    code = ''
4319
4320
    def steps_one_to_three(name):
4321
        """Perform the first three steps of SPFC."""
4322
        # filter out non A-Z
4323
        name = ''.join(_ for _ in name if _ in
4324
                       {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
4325
                        'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
4326
                        'W', 'X', 'Y', 'Z'})
4327
4328
        # 1. In the field, convert DK to K, DT to T, SC to S, KN to N,
4329
        # and MN to N
4330
        for subst in _substitutions:
4331
            name = name.replace(subst[0], subst[1])
4332
4333
        # 2. In the name field, replace multiple letters with a single letter
4334
        name = _delete_consecutive_repeats(name)
4335
4336
        # 3. Remove vowels, W, H, and Y, but keep the first letter in the name
4337
        # field.
4338
        if name:
4339
            name = name[0] + ''.join(_ for _ in name[1:] if _ not in
4340
                                     {'A', 'E', 'H', 'I', 'O', 'U', 'W', 'Y'})
4341
        return name
4342
4343
    names = [steps_one_to_three(_) for _ in names]
4344
4345
    # 4. The first digit of the code is obtained using PF1 and the first letter
4346
    # of the name field. Remove this letter after coding.
4347
    if names[1]:
4348
        code += names[1][0].translate(_pf1)
4349
        names[1] = names[1][1:]
4350
4351
    # 5. Using the last letters of the name, use Table PF3 to obtain the
4352
    # second digit of the code. Use as many letters as possible and remove
4353
    # after coding.
4354
    if names[1]:
4355
        if names[1][-3:] == 'STN' or names[1][-3:] == 'PRS':
4356
            code += '8'
4357
            names[1] = names[1][:-3]
4358
        elif names[1][-2:] == 'SN':
4359
            code += '8'
4360
            names[1] = names[1][:-2]
4361
        elif names[1][-3:] == 'STR':
4362
            code += '9'
4363
            names[1] = names[1][:-3]
4364
        elif names[1][-2:] in {'SR', 'TN', 'TD'}:
4365
            code += '9'
4366
            names[1] = names[1][:-2]
4367
        elif names[1][-3:] == 'DRS':
4368
            code += '7'
4369
            names[1] = names[1][:-3]
4370
        elif names[1][-2:] in {'TR', 'MN'}:
4371
            code += '7'
4372
            names[1] = names[1][:-2]
4373
        else:
4374
            code += names[1][-1].translate(_pf3)
4375
            names[1] = names[1][:-1]
4376
4377
    # 6. The third digit is found using Table PF2 and the first character of
4378
    # the first name. Remove after coding.
4379
    if names[0]:
4380
        code += names[0][0].translate(_pf2)
4381
        names[0] = names[0][1:]
4382
4383
    # 7. The fourth digit is found using Table PF2 and the first character of
4384
    # the name field. If no letters remain use zero. After coding remove the
4385
    # letter.
4386
    # 8. The fifth digit is found in the same manner as the fourth using the
4387
    # remaining characters of the name field if any.
4388
    for _ in range(2):
4389
        if names[1]:
4390
            code += names[1][0].translate(_pf2)
4391
            names[1] = names[1][1:]
4392
        else:
4393
            code += '0'
4394
4395
    return code
4396
4397
4398
def statistics_canada(word, maxlength=4):
4399
    """Return the Statistics Canada code for a word.
4400
4401
    The original description of this algorithm could not be located, and
4402
    may only have been specified in an unpublished TR. The coding does not
4403
    appear to be in use by Statistics Canada any longer. In its place, this is
4404
    an implementation of the "Census modified Statistics Canada name coding
4405
    procedure".
4406
4407
    The modified version of this algorithm is described in Appendix B of
4408
    Lynch, Billy T. and William L. Arends. `Selection of a Surname Coding
4409
    Procedure for the SRS Record Linkage System.` Statistical Reporting
4410
    Service, U.S. Department of Agriculture, Washington, D.C. February 1977.
4411
    https://naldc.nal.usda.gov/download/27833/PDF
4412
4413
    :param str word: the word to transform
4414
    :param int maxlength: the maximum length (default 6) of the code to return
4415
    :param bool modified: indicates whether to use USDA modified algorithm
4416
    :returns: the Statistics Canada name code value
4417
    :rtype: str
4418
4419
    >>> statistics_canada('Christopher')
4420
    'CHRS'
4421
    >>> statistics_canada('Niall')
4422
    'NL'
4423
    >>> statistics_canada('Smith')
4424
    'SMTH'
4425
    >>> statistics_canada('Schmidt')
4426
    'SCHM'
4427
    """
4428
    # uppercase, normalize, decompose, and filter non-A-Z out
4429
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4430
    word = word.replace('ß', 'SS')
4431
    word = ''.join(c for c in word if c in
4432
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4433
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4434
                    'Y', 'Z'})
4435
    if not word:
4436
        return ''
4437
4438
    code = word[1:]
4439
    for vowel in {'A', 'E', 'I', 'O', 'U', 'Y'}:
4440
        code = code.replace(vowel, '')
4441
    code = word[0]+code
4442
    code = _delete_consecutive_repeats(code)
4443
    code = code.replace(' ', '')
4444
4445
    return code[:maxlength]
4446
4447
4448
def lein(word, maxlength=4, zero_pad=True):
4449
    """Return the Lein code for a word.
4450
4451
    This is Lein name coding, based on
4452
    https://naldc.nal.usda.gov/download/27833/PDF
4453
4454
    :param str word: the word to transform
4455
    :param int maxlength: the maximum length (default 4) of the code to return
4456
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4457
        maxlength string
4458
    :returns: the Lein code
4459
    :rtype: str
4460
4461
    >>> lein('Christopher')
4462
    'C351'
4463
    >>> lein('Niall')
4464
    'N300'
4465
    >>> lein('Smith')
4466
    'S210'
4467
    >>> lein('Schmidt')
4468
    'S521'
4469
    """
4470
    _lein_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
4471
                                  'BCDFGJKLMNPQRSTVXZ'),
4472
                                 '451455532245351455'))
4473
4474
    # uppercase, normalize, decompose, and filter non-A-Z out
4475
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4476
    word = word.replace('ß', 'SS')
4477
    word = ''.join(c for c in word if c in
4478
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4479
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4480
                    'Y', 'Z'})
4481
4482
    if not word:
4483
        return ''
4484
4485
    code = word[0]  # Rule 1
4486
    word = word[1:].translate({32: None, 65: None, 69: None, 72: None,
4487
                               73: None, 79: None, 85: None, 87: None,
4488
                               89: None})  # Rule 2
4489
    word = _delete_consecutive_repeats(word)  # Rule 3
4490
    code += word.translate(_lein_translation)  # Rule 4
4491
4492
    if zero_pad:
4493
        code += ('0'*maxlength)  # Rule 4
4494
4495
    return code[:maxlength]
4496
4497
4498
def roger_root(word, maxlength=5, zero_pad=True):
4499
    """Return the Roger Root code for a word.
4500
4501
    This is Roger Root name coding, based on
4502
    https://naldc.nal.usda.gov/download/27833/PDF
4503
4504
    :param str word: the word to transform
4505
    :param int maxlength: the maximum length (default 5) of the code to return
4506
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4507
        maxlength string
4508
    :returns: the Roger Root code
4509
    :rtype: str
4510
4511
    >>> roger_root('Christopher')
4512
    '06401'
4513
    >>> roger_root('Niall')
4514
    '02500'
4515
    >>> roger_root('Smith')
4516
    '00310'
4517
    >>> roger_root('Schmidt')
4518
    '06310'
4519
    """
4520
    # uppercase, normalize, decompose, and filter non-A-Z out
4521
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4522
    word = word.replace('ß', 'SS')
4523
    word = ''.join(c for c in word if c in
4524
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4525
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4526
                    'Y', 'Z'})
4527
4528
    if not word:
4529
        return ''
4530
4531
    # '*' is used to prevent combining by _delete_consecutive_repeats()
4532
    _init_patterns = {4: {'TSCH': '06'},
4533
                      3: {'TSH': '06', 'SCH': '06'},
4534
                      2: {'CE': '0*0', 'CH': '06', 'CI': '0*0', 'CY': '0*0',
4535
                          'DG': '07', 'GF': '08', 'GM': '03', 'GN': '02',
4536
                          'KN': '02', 'PF': '08', 'PH': '08', 'PN': '02',
4537
                          'SH': '06', 'TS': '0*0', 'WR': '04'},
4538
                      1: {'A': '1', 'B': '09', 'C': '07', 'D': '01', 'E': '1',
4539
                          'F': '08', 'G': '07', 'H': '2', 'I': '1', 'J': '3',
4540
                          'K': '07', 'L': '05', 'M': '03', 'N': '02', 'O': '1',
4541
                          'P': '09', 'Q': '07', 'R': '04', 'S': '0*0',
4542
                          'T': '01', 'U': '1', 'V': '08', 'W': '4', 'X': '07',
4543
                          'Y': '5', 'Z': '0*0'}}
4544
4545
    _med_patterns = {4: {'TSCH': '6'},
4546
                     3: {'TSH': '6', 'SCH': '6'},
4547
                     2: {'CE': '0', 'CH': '6', 'CI': '0', 'CY': '0', 'DG': '7',
4548
                         'PH': '8', 'SH': '6', 'TS': '0'},
4549
                     1: {'B': '9', 'C': '7', 'D': '1', 'F': '8', 'G': '7',
4550
                         'J': '6', 'K': '7', 'L': '5', 'M': '3', 'N': '2',
4551
                         'P': '9', 'Q': '7', 'R': '4', 'S': '0', 'T': '1',
4552
                         'V': '8', 'X': '7', 'Z': '0',
4553
                         'A': '*', 'E': '*', 'H': '*', 'I': '*', 'O': '*',
4554
                         'U': '*', 'W': '*', 'Y': '*'}}
4555
4556
    code = ''
4557
    pos = 0
4558
4559
    # Do first digit(s) first
4560
    for num in range(4, 0, -1):
4561
        if word[:num] in _init_patterns[num]:
4562
            code = _init_patterns[num][word[:num]]
4563
            pos += num
4564
            break
4565
    else:
4566
        pos += 1  # Advance if nothing is recognized
4567
4568
    # Then code subsequent digits
4569
    while pos < len(word):
4570
        for num in range(4, 0, -1):
4571
            if word[pos:pos+num] in _med_patterns[num]:
4572
                code += _med_patterns[num][word[pos:pos+num]]
4573
                pos += num
4574
                break
4575
        else:
4576
            pos += 1  # Advance if nothing is recognized
4577
4578
    code = _delete_consecutive_repeats(code)
4579
    code = code.replace('*', '')
4580
4581
    if zero_pad:
4582
        code += '0'*maxlength
4583
4584
    return code[:maxlength]
4585
4586
4587
def onca(word, maxlength=4, zero_pad=True):
4588
    """Return the Oxford Name Compression Algorithm (ONCA) code for a word.
4589
4590
    This is the Oxford Name Compression Algorithm, based on:
4591
    Gill, Leicester E. 1997. "OX-LINK: The Oxford Medical Record Linkage
4592
    System." In ``Record Linkage Techniques -- 1997``. Arlington, VA. March
4593
    20--21, 1997.
4594
    https://nces.ed.gov/FCSM/pdf/RLT97.pdf
4595
4596
    I can find no complete description of the "anglicised version of the NYSIIS
4597
    method" identified as the first step in this algorithm, so this is likely
4598
    not a correct implementation, in that it employs the standard NYSIIS
4599
    algorithm.
4600
4601
    :param str word: the word to transform
4602
    :param int maxlength: the maximum length (default 5) of the code to return
4603
    :param bool zero_pad: pad the end of the return value with 0s to achieve a
4604
        maxlength string
4605
    :returns: the ONCA code
4606
    :rtype: str
4607
4608
    >>> onca('Christopher')
4609
    'C623'
4610
    >>> onca('Niall')
4611
    'N400'
4612
    >>> onca('Smith')
4613
    'S530'
4614
    >>> onca('Schmidt')
4615
    'S530'
4616
    """
4617
    # In the most extreme case, 3 characters of NYSIIS input can be compressed
4618
    # to one character of output, so give it triple the maxlength.
4619
    return soundex(nysiis(word, maxlength=maxlength*3), maxlength,
4620
                   zero_pad=zero_pad)
4621
4622
4623
def eudex(word, maxlength=8):
4624
    """Return the eudex phonetic hash of a word.
4625
4626
    This implementation of eudex phonetic hashing is based on the specification
4627
    (not the reference implementation) at:
4628
    Ticki. 2017. "Eudex: A blazingly fast phonetic reduction/hashing
4629
    algorithm." https://docs.rs/crate/eudex
4630
4631
    Further details can be found at
4632
    http://ticki.github.io/blog/the-eudex-algorithm/
4633
4634
    :param str word: the word to transform
4635
    :param int maxlength: the length of the code returned (defaults to 8)
4636
    :returns: the eudex hash
4637
    :rtype: str
4638
    """
4639
    _trailing_phones = {
4640
        'a': 0,  # a
4641
        'b': 0b01001000,  # b
4642
        'c': 0b00001100,  # c
4643
        'd': 0b00011000,  # d
4644
        'e': 0,  # e
4645
        'f': 0b01000100,  # f
4646
        'g': 0b00001000,  # g
4647
        'h': 0b00000100,  # h
4648
        'i': 1,  # i
4649
        'j': 0b00000101,  # j
4650
        'k': 0b00001001,  # k
4651
        'l': 0b10100000,  # l
4652
        'm': 0b00000010,  # m
4653
        'n': 0b00010010,  # n
4654
        'o': 0,  # o
4655
        'p': 0b01001001,  # p
4656
        'q': 0b10101000,  # q
4657
        'r': 0b10100001,  # r
4658
        's': 0b00010100,  # s
4659
        't': 0b00011101,  # t
4660
        'u': 1,  # u
4661
        'v': 0b01000101,  # v
4662
        'w': 0b00000000,  # w
4663
        'x': 0b10000100,  # x
4664
        'y': 1,  # y
4665
        'z': 0b10010100,  # z
4666
4667
        'ß': 0b00010101,  # ß
4668
        'à': 0,  # à
4669
        'á': 0,  # á
4670
        'â': 0,  # â
4671
        'ã': 0,  # ã
4672
        'ä': 0,  # ä[æ]
4673
        'å': 1,  # å[oː]
4674
        'æ': 0,  # æ[æ]
4675
        'ç': 0b10010101,  # ç[t͡ʃ]
4676
        'è': 1,  # è
4677
        'é': 1,  # é
4678
        'ê': 1,  # ê
4679
        'ë': 1,  # ë
4680
        'ì': 1,  # ì
4681
        'í': 1,  # í
4682
        'î': 1,  # î
4683
        'ï': 1,  # ï
4684
        'ð': 0b00010101,  # ð[ð̠](represented as a non-plosive T)
4685
        'ñ': 0b00010111,  # ñ[nj](represented as a combination of n and j)
4686
        'ò': 0,  # ò
4687
        'ó': 0,  # ó
4688
        'ô': 0,  # ô
4689
        'õ': 0,  # õ
4690
        'ö': 1,  # ö[ø]
4691
        '÷': 0b11111111,  # ÷
4692
        'ø': 1,  # ø[ø]
4693
        'ù': 1,  # ù
4694
        'ú': 1,  # ú
4695
        'û': 1,  # û
4696
        'ü': 1,  # ü
4697
        'ý': 1,  # ý
4698
        'þ': 0b00010101,  # þ[ð̠](represented as a non-plosive T)
4699
        'ÿ': 1,  # ÿ
4700
    }
4701
4702
    _initial_phones = {
4703
        'a': 0b10000100,  # a*
4704
        'b': 0b00100100,  # b
4705
        'c': 0b00000110,  # c
4706
        'd': 0b00001100,  # d
4707
        'e': 0b11011000,  # e*
4708
        'f': 0b00100010,  # f
4709
        'g': 0b00000100,  # g
4710
        'h': 0b00000010,  # h
4711
        'i': 0b11111000,  # i*
4712
        'j': 0b00000011,  # j
4713
        'k': 0b00000101,  # k
4714
        'l': 0b01010000,  # l
4715
        'm': 0b00000001,  # m
4716
        'n': 0b00001001,  # n
4717
        'o': 0b10010100,  # o*
4718
        'p': 0b00100101,  # p
4719
        'q': 0b01010100,  # q
4720
        'r': 0b01010001,  # r
4721
        's': 0b00001010,  # s
4722
        't': 0b00001110,  # t
4723
        'u': 0b11100000,  # u*
4724
        'v': 0b00100011,  # v
4725
        'w': 0b00000000,  # w
4726
        'x': 0b01000010,  # x
4727
        'y': 0b11100100,  # y*
4728
        'z': 0b01001010,  # z
4729
4730
        'ß': 0b00001011,  # ß
4731
        'à': 0b10000101,  # à
4732
        'á': 0b10000101,  # á
4733
        'â': 0b10000000,  # â
4734
        'ã': 0b10000110,  # ã
4735
        'ä': 0b10100110,  # ä [æ]
4736
        'å': 0b11000010,  # å [oː]
4737
        'æ': 0b10100111,  # æ [æ]
4738
        'ç': 0b01010100,  # ç [t͡ʃ]
4739
        'è': 0b11011001,  # è
4740
        'é': 0b11011001,  # é
4741
        'ê': 0b11011001,  # ê
4742
        'ë': 0b11000110,  # ë [ə] or [œ]
4743
        'ì': 0b11111001,  # ì
4744
        'í': 0b11111001,  # í
4745
        'î': 0b11111001,  # î
4746
        'ï': 0b11111001,  # ï
4747
        'ð': 0b00001011,  # ð [ð̠] (represented as a non-plosive T)
4748
        'ñ': 0b00001011,  # ñ [nj] (represented as a combination of n and j)
4749
        'ò': 0b10010101,  # ò
4750
        'ó': 0b10010101,  # ó
4751
        'ô': 0b10010101,  # ô
4752
        'õ': 0b10010101,  # õ
4753
        'ö': 0b11011100,  # ö [œ] or [ø]
4754
        '÷': 0b11111111,  # ÷
4755
        'ø': 0b11011101,  # ø [œ] or [ø]
4756
        'ù': 0b11100001,  # ù
4757
        'ú': 0b11100001,  # ú
4758
        'û': 0b11100001,  # û
4759
        'ü': 0b11100101,  # ü
4760
        'ý': 0b11100101,  # ý
4761
        'þ': 0b00001011,  # þ [ð̠] (represented as a non-plosive T)
4762
        'ÿ': 0b11100101,  # ÿ
4763
    }
4764
    # Lowercase input & filter unknown characters
4765
    word = ''.join(char for char in word.lower() if char in _initial_phones)
4766
4767
    # Perform initial eudex coding of each character
4768
    values = [_initial_phones[word[0]]]
4769
    values += [_trailing_phones[char] for char in word[1:]]
4770
4771
    # Right-shift by one to determine if second instance should be skipped
4772
    shifted_values = [_ >> 1 for _ in values]
4773
    condensed_values = [values[0]]
4774
    for n in range(1, len(shifted_values)):
4775
        if shifted_values[n] != shifted_values[n-1]:
4776
            condensed_values.append(values[n])
4777
4778
    # Add padding after first character & trim beyond maxlength
4779
    values = ([condensed_values[0]] +
4780
              [0]*max(0, maxlength - len(condensed_values)) +
4781
              condensed_values[1:maxlength])
4782
4783
    # Combine individual character values into eudex hash
4784
    hash_value = 0
4785
    for val in values:
4786
        hash_value = (hash_value << 8) | val
4787
4788
    return hash_value
4789
4790
4791
def haase_phonetik(word, primary_only=False):
4792
    """Return the Haase Phonetik (numeric output) code for a word.
4793
4794
    Based on the algorithm described at
4795
    https://github.com/elastic/elasticsearch/blob/master/plugins/analysis-phonetic/src/main/java/org/elasticsearch/index/analysis/phonetic/HaasePhonetik.java
4796
4797
    Based on the original
4798
    Haase, Martin and Kai Heitmann. 2000. Die Erweiterte Kölner Phonetik.
4799
4800
    While the output code is numeric, it is still a str.
4801
4802
    :param str word: the word to transform
4803
    :returns: the Haase Phonetik value as a numeric string
4804
    :rtype: str
4805
    """
4806
    def _after(word, i, letters):
4807
        """Return True if word[i] follows one of the supplied letters."""
4808
        if i > 0 and word[i-1] in letters:
4809
            return True
4810
        return False
4811
4812
    def _before(word, i, letters):
4813
        """Return True if word[i] precedes one of the supplied letters."""
4814
        if i+1 < len(word) and word[i+1] in letters:
4815
            return True
4816
        return False
4817
4818
    _vowels = {'A', 'E', 'I', 'J', 'O', 'U', 'Y'}
4819
4820
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
4821
    word = word.replace('ß', 'SS')
4822
4823
    word = word.replace('Ä', 'AE')
4824
    word = word.replace('Ö', 'OE')
4825
    word = word.replace('Ü', 'UE')
4826
    word = ''.join(c for c in word if c in
4827
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
4828
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
4829
                    'Y', 'Z'})
4830
4831
    # Nothing to convert, return base case
4832
    if not word:
4833
        return ''
4834
4835
    variants = []
4836
    if primary_only:
4837
        variants = [word]
4838
    else:
4839
        pos = 0
4840
        if word[:2] == 'CH':
4841
            variants.append(('CH', 'SCH'))
4842
            pos += 2
4843
        len_3_vars = {'OWN': 'AUN', 'WSK': 'RSK', 'SCH': 'CH', 'GLI': 'LI',
4844
                      'AUX': 'O', 'EUX': 'O'}
4845
        while pos < len(word):
4846
            if word[pos:pos+4] == 'ILLE':
4847
                variants.append(('ILLE', 'I'))
4848
                pos += 4
4849
            elif word[pos:pos+3] in len_3_vars:
4850
                variants.append((word[pos:pos+3], len_3_vars[word[pos:pos+3]]))
4851
                pos += 3
4852
            elif word[pos:pos+2] == 'RB':
4853
                variants.append(('RB', 'RW'))
4854
                pos += 2
4855
            elif len(word[pos:]) == 3 and word[pos:] == 'EAU':
4856
                variants.append(('EAU', 'O'))
4857
                pos += 3
4858
            elif len(word[pos:]) == 1 and word[pos:] in {'A', 'O'}:
4859
                if word[pos:] == 'O':
4860
                    variants.append(('O', 'OW'))
4861
                else:
4862
                    variants.append(('A', 'AR'))
4863
                pos += 1
4864
            else:
4865
                variants.append((word[pos],))
4866
                pos += 1
4867
4868
        variants = [''.join(letters) for letters in product(*variants)]
4869
4870
    def _haase_code(word):
4871
        sdx = ''
4872
        for i in range(len(word)):
4873 View Code Duplication
            if word[i] in _vowels:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
4874
                sdx += '9'
4875
            elif word[i] == 'B':
4876
                sdx += '1'
4877
            elif word[i] == 'P':
4878
                if _before(word, i, {'H'}):
4879
                    sdx += '3'
4880
                else:
4881
                    sdx += '1'
4882
            elif word[i] in {'D', 'T'}:
4883
                if _before(word, i, {'C', 'S', 'Z'}):
4884
                    sdx += '8'
4885
                else:
4886
                    sdx += '2'
4887
            elif word[i] in {'F', 'V', 'W'}:
4888
                sdx += '3'
4889
            elif word[i] in {'G', 'K', 'Q'}:
4890
                sdx += '4'
4891
            elif word[i] == 'C':
4892
                if _after(word, i, {'S', 'Z'}):
4893
                    sdx += '8'
4894
                elif i == 0:
4895
                    if _before(word, i, {'A', 'H', 'K', 'L', 'O', 'Q', 'R',
4896
                                         'U', 'X'}):
4897
                        sdx += '4'
4898
                    else:
4899
                        sdx += '8'
4900
                elif _before(word, i, {'A', 'H', 'K', 'O', 'Q', 'U', 'X'}):
4901
                    sdx += '4'
4902
                else:
4903
                    sdx += '8'
4904
            elif word[i] == 'X':
4905
                if _after(word, i, {'C', 'K', 'Q'}):
4906
                    sdx += '8'
4907
                else:
4908
                    sdx += '48'
4909
            elif word[i] == 'L':
4910
                sdx += '5'
4911
            elif word[i] in {'M', 'N'}:
4912
                sdx += '6'
4913
            elif word[i] == 'R':
4914
                sdx += '7'
4915
            elif word[i] in {'S', 'Z'}:
4916
                sdx += '8'
4917
4918
        sdx = _delete_consecutive_repeats(sdx)
4919
4920
        # if sdx:
4921
        #     sdx = sdx[0] + sdx[1:].replace('9', '')
4922
4923
        return sdx
4924
4925
    return tuple(_haase_code(word) for word in variants)
4926
4927
4928
def reth_schek_phonetik(word):
4929
    """Return Reth-Schek Phonetik code for a word.
4930
4931
    This algorithm is proposed in:
4932
    von Reth, Hans-Peter and Schek, Hans-Jörg. 1977. "Eine Zugriffsmethode für
4933
    die phonetische Ähnlichkeitssuche." Heidelberg Scientific Center technical
4934
    reports 77.03.002. IBM Deutschland GmbH.
4935
4936
    Since I couldn't secure a copy of that document (maybe I'll look for it
4937
    next time I'm in Germany), this implementation is based on what I could
4938
    glean from the implementations published by German Record Linkage
4939
    Center (www.record-linkage.de):
4940
    - Privacy-preserving Record Linkage (PPRL) (in R)
4941
    - Merge ToolBox (in Java)
4942
4943
    Rules that are unclear:
4944
    - Should 'C' become 'G' or 'Z'? (PPRL has both, 'Z' rule blocked)
4945
    - Should 'CC' become 'G'? (PPRL has blocked 'CK' that may be typo)
4946
    - Should 'TUI' -> 'ZUI' rule exist? (PPRL has rule, but I can't
4947
        think of a German word with '-tui-' in it.)
4948
    - Should we really change 'SCH' -> 'CH' and then 'CH' -> 'SCH'?
4949
4950
    :param word:
4951
    :return:
4952
    """
4953
    replacements = {3: {'AEH': 'E', 'IEH': 'I', 'OEH': 'OE', 'UEH': 'UE',
4954
                        'SCH': 'CH', 'ZIO': 'TIO', 'TIU': 'TIO', 'ZIU': 'TIO',
4955
                        'CHS': 'X', 'CKS': 'X', 'AEU': 'OI'},
4956
                    2: {'LL': 'L', 'AA': 'A', 'AH': 'A', 'BB': 'B', 'PP': 'B',
4957
                        'BP': 'B', 'PB': 'B', 'DD': 'D', 'DT': 'D', 'TT': 'D',
4958
                        'TH': 'D', 'EE': 'E', 'EH': 'E', 'AE': 'E', 'FF': 'F',
4959
                        'PH': 'F', 'KK': 'K', 'GG': 'G', 'GK': 'G', 'KG': 'G',
4960
                        'CK': 'G', 'CC': 'C', 'IE': 'I', 'IH': 'I', 'MM': 'M',
4961
                        'NN': 'N', 'OO': 'O', 'OH': 'O', 'SZ': 'S', 'UH': 'U',
4962
                        'GS': 'X', 'KS': 'X', 'TZ': 'Z', 'AY': 'AI',
4963
                        'EI': 'AI', 'EY': 'AI', 'EU': 'OI', 'RR': 'R',
4964
                        'SS': 'S', 'KW': 'QU'},
4965
                    1: {'P': 'B', 'T': 'D', 'V': 'F', 'W': 'F', 'C': 'G',
4966
                        'K': 'G', 'Y': 'I'}}
4967
4968
    # Uppercase
4969
    word = word.upper()
4970
4971
    # Replace umlauts/eszett
4972
    word = word.replace('Ä', 'AE')
4973
    word = word.replace('Ö', 'OE')
4974
    word = word.replace('Ü', 'UE')
4975
    word = word.replace('ß', 'SS')
4976
4977
    # Main loop, using above replacements table
4978
    pos = 0
4979
    while pos < len(word):
4980
        for num in range(3, 0, -1):
4981
            if word[pos:pos+num] in replacements[num]:
4982
                word = (word[:pos] + replacements[num][word[pos:pos+num]]
4983
                        + word[pos+num:])
4984
                pos += 1
4985
                break
4986
        else:
4987
            pos += 1  # Advance if nothing is recognized
4988
4989
    # Change 'CH' back(?) to 'SCH'
4990
    word = word.replace('CH', 'SCH')
4991
4992
    # Replace final sequences
4993
    if word[-2:] == 'ER':
4994
        word = word[:-2]+'R'
4995
    elif word[-2:] == 'EL':
4996
        word = word[:-2]+'L'
4997
    elif word[-1] == 'H':
4998
        word = word[:-1]
4999
5000
    return word
5001
5002
5003
def fonem(word):
5004
    """Return the FONEM code of a word.
5005
5006
    FONEM is a phonetic algorithm designed for French (particularly surnames in
5007
    Saguenay, Canada), defined in:
5008
    Bouchard, Gérard, Patrick Brard, and Yolande Lavoie. 1981. "FONEM: Un code
5009
    de transcription phonétique pour la reconstitution automatique des
5010
    familles saguenayennes." Population. 36(6). 1085--1103.
5011
    https://doi.org/10.2307/1532326
5012
    http://www.persee.fr/doc/pop_0032-4663_1981_num_36_6_17248
5013
5014
    Guillaume Plique's Javascript implementation at
5015
    https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/french/fonem.js
5016
    was also consulted for this implementation.
5017
5018
    :param str word: the word to transform
5019
    :returns: the FONEM code
5020
    :rtype: str
5021
    """
5022
    # I don't see a sane way of doing this without regexps :(
5023
    rule_table = {
5024
        # Vowels & groups of vowels
5025
        'V-1':     (re.compile('E?AU'), 'O'),
5026
        'V-2,5':   (re.compile('(E?AU|O)L[TX]$'), 'O'),
5027
        'V-3,4':   (re.compile('E?AU[TX]$'), 'O'),
5028
        'V-6':     (re.compile('E?AUL?D$'), 'O'),
5029
        'V-7':     (re.compile(r'(?<!G)AY$'), 'E'),
5030
        'V-8':     (re.compile('EUX$'), 'EU'),
5031
        'V-9':     (re.compile('EY(?=$|[BCDFGHJKLMNPQRSTVWXZ])'), 'E'),
5032
        'V-10':    ('Y', 'I'),
5033
        'V-11':    (re.compile('(?<=[AEIOUY])I(?=[AEIOUY])'), 'Y'),
5034
        'V-12':    (re.compile('(?<=[AEIOUY])ILL'), 'Y'),
5035
        'V-13':    (re.compile('OU(?=[AEOU]|I(?!LL))'), 'W'),
5036
        'V-14':    (re.compile(r'([AEIOUY])(?=\1)'), ''),
5037
        # Nasal vowels
5038
        'V-15':    (re.compile('[AE]M(?=[BCDFGHJKLMPQRSTVWXZ])(?!$)'), 'EN'),
5039
        'V-16':    (re.compile('OM(?=[BCDFGHJKLMPQRSTVWXZ])'), 'ON'),
5040
        'V-17':    (re.compile('AN(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'EN'),
5041
        'V-18':    (re.compile('(AI[MN]|EIN)(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'),
5042
                    'IN'),
5043
        'V-19':    (re.compile('B(O|U|OU)RNE?$'), 'BURN'),
5044
        'V-20':    (re.compile('(^IM|(?<=[BCDFGHJKLMNPQRSTVWXZ])IM(?=[BCDFGHJKLMPQRSTVWXZ]))'),
5045
                    'IN'),
5046
        # Consonants and groups of consonants
5047
        'C-1':     ('BV', 'V'),
5048
        'C-2':     (re.compile('(?<=[AEIOUY])C(?=[EIY])'), 'SS'),
5049
        'C-3':     (re.compile('(?<=[BDFGHJKLMNPQRSTVWZ])C(?=[EIY])'), 'S'),
5050
        'C-4':     (re.compile('^C(?=[EIY])'), 'S'),
5051
        'C-5':     (re.compile('^C(?=[OUA])'), 'K'),
5052
        'C-6':     (re.compile('(?<=[AEIOUY])C$'), 'K'),
5053
        'C-7':     (re.compile('C(?=[BDFGJKLMNPQRSTVWXZ])'), 'K'),
5054
        'C-8':     (re.compile('CC(?=[AOU])'), 'K'),
5055
        'C-9':     (re.compile('CC(?=[EIY])'), 'X'),
5056
        'C-10':    (re.compile('G(?=[EIY])'), 'J'),
5057
        'C-11':    (re.compile('GA(?=I?[MN])'), 'G#'),
5058
        'C-12':    (re.compile('GE(O|AU)'), 'JO'),
5059
        'C-13':    (re.compile('GNI(?=[AEIOUY])'), 'GN'),
5060
        'C-14':    (re.compile('(?<![PCS])H'), ''),
5061
        'C-15':    ('JEA', 'JA'),
5062
        'C-16':    (re.compile('^MAC(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'MA#'),
5063
        'C-17':    (re.compile('^MC'), 'MA#'),
5064
        'C-18':    ('PH', 'F'),
5065
        'C-19':    ('QU', 'K'),
5066
        'C-20':    (re.compile('^SC(?=[EIY])'), 'S'),
5067
        'C-21':    (re.compile('(?<=.)SC(?=[EIY])'), 'SS'),
5068
        'C-22':    (re.compile('(?<=.)SC(?=[AOU])'), 'SK'),
5069
        'C-23':    ('SH', 'CH'),
5070
        'C-24':    (re.compile('TIA$'), 'SSIA'),
5071
        'C-25':    (re.compile('(?<=[AIOUY])W'), ''),
5072
        'C-26':    (re.compile('X[CSZ]'), 'X'),
5073
        'C-27':    (re.compile('(?<=[AEIOUY])Z|(?<=[BCDFGHJKLMNPQRSTVWXZ])Z(?=[BCDFGHJKLMNPQRSTVWXZ])'), 'S'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (110/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
5074
        'C-28':    (re.compile(r'([BDFGHJKMNPQRTVWXZ])\1'), r'\1'),
5075
        'C-28a':   (re.compile('CC(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'C'),
5076
        'C-28b':   (re.compile('((?<=[BCDFGHJKLMNPQRSTVWXZ])|^)SS'), 'S'),
5077
        'C-28bb':  (re.compile('SS(?=[BCDFGHJKLMNPQRSTVWXZ]|$)'), 'S'),
5078
        'C-28c':   (re.compile('((?<=[^I])|^)LL'), 'L'),
5079
        'C-28d':   (re.compile('ILE$'), 'ILLE'),
5080
        'C-29':    (re.compile('(ILS|[CS]H|[MN]P|R[CFKLNSX])$|([BCDFGHJKLMNPQRSTVWXZ])[BCDFGHJKLMNPQRSTVWXZ]$'), r'\1\2'),
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (122/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
5081
        'C-30,32': (re.compile('^(SA?INT?|SEI[NM]|CINQ?|ST)(?!E)-?'), 'ST-'),
5082
        'C-31,33': (re.compile('^(SAINTE|STE)-?'), 'STE-'),
5083
        # Rules to undo rule bleeding prevention in C-11, C-16, C-17
5084
        'C-34':    ('G#', 'GA'),
5085
        'C-35':    ('MA#', 'MAC')
5086
    }
5087
    rule_order = [
5088
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5089
        'C-12',
5090
        'C-8', 'C-9', 'C-10',
5091
        'C-16', 'C-17', 'C-2', 'C-3', 'C-7',
5092
        'V-2,5', 'V-3,4', 'V-6',
5093
        'V-1', 'C-14',
5094
        'C-31,33', 'C-30,32',
5095
        'C-11', 'V-15', 'V-17', 'V-18',
5096
        'V-7', 'V-8', 'V-9', 'V-10', 'V-11', 'V-12', 'V-13', 'V-16',
5097
        'V-19', 'V-20',
5098
        'C-1', 'C-4', 'C-5', 'C-6', 'C-13', 'C-15',
5099
        'C-18', 'C-19', 'C-20', 'C-21', 'C-22', 'C-23', 'C-24',
5100
        'C-25', 'C-26', 'C-27',
5101
        'C-29',
5102
        'V-14', 'C-28', 'C-28a', 'C-28b', 'C-28bb', 'C-28c', 'C-28d',
5103
        'C-34', 'C-35'
5104
    ]
5105
5106
    # normalize, upper-case, and filter non-French letters
5107
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
5108
    word = word.translate({198: 'AE', 338: 'OE'})
5109
    word = ''.join(c for c in word if c in
5110
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5111
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5112
                    'Y', 'Z', '-'})
5113
5114
    for rule in rule_order:
5115
        regex, repl = rule_table[rule]
5116
        if isinstance(regex, text_type):
5117
            word = word.replace(regex, repl)
5118
        else:
5119
            word = regex.sub(repl, word)
5120
        # print(rule, word)
5121
5122
    return word
5123
5124
5125
def parmar_kumbharana(word):
5126
    """Return the Parmar-Kumbharana encoding of a word.
5127
5128
    This is based on the phonetic algorithm proposed in
5129
    Parmar, Vimal P. and CK Kumbharana. 2014. "Study Existing Various Phonetic
5130
    Algorithms and Designing and Development of a working model for the New
5131
    Developed Algorithm and Comparison by implementing ti with Existing
5132
    Algorithm(s)." International Journal of Computer Applications. 98(19).
5133
    https://doi.org/10.5120/17295-7795
5134
5135
    :param word:
5136
    :return:
5137
    """
5138
    rule_table = {4: {'OUGH': 'F'},
5139
                  3: {'DGE': 'J',
5140
                      'OUL': 'U',
5141
                      'GHT': 'T'},
5142
                  2: {'CE': 'S', 'CI': 'S', 'CY': 'S',
5143
                      'GE': 'J', 'GI': 'J', 'GY': 'J',
5144
                      'WR': 'R',
5145
                      'GN': 'N', 'KN': 'N', 'PN': 'N',
5146
                      'CK': 'K',
5147
                      'SH': 'S'}}
5148
    vowel_trans = {65: '', 69: '', 73: '', 79: '', 85: '', 89: ''}
5149
5150
    word = word.upper()  # Rule 3
5151
    word = _delete_consecutive_repeats(word)  # Rule 4
5152
5153
    # Rule 5
5154
    i = 0
5155
    while i < len(word):
5156
        for match_len in range(4, 1, -1):
5157
            if word[i:i+match_len] in rule_table[match_len]:
5158
                repl = rule_table[match_len][word[i:i+match_len]]
5159
                word = (word[:i] + repl + word[i+match_len:])
5160
                i += len(repl)
5161
        else:
5162
            i += 1
5163
5164
    word = word[0]+word[1:].translate(vowel_trans)  # Rule 6
5165
    return word
5166
5167
5168
def davidson(lname, fname='.', omit_fname=False):
5169
    """Return Davidson's Consonant Code.
5170
5171
    This is based on the name compression system described in:
5172
    Davidson, Leon. 1962. "Retrieval of Misspelled Names in an Airline
5173
    Passenger Record System." Communications of the ACM. 5(3). 169--171.
5174
    https://dl.acm.org/citation.cfm?id=366913
5175
5176
    :param str lname: Last name (or word) to be encoded
5177
    :param str fname: First name (optional), of which the first character is
5178
        included in the code.
5179
    :param str omit_fname: Set to True to completely omit the first character
5180
        of the first name
5181
    :return: Davidson's Consonant Code
5182
    """
5183
    trans = {65: '', 69: '', 73: '', 79: '', 85: '', 72: '', 87: '', 89: ''}
5184
5185
    lname = lname.upper()
5186
    code = _delete_consecutive_repeats(lname[:1] + lname[1:].translate(trans))
5187
    code = code[:4] + (4-len(code))*' '
5188
5189
    if not omit_fname:
5190
        code += fname[:1].upper()
5191
5192
    return code
5193
5194
5195
def sound_d(word, maxlength=4):
5196
    """Return the SoundD code.
5197
5198
    SoundD is defined in
5199
    Varol, Cihan and Coskun Bayrak. 2012. "Hybrid Matching Algorithm for
5200
    Personal Names." Journal of Data and Information Quality, 3(4).
5201
    doi:10.1145/2348828.2348830
5202
5203
    :param str word: the word to transform
5204
    :param int maxlength: the length of the code returned (defaults to 4)
5205
    :return:
5206
    """
5207
    _ref_soundd_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5208
                                        'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5209
                                       '01230120022455012623010202'))
5210
5211
    word = unicodedata.normalize('NFKD', text_type(word.upper()))
5212
    word = word.replace('ß', 'SS')
5213
    word = ''.join(c for c in word if c in
5214
                   {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
5215
                    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
5216
                    'Y', 'Z'})
5217
5218
    if word[:2] in {'KN', 'GN', 'PN', 'AC', 'WR'}:
5219
        word = word[1:]
5220
    elif word[:1] == 'X':
5221
        word = 'S'+word[1:]
5222
    elif word[:2] == 'WH':
5223
        word = 'W'+word[2:]
5224
5225
    word = word.replace('DGE', '20').replace('DGI', '20').replace('GH', '0')
5226
5227
    word = word.translate(_ref_soundd_translation)
5228
    word = _delete_consecutive_repeats(word)
5229
    word = word.replace('0', '')
5230
5231
    if maxlength is not None:
5232
        if len(word) < maxlength:
5233
            word += '0' * (maxlength-len(word))
5234
        else:
5235
            word = word[:maxlength]
5236
5237
    return word
5238
5239
5240
def pshp_soundex_last(lname, maxlength=4, german=False):
5241
    """Calculate the PSHP Soundex/Viewex Coding of a last name.
5242
5243
    This coding is based on Hershberg, Theodore, Alan Burstein, and Robert
5244
    Dockhorn. 1976. "Record Linkage." Historical Methods Newsletter.
5245
    9(2-3). 137--163. doi:10.1080/00182494.1976.10112639
5246
5247
    Reference was also made to the German version of the same:
5248
    Hershberg, Theodore, Alan Burstein, and Robert Dockhorn. 1976. "Verkettung
5249
    von Daten: Record Linkage am Beispiel des Philadelphia Social History
5250
    Project." Moderne Stadtgeschichte. Stuttgart: Klett-Cotta, 1979.
5251
    http://nbn-resolving.de/urn:nbn:de:0168-ssoar-327824
5252
5253
    A separate function, pshp_soundex_first() is used for first names.
5254
5255
    :param lname: the last name to encode
5256
    :param german: set to True if the name is German (different rules apply)
5257
    :return:
5258
    """
5259
    lname = unicodedata.normalize('NFKD', text_type(lname.upper()))
5260
    lname = lname.replace('ß', 'SS')
5261
    lname = ''.join(c for c in lname if c in
5262
                    {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
5263
                     'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
5264
                     'W', 'X', 'Y', 'Z'})
5265
5266
    # A. Prefix treatment
5267
    if lname[:3] == 'VON' or lname[:3] == 'VAN':
5268
        lname = lname[3:].strip()
5269
5270
    # The rule implemented below says "MC, MAC become 1". I believe it meant to
5271
    # say they become M except in German data (where superscripted 1 indicates
5272
    # "except in German data"). It doesn't make sense for them to become 1
5273
    # (BPFV -> 1) or to apply outside German. Unfortunately, both articles have
5274
    # this error(?).
5275
    if not german:
5276
        if lname[:3] == 'MAC':
5277
            lname = 'M'+lname[3:]
5278
        elif lname[:2] == 'MC':
5279
            lname = 'M'+lname[2:]
5280
5281
    # The non-German-only rule to strip ' is unnecessary due to filtering
5282
5283
    if lname[:1] in {'E', 'I', 'O', 'U'}:
5284
        lname = 'A' + lname[1:]
5285
    elif lname[:2] in {'GE', 'GI', 'GY'}:
5286
        lname = 'J' + lname[1:]
5287
    elif lname[:2] in {'CE', 'CI', 'CY'}:
5288
        lname = 'S' + lname[1:]
5289
    elif lname[:3] == 'CHR':
5290
        lname = 'K' + lname[1:]
5291
    elif lname[:1] == 'C' and lname[:2] != 'CH':
5292
        lname = 'K' + lname[1:]
5293
5294
    if lname[:2] == 'KN':
5295
        lname = 'N' + lname[1:]
5296
    elif lname[:2] == 'PH':
5297
        lname = 'F' + lname[1:]
5298
    elif lname[:3] in {'WIE', 'WEI'}:
5299
        lname = 'V' + lname[1:]
5300
5301
    if german and lname[:1] in {'W', 'M', 'Y', 'Z'}:
5302
        lname = {'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[lname[0]]+lname[1:]
5303
5304
    code = lname[:1]
5305
5306
    # B. Postfix treatment
5307
    if lname[-1:] == 'R':
5308
        lname = lname[:-1] + 'N'
5309
    elif lname[-2:] in {'SE', 'CE'}:
5310
        lname = lname[:-2]
5311
    if lname[-2:] == 'SS':
5312
        lname = lname[:-2]
5313
    elif lname[-1:] == 'S':
5314
        lname = lname[:-1]
5315
5316
    if not german:
5317
        l5_repl = {'STOWN': 'SAWON', 'MPSON': 'MASON'}
5318
        l4_repl = {'NSEN': 'ASEN', 'MSON': 'ASON', 'STEN': 'SAEN',
5319
                   'STON': 'SAON'}
5320
        if lname[-5:] in l5_repl:
5321
            lname = lname[:-5] + l5_repl[lname[-5:]]
5322
        elif lname[-4:] in l4_repl:
5323
            lname = lname[:-4] + l4_repl[lname[-4:]]
5324
5325
    if lname[-2:] in {'NG', 'ND'}:
5326
        lname = lname[:-1]
5327
    if not german and lname[-3:] in {'GAN', 'GEN'}:
5328
        lname = lname[:-3]+'A'+lname[-2:]
5329
5330
    if german:
5331
        if lname[-3:] == 'TES':
5332
            lname = lname[:-3]
5333
        elif lname[-2:] == 'TS':
5334
            lname = lname[:-2]
5335
        if lname[-3:] == 'TZE':
5336
            lname = lname[:-3]
5337
        elif lname[-2:] == 'ZE':
5338
            lname = lname[:-2]
5339
        if lname[-1:] == 'Z':
5340
            lname = lname[:-1]
5341
        elif lname[-2:] == 'TE':
5342
            lname = lname[:-2]
5343
5344
    # C. Infix Treatment
5345
    lname = lname.replace('CK', 'C')
5346
    lname = lname.replace('SCH', 'S')
5347
    lname = lname.replace('DT', 'T')
5348
    lname = lname.replace('ND', 'N')
5349
    lname = lname.replace('NG', 'N')
5350
    lname = lname.replace('LM', 'M')
5351
    lname = lname.replace('MN', 'M')
5352
    lname = lname.replace('WIE', 'VIE')
5353
    lname = lname.replace('WEI', 'VEI')
5354
5355
    # D. Soundexing
5356
    # code for X & Y are unspecified, but presumably are 2 & 0
5357
    _pshp_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5358
                                  'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5359
                                 '01230120022455012523010202'))
5360
5361
    lname = lname.translate(_pshp_translation)
5362
    lname = _delete_consecutive_repeats(lname)
5363
5364
    code += lname[1:]
5365
    code = code.replace('0', '')  # rule 1
5366
5367
    if maxlength is not None:
5368
        if len(code) < maxlength:
5369
            code += '0' * (maxlength-len(code))
5370
        else:
5371
            code = code[:maxlength]
5372
5373
    return code
5374
5375
5376
def pshp_soundex_first(fname, maxlength=4, german=False):
5377
    """Calculate the PSHP Soundex/Viewex Coding of a first name.
5378
5379
    This coding is based on Hershberg, Theodore, Alan Burstein, and Robert
5380
    Dockhorn. 1976. "Record Linkage." Historical Methods Newsletter.
5381
    9(2-3). 137--163. doi:10.1080/00182494.1976.10112639
5382
5383
    Reference was also made to the German version of the same:
5384
    Hershberg, Theodore, Alan Burstein, and Robert Dockhorn. 1976. "Verkettung
5385
    von Daten: Record Linkage am Beispiel des Philadelphia Social History
5386
    Project." Moderne Stadtgeschichte. Stuttgart: Klett-Cotta, 1979.
5387
    http://nbn-resolving.de/urn:nbn:de:0168-ssoar-327824
5388
5389
    A separate function, pshp_soundex_last() is used for last names.
5390
5391
    :param fname: the first name to encode
5392
    :param german: set to True if the name is German (different rules apply)
5393
    :return:
5394
    """
5395
    fname = unicodedata.normalize('NFKD', text_type(fname.upper()))
5396
    fname = fname.replace('ß', 'SS')
5397
    fname = ''.join(c for c in fname if c in
5398
                    {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
5399
                     'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
5400
                     'W', 'X', 'Y', 'Z'})
5401
5402
    # special rules
5403
    if fname == 'JAMES':
5404
        code = 'J7'
5405
    elif fname == 'PAT':
5406
        code = 'P7'
5407
5408
    else:
5409
        # A. Prefix treatment
5410
        if fname[:2] in {'GE', 'GI', 'GY'}:
5411
            fname = 'J' + fname[1:]
5412
        elif fname[:2] in {'CE', 'CI', 'CY'}:
5413
            fname = 'S' + fname[1:]
5414
        elif fname[:3] == 'CHR':
5415
            fname = 'K' + fname[1:]
5416
        elif fname[:1] == 'C' and fname[:2] != 'CH':
5417
            fname = 'K' + fname[1:]
5418
5419
        if fname[:2] == 'KN':
5420
            fname = 'N' + fname[1:]
5421
        elif fname[:2] == 'PH':
5422
            fname = 'F' + fname[1:]
5423
        elif fname[:3] in {'WIE', 'WEI'}:
5424
            fname = 'V' + fname[1:]
5425
5426
        if german and fname[:1] in {'W', 'M', 'Y', 'Z'}:
5427
            fname = ({'W': 'V', 'M': 'N', 'Y': 'J', 'Z': 'S'}[fname[0]] +
5428
                     fname[1:])
5429
5430
        code = fname[:1]
5431
5432
        # B. Soundex coding
5433
        # code for Y unspecified, but presumably is 0
5434
        _pshp_translation = dict(zip((ord(_) for _ in
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
5435
                                      'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
5436
                                     '01230120022455012523010202'))
5437
5438
        fname = fname.translate(_pshp_translation)
5439
        fname = _delete_consecutive_repeats(fname)
5440
        print(fname)
5441
        code += fname[1:]
5442
        syl_ptr = code.find('0')
5443
        syl2_ptr = code[syl_ptr + 1:].find('0')
5444
        if syl_ptr != -1 and syl2_ptr != -1 and syl2_ptr - syl_ptr > -1:
5445
            code = code[:syl_ptr + 2]
5446
5447
        code = code.replace('0', '')  # rule 1
5448
5449
    if maxlength is not None:
5450
        if len(code) < maxlength:
5451
            code += '0' * (maxlength-len(code))
5452
        else:
5453
            code = code[:maxlength]
5454
5455
    return code
5456
5457
5458
def bmpm(word, language_arg=0, name_mode='gen', match_mode='approx',
5459
         concat=False, filter_langs=False):
5460
    """Return the Beider-Morse Phonetic Matching algorithm code for a word.
5461
5462
    The Beider-Morse Phonetic Matching algorithm is described at:
5463
    http://stevemorse.org/phonetics/bmpm.htm
5464
    The reference implementation is licensed under GPLv3 and available at:
5465
    http://stevemorse.org/phoneticinfo.htm
5466
5467
    :param str word: the word to transform
5468
    :param str language_arg: the language of the term; supported values
5469
        include:
5470
5471
            - 'any'
5472
            - 'arabic'
5473
            - 'cyrillic'
5474
            - 'czech'
5475
            - 'dutch'
5476
            - 'english'
5477
            - 'french'
5478
            - 'german'
5479
            - 'greek'
5480
            - 'greeklatin'
5481
            - 'hebrew'
5482
            - 'hungarian'
5483
            - 'italian'
5484
            - 'polish'
5485
            - 'portuguese'
5486
            - 'romanian'
5487
            - 'russian'
5488
            - 'spanish'
5489
            - 'turkish'
5490
            - 'germandjsg'
5491
            - 'polishdjskp'
5492
            - 'russiandjsre'
5493
5494
    :param str name_mode: the name mode of the algorithm:
5495
5496
            - 'gen' -- general (default)
5497
            - 'ash' -- Ashkenazi
5498
            - 'sep' -- Sephardic
5499
5500
    :param str match_mode: matching mode: 'approx' or 'exact'
5501
    :param bool concat: concatenation mode
5502
    :param bool filter_langs: filter out incompatible languages
5503
    :returns: the BMPM value(s)
5504
    :rtype: tuple
5505
5506
    >>> bmpm('Christopher')
5507
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5508
    xristYfir xristopi xritopir xritopi xristofi xritofir xritofi tzristopir
5509
    tzristofir zristopir zristopi zritopir zritopi zristofir zristofi zritofir
5510
    zritofi'
5511
    >>> bmpm('Niall')
5512
    'nial niol'
5513
    >>> bmpm('Smith')
5514
    'zmit'
5515
    >>> bmpm('Schmidt')
5516
    'zmit stzmit'
5517
5518
    >>> bmpm('Christopher', language_arg='German')
5519
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5520
    xristYfir'
5521
    >>> bmpm('Christopher', language_arg='English')
5522
    'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir xristafir
5523
    xrQstafir'
5524
    >>> bmpm('Christopher', language_arg='German', name_mode='ash')
5525
    'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
5526
    xristYfir'
5527
5528
    >>> bmpm('Christopher', language_arg='German', match_mode='exact')
5529
    'xriStopher xriStofer xristopher xristofer'
5530
    """
5531
    return _bmpm(word, language_arg, name_mode, match_mode,
5532
                 concat, filter_langs)
5533
5534
5535
if __name__ == '__main__':
5536
    import doctest
5537
    doctest.testmod()
5538