Completed
Pull Request — master (#141)
by Chris
13:03
created

abydos.phonetic._bmpm.BeiderMorse._phonetic()   F

Complexity

Conditions 23

Size

Total Lines 169
Code Lines 94

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 61
CRAP Score 23

Importance

Changes 0
Metric Value
eloc 94
dl 0
loc 169
ccs 61
cts 61
cp 1
rs 0
c 0
b 0
f 0
cc 23
nop 8
crap 23

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic._bmpm.BeiderMorse._phonetic() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# This file is based on Alexander Beider and Stephen P. Morse's implementation
7
# of the Beider-Morse Phonetic Matching (BMPM) System, available at
8
# http://stevemorse.org/phonetics/bmpm.htm.
9
#
10
# Abydos is free software: you can redistribute it and/or modify
11
# it under the terms of the GNU General Public License as published by
12
# the Free Software Foundation, either version 3 of the License, or
13
# (at your option) any later version.
14
#
15
# Abydos is distributed in the hope that it will be useful,
16
# but WITHOUT ANY WARRANTY; without even the implied warranty of
17
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
18
# GNU General Public License for more details.
19
#
20
# You should have received a copy of the GNU General Public License
21
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
22
23 1
"""abydos.phonetic._bmpm.
24
25
The phonetic._bmpm module implements the Beider-Morse Phonentic Matching (BMPM)
26
algorithm.
27
"""
28
29 1
from __future__ import unicode_literals
30
31 1
from re import search
32 1
from unicodedata import normalize
33
34 1
from six import PY3, text_type
35 1
from six.moves import range
36
37 1
from ._bmdata import (
38
    BMDATA,
39
    L_ANY,
40
    L_ARABIC,
41
    L_CYRILLIC,
42
    L_CZECH,
43
    L_DUTCH,
44
    L_ENGLISH,
45
    L_FRENCH,
46
    L_GERMAN,
47
    L_GREEK,
48
    L_GREEKLATIN,
49
    L_HEBREW,
50
    L_HUNGARIAN,
51
    L_ITALIAN,
52
    L_LATVIAN,
53
    L_NONE,
54
    L_POLISH,
55
    L_PORTUGUESE,
56
    L_ROMANIAN,
57
    L_RUSSIAN,
58
    L_SPANISH,
59
    L_TURKISH,
60
)
61 1
from ._phonetic import Phonetic
62
63 1
__all__ = ['BeiderMorse', 'bmpm']
64
65
if PY3:
66
    long = int
0 ignored issues
show
Coding Style Naming introduced by
The name long does not conform to the class naming conventions ([A-Z_][a-zA-Z0-9]+$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
67
68 1
_LANG_DICT = {
69
    'any': L_ANY,
70
    'arabic': L_ARABIC,
71
    'cyrillic': L_CYRILLIC,
72
    'czech': L_CZECH,
73
    'dutch': L_DUTCH,
74
    'english': L_ENGLISH,
75
    'french': L_FRENCH,
76
    'german': L_GERMAN,
77
    'greek': L_GREEK,
78
    'greeklatin': L_GREEKLATIN,
79
    'hebrew': L_HEBREW,
80
    'hungarian': L_HUNGARIAN,
81
    'italian': L_ITALIAN,
82
    'latvian': L_LATVIAN,
83
    'polish': L_POLISH,
84
    'portuguese': L_PORTUGUESE,
85
    'romanian': L_ROMANIAN,
86
    'russian': L_RUSSIAN,
87
    'spanish': L_SPANISH,
88
    'turkish': L_TURKISH,
89
}
90
91 1
BMDATA['gen']['discards'] = {
92
    'da ',
93
    'dal ',
94
    'de ',
95
    'del ',
96
    'dela ',
97
    'de la ',
98
    'della ',
99
    'des ',
100
    'di ',
101
    'do ',
102
    'dos ',
103
    'du ',
104
    'van ',
105
    'von ',
106
    'd\'',
107
}
108 1
BMDATA['sep']['discards'] = {
109
    'al',
110
    'el',
111
    'da',
112
    'dal',
113
    'de',
114
    'del',
115
    'dela',
116
    'de la',
117
    'della',
118
    'des',
119
    'di',
120
    'do',
121
    'dos',
122
    'du',
123
    'van',
124
    'von',
125
}
126 1
BMDATA['ash']['discards'] = {'bar', 'ben', 'da', 'de', 'van', 'von'}
127
128
# format of rules array
129 1
_PATTERN_POS = 0
130 1
_LCONTEXT_POS = 1
131 1
_RCONTEXT_POS = 2
132 1
_PHONETIC_POS = 3
133
134
135 1
class BeiderMorse(Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
136
    """Beider-Morse Phonetic Matching.
137
138
    The Beider-Morse Phonetic Matching algorithm is described in
139
    :cite:`Beider:2008`.
140
    The reference implementation is licensed under GPLv3.
141
    """
142
143 1
    def _language(self, name, name_mode):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
144
        """Return the best guess language ID for the word and language choices.
145
146
        Args:
147
            name (str): The term to guess the language of
148
            name_mode (str): the name mode of the algorithm: 'gen' (default),
149
                    'ash' (Ashkenazi), or 'sep' (Sephardic)
150
151
        Returns:
152
            int: Language ID
153
154
        """
155 1
        name = name.strip().lower()
156 1
        rules = BMDATA[name_mode]['language_rules']
157 1
        all_langs = (
158
            sum(_LANG_DICT[_] for _ in BMDATA[name_mode]['languages']) - 1
159
        )
160 1
        choices_remaining = all_langs
161 1
        for rule in rules:
162 1
            letters, languages, accept = rule
163 1
            if search(letters, name) is not None:
164 1
                if accept:
165 1
                    choices_remaining &= languages
166
                else:
167 1
                    choices_remaining &= (~languages) % (all_langs + 1)
168 1
        if choices_remaining == L_NONE:
169 1
            choices_remaining = L_ANY
170 1
        return choices_remaining
171
172 1
    def _redo_language(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
173
        self, term, name_mode, rules, final_rules1, final_rules2, concat
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
174
    ):
175
        """Reassess the language of the terms and call the phonetic encoder.
176
177
        Uses a split multi-word term.
178
179
        Args:
180
            term (str): The term to encode via Beider-Morse
181
            name_mode (str): The name mode of the algorithm: ``gen`` (default),
182
                ``ash`` (Ashkenazi), or ``sep`` (Sephardic)
183
            rules (tuple): The set of initial phonetic transform regexps
184
            final_rules1 (tuple): The common set of final phonetic transform
185
                regexps
186
            final_rules2 (tuple): The specific set of final phonetic transform
187
                regexps
188
            concat (bool): A flag to indicate concatenation
189
190
        Returns:
191
            str: A BMPM code
192
193
        """
194 1
        language_arg = self._language(term, name_mode)
195 1
        return self._phonetic(
196
            term,
197
            name_mode,
198
            rules,
199
            final_rules1,
200
            final_rules2,
201
            language_arg,
202
            concat,
203
        )
204
205 1
    def _phonetic(
0 ignored issues
show
best-practice introduced by
Too many arguments (8/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (29/15).
Loading history...
206
        self,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
207
        term,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
208
        name_mode,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
209
        rules,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
210
        final_rules1,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
211
        final_rules2,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
212
        language_arg=0,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
213
        concat=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
214
    ):
215
        """Return the Beider-Morse encoding(s) of a term.
216
217
        Args:
218
            term (str): The term to encode via Beider-Morse
219
            name_mode (str): The name mode of the algorithm: ``gen`` (default),
220
                ``ash`` (Ashkenazi), or ``sep`` (Sephardic)
221
            rules (tuple): The set of initial phonetic transform regexps
222
            final_rules1 (tuple): The common set of final phonetic transform
223
                regexps
224
            final_rules2 (tuple): The specific set of final phonetic transform
225
                regexps
226
            language_arg (int): The language of the term
227
            concat (bool): A flag to indicate concatenation
228
229
        Returns:
230
            str: A BMPM code
231
232
        """
233 1
        term = term.replace('-', ' ').strip()
234
235 1
        if name_mode == 'gen':  # generic case
236
            # discard and concatenate certain words if at the start of the name
237 1
            for pfx in BMDATA['gen']['discards']:
238 1
                if term.startswith(pfx):
239 1
                    remainder = term[len(pfx) :]
240 1
                    combined = pfx[:-1] + remainder
241 1
                    result = (
242
                        self._redo_language(
243
                            remainder,
244
                            name_mode,
245
                            rules,
246
                            final_rules1,
247
                            final_rules2,
248
                            concat,
249
                        )
250
                        + '-'
251
                        + self._redo_language(
252
                            combined,
253
                            name_mode,
254
                            rules,
255
                            final_rules1,
256
                            final_rules2,
257
                            concat,
258
                        )
259
                    )
260 1
                    return result
261
262 1
        words = (
263
            term.split()
264
        )  # create array of the individual words in the name
265 1
        words2 = []
266
267 1
        if name_mode == 'sep':  # Sephardic case
268
            # for each word in the name, delete portions of word preceding
269
            # apostrophe
270
            # ex: d'avila d'aguilar --> avila aguilar
271
            # also discard certain words in the name
272
273
            # note that we can never get a match on "de la" because we are
274
            # checking single words below
275
            # this is a bug, but I won't try to fix it now
276
277 1
            for word in words:
278 1
                word = word[word.rfind('\'') + 1 :]
279 1
                if word not in BMDATA['sep']['discards']:
280 1
                    words2.append(word)
281
282 1
        elif name_mode == 'ash':  # Ashkenazic case
283
            # discard certain words if at the start of the name
284 1
            if len(words) > 1 and words[0] in BMDATA['ash']['discards']:
285 1
                words2 = words[1:]
286
            else:
287 1
                words2 = list(words)
288
        else:
289 1
            words2 = list(words)
290
291 1
        if concat:
292
            # concatenate the separate words of a multi-word name
293
            # (normally used for exact matches)
294 1
            term = ' '.join(words2)
295 1
        elif len(words2) == 1:  # not a multi-word name
296 1
            term = words2[0]
297
        else:
298
            # encode each word in a multi-word name separately
299
            # (normally used for approx matches)
300 1
            result = '-'.join(
301
                [
302
                    self._redo_language(
303
                        w, name_mode, rules, final_rules1, final_rules2, concat
304
                    )
305
                    for w in words2
306
                ]
307
            )
308 1
            return result
309
310 1
        term_length = len(term)
311
312
        # apply language rules to map to phonetic alphabet
313 1
        phonetic = ''
314 1
        skip = 0
315 1
        for i in range(term_length):
316 1
            if skip:
317 1
                skip -= 1
318 1
                continue
319 1
            found = False
320 1
            for rule in rules:
321 1
                pattern = rule[_PATTERN_POS]
322 1
                pattern_length = len(pattern)
323 1
                lcontext = rule[_LCONTEXT_POS]
324 1
                rcontext = rule[_RCONTEXT_POS]
325
326
                # check to see if next sequence in input matches the string in
327
                # the rule
328 1
                if (pattern_length > term_length - i) or (
329
                    term[i : i + pattern_length] != pattern
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
330
                ):  # no match
331 1
                    continue
332
333 1
                right = '^' + rcontext
334 1
                left = lcontext + '$'
335
336
                # check that right context is satisfied
337 1
                if rcontext != '':
338 1
                    if not search(right, term[i + pattern_length :]):
339 1
                        continue
340
341
                # check that left context is satisfied
342 1
                if lcontext != '':
343 1
                    if not search(left, term[:i]):
344 1
                        continue
345
346
                # check for incompatible attributes
347 1
                candidate = self._apply_rule_if_compat(
348
                    phonetic, rule[_PHONETIC_POS], language_arg
349
                )
350
                # The below condition shouldn't ever be false
351 1
                if candidate is not None:  # pragma: no branch
352 1
                    phonetic = candidate
353 1
                    found = True
354 1
                    break
355
356 1
            if (
357
                not found
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
358
            ):  # character in name that is not in table -- e.g., space
359 1
                pattern_length = 1
360 1
            skip = pattern_length - 1
0 ignored issues
show
introduced by
The variable pattern_length does not seem to be defined for all execution paths.
Loading history...
361
362
        # apply final rules on phonetic-alphabet,
363
        # doing a substitution of certain characters
364 1
        phonetic = self._apply_final_rules(
365
            phonetic, final_rules1, language_arg, False
366
        )  # apply common rules
367
        # final_rules1 are the common approx rules,
368
        # final_rules2 are approx rules for specific language
369 1
        phonetic = self._apply_final_rules(
370
            phonetic, final_rules2, language_arg, True
371
        )  # apply lang specific rules
372
373 1
        return phonetic
374
375 1
    def _apply_final_rules(self, phonetic, final_rules, language_arg, strip):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (21/15).
Loading history...
376
        """Apply a set of final rules to the phonetic encoding.
377
378
        Args:
379
            phonetic (str): The term to which to apply the final rules
380
            final_rules (tuple): The set of final phonetic transform regexps
381
            language_arg (int): An integer representing the target language of
382
                the phonetic encoding
383
            strip (bool): Flag to indicate whether to normalize the language
384
                attributes
385
386
        Returns:
387
            str: A BMPM code
388
389
        """
390
        # optimization to save time
391 1
        if not final_rules:
392 1
            return phonetic
393
394
        # expand the result
395 1
        phonetic = self._expand_alternates(phonetic)
396 1
        phonetic_array = phonetic.split('|')
397
398 1
        for k in range(len(phonetic_array)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
399 1
            phonetic = phonetic_array[k]
400 1
            phonetic2 = ''
401 1
            phoneticx = self._normalize_lang_attrs(phonetic, True)
402
403 1
            i = 0
404 1
            while i < len(phonetic):
405 1
                found = False
406
407 1
                if phonetic[i] == '[':  # skip over language attribute
408 1
                    attrib_start = i
409 1
                    i += 1
410 1
                    while True:
411 1
                        if phonetic[i] == ']':
412 1
                            i += 1
413 1
                            phonetic2 += phonetic[attrib_start:i]
414 1
                            break
415 1
                        i += 1
416 1
                    continue
417
418 1
                for rule in final_rules:
419 1
                    pattern = rule[_PATTERN_POS]
420 1
                    pattern_length = len(pattern)
421 1
                    lcontext = rule[_LCONTEXT_POS]
422 1
                    rcontext = rule[_RCONTEXT_POS]
423
424 1
                    right = '^' + rcontext
425 1
                    left = lcontext + '$'
426
427
                    # check to see if next sequence in phonetic matches the
428
                    # string in the rule
429 1
                    if (pattern_length > len(phoneticx) - i) or phoneticx[
430
                        i : i + pattern_length
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
431
                    ] != pattern:
432 1
                        continue
433
434
                    # check that right context is satisfied
435 1
                    if rcontext != '':
436 1
                        if not search(right, phoneticx[i + pattern_length :]):
437 1
                            continue
438
439
                    # check that left context is satisfied
440 1
                    if lcontext != '':
441 1
                        if not search(left, phoneticx[:i]):
442 1
                            continue
443
444
                    # check for incompatible attributes
445 1
                    candidate = self._apply_rule_if_compat(
446
                        phonetic2, rule[_PHONETIC_POS], language_arg
447
                    )
448
                    # The below condition shouldn't ever be false
449 1
                    if candidate is not None:  # pragma: no branch
450 1
                        phonetic2 = candidate
451 1
                        found = True
452 1
                        break
453
454 1
                if not found:
455
                    # character in name for which there is no substitution in
456
                    # the table
457 1
                    phonetic2 += phonetic[i]
458 1
                    pattern_length = 1
459
460 1
                i += pattern_length
0 ignored issues
show
introduced by
The variable pattern_length does not seem to be defined for all execution paths.
Loading history...
461
462 1
            phonetic_array[k] = self._expand_alternates(phonetic2)
463
464 1
        phonetic = '|'.join(phonetic_array)
465 1
        if strip:
466 1
            phonetic = self._normalize_lang_attrs(phonetic, True)
467
468 1
        if '|' in phonetic:
469 1
            phonetic = '(' + self._remove_dupes(phonetic) + ')'
470
471 1
        return phonetic
472
473 1
    def _phonetic_number(self, phonetic):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
474
        """Remove bracketed text from the end of a string.
475
476
        Args:
477
            phonetic (str): A Beider-Morse phonetic encoding
478
479
        Returns:
480
            str: A BMPM code
481
482
        """
483 1
        if '[' in phonetic:
484 1
            return phonetic[: phonetic.find('[')]
485
486 1
        return phonetic  # experimental !!!!
487
488 1
    def _expand_alternates(self, phonetic):
489
        """Expand phonetic alternates separated by |s.
490
491
        Args:
492
            phonetic (str): A Beider-Morse phonetic encoding
493
494
        Returns:
495
            str: A BMPM code
496
497
        """
498 1
        alt_start = phonetic.find('(')
499 1
        if alt_start == -1:
500 1
            return self._normalize_lang_attrs(phonetic, False)
501
502 1
        prefix = phonetic[:alt_start]
503 1
        alt_start += 1  # get past the (
504 1
        alt_end = phonetic.find(')', alt_start)
505 1
        alt_string = phonetic[alt_start:alt_end]
506 1
        alt_end += 1  # get past the )
507 1
        suffix = phonetic[alt_end:]
508 1
        alt_array = alt_string.split('|')
509 1
        result = ''
510
511 1
        for i in range(len(alt_array)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
512 1
            alt = alt_array[i]
513 1
            alternate = self._expand_alternates(prefix + alt + suffix)
514 1
            if alternate != '' and alternate != '[0]':
515 1
                if result != '':
516 1
                    result += '|'
517 1
                result += alternate
518
519 1
        return result
520
521 1
    def _pnums_with_leading_space(self, phonetic):
522
        """Join prefixes & suffixes in cases of alternate phonetic values.
523
524
        Args:
525
            phonetic (str): A Beider-Morse phonetic encoding
526
527
        Returns:
528
            str: A BMPM code
529
530
        """
531 1
        alt_start = phonetic.find('(')
532 1
        if alt_start == -1:
533 1
            return ' ' + self._phonetic_number(phonetic)
534
535 1
        prefix = phonetic[:alt_start]
536 1
        alt_start += 1  # get past the (
537 1
        alt_end = phonetic.find(')', alt_start)
538 1
        alt_string = phonetic[alt_start:alt_end]
539 1
        alt_end += 1  # get past the )
540 1
        suffix = phonetic[alt_end:]
541 1
        alt_array = alt_string.split('|')
542 1
        result = ''
543 1
        for alt in alt_array:
544 1
            result += self._pnums_with_leading_space(prefix + alt + suffix)
545
546 1
        return result
547
548 1
    def _phonetic_numbers(self, phonetic):
549
        """Prepare & join phonetic numbers.
550
551
        Split phonetic value on '-', run through _pnums_with_leading_space,
552
        and join with ' '
553
554
        Args:
555
            phonetic (str): A Beider-Morse phonetic encoding
556
557
        Returns:
558
            str: A BMPM code
559
560
        """
561 1
        phonetic_array = phonetic.split('-')  # for names with spaces in them
562 1
        result = ' '.join(
563
            [self._pnums_with_leading_space(i)[1:] for i in phonetic_array]
564
        )
565 1
        return result
566
567 1
    def _remove_dupes(self, phonetic):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
568
        """Remove duplicates from a phonetic encoding list.
569
570
        Args:
571
            phonetic (str): A Beider-Morse phonetic encoding
572
573
        Returns:
574
            str: A BMPM code
575
576
        """
577 1
        alt_string = phonetic
578 1
        alt_array = alt_string.split('|')
579
580 1
        result = '|'
581 1
        for i in range(len(alt_array)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
582 1
            alt = alt_array[i]
583 1
            if alt and '|' + alt + '|' not in result:
584 1
                result += alt + '|'
585
586 1
        return result[1:-1]  # remove leading and trailing |
587
588 1
    def _normalize_lang_attrs(self, text, strip):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
589
        """Remove embedded bracketed attributes.
590
591
        This (potentially) bitwise-ands bracketed attributes together and adds
592
        to the end.
593
        This is applied to a single alternative at a time -- not to a
594
        parenthesized list.
595
        It removes all embedded bracketed attributes, logically-ands them
596
        together, and places them at the end.
597
        However if strip is true, this can indeed remove embedded bracketed
598
        attributes from a parenthesized list.
599
600
        Args:
601
            text (str): A Beider-Morse phonetic encoding (in progress)
602
            strip (bool): Remove the bracketed attributes (and throw away)
603
604
        Returns:
605
            str: A BMPM code
606
607
        Raises:
608
            ValueError: No closing square bracket
609
610
        """
611 1
        uninitialized = -1  # all 1's
612 1
        attrib = uninitialized
613 1
        while '[' in text:
614 1
            bracket_start = text.find('[')
615 1
            bracket_end = text.find(']', bracket_start)
616 1
            if bracket_end == -1:
617 1
                raise ValueError(
618
                    'No closing square bracket: text=('
619
                    + text
620
                    + ') strip=('
621
                    + text_type(strip)
622
                    + ')'
623
                )
624 1
            attrib &= int(text[bracket_start + 1 : bracket_end])
625 1
            text = text[:bracket_start] + text[bracket_end + 1 :]
626
627 1
        if attrib == uninitialized or strip:
628 1
            return text
629 1
        elif attrib == 0:
630
            # means that the attributes were incompatible and there is no
631
            # alternative here
632 1
            return '[0]'
633 1
        return text + '[' + str(attrib) + ']'
634
635 1
    def _apply_rule_if_compat(self, phonetic, target, language_arg):
636
        """Apply a phonetic regex if compatible.
637
638
        tests for compatible language rules
639
640
        to do so, apply the rule, expand the results, and detect alternatives
641
            with incompatible attributes
642
643
        then drop each alternative that has incompatible attributes and keep
644
            those that are compatible
645
646
        if there are no compatible alternatives left, return false
647
648
        otherwise return the compatible alternatives
649
650
        apply the rule
651
652
        Args:
653
            phonetic (str): The Beider-Morse phonetic encoding (so far)
654
            target (str): A proposed addition to the phonetic encoding
655
            language_arg (int): An integer representing the target language of
656
                the phonetic encoding
657
658
        Returns:
659
            str: A candidate encoding
660
661
        """
662 1
        candidate = phonetic + target
663 1
        if '[' not in candidate:  # no attributes so we need test no further
664 1
            return candidate
665
666
        # expand the result, converting incompatible attributes to [0]
667 1
        candidate = self._expand_alternates(candidate)
668 1
        candidate_array = candidate.split('|')
669
670
        # drop each alternative that has incompatible attributes
671 1
        candidate = ''
672 1
        found = False
673
674 1
        for i in range(len(candidate_array)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
675 1
            this_candidate = candidate_array[i]
676 1
            if language_arg != 1:
677 1
                this_candidate = self._normalize_lang_attrs(
678
                    this_candidate + '[' + str(language_arg) + ']', False
679
                )
680 1
            if this_candidate != '[0]':
681 1
                found = True
682 1
                if candidate:
683 1
                    candidate += '|'
684 1
                candidate += this_candidate
685
686
        # return false if no compatible alternatives remain
687 1
        if not found:
688 1
            return None
689
690
        # return the result of applying the rule
691 1
        if '|' in candidate:
692 1
            candidate = '(' + candidate + ')'
693 1
        return candidate
694
695 1
    def _language_index_from_code(self, code, name_mode):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
696
        """Return the index value for a language code.
697
698
        This returns l_any if more than one code is specified or the code is
699
        out of bounds.
700
701
        Args:
702
            code (int): The language code to interpret
703
            name_mode (str): The name mode of the algorithm: ``gen`` (default),
704
                    ``ash`` (Ashkenazi), or ``sep`` (Sephardic)
705
706
        Returns:
707
            int: Language code index
708
709
        """
710 1
        if code < 1 or code > sum(
711
            _LANG_DICT[_] for _ in BMDATA[name_mode]['languages']
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
712
        ):  # code out of range
713 1
            return L_ANY
714 1
        if (
715
            code & (code - 1)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
716
        ) != 0:  # choice was more than one language; use any
717 1
            return L_ANY
718 1
        return code
719
720 1
    def encode(
0 ignored issues
show
best-practice introduced by
Too many arguments (7/5)
Loading history...
Comprehensibility introduced by
This function exceeds the maximum number of variables (16/15).
Loading history...
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
721
        self,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
722
        word,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
723
        language_arg=0,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
724
        name_mode='gen',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
725
        match_mode='approx',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
726
        concat=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
727
        filter_langs=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
728
    ):
729
        """Return the Beider-Morse Phonetic Matching encoding(s) of a term.
730
731
        Args:
732
            word (str): The word to transform
733
            language_arg (int): The language of the term; supported values
734
                include:
735
                    - ``any``
736
                    - ``arabic``
737
                    - ``cyrillic``
738
                    - ``czech``
739
                    - ``dutch``
740
                    - ``english``
741
                    - ``french``
742
                    - ``german``
743
                    - ``greek``
744
                    - ``greeklatin``
745
                    - ``hebrew``
746
                    - ``hungarian``
747
                    - ``italian``
748
                    - ``latvian``
749
                    - ``polish``
750
                    - ``portuguese``
751
                    - ``romanian``
752
                    - ``russian``
753
                    - ``spanish``
754
                    - ``turkish``
755
            name_mode (str): The name mode of the algorithm:
756
                - ``gen`` -- general (default)
757
                - ``ash`` -- Ashkenazi
758
                - ``sep`` -- Sephardic
759
            match_mode (str): Matching mode: ``approx`` or ``exact``
760
            concat (bool): Concatenation mode
761
            filter_langs (bool): Filter out incompatible languages
762
763
        Returns:
764
            tuple: The BMPM value(s)
765
766
        Raises:
767
            ValueError: Unknown language
768
769
        Examples:
770
            >>> pe = BeiderMorse()
771
            >>> pe.encode('Christopher')
772
            'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir
773
            xristofir xristYfir xristopi xritopir xritopi xristofi xritofir
774
            xritofi tzristopir tzristofir zristopir zristopi zritopir zritopi
775
            zristofir zristofi zritofir zritofi'
776
            >>> pe.encode('Niall')
777
            'nial niol'
778
            >>> pe.encode('Smith')
779
            'zmit'
780
            >>> pe.encode('Schmidt')
781
            'zmit stzmit'
782
783
            >>> pe.encode('Christopher', language_arg='German')
784
            'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir
785
            xristofir xristYfir'
786
            >>> pe.encode('Christopher', language_arg='English')
787
            'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir
788
            xristafir xrQstafir'
789
            >>> pe.encode('Christopher', language_arg='German',
790
            ... name_mode='ash')
791
            'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir
792
            xristofir xristYfir'
793
794
            >>> pe.encode('Christopher', language_arg='German',
795
            ... match_mode='exact')
796
            'xriStopher xriStofer xristopher xristofer'
797
798
        """
799 1
        word = normalize('NFC', text_type(word.strip().lower()))
800
801 1
        name_mode = name_mode.strip().lower()[:3]
802 1
        if name_mode not in {'ash', 'sep', 'gen'}:
803 1
            name_mode = 'gen'
804
805 1
        if match_mode != 'exact':
806 1
            match_mode = 'approx'
807
808
        # Translate the supplied language_arg value into an integer
809
        # representing a set of languages
810 1
        all_langs = (
811
            sum(_LANG_DICT[_] for _ in BMDATA[name_mode]['languages']) - 1
812
        )
813 1
        lang_choices = 0
814 1
        if isinstance(language_arg, (int, float, long)):
0 ignored issues
show
introduced by
The variable long does not seem to be defined in case PY3 on line 65 is False. Are you sure this can never be the case?
Loading history...
815 1
            lang_choices = int(language_arg)
816 1
        elif language_arg != '' and isinstance(language_arg, (text_type, str)):
817 1
            for lang in text_type(language_arg).lower().split(','):
818 1
                if lang in _LANG_DICT and (_LANG_DICT[lang] & all_langs):
819 1
                    lang_choices += _LANG_DICT[lang]
820 1
                elif not filter_langs:
821 1
                    raise ValueError(
822
                        'Unknown \''
823
                        + name_mode
824
                        + '\' language: \''
825
                        + lang
826
                        + '\''
827
                    )
828
829
        # Language choices are either all incompatible with the name mode or
830
        # no choices were given, so try to autodetect
831 1
        if lang_choices == 0:
832 1
            language_arg = self._language(word, name_mode)
833
        else:
834 1
            language_arg = lang_choices
835 1
        language_arg2 = self._language_index_from_code(language_arg, name_mode)
836
837 1
        rules = BMDATA[name_mode]['rules'][language_arg2]
838 1
        final_rules1 = BMDATA[name_mode][match_mode]['common']
839 1
        final_rules2 = BMDATA[name_mode][match_mode][language_arg2]
840
841 1
        result = self._phonetic(
842
            word,
843
            name_mode,
844
            rules,
845
            final_rules1,
846
            final_rules2,
847
            language_arg,
848
            concat,
849
        )
850 1
        result = self._phonetic_numbers(result)
851
852 1
        return result
853
854
855 1
def bmpm(
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
856
    word,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
857
    language_arg=0,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
858
    name_mode='gen',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
859
    match_mode='approx',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
860
    concat=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
861
    filter_langs=False,
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
862
):
863
    """Return the Beider-Morse Phonetic Matching encoding(s) of a term.
864
865
    This is a wrapper for :py:meth:`BeiderMorse.encode`.
866
867
    Args:
868
        word (str): The word to transform
869
        language_arg (str): The language of the term; supported values
870
            include:
871
                - ``any``
872
                - ``arabic``
873
                - ``cyrillic``
874
                - ``czech``
875
                - ``dutch``
876
                - ``english``
877
                - ``french``
878
                - ``german``
879
                - ``greek``
880
                - ``greeklatin``
881
                - ``hebrew``
882
                - ``hungarian``
883
                - ``italian``
884
                - ``latvian``
885
                - ``polish``
886
                - ``portuguese``
887
                - ``romanian``
888
                - ``russian``
889
                - ``spanish``
890
                - ``turkish``
891
        name_mode (str): The name mode of the algorithm:
892
            - ``gen`` -- general (default)
893
            - ``ash`` -- Ashkenazi
894
            - ``sep`` -- Sephardic
895
        match_mode (str): Matching mode: ``approx`` or ``exact``
896
        concat (bool): Concatenation mode
897
        filter_langs (bool): Filter out incompatible languages
898
899
    Returns:
900
        tuple: The BMPM value(s)
901
902
    Examples:
903
        >>> bmpm('Christopher')
904
        'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
905
        xristYfir xristopi xritopir xritopi xristofi xritofir xritofi
906
        tzristopir tzristofir zristopir zristopi zritopir zritopi zristofir
907
        zristofi zritofir zritofi'
908
        >>> bmpm('Niall')
909
        'nial niol'
910
        >>> bmpm('Smith')
911
        'zmit'
912
        >>> bmpm('Schmidt')
913
        'zmit stzmit'
914
915
        >>> bmpm('Christopher', language_arg='German')
916
        'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
917
        xristYfir'
918
        >>> bmpm('Christopher', language_arg='English')
919
        'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir
920
        xristafir xrQstafir'
921
        >>> bmpm('Christopher', language_arg='German', name_mode='ash')
922
        'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
923
        xristYfir'
924
925
        >>> bmpm('Christopher', language_arg='German', match_mode='exact')
926
        'xriStopher xriStofer xristopher xristofer'
927
928
    """
929 1
    return BeiderMorse().encode(
930
        word, language_arg, name_mode, match_mode, concat, filter_langs
931
    )
932
933
934
if __name__ == '__main__':
935
    import doctest
936
937
    doctest.testmod()
938