Completed
Push — master ( f43547...71985b )
by Chris
12:00 queued 10s
created

abydos.phonetic._soundex.Soundex.encode()   D

Complexity

Conditions 13

Size

Total Lines 125
Code Lines 37

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 30
CRAP Score 13

Importance

Changes 0
Metric Value
cc 13
eloc 37
nop 6
dl 0
loc 125
ccs 30
cts 30
cp 1
crap 13
rs 4.2
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.phonetic._soundex.Soundex.encode() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.phonetic._soundex.
20
21
American Soundex
22
"""
23
24 1
from __future__ import (
25
    absolute_import,
26
    division,
27
    print_function,
28
    unicode_literals,
29
)
30
31 1
from unicodedata import normalize as unicode_normalize
32
33 1
from six import text_type
34
35 1
from ._phonetic import _Phonetic
36
37 1
__all__ = ['Soundex', 'soundex']
38
39
40 1
class Soundex(_Phonetic):
0 ignored issues
show
Unused Code introduced by
The variable __class__ seems to be unused.
Loading history...
41
    """Soundex.
42
43
    Three variants of Soundex are implemented:
44
45
    - 'American' follows the American Soundex algorithm, as described at
46
      :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
47
      Miracode
48
    - 'special' follows the rules from the 1880-1910 US Census
49
      retrospective re-analysis, in which h & w are not treated as blocking
50
      consonants but as vowels. Cf. :cite:`Repici:2013`.
51
    - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
52
      US Census, including coding prefixed and unprefixed versions of some
53
      names
54
    """
55
56 1
    _trans = dict(
57
        zip(
58
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
59
            '01230129022455012623019202',
60
        )
61
    )
62
63 1
    def encode(
0 ignored issues
show
best-practice introduced by
Too many arguments (6/5)
Loading history...
Bug introduced by
Parameters differ from overridden 'encode' method
Loading history...
64
        self, word, max_length=4, var='American', reverse=False, zero_pad=True
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
65
    ):
66
        """Return the Soundex code for a word.
67
68
        Parameters
69
        ----------
70
        word : str
71
            The word to transform
72
        max_length : int
73
            The length of the code returned (defaults to 4)
74
        var : str
75
            The variant of the algorithm to employ (defaults to ``American``):
76
77
                - ``American`` follows the American Soundex algorithm, as
78
                  described at :cite:`US:2007` and in :cite:`Knuth:1998`; this
79
                  is also called Miracode
80
                - ``special`` follows the rules from the 1880-1910 US Census
81
                  retrospective re-analysis, in which h & w are not treated as
82
                  blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
83
                - ``Census`` follows the rules laid out in GIL 55
84
                  :cite:`US:1997` by the US Census, including coding prefixed
85
                  and unprefixed versions of some names
86
87
        reverse : bool
88
            Reverse the word before computing the selected Soundex (defaults to
89
            False); This results in "Reverse Soundex", which is useful for
90
            blocking in cases where the initial elements may be in error.
91
        zero_pad : bool
92
            Pad the end of the return value with 0s to achieve a max_length
93
            string
94
95
        Returns
96
        -------
97
        str
98
            The Soundex value
99
100
        Examples
101
        --------
102
        >>> pe = Soundex()
103
        >>> pe.encode("Christopher")
104
        'C623'
105
        >>> pe.encode("Niall")
106
        'N400'
107
        >>> pe.encode('Smith')
108
        'S530'
109
        >>> pe.encode('Schmidt')
110
        'S530'
111
112
        >>> pe.encode('Christopher', max_length=-1)
113
        'C623160000000000000000000000000000000000000000000000000000000000'
114
        >>> pe.encode('Christopher', max_length=-1, zero_pad=False)
115
        'C62316'
116
117
        >>> pe.encode('Christopher', reverse=True)
118
        'R132'
119
120
        >>> pe.encode('Ashcroft')
121
        'A261'
122
        >>> pe.encode('Asicroft')
123
        'A226'
124
        >>> pe.encode('Ashcroft', var='special')
125
        'A226'
126
        >>> pe.encode('Asicroft', var='special')
127
        'A226'
128
129
        """
130
        # Require a max_length of at least 4 and not more than 64
131 1
        if max_length != -1:
132 1
            max_length = min(max(4, max_length), 64)
133
        else:
134 1
            max_length = 64
135
136
        # uppercase, normalize, decompose, and filter non-A-Z out
137 1
        word = unicode_normalize('NFKD', text_type(word.upper()))
138 1
        word = word.replace('ß', 'SS')
139
140 1
        if var == 'Census':
141 1
            if word[:3] in {'VAN', 'CON'} and len(word) > 4:
142 1
                return (
143
                    soundex(word, max_length, 'American', reverse, zero_pad),
144
                    soundex(
145
                        word[3:], max_length, 'American', reverse, zero_pad
146
                    ),
147
                )
148 1
            if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
149 1
                return (
150
                    soundex(word, max_length, 'American', reverse, zero_pad),
151
                    soundex(
152
                        word[2:], max_length, 'American', reverse, zero_pad
153
                    ),
154
                )
155
            # Otherwise, proceed as usual (var='American' mode, ostensibly)
156
157 1
        word = ''.join(c for c in word if c in self._uc_set)
158
159
        # Nothing to convert, return base case
160 1
        if not word:
161 1
            if zero_pad:
162 1
                return '0' * max_length
163 1
            return '0'
164
165
        # Reverse word if computing Reverse Soundex
166 1
        if reverse:
167 1
            word = word[::-1]
168
169
        # apply the Soundex algorithm
170 1
        sdx = word.translate(self._trans)
171
172 1
        if var == 'special':
173 1
            sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
174
        else:
175 1
            sdx = sdx.replace('9', '')  # rule 1
176 1
        sdx = self._delete_consecutive_repeats(sdx)  # rule 3
177
178 1
        if word[0] in 'HW':
179 1
            sdx = word[0] + sdx
180
        else:
181 1
            sdx = word[0] + sdx[1:]
182 1
        sdx = sdx.replace('0', '')  # rule 1
183
184 1
        if zero_pad:
185 1
            sdx += '0' * max_length  # rule 4
186
187 1
        return sdx[:max_length]
188
189
190 1
def soundex(word, max_length=4, var='American', reverse=False, zero_pad=True):
191
    """Return the Soundex code for a word.
192
193
    This is a wrapper for :py:meth:`Soundex.encode`.
194
195
    Parameters
196
    ----------
197
    word : str
198
        The word to transform
199
    max_length : int
200
        The length of the code returned (defaults to 4)
201
    var : str
202
        The variant of the algorithm to employ (defaults to ``American``):
203
204
            - ``American`` follows the American Soundex algorithm, as described
205
              at :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
206
              Miracode
207
            - ``special`` follows the rules from the 1880-1910 US Census
208
              retrospective re-analysis, in which h & w are not treated as
209
              blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
210
            - ``Census`` follows the rules laid out in GIL 55 :cite:`US:1997`
211
              by the US Census, including coding prefixed and unprefixed
212
              versions of some names
213
214
    reverse : bool
215
        Reverse the word before computing the selected Soundex (defaults to
216
        False); This results in "Reverse Soundex", which is useful for blocking
217
        in cases where the initial elements may be in error.
218
    zero_pad : bool
219
        Pad the end of the return value with 0s to achieve a max_length string
220
221
    Returns
222
    -------
223
    str
224
        The Soundex value
225
226
    Examples
227
    --------
228
    >>> soundex("Christopher")
229
    'C623'
230
    >>> soundex("Niall")
231
    'N400'
232
    >>> soundex('Smith')
233
    'S530'
234
    >>> soundex('Schmidt')
235
    'S530'
236
237
    >>> soundex('Christopher', max_length=-1)
238
    'C623160000000000000000000000000000000000000000000000000000000000'
239
    >>> soundex('Christopher', max_length=-1, zero_pad=False)
240
    'C62316'
241
242
    >>> soundex('Christopher', reverse=True)
243
    'R132'
244
245
    >>> soundex('Ashcroft')
246
    'A261'
247
    >>> soundex('Asicroft')
248
    'A226'
249
    >>> soundex('Ashcroft', var='special')
250
    'A226'
251
    >>> soundex('Asicroft', var='special')
252
    'A226'
253
254
    """
255 1
    return Soundex().encode(word, max_length, var, reverse, zero_pad)
256
257
258
if __name__ == '__main__':
259
    import doctest
260
261
    doctest.testmod()
262