abydos.phonetic._soundex - Code Metrics - Inspection of "Merge pull request #149 from chrislit/0.3.6" - chrislit/abydos - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Push — master ( f43547...71985b )

by Chris

created 2018-11-17 08:52 UTC

abydos.phonetic._soundex A

↳ Parent: Project

Complexity

Total Complexity

Size/Duplication

Total Lines	262
Duplicated Lines	0 %

Test Coverage

Coverage

100%

Importance

Changes

Metric	Value
eloc	57
dl	0
loc	262
ccs	40
cts	40
cp	1
rs	10
c	0
b	0
f	0
wmc	14

1 Function

Rating	Name	Duplication	Size	Complexity
A	soundex()	0	66	1

1 Method

Rating	Name	Duplication	Size	Complexity
D	Soundex.encode()	0	125	13

# -*- coding: utf-8 -*-

# Copyright 2014-2018 by Christopher C. Little.
# This file is part of Abydos.
#
# Abydos is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Abydos is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.

"""abydos.phonetic._soundex.

American Soundex
"""

from __future__ import (
    absolute_import,
    division,
    print_function,
    unicode_literals,
)

from unicodedata import normalize as unicode_normalize

from six import text_type

from ._phonetic import _Phonetic

__all__ = ['Soundex', 'soundex']


class Soundex(_Phonetic):

    """Soundex.

    Three variants of Soundex are implemented:

    - 'American' follows the American Soundex algorithm, as described at
      :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
      Miracode
    - 'special' follows the rules from the 1880-1910 US Census
      retrospective re-analysis, in which h & w are not treated as blocking
      consonants but as vowels. Cf. :cite:`Repici:2013`.
    - 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
      US Census, including coding prefixed and unprefixed versions of some
      names
    """

    _trans = dict(
        zip(
            (ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),

            '01230129022455012623019202',
        )
    )

    def encode(

        self, word, max_length=4, var='American', reverse=False, zero_pad=True

    ):
        """Return the Soundex code for a word.

        Parameters
        ----------
        word : str
            The word to transform
        max_length : int
            The length of the code returned (defaults to 4)
        var : str
            The variant of the algorithm to employ (defaults to ``American``):

                - ``American`` follows the American Soundex algorithm, as
                  described at :cite:`US:2007` and in :cite:`Knuth:1998`; this
                  is also called Miracode
                - ``special`` follows the rules from the 1880-1910 US Census
                  retrospective re-analysis, in which h & w are not treated as
                  blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
                - ``Census`` follows the rules laid out in GIL 55
                  :cite:`US:1997` by the US Census, including coding prefixed
                  and unprefixed versions of some names

        reverse : bool
            Reverse the word before computing the selected Soundex (defaults to
            False); This results in "Reverse Soundex", which is useful for
            blocking in cases where the initial elements may be in error.
        zero_pad : bool
            Pad the end of the return value with 0s to achieve a max_length
            string

        Returns
        -------
        str
            The Soundex value

        Examples
        --------
        >>> pe = Soundex()
        >>> pe.encode("Christopher")
        'C623'
        >>> pe.encode("Niall")
        'N400'
        >>> pe.encode('Smith')
        'S530'
        >>> pe.encode('Schmidt')
        'S530'

        >>> pe.encode('Christopher', max_length=-1)
        'C623160000000000000000000000000000000000000000000000000000000000'
        >>> pe.encode('Christopher', max_length=-1, zero_pad=False)
        'C62316'

        >>> pe.encode('Christopher', reverse=True)
        'R132'

        >>> pe.encode('Ashcroft')
        'A261'
        >>> pe.encode('Asicroft')
        'A226'
        >>> pe.encode('Ashcroft', var='special')
        'A226'
        >>> pe.encode('Asicroft', var='special')
        'A226'

        """
        # Require a max_length of at least 4 and not more than 64
        if max_length != -1:
            max_length = min(max(4, max_length), 64)
        else:
            max_length = 64

        # uppercase, normalize, decompose, and filter non-A-Z out
        word = unicode_normalize('NFKD', text_type(word.upper()))
        word = word.replace('ß', 'SS')

        if var == 'Census':
            if word[:3] in {'VAN', 'CON'} and len(word) > 4:
                return (
                    soundex(word, max_length, 'American', reverse, zero_pad),
                    soundex(
                        word[3:], max_length, 'American', reverse, zero_pad
                    ),
                )
            if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
                return (
                    soundex(word, max_length, 'American', reverse, zero_pad),
                    soundex(
                        word[2:], max_length, 'American', reverse, zero_pad
                    ),
                )
            # Otherwise, proceed as usual (var='American' mode, ostensibly)

        word = ''.join(c for c in word if c in self._uc_set)

        # Nothing to convert, return base case
        if not word:
            if zero_pad:
                return '0' * max_length
            return '0'

        # Reverse word if computing Reverse Soundex
        if reverse:
            word = word[::-1]

        # apply the Soundex algorithm
        sdx = word.translate(self._trans)

        if var == 'special':
            sdx = sdx.replace('9', '0')  # special rule for 1880-1910 census
        else:
            sdx = sdx.replace('9', '')  # rule 1
        sdx = self._delete_consecutive_repeats(sdx)  # rule 3

        if word[0] in 'HW':
            sdx = word[0] + sdx
        else:
            sdx = word[0] + sdx[1:]
        sdx = sdx.replace('0', '')  # rule 1

        if zero_pad:
            sdx += '0' * max_length  # rule 4

        return sdx[:max_length]


def soundex(word, max_length=4, var='American', reverse=False, zero_pad=True):
    """Return the Soundex code for a word.

    This is a wrapper for :py:meth:`Soundex.encode`.

    Parameters
    ----------
    word : str
        The word to transform
    max_length : int
        The length of the code returned (defaults to 4)
    var : str
        The variant of the algorithm to employ (defaults to ``American``):

            - ``American`` follows the American Soundex algorithm, as described
              at :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
              Miracode
            - ``special`` follows the rules from the 1880-1910 US Census
              retrospective re-analysis, in which h & w are not treated as
              blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
            - ``Census`` follows the rules laid out in GIL 55 :cite:`US:1997`
              by the US Census, including coding prefixed and unprefixed
              versions of some names

    reverse : bool
        Reverse the word before computing the selected Soundex (defaults to
        False); This results in "Reverse Soundex", which is useful for blocking
        in cases where the initial elements may be in error.
    zero_pad : bool
        Pad the end of the return value with 0s to achieve a max_length string

    Returns
    -------
    str
        The Soundex value

    Examples
    --------
    >>> soundex("Christopher")
    'C623'
    >>> soundex("Niall")
    'N400'
    >>> soundex('Smith')
    'S530'
    >>> soundex('Schmidt')
    'S530'

    >>> soundex('Christopher', max_length=-1)
    'C623160000000000000000000000000000000000000000000000000000000000'
    >>> soundex('Christopher', max_length=-1, zero_pad=False)
    'C62316'

    >>> soundex('Christopher', reverse=True)
    'R132'

    >>> soundex('Ashcroft')
    'A261'
    >>> soundex('Asicroft')
    'A226'
    >>> soundex('Ashcroft', var='special')
    'A226'
    >>> soundex('Asicroft', var='special')
    'A226'

    """
    return Soundex().encode(word, max_length, var, reverse, zero_pad)


if __name__ == '__main__':
    import doctest

    doctest.testmod()


1		# -- coding: utf-8 --
2
3		# Copyright 2014-2018 by Christopher C. Little.
4		# This file is part of Abydos.
5		#
6		# Abydos is free software: you can redistribute it and/or modify
7		# it under the terms of the GNU General Public License as published by
8		# the Free Software Foundation, either version 3 of the License, or
9		# (at your option) any later version.
10		#
11		# Abydos is distributed in the hope that it will be useful,
12		# but WITHOUT ANY WARRANTY; without even the implied warranty of
13		# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14		# GNU General Public License for more details.
15		#
16		# You should have received a copy of the GNU General Public License
17		# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19	1	"""abydos.phonetic._soundex.
20
21		American Soundex
22		"""
23
24	1	from __future__ import (
25		absolute_import,
26		division,
27		print_function,
28		unicode_literals,
29		)
30
31	1	from unicodedata import normalize as unicode_normalize
32
33	1	from six import text_type
34
35	1	from ._phonetic import _Phonetic
36
37	1	__all__ = ['Soundex', 'soundex']
38
39
40	1	class Soundex(_Phonetic):
		0 ignored issues – show Unused Code introduced 2018-11-10 01:42 UTC by Report Bug Copy Issue Report The variable `__class__` seems to be unused. Loading history...
41		"""Soundex.
42
43		Three variants of Soundex are implemented:
44
45		- 'American' follows the American Soundex algorithm, as described at
46		:cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
47		Miracode
48		- 'special' follows the rules from the 1880-1910 US Census
49		retrospective re-analysis, in which h & w are not treated as blocking
50		consonants but as vowels. Cf. :cite:`Repici:2013`.
51		- 'Census' follows the rules laid out in GIL 55 :cite:`US:1997` by the
52		US Census, including coding prefixed and unprefixed versions of some
53		names
54		"""
55
56	1	_trans = dict(
57		zip(
58		(ord(_) for _ in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
		0 ignored issues – show Comprehensibility Best Practice introduced 2018-10-24 06:00 UTC by Report Bug Copy Issue Report The variable `_` does not seem to be defined. Loading history...
59		'01230129022455012623019202',
60		)
61		)
62
63	1	def encode(
		0 ignored issues – show best-practice introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Too many arguments (6/5) Loading history... Bug introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Parameters differ from overridden 'encode' method Loading history...
64		self, word, max_length=4, var='American', reverse=False, zero_pad=True
		0 ignored issues – show Coding Style introduced 2018-11-04 08:02 UTC by Report Bug Copy Issue Report Wrong hanging indentation before block (add 4 spaces). Loading history...
65		):
66		"""Return the Soundex code for a word.
67
68		Parameters
69		----------
70		word : str
71		The word to transform
72		max_length : int
73		The length of the code returned (defaults to 4)
74		var : str
75		The variant of the algorithm to employ (defaults to ``American``):
76
77		- ``American`` follows the American Soundex algorithm, as
78		described at :cite:`US:2007` and in :cite:`Knuth:1998`; this
79		is also called Miracode
80		- ``special`` follows the rules from the 1880-1910 US Census
81		retrospective re-analysis, in which h & w are not treated as
82		blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
83		- ``Census`` follows the rules laid out in GIL 55
84		:cite:`US:1997` by the US Census, including coding prefixed
85		and unprefixed versions of some names
86
87		reverse : bool
88		Reverse the word before computing the selected Soundex (defaults to
89		False); This results in "Reverse Soundex", which is useful for
90		blocking in cases where the initial elements may be in error.
91		zero_pad : bool
92		Pad the end of the return value with 0s to achieve a max_length
93		string
94
95		Returns
96		-------
97		str
98		The Soundex value
99
100		Examples
101		--------
102		>>> pe = Soundex()
103		>>> pe.encode("Christopher")
104		'C623'
105		>>> pe.encode("Niall")
106		'N400'
107		>>> pe.encode('Smith')
108		'S530'
109		>>> pe.encode('Schmidt')
110		'S530'
111
112		>>> pe.encode('Christopher', max_length=-1)
113		'C623160000000000000000000000000000000000000000000000000000000000'
114		>>> pe.encode('Christopher', max_length=-1, zero_pad=False)
115		'C62316'
116
117		>>> pe.encode('Christopher', reverse=True)
118		'R132'
119
120		>>> pe.encode('Ashcroft')
121		'A261'
122		>>> pe.encode('Asicroft')
123		'A226'
124		>>> pe.encode('Ashcroft', var='special')
125		'A226'
126		>>> pe.encode('Asicroft', var='special')
127		'A226'
128
129		"""
130		# Require a max_length of at least 4 and not more than 64
131	1	if max_length != -1:
132	1	max_length = min(max(4, max_length), 64)
133		else:
134	1	max_length = 64
135
136		# uppercase, normalize, decompose, and filter non-A-Z out
137	1	word = unicode_normalize('NFKD', text_type(word.upper()))
138	1	word = word.replace('ß', 'SS')
139
140	1	if var == 'Census':
141	1	if word[:3] in {'VAN', 'CON'} and len(word) > 4:
142	1	return (
143		soundex(word, max_length, 'American', reverse, zero_pad),
144		soundex(
145		word[3:], max_length, 'American', reverse, zero_pad
146		),
147		)
148	1	if word[:2] in {'DE', 'DI', 'LA', 'LE'} and len(word) > 3:
149	1	return (
150		soundex(word, max_length, 'American', reverse, zero_pad),
151		soundex(
152		word[2:], max_length, 'American', reverse, zero_pad
153		),
154		)
155		# Otherwise, proceed as usual (var='American' mode, ostensibly)
156
157	1	word = ''.join(c for c in word if c in self._uc_set)
158
159		# Nothing to convert, return base case
160	1	if not word:
161	1	if zero_pad:
162	1	return '0' * max_length
163	1	return '0'
164
165		# Reverse word if computing Reverse Soundex
166	1	if reverse:
167	1	word = word[::-1]
168
169		# apply the Soundex algorithm
170	1	sdx = word.translate(self._trans)
171
172	1	if var == 'special':
173	1	sdx = sdx.replace('9', '0') # special rule for 1880-1910 census
174		else:
175	1	sdx = sdx.replace('9', '') # rule 1
176	1	sdx = self._delete_consecutive_repeats(sdx) # rule 3
177
178	1	if word[0] in 'HW':
179	1	sdx = word[0] + sdx
180		else:
181	1	sdx = word[0] + sdx[1:]
182	1	sdx = sdx.replace('0', '') # rule 1
183
184	1	if zero_pad:
185	1	sdx += '0' * max_length # rule 4
186
187	1	return sdx[:max_length]
188
189
190	1	def soundex(word, max_length=4, var='American', reverse=False, zero_pad=True):
191		"""Return the Soundex code for a word.
192
193		This is a wrapper for :py:meth:`Soundex.encode`.
194
195		Parameters
196		----------
197		word : str
198		The word to transform
199		max_length : int
200		The length of the code returned (defaults to 4)
201		var : str
202		The variant of the algorithm to employ (defaults to ``American``):
203
204		- ``American`` follows the American Soundex algorithm, as described
205		at :cite:`US:2007` and in :cite:`Knuth:1998`; this is also called
206		Miracode
207		- ``special`` follows the rules from the 1880-1910 US Census
208		retrospective re-analysis, in which h & w are not treated as
209		blocking consonants but as vowels. Cf. :cite:`Repici:2013`.
210		- ``Census`` follows the rules laid out in GIL 55 :cite:`US:1997`
211		by the US Census, including coding prefixed and unprefixed
212		versions of some names
213
214		reverse : bool
215		Reverse the word before computing the selected Soundex (defaults to
216		False); This results in "Reverse Soundex", which is useful for blocking
217		in cases where the initial elements may be in error.
218		zero_pad : bool
219		Pad the end of the return value with 0s to achieve a max_length string
220
221		Returns
222		-------
223		str
224		The Soundex value
225
226		Examples
227		--------
228		>>> soundex("Christopher")
229		'C623'
230		>>> soundex("Niall")
231		'N400'
232		>>> soundex('Smith')
233		'S530'
234		>>> soundex('Schmidt')
235		'S530'
236
237		>>> soundex('Christopher', max_length=-1)
238		'C623160000000000000000000000000000000000000000000000000000000000'
239		>>> soundex('Christopher', max_length=-1, zero_pad=False)
240		'C62316'
241
242		>>> soundex('Christopher', reverse=True)
243		'R132'
244
245		>>> soundex('Ashcroft')
246		'A261'
247		>>> soundex('Asicroft')
248		'A226'
249		>>> soundex('Ashcroft', var='special')
250		'A226'
251		>>> soundex('Asicroft', var='special')
252		'A226'
253
254		"""
255	1	return Soundex().encode(word, max_length, var, reverse, zero_pad)
256
257
258		if __name__ == '__main__':
259		import doctest
260
261		doctest.testmod()
262

chrislit / abydos

Push — master ( f43547...71985b )

abydos.phonetic._soundex A

Complexity

Size/Duplication

Test Coverage

Importance

1 Function

1 Method

Duplication Side-by-Side

Filter issues like