Completed
Push — master ( 14a933...449757 )
by Chris
09:19
created

abydos.stemmer.paice_husk()   F

Complexity

Conditions 19

Size

Total Lines 180
Code Lines 154

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 19
eloc 154
nop 1
dl 0
loc 180
rs 0.4199
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.stemmer.paice_husk() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (2242/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.stemmer.
20
21
The stemmer module defines word stemmers including:
22
23
    - the Lovins stemmer
24
    - the Porter and Porter2 (Snowball English) stemmers
25
    - Snowball stemmers for German, Dutch, Norwegian, Swedish, and Danish
26
    - CLEF German, German plus, and Swedish stemmers
27
    - Caumann's German stemmer
28
    - UEA-Lite stemmer
29
"""
30
31
from __future__ import unicode_literals
32
33
import re
34
import unicodedata
35
36
from six import text_type
37
from six.moves import range
38
39
40
def lovins(word):
41
    """Return Lovins stem.
42
43
    Lovins stemmer
44
45
    The Lovins stemmer is described in Julie Beth Lovins's article at:
46
    http://www.mt-archive.info/MT-1968-Lovins.pdf
47
48
    :param word: the word to stem
49
    :returns: word stem
50
    :rtype: string
51
52
    >>> lovins('reading')
53
    'read'
54
    >>> lovins('suspension')
55
    'suspens'
56
    >>> lovins('elusiveness')
57
    'elus'
58
    """
59
    # pylint: disable=too-many-branches, too-many-locals
60
61
    # lowercase, normalize, and compose
62
    word = unicodedata.normalize('NFC', text_type(word.lower()))
63
64
    def cond_b(word, suffix_len):
65
        """Return Lovins' condition B."""
66
        return len(word)-suffix_len >= 3
67
68
    def cond_c(word, suffix_len):
69
        """Return Lovins' condition C."""
70
        return len(word)-suffix_len >= 4
71
72
    def cond_d(word, suffix_len):
73
        """Return Lovins' condition D."""
74
        return len(word)-suffix_len >= 5
75
76
    def cond_e(word, suffix_len):
77
        """Return Lovins' condition E."""
78
        return word[-suffix_len-1] != 'e'
79
80
    def cond_f(word, suffix_len):
81
        """Return Lovins' condition F."""
82
        return (len(word)-suffix_len >= 3 and
83
                word[-suffix_len-1] != 'e')
84
85
    def cond_g(word, suffix_len):
86
        """Return Lovins' condition G."""
87
        return (len(word)-suffix_len >= 3 and
88
                word[-suffix_len-1] == 'f')
89
90
    def cond_h(word, suffix_len):
91
        """Return Lovins' condition H."""
92
        return (word[-suffix_len-1] == 't' or
93
                word[-suffix_len-2:-suffix_len] == 'll')
94
95
    def cond_i(word, suffix_len):
96
        """Return Lovins' condition I."""
97
        return word[-suffix_len-1] not in {'e', 'o'}
98
99
    def cond_j(word, suffix_len):
100
        """Return Lovins' condition J."""
101
        return word[-suffix_len-1] not in {'a', 'e'}
102
103
    def cond_k(word, suffix_len):
104
        """Return Lovins' condition K."""
105
        return (len(word)-suffix_len >= 3 and
106
                (word[-suffix_len-1] in {'i', 'l'} or
107
                 (word[-suffix_len-3] == 'u' and word[-suffix_len-1] == 'e')))
108
109
    def cond_l(word, suffix_len):
110
        """Return Lovins' condition L."""
111
        return (word[-suffix_len-1] not in {'s', 'u', 'x'} or
112
                word[-suffix_len-1] == 'os')
113
114
    def cond_m(word, suffix_len):
115
        """Return Lovins' condition M."""
116
        return word[-suffix_len-1] not in {'a', 'c', 'e', 'm'}
117
118
    def cond_n(word, suffix_len):
119
        """Return Lovins' condition N."""
120
        if len(word)-suffix_len >= 3:
121
            if word[-suffix_len-3] == 's':
122
                if len(word)-suffix_len >= 4:
123
                    return True
124
            else:
125
                return True
126
        return False
127
128
    def cond_o(word, suffix_len):
129
        """Return Lovins' condition O."""
130
        return word[-suffix_len-1] in {'i', 'l'}
131
132
    def cond_p(word, suffix_len):
133
        """Return Lovins' condition P."""
134
        return word[-suffix_len-1] != 'c'
135
136
    def cond_q(word, suffix_len):
137
        """Return Lovins' condition Q."""
138
        return (len(word)-suffix_len >= 3 and
139
                word[-suffix_len-1] not in {'l', 'n'})
140
141
    def cond_r(word, suffix_len):
142
        """Return Lovins' condition R."""
143
        return word[-suffix_len-1] in {'n', 'r'}
144
145
    def cond_s(word, suffix_len):
146
        """Return Lovins' condition S."""
147
        return (word[-suffix_len-2:-suffix_len] == 'dr' or
148
                (word[-suffix_len-1] == 't' and
149
                 word[-suffix_len-2:-suffix_len] != 'tt'))
150
151
    def cond_t(word, suffix_len):
152
        """Return Lovins' condition T."""
153
        return (word[-suffix_len-1] in {'s', 't'} and
154
                word[-suffix_len-2:-suffix_len] != 'ot')
155
156
    def cond_u(word, suffix_len):
157
        """Return Lovins' condition U."""
158
        return word[-suffix_len-1] in {'l', 'm', 'n', 'r'}
159
160
    def cond_v(word, suffix_len):
161
        """Return Lovins' condition V."""
162
        return word[-suffix_len-1] == 'c'
163
164
    def cond_w(word, suffix_len):
165
        """Return Lovins' condition W."""
166
        return word[-suffix_len-1] not in {'s', 'u'}
167
168
    def cond_x(word, suffix_len):
169
        """Return Lovins' condition X."""
170
        return (word[-suffix_len-1] in {'i', 'l'} or
171
                (word[-suffix_len-3:-suffix_len] == 'u' and
172
                 word[-suffix_len-1] == 'e'))
173
174
    def cond_y(word, suffix_len):
175
        """Return Lovins' condition Y."""
176
        return word[-suffix_len-2:-suffix_len] == 'in'
177
178
    def cond_z(word, suffix_len):
179
        """Return Lovins' condition Z."""
180
        return word[-suffix_len-1] != 'f'
181
182
    def cond_aa(word, suffix_len):
183
        """Return Lovins' condition AA."""
184
        return (word[-suffix_len-1] in {'d', 'f', 'l', 't'} or
185
                word[-suffix_len-2:-suffix_len] in {'ph', 'th', 'er', 'or',
186
                                                    'es'})
187
188
    def cond_bb(word, suffix_len):
189
        """Return Lovins' condition BB."""
190
        return (len(word)-suffix_len >= 3 and
191
                word[-suffix_len-3:-suffix_len] != 'met' and
192
                word[-suffix_len-4:-suffix_len] != 'ryst')
193
194
    def cond_cc(word, suffix_len):
195
        """Return Lovins' condition CC."""
196
        return word[-suffix_len-1] == 'l'
197
198
    suffix = {'alistically': cond_b, 'arizability': None,
199
              'izationally': cond_b, 'antialness': None,
200
              'arisations': None, 'arizations': None, 'entialness': None,
201
              'allically': cond_c, 'antaneous': None, 'antiality': None,
202
              'arisation': None, 'arization': None, 'ationally': cond_b,
203
              'ativeness': None, 'eableness': cond_e, 'entations': None,
204
              'entiality': None, 'entialize': None, 'entiation': None,
205
              'ionalness': None, 'istically': None, 'itousness': None,
206
              'izability': None, 'izational': None, 'ableness': None,
207
              'arizable': None, 'entation': None, 'entially': None,
208
              'eousness': None, 'ibleness': None, 'icalness': None,
209
              'ionalism': None, 'ionality': None, 'ionalize': None,
210
              'iousness': None, 'izations': None, 'lessness': None,
211
              'ability': None, 'aically': None, 'alistic': cond_b,
212
              'alities': None, 'ariness': cond_e, 'aristic': None,
213
              'arizing': None, 'ateness': None, 'atingly': None,
214
              'ational': cond_b, 'atively': None, 'ativism': None,
215
              'elihood': cond_e, 'encible': None, 'entally': None,
216
              'entials': None, 'entiate': None, 'entness': None,
217
              'fulness': None, 'ibility': None, 'icalism': None,
218
              'icalist': None, 'icality': None, 'icalize': None,
219
              'ication': cond_g, 'icianry': None, 'ination': None,
220
              'ingness': None, 'ionally': None, 'isation': None,
221
              'ishness': None, 'istical': None, 'iteness': None,
222
              'iveness': None, 'ivistic': None, 'ivities': None,
223
              'ization': cond_f, 'izement': None, 'oidally': None,
224
              'ousness': None, 'aceous': None, 'acious': cond_b,
225
              'action': cond_g, 'alness': None, 'ancial': None,
226
              'ancies': None, 'ancing': cond_b, 'ariser': None,
227
              'arized': None, 'arizer': None, 'atable': None,
228
              'ations': cond_b, 'atives': None, 'eature': cond_z,
229
              'efully': None, 'encies': None, 'encing': None,
230
              'ential': None, 'enting': cond_c, 'entist': None,
231
              'eously': None, 'ialist': None, 'iality': None,
232
              'ialize': None, 'ically': None, 'icance': None,
233
              'icians': None, 'icists': None, 'ifully': None,
234
              'ionals': None, 'ionate': cond_d, 'ioning': None,
235
              'ionist': None, 'iously': None, 'istics': None,
236
              'izable': cond_e, 'lessly': None, 'nesses': None,
237
              'oidism': None, 'acies': None, 'acity': None,
238
              'aging': cond_b, 'aical': None, 'alist': None,
239
              'alism': cond_b, 'ality': None, 'alize': None,
240
              'allic': cond_bb, 'anced': cond_b, 'ances': cond_b,
241
              'antic': cond_c, 'arial': None, 'aries': None,
242
              'arily': None, 'arity': cond_b, 'arize': None,
243
              'aroid': None, 'ately': None, 'ating': cond_i,
244
              'ation': cond_b, 'ative': None, 'ators': None,
245
              'atory': None, 'ature': cond_e, 'early': cond_y,
246
              'ehood': None, 'eless': None, 'elity': None,
247
              'ement': None, 'enced': None, 'ences': None,
248
              'eness': cond_e, 'ening': cond_e, 'ental': None,
249
              'ented': cond_c, 'ently': None, 'fully': None,
250
              'ially': None, 'icant': None, 'ician': None,
251
              'icide': None, 'icism': None, 'icist': None,
252
              'icity': None, 'idine': cond_i, 'iedly': None,
253
              'ihood': None, 'inate': None, 'iness': None,
254
              'ingly': cond_b, 'inism': cond_j, 'inity': cond_cc,
255
              'ional': None, 'ioned': None, 'ished': None,
256
              'istic': None, 'ities': None, 'itous': None,
257
              'ively': None, 'ivity': None, 'izers': cond_f,
258
              'izing': cond_f, 'oidal': None, 'oides': None,
259
              'otide': None, 'ously': None, 'able': None, 'ably': None,
260
              'ages': cond_b, 'ally': cond_b, 'ance': cond_b, 'ancy': cond_b,
261
              'ants': cond_b, 'aric': None, 'arly': cond_k, 'ated': cond_i,
262
              'ates': None, 'atic': cond_b, 'ator': None, 'ealy': cond_y,
263
              'edly': cond_e, 'eful': None, 'eity': None, 'ence': None,
264
              'ency': None, 'ened': cond_e, 'enly': cond_e, 'eous': None,
265
              'hood': None, 'ials': None, 'ians': None, 'ible': None,
266
              'ibly': None, 'ical': None, 'ides': cond_l, 'iers': None,
267
              'iful': None, 'ines': cond_m, 'ings': cond_n, 'ions': cond_b,
268
              'ious': None, 'isms': cond_b, 'ists': None, 'itic': cond_h,
269
              'ized': cond_f, 'izer': cond_f, 'less': None, 'lily': None,
270
              'ness': None, 'ogen': None, 'ward': None, 'wise': None,
271
              'ying': cond_b, 'yish': None, 'acy': None, 'age': cond_b,
272
              'aic': None, 'als': cond_bb, 'ant': cond_b, 'ars': cond_o,
273
              'ary': cond_f, 'ata': None, 'ate': None, 'eal': cond_y,
274
              'ear': cond_y, 'ely': cond_e, 'ene': cond_e, 'ent': cond_c,
275
              'ery': cond_e, 'ese': None, 'ful': None, 'ial': None,
276
              'ian': None, 'ics': None, 'ide': cond_l, 'ied': None,
277
              'ier': None, 'ies': cond_p, 'ily': None, 'ine': cond_m,
278
              'ing': cond_n, 'ion': cond_q, 'ish': cond_c, 'ism': cond_b,
279
              'ist': None, 'ite': cond_aa, 'ity': None, 'ium': None,
280
              'ive': None, 'ize': cond_f, 'oid': None, 'one': cond_r,
281
              'ous': None, 'ae': None, 'al': cond_bb, 'ar': cond_x,
282
              'as': cond_b, 'ed': cond_e, 'en': cond_f, 'es': cond_e,
283
              'ia': None, 'ic': None, 'is': None, 'ly': cond_b,
284
              'on': cond_s, 'or': cond_t, 'um': cond_u, 'us': cond_v,
285
              'yl': cond_r, '\'s': None, 's\'': None, 'a': None,
286
              'e': None, 'i': None, 'o': None, 's': cond_w, 'y': cond_b}
287
288
    for suffix_len in range(11, 0, -1):
289
        ending = word[-suffix_len:]
290
        if (ending in suffix and
291
                len(word)-suffix_len >= 2 and
292
                (suffix[ending] is None or
293
                 suffix[ending](word, suffix_len))):
294
            word = word[:-suffix_len]
295
            break
296
297
    def recode9(stem):
298
        """Return Lovins' conditional recode rule 9."""
299
        if stem[-3:-2] in {'a', 'i', 'o'}:
300
            return stem
301
        return stem[:-2]+'l'
302
303
    def recode24(stem):
304
        """Return Lovins' conditional recode rule 24."""
305
        if stem[-4:-3] == 's':
306
            return stem
307
        return stem[:-1]+'s'
308
309
    def recode28(stem):
310
        """Return Lovins' conditional recode rule 28."""
311
        if stem[-4:-3] in {'p', 't'}:
312
            return stem
313
        return stem[:-1]+'s'
314
315
    def recode30(stem):
316
        """Return Lovins' conditional recode rule 30."""
317
        if stem[-4:-3] == 'm':
318
            return stem
319
        return stem[:-1]+'s'
320
321
    def recode32(stem):
322
        """Return Lovins' conditional recode rule 32."""
323
        if stem[-3:-2] == 'n':
324
            return stem
325
        return stem[:-1]+'s'
326
327
    if word[-2:] in {'bb', 'dd', 'gg', 'll', 'mm', 'nn', 'pp', 'rr', 'ss',
328
                     'tt'}:
329
        word = word[:-1]
330
331
    recode = (('iev', 'ief'),
332
              ('uct', 'uc'),
333
              ('umpt', 'um'),
334
              ('rpt', 'rb'),
335
              ('urs', 'ur'),
336
              ('istr', 'ister'),
337
              ('metr', 'meter'),
338
              ('olv', 'olut'),
339
              ('ul', recode9),
340
              ('bex', 'bic'),
341
              ('dex', 'dic'),
342
              ('pex', 'pic'),
343
              ('tex', 'tic'),
344
              ('ax', 'ac'),
345
              ('ex', 'ec'),
346
              ('ix', 'ic'),
347
              ('lux', 'luc'),
348
              ('uad', 'uas'),
349
              ('vad', 'vas'),
350
              ('cid', 'cis'),
351
              ('lid', 'lis'),
352
              ('erid', 'eris'),
353
              ('pand', 'pans'),
354
              ('end', recode24),
355
              ('ond', 'ons'),
356
              ('lud', 'lus'),
357
              ('rud', 'rus'),
358
              ('her', recode28),
359
              ('mit', 'mis'),
360
              ('ent', recode30),
361
              ('ert', 'ers'),
362
              ('et', recode32),
363
              ('yt', 'ys'),
364
              ('yz', 'ys'))
365
366
    for ending, replacement in recode:
367
        if word.endswith(ending):
368
            if callable(replacement):
369
                word = replacement(word)
370
            else:
371
                word = word[:-len(ending)] + replacement
372
373
    return word
374
375
376
def _m_degree(term, vowels):
377
    """Return Porter helper function _m_degree value.
378
379
    m-degree is equal to the number of V to C transitions
380
381
    :param term: the word for which to calculate the m-degree
382
    :param vowels: the set of vowels in the language
383
    :returns: the m-degree as defined in the Porter stemmer definition
384
    """
385
    mdeg = 0
386
    last_was_vowel = False
387
    for letter in term:
388
        if letter in vowels:
389
            last_was_vowel = True
390
        else:
391
            if last_was_vowel:
392
                mdeg += 1
393
            last_was_vowel = False
394
    return mdeg
395
396
397
def _sb_has_vowel(term, vowels):
398
    """Return Porter helper function _sb_has_vowel value.
399
400
    :param term: the word to scan for vowels
401
    :param vowels: the set of vowels in the language
402
    :returns: true iff a vowel exists in the term (as defined in the Porter
403
        stemmer definition)
404
    """
405
    for letter in term:
406
        if letter in vowels:
407
            return True
408
    return False
409
410
411
def _ends_in_doubled_cons(term, vowels):
412
    """Return Porter helper function _ends_in_doubled_cons value.
413
414
    :param term: the word to check for a final doubled consonant
415
    :param vowels: the set of vowels in the language
416
    :returns: true iff the stem ends in a doubled consonant (as defined in the
417
        Porter stemmer definition)
418
    """
419
    if len(term) > 1 and term[-1] not in vowels and term[-2] == term[-1]:
420
        return True
421
    return False
422
423
424
def _ends_in_cvc(term, vowels):
425
    """Return Porter helper function _ends_in_cvc value.
426
427
    :param term: the word to scan for cvc
428
    :param vowels: the set of vowels in the language
429
    :returns: true iff the stem ends in cvc (as defined in the Porter stemmer
430
        definition)
431
    """
432
    if len(term) > 2 and (term[-1] not in vowels and
433
                          term[-2] in vowels and
434
                          term[-3] not in vowels and
435
                          term[-1] not in tuple('wxY')):
436
        return True
437
    return False
438
439
440
def porter(word, early_english=False):
441
    """Return Porter stem.
442
443
    The Porter stemmer is defined at:
444
    http://snowball.tartarus.org/algorithms/porter/stemmer.html
445
446
    :param word: the word to calculate the stem of
447
    :param early_english: set to True in order to remove -eth & -est (2nd & 3rd
448
        person singular verbal agreement suffixes)
449
    :returns: word stem
450
    :rtype: str
451
452
    >>> porter('reading')
453
    'read'
454
    >>> porter('suspension')
455
    'suspens'
456
    >>> porter('elusiveness')
457
    'elus'
458
459
    >>> porter('eateth', early_english=True)
460
    'eat'
461
    """
462
    # pylint: disable=too-many-branches
463
464
    # lowercase, normalize, and compose
465
    word = unicodedata.normalize('NFC', text_type(word.lower()))
466
467
    # Return word if stem is shorter than 2
468
    if len(word) < 3:
469
        return word
470
471
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
472
    # Re-map consonantal y to Y (Y will be C, y will be V)
473
    if word[0] == 'y':
474
        word = 'Y' + word[1:]
475
    for i in range(1, len(word)):
476
        if word[i] == 'y' and word[i-1] in _vowels:
477
            word = word[:i] + 'Y' + word[i+1:]
478
479
    # Step 1a
480
    if word[-1] == 's':
481
        if word[-4:] == 'sses':
482
            word = word[:-2]
483
        elif word[-3:] == 'ies':
484
            word = word[:-2]
485
        elif word[-2:] == 'ss':
486
            pass
487
        else:
488
            word = word[:-1]
489
490
    # Step 1b
491
    step1b_flag = False
492
    if word[-3:] == 'eed':
493
        if _m_degree(word[:-3], _vowels) > 0:
494
            word = word[:-1]
495
    elif word[-2:] == 'ed':
496
        if _sb_has_vowel(word[:-2], _vowels):
497
            word = word[:-2]
498
            step1b_flag = True
499
    elif word[-3:] == 'ing':
500
        if _sb_has_vowel(word[:-3], _vowels):
501
            word = word[:-3]
502
            step1b_flag = True
503
    elif early_english:
504
        if word[-3:] == 'est':
505
            if _sb_has_vowel(word[:-3], _vowels):
506
                word = word[:-3]
507
                step1b_flag = True
508
        elif word[-3:] == 'eth':
509
            if _sb_has_vowel(word[:-3], _vowels):
510
                word = word[:-3]
511
                step1b_flag = True
512
513
    if step1b_flag:
514
        if word[-2:] in {'at', 'bl', 'iz'}:
515
            word += 'e'
516
        elif (_ends_in_doubled_cons(word, _vowels) and
517
              word[-1] not in {'l', 's', 'z'}):
518
            word = word[:-1]
519
        elif _m_degree(word, _vowels) == 1 and _ends_in_cvc(word, _vowels):
520
            word += 'e'
521
522
    # Step 1c
523
    if word[-1] in {'Y', 'y'} and _sb_has_vowel(word[:-1], _vowels):
524
        word = word[:-1] + 'i'
525
526
    # Step 2
527
    if len(word) > 1:
528
        if word[-2] == 'a':
529
            if word[-7:] == 'ational':
530
                if _m_degree(word[:-7], _vowels) > 0:
531
                    word = word[:-5] + 'e'
532
            elif word[-6:] == 'tional':
533
                if _m_degree(word[:-6], _vowels) > 0:
534
                    word = word[:-2]
535
        elif word[-2] == 'c':
536
            if word[-4:] in {'enci', 'anci'}:
537
                if _m_degree(word[:-4], _vowels) > 0:
538
                    word = word[:-1] + 'e'
539
        elif word[-2] == 'e':
540
            if word[-4:] == 'izer':
541
                if _m_degree(word[:-4], _vowels) > 0:
542
                    word = word[:-1]
543
        elif word[-2] == 'g':
544
            if word[-4:] == 'logi':
545
                if _m_degree(word[:-4], _vowels) > 0:
546
                    word = word[:-1]
547
        elif word[-2] == 'l':
548
            if word[-3:] == 'bli':
549
                if _m_degree(word[:-3], _vowels) > 0:
550
                    word = word[:-1] + 'e'
551
            elif word[-4:] == 'alli':
552
                if _m_degree(word[:-4], _vowels) > 0:
553
                    word = word[:-2]
554
            elif word[-5:] == 'entli':
555
                if _m_degree(word[:-5], _vowels) > 0:
556
                    word = word[:-2]
557
            elif word[-3:] == 'eli':
558
                if _m_degree(word[:-3], _vowels) > 0:
559
                    word = word[:-2]
560
            elif word[-5:] == 'ousli':
561
                if _m_degree(word[:-5], _vowels) > 0:
562
                    word = word[:-2]
563
        elif word[-2] == 'o':
564
            if word[-7:] == 'ization':
565
                if _m_degree(word[:-7], _vowels) > 0:
566
                    word = word[:-5] + 'e'
567
            elif word[-5:] == 'ation':
568
                if _m_degree(word[:-5], _vowels) > 0:
569
                    word = word[:-3] + 'e'
570
            elif word[-4:] == 'ator':
571
                if _m_degree(word[:-4], _vowels) > 0:
572
                    word = word[:-2] + 'e'
573
        elif word[-2] == 's':
574
            if word[-5:] == 'alism':
575
                if _m_degree(word[:-5], _vowels) > 0:
576
                    word = word[:-3]
577
            elif word[-7:] in {'iveness', 'fulness', 'ousness'}:
578
                if _m_degree(word[:-7], _vowels) > 0:
579
                    word = word[:-4]
580
        elif word[-2] == 't':
581
            if word[-5:] == 'aliti':
582
                if _m_degree(word[:-5], _vowels) > 0:
583
                    word = word[:-3]
584
            elif word[-5:] == 'iviti':
585
                if _m_degree(word[:-5], _vowels) > 0:
586
                    word = word[:-3] + 'e'
587
            elif word[-6:] == 'biliti':
588
                if _m_degree(word[:-6], _vowels) > 0:
589
                    word = word[:-5] + 'le'
590
591
    # Step 3
592
    if word[-5:] == 'icate':
593
        if _m_degree(word[:-5], _vowels) > 0:
594
            word = word[:-3]
595
    elif word[-5:] == 'ative':
596
        if _m_degree(word[:-5], _vowels) > 0:
597
            word = word[:-5]
598
    elif word[-5:] in {'alize', 'iciti'}:
599
        if _m_degree(word[:-5], _vowels) > 0:
600
            word = word[:-3]
601
    elif word[-4:] == 'ical':
602
        if _m_degree(word[:-4], _vowels) > 0:
603
            word = word[:-2]
604
    elif word[-3:] == 'ful':
605
        if _m_degree(word[:-3], _vowels) > 0:
606
            word = word[:-3]
607
    elif word[-4:] == 'ness':
608
        if _m_degree(word[:-4], _vowels) > 0:
609
            word = word[:-4]
610
611
    # Step 4
612
    if word[-2:] == 'al':
613
        if _m_degree(word[:-2], _vowels) > 1:
614
            word = word[:-2]
615
    elif word[-4:] == 'ance':
616
        if _m_degree(word[:-4], _vowels) > 1:
617
            word = word[:-4]
618
    elif word[-4:] == 'ence':
619
        if _m_degree(word[:-4], _vowels) > 1:
620
            word = word[:-4]
621
    elif word[-2:] == 'er':
622
        if _m_degree(word[:-2], _vowels) > 1:
623
            word = word[:-2]
624
    elif word[-2:] == 'ic':
625
        if _m_degree(word[:-2], _vowels) > 1:
626
            word = word[:-2]
627
    elif word[-4:] == 'able':
628
        if _m_degree(word[:-4], _vowels) > 1:
629
            word = word[:-4]
630
    elif word[-4:] == 'ible':
631
        if _m_degree(word[:-4], _vowels) > 1:
632
            word = word[:-4]
633
    elif word[-3:] == 'ant':
634
        if _m_degree(word[:-3], _vowels) > 1:
635
            word = word[:-3]
636
    elif word[-5:] == 'ement':
637
        if _m_degree(word[:-5], _vowels) > 1:
638
            word = word[:-5]
639
    elif word[-4:] == 'ment':
640
        if _m_degree(word[:-4], _vowels) > 1:
641
            word = word[:-4]
642
    elif word[-3:] == 'ent':
643
        if _m_degree(word[:-3], _vowels) > 1:
644
            word = word[:-3]
645
    elif word[-4:] in {'sion', 'tion'}:
646
        if _m_degree(word[:-3], _vowels) > 1:
647
            word = word[:-3]
648
    elif word[-2:] == 'ou':
649
        if _m_degree(word[:-2], _vowels) > 1:
650
            word = word[:-2]
651
    elif word[-3:] == 'ism':
652
        if _m_degree(word[:-3], _vowels) > 1:
653
            word = word[:-3]
654
    elif word[-3:] == 'ate':
655
        if _m_degree(word[:-3], _vowels) > 1:
656
            word = word[:-3]
657
    elif word[-3:] == 'iti':
658
        if _m_degree(word[:-3], _vowels) > 1:
659
            word = word[:-3]
660
    elif word[-3:] == 'ous':
661
        if _m_degree(word[:-3], _vowels) > 1:
662
            word = word[:-3]
663
    elif word[-3:] == 'ive':
664
        if _m_degree(word[:-3], _vowels) > 1:
665
            word = word[:-3]
666
    elif word[-3:] == 'ize':
667
        if _m_degree(word[:-3], _vowels) > 1:
668
            word = word[:-3]
669
670
    # Step 5a
671
    if word[-1] == 'e':
672
        if _m_degree(word[:-1], _vowels) > 1:
673
            word = word[:-1]
674
        elif (_m_degree(word[:-1], _vowels) == 1 and
675
              not _ends_in_cvc(word[:-1], _vowels)):
676
            word = word[:-1]
677
678
    # Step 5b
679
    if word[-2:] == 'll' and _m_degree(word, _vowels) > 1:
680
        word = word[:-1]
681
682
    # Change 'Y' back to 'y' if it survived stemming
683
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
684
        if word[i] == 'Y':
685
            word = word[:i] + 'y' + word[i+1:]
686
687
    return word
688
689
690
def _sb_r1(term, vowels, r1_prefixes=None):
691
    """Return the R1 region, as defined in the Porter2 specification."""
692
    vowel_found = False
693
    if hasattr(r1_prefixes, '__iter__'):
694
        for prefix in r1_prefixes:
695
            if term[:len(prefix)] == prefix:
696
                return len(prefix)
697
698
    for i in range(len(term)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
699
        if not vowel_found and term[i] in vowels:
700
            vowel_found = True
701
        elif vowel_found and term[i] not in vowels:
702
            return i + 1
703
    return len(term)
704
705
706
def _sb_r2(term, vowels, r1_prefixes=None):
707
    """Return the R2 region, as defined in the Porter2 specification."""
708
    r1_start = _sb_r1(term, vowels, r1_prefixes)
709
    return r1_start + _sb_r1(term[r1_start:], vowels)
710
711
712
def _sb_ends_in_short_syllable(term, vowels, codanonvowels):
713
    """Return True iff term ends in a short syllable.
714
715
    (...according to the Porter2 specification.)
716
717
    NB: This is akin to the CVC test from the Porter stemmer. The description
718
    is unfortunately poor/ambiguous.
719
    """
720
    if not term:
721
        return False
722
    if len(term) == 2:
723
        if term[-2] in vowels and term[-1] not in vowels:
724
            return True
725
    elif len(term) >= 3:
726
        if ((term[-3] not in vowels and term[-2] in vowels and
727
             term[-1] in codanonvowels)):
728
            return True
729
    return False
730
731
732
def _sb_short_word(term, vowels, codanonvowels, r1_prefixes=None):
733
    """Return True iff term is a short word.
734
735
    (...according to the Porter2 specification.)
736
    """
737
    if ((_sb_r1(term, vowels, r1_prefixes) == len(term) and
738
         _sb_ends_in_short_syllable(term, vowels, codanonvowels))):
739
        return True
740
    return False
741
742
743
def porter2(word, early_english=False):
744
    """Return the Porter2 (Snowball English) stem.
745
746
    The Porter2 (Snowball English) stemmer is defined at:
747
    http://snowball.tartarus.org/algorithms/english/stemmer.html
748
749
    :param word: the word to calculate the stem of
750
    :param early_english: set to True in order to remove -eth & -est (2nd & 3rd
751
        person singular verbal agreement suffixes)
752
    :returns: word stem
753
    :rtype: str
754
755
    >>> porter2('reading')
756
    'read'
757
    >>> porter2('suspension')
758
    'suspens'
759
    >>> porter2('elusiveness')
760
    'elus'
761
762
    >>> porter2('eateth', early_english=True)
763
    'eat'
764
    """
765
    # pylint: disable=too-many-branches
766
    # pylint: disable=too-many-return-statements
767
768
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
769
    _codanonvowels = {"'", 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm',
770
                      'n', 'p', 'q', 'r', 's', 't', 'v', 'z'}
771
    _doubles = {'bb', 'dd', 'ff', 'gg', 'mm', 'nn', 'pp', 'rr', 'tt'}
772
    _li = {'c', 'd', 'e', 'g', 'h', 'k', 'm', 'n', 'r', 't'}
773
774
    # R1 prefixes should be in order from longest to shortest to prevent
775
    # masking
776
    _r1_prefixes = ('commun', 'gener', 'arsen')
777
    _exception1dict = {  # special changes:
778
        'skis': 'ski', 'skies': 'sky', 'dying': 'die',
779
        'lying': 'lie', 'tying': 'tie',
780
        # special -LY cases:
781
        'idly': 'idl', 'gently': 'gentl', 'ugly': 'ugli',
782
        'early': 'earli', 'only': 'onli', 'singly': 'singl'}
783
    _exception1set = {'sky', 'news', 'howe', 'atlas', 'cosmos', 'bias',
784
                      'andes'}
785
    _exception2set = {'inning', 'outing', 'canning', 'herring', 'earring',
786
                      'proceed', 'exceed', 'succeed'}
787
788
    # lowercase, normalize, and compose
789
    word = unicodedata.normalize('NFC', text_type(word.lower()))
790
    # replace apostrophe-like characters with U+0027, per
791
    # http://snowball.tartarus.org/texts/apostrophe.html
792
    word = word.replace('’', '\'')
793
    word = word.replace('’', '\'')
794
795
    # Exceptions 1
796
    if word in _exception1dict:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
797
        return _exception1dict[word]
798
    elif word in _exception1set:
799
        return word
800
801
    # Return word if stem is shorter than 3
802
    if len(word) < 3:
803
        return word
804
805
    # Remove initial ', if present.
806
    while word and word[0] == '\'':
807
        word = word[1:]
808
        # Return word if stem is shorter than 2
809
        if len(word) < 2:
810
            return word
811
812
    # Re-map vocalic Y to y (Y will be C, y will be V)
813
    if word[0] == 'y':
814
        word = 'Y' + word[1:]
815
    for i in range(1, len(word)):
816
        if word[i] == 'y' and word[i-1] in _vowels:
817
            word = word[:i] + 'Y' + word[i+1:]
818
819
    r1_start = _sb_r1(word, _vowels, _r1_prefixes)
820
    r2_start = _sb_r2(word, _vowels, _r1_prefixes)
821
822
    # Step 0
823
    if word[-3:] == '\'s\'':
824
        word = word[:-3]
825
    elif word[-2:] == '\'s':
826
        word = word[:-2]
827
    elif word[-1:] == '\'':
828
        word = word[:-1]
829
    # Return word if stem is shorter than 2
830
    if len(word) < 3:
831
        return word
832
833
    # Step 1a
834
    if word[-4:] == 'sses':
835
        word = word[:-2]
836
    elif word[-3:] in {'ied', 'ies'}:
837
        if len(word) > 4:
838
            word = word[:-2]
839
        else:
840
            word = word[:-1]
841
    elif word[-2:] in {'us', 'ss'}:
842
        pass
843
    elif word[-1] == 's':
844
        if _sb_has_vowel(word[:-2], _vowels):
845
            word = word[:-1]
846
847
    # Exceptions 2
848
    if word in _exception2set:
849
        return word
850
851
    # Step 1b
852
    step1b_flag = False
853
    if word[-5:] == 'eedly':
854
        if len(word[r1_start:]) >= 5:
855
            word = word[:-3]
856
    elif word[-5:] == 'ingly':
857
        if _sb_has_vowel(word[:-5], _vowels):
858
            word = word[:-5]
859
            step1b_flag = True
860
    elif word[-4:] == 'edly':
861
        if _sb_has_vowel(word[:-4], _vowels):
862
            word = word[:-4]
863
            step1b_flag = True
864
    elif word[-3:] == 'eed':
865
        if len(word[r1_start:]) >= 3:
866
            word = word[:-1]
867
    elif word[-3:] == 'ing':
868
        if _sb_has_vowel(word[:-3], _vowels):
869
            word = word[:-3]
870
            step1b_flag = True
871
    elif word[-2:] == 'ed':
872
        if _sb_has_vowel(word[:-2], _vowels):
873
            word = word[:-2]
874
            step1b_flag = True
875
    elif early_english:
876
        if word[-3:] == 'est':
877
            if _sb_has_vowel(word[:-3], _vowels):
878
                word = word[:-3]
879
                step1b_flag = True
880
        elif word[-3:] == 'eth':
881
            if _sb_has_vowel(word[:-3], _vowels):
882
                word = word[:-3]
883
                step1b_flag = True
884
885
    if step1b_flag:
886
        if word[-2:] in {'at', 'bl', 'iz'}:
887
            word += 'e'
888
        elif word[-2:] in _doubles:
889
            word = word[:-1]
890
        elif _sb_short_word(word, _vowels, _codanonvowels, _r1_prefixes):
891
            word += 'e'
892
893
    # Step 1c
894
    if ((len(word) > 2 and word[-1] in {'Y', 'y'} and
895
         word[-2] not in _vowels)):
896
        word = word[:-1] + 'i'
897
898
    # Step 2
899
    if word[-2] == 'a':
900
        if word[-7:] == 'ational':
901
            if len(word[r1_start:]) >= 7:
902
                word = word[:-5] + 'e'
903
        elif word[-6:] == 'tional':
904
            if len(word[r1_start:]) >= 6:
905
                word = word[:-2]
906
    elif word[-2] == 'c':
907
        if word[-4:] in {'enci', 'anci'}:
908
            if len(word[r1_start:]) >= 4:
909
                word = word[:-1] + 'e'
910
    elif word[-2] == 'e':
911
        if word[-4:] == 'izer':
912
            if len(word[r1_start:]) >= 4:
913
                word = word[:-1]
914
    elif word[-2] == 'g':
915
        if word[-3:] == 'ogi':
916
            if ((r1_start >= 1 and len(word[r1_start:]) >= 3 and
917
                 word[-4] == 'l')):
918
                word = word[:-1]
919
    elif word[-2] == 'l':
920
        if word[-6:] == 'lessli':
921
            if len(word[r1_start:]) >= 6:
922
                word = word[:-2]
923
        elif word[-5:] in {'entli', 'fulli', 'ousli'}:
924
            if len(word[r1_start:]) >= 5:
925
                word = word[:-2]
926
        elif word[-4:] == 'abli':
927
            if len(word[r1_start:]) >= 4:
928
                word = word[:-1] + 'e'
929
        elif word[-4:] == 'alli':
930
            if len(word[r1_start:]) >= 4:
931
                word = word[:-2]
932
        elif word[-3:] == 'bli':
933
            if len(word[r1_start:]) >= 3:
934
                word = word[:-1] + 'e'
935
        elif word[-2:] == 'li':
936
            if ((r1_start >= 1 and len(word[r1_start:]) >= 2 and
937
                 word[-3] in _li)):
938
                word = word[:-2]
939
    elif word[-2] == 'o':
940
        if word[-7:] == 'ization':
941
            if len(word[r1_start:]) >= 7:
942
                word = word[:-5] + 'e'
943
        elif word[-5:] == 'ation':
944
            if len(word[r1_start:]) >= 5:
945
                word = word[:-3] + 'e'
946
        elif word[-4:] == 'ator':
947
            if len(word[r1_start:]) >= 4:
948
                word = word[:-2] + 'e'
949
    elif word[-2] == 's':
950
        if word[-7:] in {'fulness', 'ousness', 'iveness'}:
951
            if len(word[r1_start:]) >= 7:
952
                word = word[:-4]
953
        elif word[-5:] == 'alism':
954
            if len(word[r1_start:]) >= 5:
955
                word = word[:-3]
956
    elif word[-2] == 't':
957
        if word[-6:] == 'biliti':
958
            if len(word[r1_start:]) >= 6:
959
                word = word[:-5] + 'le'
960
        elif word[-5:] == 'aliti':
961
            if len(word[r1_start:]) >= 5:
962
                word = word[:-3]
963
        elif word[-5:] == 'iviti':
964
            if len(word[r1_start:]) >= 5:
965
                word = word[:-3] + 'e'
966
967
    # Step 3
968
    if word[-7:] == 'ational':
969
        if len(word[r1_start:]) >= 7:
970
            word = word[:-5] + 'e'
971
    elif word[-6:] == 'tional':
972
        if len(word[r1_start:]) >= 6:
973
            word = word[:-2]
974
    elif word[-5:] in {'alize', 'icate', 'iciti'}:
975
        if len(word[r1_start:]) >= 5:
976
            word = word[:-3]
977
    elif word[-5:] == 'ative':
978
        if len(word[r2_start:]) >= 5:
979
            word = word[:-5]
980
    elif word[-4:] == 'ical':
981
        if len(word[r1_start:]) >= 4:
982
            word = word[:-2]
983
    elif word[-4:] == 'ness':
984
        if len(word[r1_start:]) >= 4:
985
            word = word[:-4]
986
    elif word[-3:] == 'ful':
987
        if len(word[r1_start:]) >= 3:
988
            word = word[:-3]
989
990
    # Step 4
991
    for suffix in ('ement', 'ance', 'ence', 'able', 'ible', 'ment', 'ant',
992
                   'ent', 'ism', 'ate', 'iti', 'ous', 'ive', 'ize', 'al', 'er',
993
                   'ic'):
994
        if word[-len(suffix):] == suffix:
995
            if len(word[r2_start:]) >= len(suffix):
996
                word = word[:-len(suffix)]
997
            break
998
    else:
999
        if word[-3:] == 'ion':
1000
            if ((len(word[r2_start:]) >= 3 and len(word) >= 4 and
1001
                 word[-4] in tuple('st'))):
1002
                word = word[:-3]
1003
1004
    # Step 5
1005
    if word[-1] == 'e':
1006
        if (len(word[r2_start:]) >= 1 or
1007
                (len(word[r1_start:]) >= 1 and
1008
                 not _sb_ends_in_short_syllable(word[:-1], _vowels,
1009
                                                _codanonvowels))):
1010
            word = word[:-1]
1011
    elif word[-1] == 'l':
1012
        if len(word[r2_start:]) >= 1 and word[-2] == 'l':
1013
            word = word[:-1]
1014
1015
    # Change 'Y' back to 'y' if it survived stemming
1016
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1017
        if word[i] == 'Y':
1018
            word = word[:i] + 'y' + word[i+1:]
1019
1020
    return word
1021
1022
1023
def sb_german(word, alternate_vowels=False):
1024
    """Return Snowball German stem.
1025
1026
    The Snowball German stemmer is defined at:
1027
    http://snowball.tartarus.org/algorithms/german/stemmer.html
1028
1029
    :param word: the word to calculate the stem of
1030
    :param alternate_vowels: composes ae as ä, oe as ö, and ue as ü before
1031
        running the algorithm
1032
    :returns: word stem
1033
    :rtype: str
1034
1035
    >>> sb_german('lesen')
1036
    'les'
1037
    >>> sb_german('graues')
1038
    'grau'
1039
    >>> sb_german('buchstabieren')
1040
    'buchstabi'
1041
    """
1042
    # pylint: disable=too-many-branches
1043
1044
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'ö', 'ü'}
1045
    _s_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'r', 't'}
1046
    _st_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
1047
1048
    # lowercase, normalize, and compose
1049
    word = unicodedata.normalize('NFC', word.lower())
1050
    word = word.replace('ß', 'ss')
1051
1052
    if len(word) > 2:
1053
        for i in range(2, len(word)):
1054
            if word[i] in _vowels and word[i-2] in _vowels:
1055
                if word[i-1] == 'u':
1056
                    word = word[:i-1] + 'U' + word[i:]
1057
                elif word[i-1] == 'y':
1058
                    word = word[:i-1] + 'Y' + word[i:]
1059
1060
    if alternate_vowels:
1061
        word = word.replace('ae', 'ä')
1062
        word = word.replace('oe', 'ö')
1063
        word = word.replace('que', 'Q')
1064
        word = word.replace('ue', 'ü')
1065
        word = word.replace('Q', 'que')
1066
1067
    r1_start = max(3, _sb_r1(word, _vowels))
1068
    r2_start = _sb_r2(word, _vowels)
1069
1070
    # Step 1
1071
    niss_flag = False
1072
    if word[-3:] == 'ern':
1073
        if len(word[r1_start:]) >= 3:
1074
            word = word[:-3]
1075
    elif word[-2:] == 'em':
1076
        if len(word[r1_start:]) >= 2:
1077
            word = word[:-2]
1078
    elif word[-2:] == 'er':
1079
        if len(word[r1_start:]) >= 2:
1080
            word = word[:-2]
1081
    elif word[-2:] == 'en':
1082
        if len(word[r1_start:]) >= 2:
1083
            word = word[:-2]
1084
            niss_flag = True
1085
    elif word[-2:] == 'es':
1086
        if len(word[r1_start:]) >= 2:
1087
            word = word[:-2]
1088
            niss_flag = True
1089
    elif word[-1:] == 'e':
1090
        if len(word[r1_start:]) >= 1:
1091
            word = word[:-1]
1092
            niss_flag = True
1093
    elif word[-1:] == 's':
1094
        if ((len(word[r1_start:]) >= 1 and len(word) >= 2 and
1095
             word[-2] in _s_endings)):
1096
            word = word[:-1]
1097
1098
    if niss_flag and word[-4:] == 'niss':
1099
        word = word[:-1]
1100
1101
    # Step 2
1102
    if word[-3:] == 'est':
1103
        if len(word[r1_start:]) >= 3:
1104
            word = word[:-3]
1105
    elif word[-2:] == 'en':
1106
        if len(word[r1_start:]) >= 2:
1107
            word = word[:-2]
1108
    elif word[-2:] == 'er':
1109
        if len(word[r1_start:]) >= 2:
1110
            word = word[:-2]
1111
    elif word[-2:] == 'st':
1112
        if ((len(word[r1_start:]) >= 2 and len(word) >= 6 and
1113
             word[-3] in _st_endings)):
1114
            word = word[:-2]
1115
1116
    # Step 3
1117
    if word[-4:] == 'isch':
1118
        if len(word[r2_start:]) >= 4 and word[-5] != 'e':
1119
            word = word[:-4]
1120
    elif word[-4:] in {'lich', 'heit'}:
1121
        if len(word[r2_start:]) >= 4:
1122
            word = word[:-4]
1123
            if ((word[-2:] in {'er', 'en'} and
1124
                 len(word[r1_start:]) >= 2)):
1125
                word = word[:-2]
1126
    elif word[-4:] == 'keit':
1127
        if len(word[r2_start:]) >= 4:
1128
            word = word[:-4]
1129
            if word[-4:] == 'lich' and len(word[r2_start:]) >= 4:
1130
                word = word[:-4]
1131
            elif word[-2:] == 'ig' and len(word[r2_start:]) >= 2:
1132
                word = word[:-2]
1133
    elif word[-3:] in {'end', 'ung'}:
1134
        if len(word[r2_start:]) >= 3:
1135
            word = word[:-3]
1136
            if ((word[-2:] == 'ig' and len(word[r2_start:]) >= 2 and
1137
                 word[-3] != 'e')):
1138
                word = word[:-2]
1139
    elif word[-2:] in {'ig', 'ik'}:
1140
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1141
            word = word[:-2]
1142
1143
    # Change 'Y' and 'U' back to lowercase if survived stemming
1144
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1145
        if word[i] == 'Y':
1146
            word = word[:i] + 'y' + word[i+1:]
1147
        elif word[i] == 'U':
1148
            word = word[:i] + 'u' + word[i+1:]
1149
1150
    # Remove umlauts
1151
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1152
    word = word.translate(_umlauts)
1153
1154
    return word
1155
1156
1157
def sb_dutch(word):
1158
    """Return Snowball Dutch stem.
1159
1160
    The Snowball Dutch stemmer is defined at:
1161
    http://snowball.tartarus.org/algorithms/dutch/stemmer.html
1162
1163
    :param word: the word to calculate the stem of
1164
    :returns: word stem
1165
    :rtype: str
1166
1167
    >>> sb_dutch('lezen')
1168
    'lez'
1169
    >>> sb_dutch('opschorting')
1170
    'opschort'
1171
    >>> sb_dutch('ongrijpbaarheid')
1172
    'ongrijp'
1173
    """
1174
    # pylint: disable=too-many-branches
1175
1176
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'è'}
1177
    _not_s_endings = {'a', 'e', 'i', 'j', 'o', 'u', 'y', 'è'}
1178
1179
    def _undouble(word):
1180
        """Undouble endings -kk, -dd, and -tt."""
1181
        if ((len(word) > 1 and word[-1] == word[-2] and
1182
             word[-1] in {'d', 'k', 't'})):
1183
            return word[:-1]
1184
        return word
1185
1186
    # lowercase, normalize, decompose, filter umlauts & acutes out, and compose
1187
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1188
    _accented = dict(zip((ord(_) for _ in 'äëïöüáéíóú'), 'aeiouaeiou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1189
    word = word.translate(_accented)
1190
1191
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1192
        if i == 0 and word[0] == 'y':
1193
            word = 'Y' + word[1:]
1194
        elif word[i] == 'y' and word[i-1] in _vowels:
1195
            word = word[:i] + 'Y' + word[i+1:]
1196
        elif (word[i] == 'i' and word[i-1] in _vowels and i+1 < len(word) and
1197
              word[i+1] in _vowels):
1198
            word = word[:i] + 'I' + word[i+1:]
1199
1200
    r1_start = max(3, _sb_r1(word, _vowels))
1201
    r2_start = _sb_r2(word, _vowels)
1202
1203
    # Step 1
1204
    if word[-5:] == 'heden':
1205
        if len(word[r1_start:]) >= 5:
1206
            word = word[:-3] + 'id'
1207
    elif word[-3:] == 'ene':
1208
        if ((len(word[r1_start:]) >= 3 and
1209
             (word[-4] not in _vowels and word[-6:-3] != 'gem'))):
1210
            word = _undouble(word[:-3])
1211
    elif word[-2:] == 'en':
1212
        if ((len(word[r1_start:]) >= 2 and
1213
             (word[-3] not in _vowels and word[-5:-2] != 'gem'))):
1214
            word = _undouble(word[:-2])
1215
    elif word[-2:] == 'se':
1216
        if len(word[r1_start:]) >= 2 and word[-3] not in _not_s_endings:
1217
            word = word[:-2]
1218
    elif word[-1:] == 's':
1219
        if len(word[r1_start:]) >= 1 and word[-2] not in _not_s_endings:
1220
            word = word[:-1]
1221
1222
    # Step 2
1223
    e_removed = False
1224
    if word[-1:] == 'e':
1225
        if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
1226
            word = _undouble(word[:-1])
1227
            e_removed = True
1228
1229
    # Step 3a
1230
    if word[-4:] == 'heid':
1231
        if len(word[r2_start:]) >= 4 and word[-5] != 'c':
1232
            word = word[:-4]
1233
            if word[-2:] == 'en':
1234
                if ((len(word[r1_start:]) >= 2 and
1235
                     (word[-3] not in _vowels and word[-5:-2] != 'gem'))):
1236
                    word = _undouble(word[:-2])
1237
1238
    # Step 3b
1239
    if word[-4:] == 'lijk':
1240
        if len(word[r2_start:]) >= 4:
1241
            word = word[:-4]
1242
            # Repeat step 2
1243
            if word[-1:] == 'e':
1244
                if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
1245
                    word = _undouble(word[:-1])
1246
    elif word[-4:] == 'baar':
1247
        if len(word[r2_start:]) >= 4:
1248
            word = word[:-4]
1249
    elif word[-3:] in ('end', 'ing'):
1250
        if len(word[r2_start:]) >= 3:
1251
            word = word[:-3]
1252
            if ((word[-2:] == 'ig' and len(word[r2_start:]) >= 2 and
1253
                 word[-3] != 'e')):
1254
                word = word[:-2]
1255
            else:
1256
                word = _undouble(word)
1257
    elif word[-3:] == 'bar':
1258
        if len(word[r2_start:]) >= 3 and e_removed:
1259
            word = word[:-3]
1260
    elif word[-2:] == 'ig':
1261
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1262
            word = word[:-2]
1263
1264
    # Step 4
1265
    if ((len(word) >= 4 and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1266
         word[-3] == word[-2] and word[-2] in {'a', 'e', 'o', 'u'} and
1267
         word[-4] not in _vowels and
1268
         word[-1] not in _vowels and word[-1] != 'I')):
1269
        word = word[:-2] + word[-1]
1270
1271
    # Change 'Y' and 'U' back to lowercase if survived stemming
1272
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1273
        if word[i] == 'Y':
1274
            word = word[:i] + 'y' + word[i+1:]
1275
        elif word[i] == 'I':
1276
            word = word[:i] + 'i' + word[i+1:]
1277
1278
    return word
1279
1280
1281
def sb_norwegian(word):
1282
    """Return Snowball Norwegian stem.
1283
1284
    The Snowball Norwegian stemmer is defined at:
1285
    http://snowball.tartarus.org/algorithms/norwegian/stemmer.html
1286
1287
    :param word: the word to calculate the stem of
1288
    :returns: word stem
1289
    :rtype: str
1290
1291
    >>> sb_norwegian('lese')
1292
    'les'
1293
    >>> sb_norwegian('suspensjon')
1294
    'suspensjon'
1295
    >>> sb_norwegian('sikkerhet')
1296
    'sikker'
1297
    """
1298
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1299
    _s_endings = {'b', 'c', 'd', 'f', 'g', 'h', 'j', 'l', 'm', 'n', 'o', 'p',
1300
                  'r', 't', 'v', 'y', 'z'}
1301
    # lowercase, normalize, and compose
1302
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1303
1304
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1305
1306
    # Step 1
1307
    _r1 = word[r1_start:]
1308
    if _r1[-7:] == 'hetenes':
1309
        word = word[:-7]
1310
    elif _r1[-6:] in {'hetene', 'hetens'}:
1311
        word = word[:-6]
1312
    elif _r1[-5:] in {'heten', 'heter', 'endes'}:
1313
        word = word[:-5]
1314
    elif _r1[-4:] in {'ande', 'ende', 'edes', 'enes', 'erte'}:
1315
        if word[-4:] == 'erte':
1316
            word = word[:-2]
1317
        else:
1318
            word = word[:-4]
1319
    elif _r1[-3:] in {'ede', 'ane', 'ene', 'ens', 'ers', 'ets', 'het', 'ast',
1320
                      'ert'}:
1321
        if word[-3:] == 'ert':
1322
            word = word[:-1]
1323
        else:
1324
            word = word[:-3]
1325
    elif _r1[-2:] in {'en', 'ar', 'er', 'as', 'es', 'et'}:
1326
        word = word[:-2]
1327
    elif _r1[-1:] in {'a', 'e'}:
1328
        word = word[:-1]
1329
    elif _r1[-1:] == 's':
1330
        if (((len(word) > 1 and word[-2] in _s_endings) or
1331
             (len(word) > 2 and word[-2] == 'k' and word[-3] not in _vowels))):
1332
            word = word[:-1]
1333
1334
    # Step 2
1335
    if word[r1_start:][-2:] in {'dt', 'vt'}:
1336
        word = word[:-1]
1337
1338
    # Step 3
1339
    _r1 = word[r1_start:]
1340
    if _r1[-7:] == 'hetslov':
1341
        word = word[:-7]
1342
    elif _r1[-4:] in {'eleg', 'elig', 'elov', 'slov'}:
1343
        word = word[:-4]
1344
    elif _r1[-3:] in {'leg', 'eig', 'lig', 'els', 'lov'}:
1345
        word = word[:-3]
1346
    elif _r1[-2:] == 'ig':
1347
        word = word[:-2]
1348
1349
    return word
1350
1351
1352
def sb_swedish(word):
1353
    """Return Snowball Swedish stem.
1354
1355
    The Snowball Swedish stemmer is defined at:
1356
    http://snowball.tartarus.org/algorithms/swedish/stemmer.html
1357
1358
    :param word: the word to calculate the stem of
1359
    :returns: word stem
1360
    :rtype: str
1361
1362
    >>> sb_swedish('undervisa')
1363
    'undervis'
1364
    >>> sb_swedish('suspension')
1365
    'suspension'
1366
    >>> sb_swedish('visshet')
1367
    'viss'
1368
    """
1369
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'å', 'ö'}
1370
    _s_endings = {'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
1371
                  'o', 'p', 'r', 't', 'v', 'y'}
1372
1373
    # lowercase, normalize, and compose
1374
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1375
1376
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1377
1378
    # Step 1
1379
    _r1 = word[r1_start:]
1380 View Code Duplication
    if _r1[-7:] == 'heterna':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1381
        word = word[:-7]
1382
    elif _r1[-6:] == 'hetens':
1383
        word = word[:-6]
1384
    elif _r1[-5:] in {'anden', 'heten', 'heter', 'arnas', 'ernas', 'ornas',
1385
                      'andes', 'arens', 'andet'}:
1386
        word = word[:-5]
1387
    elif _r1[-4:] in {'arna', 'erna', 'orna', 'ande', 'arne', 'aste', 'aren',
1388
                      'ades', 'erns'}:
1389
        word = word[:-4]
1390
    elif _r1[-3:] in {'ade', 'are', 'ern', 'ens', 'het', 'ast'}:
1391
        word = word[:-3]
1392
    elif _r1[-2:] in {'ad', 'en', 'ar', 'er', 'or', 'as', 'es', 'at'}:
1393
        word = word[:-2]
1394
    elif _r1[-1:] in {'a', 'e'}:
1395
        word = word[:-1]
1396
    elif _r1[-1:] == 's':
1397
        if len(word) > 1 and word[-2] in _s_endings:
1398
            word = word[:-1]
1399
1400
    # Step 2
1401
    if word[r1_start:][-2:] in {'dd', 'gd', 'nn', 'dt', 'gt', 'kt', 'tt'}:
1402
        word = word[:-1]
1403
1404
    # Step 3
1405
    _r1 = word[r1_start:]
1406
    if _r1[-5:] == 'fullt':
1407
        word = word[:-1]
1408
    elif _r1[-4:] == 'löst':
1409
        word = word[:-1]
1410
    elif _r1[-3:] in {'lig', 'els'}:
1411
        word = word[:-3]
1412
    elif _r1[-2:] == 'ig':
1413
        word = word[:-2]
1414
1415
    return word
1416
1417
1418
def sb_danish(word):
1419
    """Return Snowball Danish stem.
1420
1421
    The Snowball Danish stemmer is defined at:
1422
    http://snowball.tartarus.org/algorithms/danish/stemmer.html
1423
1424
    :param word: the word to calculate the stem of
1425
    :returns: word stem
1426
    :rtype: str
1427
1428
    >>> sb_danish('underviser')
1429
    'undervis'
1430
    >>> sb_danish('suspension')
1431
    'suspension'
1432
    >>> sb_danish('sikkerhed')
1433
    'sikker'
1434
    """
1435
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1436
    _s_endings = {'a', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
1437
                  'o', 'p', 'r', 't', 'v', 'y', 'z', 'å'}
1438
1439
    # lowercase, normalize, and compose
1440
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1441
1442
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1443
1444
    # Step 1
1445
    _r1 = word[r1_start:]
1446 View Code Duplication
    if _r1[-7:] == 'erendes':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1447
        word = word[:-7]
1448
    elif _r1[-6:] in {'erende', 'hedens'}:
1449
        word = word[:-6]
1450
    elif _r1[-5:] in {'ethed', 'erede', 'heden', 'heder', 'endes', 'ernes',
1451
                      'erens', 'erets'}:
1452
        word = word[:-5]
1453
    elif _r1[-4:] in {'ered', 'ende', 'erne', 'eren', 'erer', 'heds', 'enes',
1454
                      'eres', 'eret'}:
1455
        word = word[:-4]
1456
    elif _r1[-3:] in {'hed', 'ene', 'ere', 'ens', 'ers', 'ets'}:
1457
        word = word[:-3]
1458
    elif _r1[-2:] in {'en', 'er', 'es', 'et'}:
1459
        word = word[:-2]
1460
    elif _r1[-1:] == 'e':
1461
        word = word[:-1]
1462
    elif _r1[-1:] == 's':
1463
        if len(word) > 1 and word[-2] in _s_endings:
1464
            word = word[:-1]
1465
1466
    # Step 2
1467
    if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1468
        word = word[:-1]
1469
1470
    # Step 3
1471
    if word[-4:] == 'igst':
1472
        word = word[:-2]
1473
1474
    _r1 = word[r1_start:]
1475
    repeat_step2 = False
1476
    if _r1[-4:] == 'elig':
1477
        word = word[:-4]
1478
        repeat_step2 = True
1479
    elif _r1[-4:] == 'løst':
1480
        word = word[:-1]
1481
    elif _r1[-3:] in {'lig', 'els'}:
1482
        word = word[:-3]
1483
        repeat_step2 = True
1484
    elif _r1[-2:] == 'ig':
1485
        word = word[:-2]
1486
        repeat_step2 = True
1487
1488
    if repeat_step2:
1489
        if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1490
            word = word[:-1]
1491
1492
    # Step 4
1493
    if ((len(word[r1_start:]) >= 1 and len(word) >= 2 and
1494
         word[-1] == word[-2] and word[-1] not in _vowels)):
1495
        word = word[:-1]
1496
1497
    return word
1498
1499
1500
def clef_german(word):
1501
    """Return CLEF German stem.
1502
1503
    The CLEF German stemmer is defined at:
1504
    http://members.unine.ch/jacques.savoy/clef/germanStemmer.txt
1505
1506
    :param word: the word to calculate the stem of
1507
    :returns: word stem
1508
    :rtype: str
1509
1510
    >>> clef_german('lesen')
1511
    'lese'
1512
    >>> clef_german('graues')
1513
    'grau'
1514
    >>> clef_german('buchstabieren')
1515
    'buchstabier'
1516
    """
1517
    # lowercase, normalize, and compose
1518
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1519
1520
    # remove umlauts
1521
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1522
    word = word.translate(_umlauts)
1523
1524
    # remove plurals
1525
    wlen = len(word)-1
1526
1527
    if wlen > 3:
1528
        if wlen > 5:
1529
            if word[-3:] == 'nen':
1530
                return word[:-3]
1531
        if wlen > 4:
1532
            if word[-2:] in {'en', 'se', 'es', 'er'}:
1533
                return word[:-2]
1534
        if word[-1] in {'e', 'n', 'r', 's'}:
1535
            return word[:-1]
1536
    return word
1537
1538
1539
def clef_german_plus(word):
1540
    """Return 'CLEF German stemmer plus' stem.
1541
1542
    The CLEF German stemmer plus is defined at:
1543
    http://members.unine.ch/jacques.savoy/clef/germanStemmerPlus.txt
1544
1545
    :param word: the word to calculate the stem of
1546
    :returns: word stem
1547
    :rtype: str
1548
1549
    >>> clef_german_plus('lesen')
1550
    'les'
1551
    >>> clef_german_plus('graues')
1552
    'grau'
1553
    >>> clef_german_plus('buchstabieren')
1554
    'buchstabi'
1555
    """
1556
    _st_ending = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
1557
1558
    # lowercase, normalize, and compose
1559
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1560
1561
    # remove umlauts
1562
    _accents = dict(zip((ord(_) for _ in 'äàáâöòóôïìíîüùúû'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1563
                        'aaaaooooiiiiuuuu'))
1564
    word = word.translate(_accents)
1565
1566
    # Step 1
1567
    wlen = len(word)-1
1568
    if wlen > 4 and word[-3:] == 'ern':
1569
        word = word[:-3]
1570
    elif wlen > 3 and word[-2:] in {'em', 'en', 'er', 'es'}:
1571
        word = word[:-2]
1572
    elif wlen > 2 and (word[-1] == 'e' or
1573
                       (word[-1] == 's' and word[-2] in _st_ending)):
1574
        word = word[:-1]
1575
1576
    # Step 2
1577
    wlen = len(word)-1
1578
    if wlen > 4 and word[-3:] == 'est':
1579
        word = word[:-3]
1580
    elif wlen > 3 and (word[-2:] in {'er', 'en'} or
1581
                       (word[-2:] == 'st' and word[-3] in _st_ending)):
1582
        word = word[:-2]
1583
1584
    return word
1585
1586
1587
def clef_swedish(word):
1588
    """Return CLEF Swedish stem.
1589
1590
    The CLEF Swedish stemmer is defined at:
1591
    http://members.unine.ch/jacques.savoy/clef/swedishStemmer.txt
1592
1593
    :param word: the word to calculate the stem of
1594
    :returns: word stem
1595
    :rtype: str
1596
1597
    >>> clef_swedish('undervisa')
1598
    'undervis'
1599
    >>> clef_swedish('suspension')
1600
    'suspensio'
1601
    >>> clef_swedish('visshet')
1602
    'viss'
1603
    """
1604
    wlen = len(word)-1
1605
1606
    if wlen > 3 and word[-1] == 's':
1607
        word = word[:-1]
1608
        wlen -= 1
1609
1610
    if wlen > 6:
1611
        if word[-5:] in {'elser', 'heten'}:
1612
            return word[:-5]
1613
    if wlen > 5:
1614
        if word[-4:] in {'arne', 'erna', 'ande', 'else', 'aste', 'orna',
1615
                         'aren'}:
1616
            return word[:-4]
1617
    if wlen > 4:
1618
        if word[-3:] in {'are', 'ast', 'het'}:
1619
            return word[:-3]
1620
    if wlen > 3:
1621
        if word[-2:] in {'ar', 'er', 'or', 'en', 'at', 'te', 'et'}:
1622
            return word[:-2]
1623
    if wlen > 2:
1624
        if word[-1] in {'a', 'e', 'n', 't'}:
1625
            return word[:-1]
1626
    return word
1627
1628
1629
def caumanns(word):
1630
    """Return Caumanns German stem.
1631
1632
    Jörg Caumanns' stemmer is described in his article at:
1633
    http://edocs.fu-berlin.de/docs/servlets/MCRFileNodeServlet/FUDOCS_derivate_000000000350/tr-b-99-16.pdf
1634
1635
    This implementation is based on the GermanStemFilter described at:
1636
    http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene
1637
1638
    :param word: the word to calculate the stem of
1639
    :returns: word stem
1640
    :rtype: str
1641
1642
    >>> caumanns('lesen')
1643
    'les'
1644
    >>> caumanns('graues')
1645
    'grau'
1646
    >>> caumanns('buchstabieren')
1647
    'buchstabier'
1648
    """
1649
    if not word:
1650
        return ''
1651
1652
    upper_initial = word[0].isupper()
1653
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1654
1655
    # # Part 2: Substitution
1656
    # 1. Change umlauts to corresponding vowels & ß to ss
1657
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1658
    word = word.translate(_umlauts)
1659
    word = word.replace('ß', 'ss')
1660
1661
    # 2. Change second of doubled characters to *
1662
    newword = word[0]
1663
    for i in range(1, len(word)):
1664
        if newword[i-1] == word[i]:
1665
            newword += '*'
1666
        else:
1667
            newword += word[i]
1668
    word = newword
1669
1670
    # 3. Replace sch, ch, ei, ie with $, §, %, &
1671
    word = word.replace('sch', '$')
1672
    word = word.replace('ch', '§')
1673
    word = word.replace('ei', '%')
1674
    word = word.replace('ie', '&')
1675
    word = word.replace('ig', '#')
1676
    word = word.replace('st', '!')
1677
1678
    # # Part 1: Recursive Context-Free Stripping
1679
    # 1. Remove the following 7 suffixes recursively
1680
    while len(word) > 3:
1681
        if (((len(word) > 4 and word[-2:] in {'em', 'er'}) or
1682
             (len(word) > 5 and word[-2:] == 'nd'))):
1683
            word = word[:-2]
1684
        elif ((word[-1] in {'e', 's', 'n'}) or
1685
              (not upper_initial and word[-1] in {'t', '!'})):
1686
            word = word[:-1]
1687
        else:
1688
            break
1689
1690
    # Additional optimizations:
1691
    if len(word) > 5 and word[-5:] == 'erin*':
1692
        word = word[:-1]
1693
    if word[-1] == 'z':
1694
        word = word[:-1] + 'x'
1695
1696
    # Reverse substitutions:
1697
    word = word.replace('$', 'sch')
1698
    word = word.replace('§', 'ch')
1699
    word = word.replace('%', 'ei')
1700
    word = word.replace('&', 'ie')
1701
    word = word.replace('#', 'ig')
1702
    word = word.replace('!', 'st')
1703
1704
    # Expand doubled
1705
    word = ''.join([word[0]] + [word[i-1] if word[i] == '*' else word[i] for
1706
                                i in range(1, len(word))])
1707
1708
    # Finally, convert gege to ge
1709
    if len(word) > 4:
1710
        word = word.replace('gege', 'ge', 1)
1711
1712
    return word
1713
1714
1715
def uealite(word, max_word_length=20, max_acro_length=8, return_rule_no=False,
1716
            var=None):
1717
    """Return UEA-Lite stem.
1718
1719
    The UEA-Lite stemmer is discussed in:
1720
    Jenkins, Marie-Claire and Dan Smith. 2005. "Conservative stemming for
1721
    search and indexing."
1722
    http://lemur.cmp.uea.ac.uk/Research/stemmer/stemmer25feb.pdf
1723
1724
    This is chiefly based on the Java implementation of the algorithm, with
1725
    variants based on the Perl implementation and Jason Adams' Ruby port.
1726
1727
    Java version: http://lemur.cmp.uea.ac.uk/Research/stemmer/UEAstem.java
1728
    Perl version: http://lemur.cmp.uea.ac.uk/Research/stemmer/UEAstem.pl
1729
    Ruby version: https://github.com/ealdent/uea-stemmer
1730
1731
    :param word: the word to calculate the stem of
1732
    :param max_word_length: the maximum word length allowed
1733
    :param return_rule_no: if True, returns the stem along with rule number
1734
    :param var: variant to use (set to 'Adams' to use Jason Adams' rules,
1735
                or 'Perl' to use the original Perl set of rules)
1736
    :returns: word stem
1737
    :rtype: str or tuple(str, int)
1738
    """
1739
    problem_words = {'is', 'as', 'this', 'has', 'was', 'during'}
1740
1741
    # rule table format:
1742
    # top-level dictionary: length-of-suffix: dict-of-rules
1743
    # dict-of-rules: suffix: (rule_no, suffix_length_to_delete,
1744
    #                         suffix_to_append)
1745
    rule_table = {7: {'titudes': (30, 1, None),
1746
                      'fulness': (34, 4, None),
1747
                      'ousness': (35, 4, None),
1748
                      'eadings': (40.7, 4, None),
1749
                      'oadings': (40.6, 4, None),
1750
                      'ealings': (42.4, 4, None),
1751
                      'ailings': (42.2, 4, None),
1752
                      },
1753
                  6: {'aceous': (1, 6, None),
1754
                      'aining': (24, 3, None),
1755
                      'acting': (25, 3, None),
1756
                      'ttings': (26, 5, None),
1757
                      'viding': (27, 3, 'e'),
1758
                      'ssings': (37, 4, None),
1759
                      'ulting': (38, 3, None),
1760
                      'eading': (40.7, 3, None),
1761
                      'oading': (40.6, 3, None),
1762
                      'edings': (40.5, 4, None),
1763
                      'ddings': (40.4, 5, None),
1764
                      'ldings': (40.3, 4, None),
1765
                      'rdings': (40.2, 4, None),
1766
                      'ndings': (40.1, 4, None),
1767
                      'llings': (41, 5, None),
1768
                      'ealing': (42.4, 3, None),
1769
                      'olings': (42.3, 4, None),
1770
                      'ailing': (42.2, 3, None),
1771
                      'elings': (42.1, 4, None),
1772
                      'mmings': (44.3, 5, None),
1773
                      'ngings': (45.2, 4, None),
1774
                      'ggings': (45.1, 5, None),
1775
                      'stings': (47, 4, None),
1776
                      'etings': (48.4, 4, None),
1777
                      'ntings': (48.2, 4, None),
1778
                      'irings': (54.4, 4, 'e'),
1779
                      'urings': (54.3, 4, 'e'),
1780
                      'ncings': (54.2, 4, 'e'),
1781
                      'things': (58.1, 1, None),
1782
                      },
1783
                  5: {'iases': (11.4, 2, None),
1784
                      'ained': (13.6, 2, None),
1785
                      'erned': (13.5, 2, None),
1786
                      'ifted': (14, 2, None),
1787
                      'ected': (15, 2, None),
1788
                      'vided': (16, 1, None),
1789
                      'erred': (19, 3, None),
1790
                      'urred': (20.5, 3, None),
1791
                      'lored': (20.4, 2, None),
1792
                      'eared': (20.3, 2, None),
1793
                      'tored': (20.2, 1, None),
1794
                      'noted': (22.4, 1, None),
1795
                      'leted': (22.3, 1, None),
1796
                      'anges': (23, 1, None),
1797
                      'tting': (26, 4, None),
1798
                      'ulted': (32, 2, None),
1799
                      'uming': (33, 3, 'e'),
1800
                      'rabed': (36.1, 1, None),
1801
                      'rebed': (36.1, 1, None),
1802
                      'ribed': (36.1, 1, None),
1803
                      'robed': (36.1, 1, None),
1804
                      'rubed': (36.1, 1, None),
1805
                      'ssing': (37, 3, None),
1806
                      'vings': (39, 4, 'e'),
1807
                      'eding': (40.5, 3, None),
1808
                      'dding': (40.4, 4, None),
1809
                      'lding': (40.3, 3, None),
1810
                      'rding': (40.2, 3, None),
1811
                      'nding': (40.1, 3, None),
1812
                      'dings': (40, 4, 'e'),
1813
                      'lling': (41, 4, None),
1814
                      'oling': (42.3, 3, None),
1815
                      'eling': (42.1, 3, None),
1816
                      'lings': (42, 4, 'e'),
1817
                      'mming': (44.3, 4, None),
1818
                      'rming': (44.2, 3, None),
1819
                      'lming': (44.1, 3, None),
1820
                      'mings': (44, 4, 'e'),
1821
                      'nging': (45.2, 3, None),
1822
                      'gging': (45.1, 4, None),
1823
                      'gings': (45, 4, 'e'),
1824
                      'aning': (46.6, 3, None),
1825
                      'ening': (46.5, 3, None),
1826
                      'gning': (46.4, 3, None),
1827
                      'nning': (46.3, 4, None),
1828
                      'oning': (46.2, 3, None),
1829
                      'rning': (46.1, 3, None),
1830
                      'sting': (47, 3, None),
1831
                      'eting': (48.4, 3, None),
1832
                      'pting': (48.3, 3, None),
1833
                      'nting': (48.2, 3, None),
1834
                      'cting': (48.1, 3, None),
1835
                      'tings': (48, 4, 'e'),
1836
                      'iring': (54.4, 3, 'e'),
1837
                      'uring': (54.3, 3, 'e'),
1838
                      'ncing': (54.2, 3, 'e'),
1839
                      'sings': (54, 4, 'e'),
1840
                      # 'lling': (55, 3, None),  # masked by 41
1841
                      'ating': (57, 3, 'e'),
1842
                      'thing': (58.1, 0, None),
1843
                      },
1844
                  4: {'eeds': (7, 1, None),
1845
                      'uses': (11.3, 1, None),
1846
                      'sses': (11.2, 2, None),
1847
                      'eses': (11.1, 2, 'is'),
1848
                      'tled': (12.5, 1, None),
1849
                      'pled': (12.4, 1, None),
1850
                      'bled': (12.3, 1, None),
1851
                      'eled': (12.2, 2, None),
1852
                      'lled': (12.1, 2, None),
1853
                      'ened': (13.7, 2, None),
1854
                      'rned': (13.4, 2, None),
1855
                      'nned': (13.3, 3, None),
1856
                      'oned': (13.2, 2, None),
1857
                      'gned': (13.1, 2, None),
1858
                      'ered': (20.1, 2, None),
1859
                      'reds': (20, 2, None),
1860
                      'tted': (21, 3, None),
1861
                      'uted': (22.2, 1, None),
1862
                      'ated': (22.1, 1, None),
1863
                      'ssed': (28, 2, None),
1864
                      'umed': (31, 1, None),
1865
                      'beds': (36, 3, None),
1866
                      'ving': (39, 3, 'e'),
1867
                      'ding': (40, 3, 'e'),
1868
                      'ling': (42, 3, 'e'),
1869
                      'nged': (43.2, 1, None),
1870
                      'gged': (43.1, 3, None),
1871
                      'ming': (44, 3, 'e'),
1872
                      'ging': (45, 3, 'e'),
1873
                      'ning': (46, 3, 'e'),
1874
                      'ting': (48, 3, 'e'),
1875
                      # 'ssed': (49, 2, None),  # masked by 28
1876
                      # 'lled': (53, 2, None),  # masked by 12.1
1877
                      'zing': (54.1, 3, 'e'),
1878
                      'sing': (54, 3, 'e'),
1879
                      'lves': (60.1, 3, 'f'),
1880
                      'aped': (61.3, 1, None),
1881
                      'uded': (61.2, 1, None),
1882
                      'oded': (61.1, 1, None),
1883
                      # 'ated': (61, 1, None),  # masked by 22.1
1884
                      'ones': (63.6, 1, None),
1885
                      'izes': (63.5, 1, None),
1886
                      'ures': (63.4, 1, None),
1887
                      'ines': (63.3, 1, None),
1888
                      'ides': (63.2, 1, None),
1889
                      },
1890
                  3: {'ces': (2, 1, None),
1891
                      'sis': (4, 0, None),
1892
                      'tis': (5, 0, None),
1893
                      'eed': (7, 0, None),
1894
                      'ued': (8, 1, None),
1895
                      'ues': (9, 1, None),
1896
                      'ees': (10, 1, None),
1897
                      'ses': (11, 1, None),
1898
                      'led': (12, 2, None),
1899
                      'ned': (13, 1, None),
1900
                      'ved': (17, 1, None),
1901
                      'ced': (18, 1, None),
1902
                      'red': (20, 1, None),
1903
                      'ted': (22, 2, None),
1904
                      'sed': (29, 1, None),
1905
                      'bed': (36, 2, None),
1906
                      'ged': (43, 1, None),
1907
                      'les': (50, 1, None),
1908
                      'tes': (51, 1, None),
1909
                      'zed': (52, 1, None),
1910
                      'ied': (56, 3, 'y'),
1911
                      'ies': (59, 3, 'y'),
1912
                      'ves': (60, 1, None),
1913
                      'pes': (63.8, 1, None),
1914
                      'mes': (63.7, 1, None),
1915
                      'ges': (63.1, 1, None),
1916
                      'ous': (65, 0, None),
1917
                      'ums': (66, 0, None),
1918
                      },
1919
                  2: {'cs': (3, 0, None),
1920
                      'ss': (6, 0, None),
1921
                      'es': (63, 2, None),
1922
                      'is': (64, 2, 'e'),
1923
                      'us': (67, 0, None),
1924
                      }}
1925
1926
    if var == 'Perl':
1927
        perl_deletions = {7: ['eadings', 'oadings', 'ealings', 'ailings'],
1928
                          6: ['ttings', 'ssings', 'edings', 'ddings',
1929
                              'ldings', 'rdings', 'ndings', 'llings',
1930
                              'olings', 'elings', 'mmings', 'ngings',
1931
                              'ggings', 'stings', 'etings', 'ntings',
1932
                              'irings', 'urings', 'ncings', 'things'],
1933
                          5: ['vings', 'dings', 'lings', 'mings', 'gings',
1934
                              'tings', 'sings'],
1935
                          4: ['eeds', 'reds', 'beds']}
1936
1937
        # Delete the above rules from rule_table
1938
        for del_len in perl_deletions:
1939
            for term in perl_deletions[del_len]:
1940
                del rule_table[del_len][term]
1941
1942
    elif var == 'Adams':
1943
        adams_additions = {6: {'chited': (22.8, 1, None)},
1944
                           5: {'dying': (58.2, 4, 'ie'),
1945
                               'tying': (58.2, 4, 'ie'),
1946
                               'vited': (22.6, 1, None),
1947
                               'mited': (22.5, 1, None),
1948
                               'vided': (22.9, 1, None),
1949
                               'mided': (22.10, 1, None),
1950
                               'lying': (58.2, 4, 'ie'),
1951
                               'arred': (19.1, 3, None),
1952
                               },
1953
                           4: {'ited': (22.7, 2, None),
1954
                               'oked': (31.1, 1, None),
1955
                               'aked': (31.1, 1, None),
1956
                               'iked': (31.1, 1, None),
1957
                               'uked': (31.1, 1, None),
1958
                               'amed': (31, 1, None),
1959
                               'imed': (31, 1, None),
1960
                               'does': (31.2, 2, None),
1961
                               },
1962
                           3: {'oed': (31.3, 1, None),
1963
                               'oes': (31.2, 1, None),
1964
                               'kes': (63.1, 1, None),
1965
                               'des': (63.10, 1, None),
1966
                               'res': (63.9, 1, None),
1967
                               }}
1968
1969
        # Add the above additional rules to rule_table
1970
        for del_len in adams_additions:
1971
            rule_table[del_len] = dict(rule_table[del_len],
1972
                                       **adams_additions[del_len])
1973
        # Add additional problem word
1974
        problem_words.add('menses')
1975
1976
    def _stem_with_duplicate_character_check(word, del_len):
1977
        if word[-1] == 's':
1978
            del_len += 1
1979
        stemmed_word = word[:-del_len]
1980
        if re.match(r'.*(\w)\1$', stemmed_word):
1981
            stemmed_word = stemmed_word[:-1]
1982
        return stemmed_word
1983
1984
    def _stem(word):
0 ignored issues
show
best-practice introduced by
Too many return statements (16/6)
Loading history...
1985
        stemmed_word = word
1986
        rule_no = 0
1987
1988
        if not word:
1989
            return word, 0
1990
        if word in problem_words:
1991
            return word, 90
1992
        if max_word_length and len(word) > max_word_length:
1993
            return word, 95
1994
1995
        if "'" in word:
1996
            if word[-2:] in {"'s", "'S"}:
1997
                stemmed_word = word[:-2]
1998
            if word[-1:] == "'":
1999
                stemmed_word = word[:-1]
2000
            stemmed_word = stemmed_word.replace("n't", 'not')
2001
            stemmed_word = stemmed_word.replace("'ve", 'have')
2002
            stemmed_word = stemmed_word.replace("'re", 'are')
2003
            stemmed_word = stemmed_word.replace("'m", 'am')
2004
            return stemmed_word, 94
2005
2006
        if word.isdigit():
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2007
            return word, 90.3
2008
        else:
2009
            hyphen = word.find('-')
2010
            if hyphen > 0 and hyphen < len(word):
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
Unused Code introduced by
Simplify chained comparison between the operands
Loading history...
2011
                if word[:hyphen].isalpha() and word[hyphen+1:].isalpha():
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2012
                    return word, 90.2
2013
                else:
2014
                    return word, 90.1
2015
            elif '_' in word:
2016
                return word, 90
2017
            elif word[-1] == 's' and word[:-1].isupper():
2018
                if var == 'Adams' and len(word)-1 > max_acro_length:
2019
                    return word, 96
2020
                return word[:-1], 91.1
2021
            elif word.isupper():
2022
                if var == 'Adams' and len(word) > max_acro_length:
2023
                    return word, 96
2024
                return word, 91
2025
            elif re.match(r'^.*[A-Z].*[A-Z].*$', word):
2026
                return word, 92
2027
            elif word[0].isupper():
2028
                return word, 93
2029
            elif var == 'Adams' and re.match(r'^[a-z]{1}(|[rl])(ing|ed)$',
2030
                                             word):
2031
                return word, 97
2032
2033
        for n in range(7, 1, -1):
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n" doesn't conform to snake_case naming style ('(([a-z_][a-z0-9_]2,)|(_[a-z0-9_]*)|(__[a-z][a-z0-9_]+__))$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
2034
            if word[-n:] in rule_table[n]:
2035
                rule_no, del_len, add_str = rule_table[n][word[-n:]]
2036
                if del_len:
2037
                    stemmed_word = word[:-del_len]
2038
                else:
2039
                    stemmed_word = word
2040
                if add_str:
2041
                    stemmed_word += add_str
2042
                break
2043
2044
        if not rule_no:
2045
            if re.match(r'.*\w\wings?$', word):  # rule 58
2046
                stemmed_word = _stem_with_duplicate_character_check(word, 3)
2047
                rule_no = 58
2048
            elif re.match(r'.*\w\weds?$', word):  # rule 62
2049
                stemmed_word = _stem_with_duplicate_character_check(word, 2)
2050
                rule_no = 62
2051
            elif word[-1] == 's':  # rule 68
2052
                stemmed_word = word[:-1]
2053
                rule_no = 68
2054
2055
        return stemmed_word, rule_no
2056
2057
    stem, rule_no = _stem(word)
2058
    if return_rule_no:
2059
        return stem, rule_no
2060
    return stem
2061
2062
2063
def paice_husk(word):
2064
    """Return Paice-Husk stem.
2065
2066
    Implementation of the Paice-Husk Stemmer, also known as the Lancaster
2067
    Stemmer, developed by Chris Paice, with the assistance of Gareth Husk
2068
2069
    This is based on the algorithm's description in:
2070
    Paice, Chris D. 1990. "Another stemmer." ACM SIGIR Forum 24:3, Fall 1990.
2071
    56-61. doi:10.1145/101306.101310.
2072
2073
    :param word: the word to calculate the stem of
2074
    :returns: word stem
2075
    :rtype: str
2076
    """
2077
    rule_table = {6: {'ifiabl': (False, 6, None, True),
2078
                      'plicat': (False, 4, 'y', True)},
2079
                  5: {'guish': (False, 5, 'ct', True),
2080
                      'sumpt': (False, 2, None, True),
2081
                      'istry': (False, 5, None, True)},
2082
                  4: {'ytic': (False, 3, 's', True),
2083
                      'ceed': (False, 2, 'ss', True),
2084
                      'hood': (False, 4, None, False),
2085
                      'lief': (False, 1, 'v', True),
2086
                      'verj': (False, 1, 't', True),
2087
                      'misj': (False, 2, 't', True),
2088
                      'iabl': (False, 4, 'y', True),
2089
                      'iful': (False, 4, 'y', True),
2090
                      'sion': (False, 4, 'j', False),
2091
                      'xion': (False, 4, 'ct', True),
2092
                      'ship': (False, 4, None, False),
2093
                      'ness': (False, 4, None, False),
2094
                      'ment': (False, 4, None, False),
2095
                      'ript': (False, 2, 'b', True),
2096
                      'orpt': (False, 2, 'b', True),
2097
                      'duct': (False, 1, None, True),
2098
                      'cept': (False, 2, 'iv', True),
2099
                      'olut': (False, 2, 'v', True),
2100
                      'sist': (False, 0, None, True)},
2101
                  3: {'ied': (False, 3, 'y', False),
2102
                      'eed': (False, 1, None, True),
2103
                      'ing': (False, 3, None, False),
2104
                      'iag': (False, 3, 'y', True),
2105
                      'ish': (False, 3, None, False),
2106
                      'fuj': (False, 1, 's', True),
2107
                      'hej': (False, 1, 'r', True),
2108
                      'abl': (False, 3, None, False),
2109
                      'ibl': (False, 3, None, True),
2110
                      'bil': (False, 2, 'l', False),
2111
                      'ful': (False, 3, None, False),
2112
                      'ial': (False, 3, None, False),
2113
                      'ual': (False, 3, None, False),
2114
                      'ium': (False, 3, None, True),
2115
                      'ism': (False, 3, None, False),
2116
                      'ion': (False, 3, None, False),
2117
                      'ian': (False, 3, None, False),
2118
                      'een': (False, 0, None, True),
2119
                      'ear': (False, 0, None, True),
2120
                      'ier': (False, 3, 'y', False),
2121
                      'ies': (False, 3, 'y', False),
2122
                      'sis': (False, 2, None, True),
2123
                      'ous': (False, 3, None, False),
2124
                      'ent': (False, 3, None, False),
2125
                      'ant': (False, 3, None, False),
2126
                      'ist': (False, 3, None, False),
2127
                      'iqu': (False, 3, None, True),
2128
                      'ogu': (False, 1, None, True),
2129
                      'siv': (False, 3, 'j', False),
2130
                      'eiv': (False, 0, None, True),
2131
                      'bly': (False, 1, None, False),
2132
                      'ily': (False, 3, 'y', False),
2133
                      'ply': (False, 0, None, True),
2134
                      'ogy': (False, 1, None, True),
2135
                      'phy': (False, 1, None, True),
2136
                      'omy': (False, 1, None, True),
2137
                      'opy': (False, 1, None, True),
2138
                      'ity': (False, 3, None, False),
2139
                      'ety': (False, 3, None, False),
2140
                      'lty': (False, 2, None, True),
2141
                      'ary': (False, 3, None, False),
2142
                      'ory': (False, 3, None, False),
2143
                      'ify': (False, 3, None, True),
2144
                      'ncy': (False, 2, 't', False),
2145
                      'acy': (False, 3, None, False)},
2146
                  2: {'ia': (True, 2, None, True),
2147
                      'bb': (False, 1, None, True),
2148
                      'ic': (False, 2, None, False),
2149
                      'nc': (False, 1, 't', False),
2150
                      'dd': (False, 1, None, True),
2151
                      'ed': (False, 2, None, False),
2152
                      'if': (False, 2, None, False),
2153
                      'ag': (False, 2, None, False),
2154
                      'gg': (False, 1, None, True),
2155
                      'th': (True, 2, None, True),
2156
                      'ij': (False, 1, 'd', True),
2157
                      'uj': (False, 1, 'd', True),
2158
                      'oj': (False, 1, 'd', True),
2159
                      'nj': (False, 1, 'd', True),
2160
                      'cl': (False, 1, None, True),
2161
                      'ul': (False, 2, None, True),
2162
                      'al': (False, 2, None, False),
2163
                      'll': (False, 1, None, True),
2164
                      'um': (True, 2, None, True),
2165
                      'mm': (False, 1, None, True),
2166
                      'an': (False, 2, None, False),
2167
                      'en': (False, 2, None, False),
2168
                      'nn': (False, 1, None, True),
2169
                      'pp': (False, 1, None, True),
2170
                      'er': (False, 2, None, False),
2171
                      'ar': (False, 2, None, True),
2172
                      'or': (False, 2, None, False),
2173
                      'ur': (False, 2, None, False),
2174
                      'rr': (False, 1, None, True),
2175
                      'tr': (False, 1, None, False),
2176
                      'is': (False, 2, None, False),
2177
                      'ss': (False, 0, None, True),
2178
                      'us': (True, 2, None, True),
2179
                      'at': (False, 2, None, False),
2180
                      'tt': (False, 1, None, True),
2181
                      'iv': (False, 2, None, False),
2182
                      'ly': (False, 2, None, False),
2183
                      'iz': (False, 2, None, False),
2184
                      'yz': (False, 1, 's', True)},
2185
                  1: {'a': (True, 1, None, True),
2186
                      'e': (False, 1, None, False),
2187
                      'i': ((True, 1, None, True), (False, 1, 'y', False)),
2188
                      'j': (False, 1, 's', True),
2189
                      's': ((True, 1, None, False), (False, 0, None, True))}}
2190
2191
    def _has_vowel(word):
2192
        for char in word:
2193
            if char in {'a', 'e', 'i', 'o', 'u', 'y'}:
2194
                return True
2195
        return False
2196
2197
    def _acceptable(word):
2198
        if word and word[0] in {'a', 'e', 'i', 'o', 'u'}:
2199
            return len(word) > 1
2200
        return len(word) > 2 and _has_vowel(word[1:])
2201
2202
    def _apply_rule(word, rule, intact):
2203
        old_word = word
2204
        only_intact, del_len, add_str, set_terminate = rule
2205
        # print(word, word[-n:], rule)
2206
2207
        if (not only_intact) or (intact and only_intact):
2208
            if del_len:
2209
                word = word[:-del_len]
2210
            if add_str:
2211
                word += add_str
2212
        else:
2213
            return word, False, intact, terminate
2214
2215
        if _acceptable(word):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2216
            return word, True, False, set_terminate
2217
        else:
2218
            return old_word, False, intact, terminate
2219
2220
    terminate = False
2221
    intact = True
2222
    while not terminate:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
2223
        for n in range(6, 0, -1):
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n" doesn't conform to snake_case naming style ('(([a-z_][a-z0-9_]2,)|(_[a-z0-9_]*)|(__[a-z][a-z0-9_]+__))$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
2224
            if word[-n:] in rule_table[n]:
2225
                accept = False
2226
                if len(rule_table[n][word[-n:]]) < 4:
2227
                    for rule in rule_table[n][word[-n:]]:
2228
                        (word, accept, intact,
2229
                         terminate) = _apply_rule(word, rule, intact)
2230
                        if accept:
2231
                            break
2232
                else:
2233
                    rule = rule_table[n][word[-n:]]
2234
                    (word, accept, intact,
2235
                     terminate) = _apply_rule(word, rule, intact)
2236
2237
                if accept:
2238
                    break
2239
        else:
2240
            break
2241
2242
    return word
2243