Completed
Push — master ( 1afb32...5a86ce )
by Chris
13:22
created

abydos.stemmer.sb_norwegian()   F

Complexity

Conditions 21

Size

Total Lines 69
Code Lines 42

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 21
eloc 42
nop 1
dl 0
loc 69
rs 0
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.stemmer.sb_norwegian() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (1791/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.stemmer.
20
21
The stemmer module defines word stemmers including:
22
23
    - the Lovins stemmer
24
    - the Porter and Porter2 (Snowball English) stemmers
25
    - Snowball stemmers for German, Dutch, Norwegian, Swedish, and Danish
26
    - CLEF German, German plus, and Swedish stemmers
27
    - Caumann's German stemmer
28
"""
29
30
from __future__ import unicode_literals
31
32
import unicodedata
33
34
from six import text_type
35
from six.moves import range
36
37
38
def lovins(word):
39
    """Return Lovins stem.
40
41
    Lovins stemmer
42
43
    The Lovins stemmer is described in Julie Beth Lovins's article at:
44
    http://www.mt-archive.info/MT-1968-Lovins.pdf
45
46
    :param word: the word to stem
47
    :returns: word stem
48
    :rtype: string
49
50
    >>> lovins('reading')
51
    'read'
52
    >>> lovins('suspension')
53
    'suspens'
54
    >>> lovins('elusiveness')
55
    'elus'
56
    """
57
    # pylint: disable=too-many-branches, too-many-locals
58
59
    # lowercase, normalize, and compose
60
    word = unicodedata.normalize('NFC', text_type(word.lower()))
61
62
    def cond_b(word, suffix_len):
63
        """Return Lovins' condition B."""
64
        return len(word)-suffix_len >= 3
65
66
    def cond_c(word, suffix_len):
67
        """Return Lovins' condition C."""
68
        return len(word)-suffix_len >= 4
69
70
    def cond_d(word, suffix_len):
71
        """Return Lovins' condition D."""
72
        return len(word)-suffix_len >= 5
73
74
    def cond_e(word, suffix_len):
75
        """Return Lovins' condition E."""
76
        return word[-suffix_len-1] != 'e'
77
78
    def cond_f(word, suffix_len):
79
        """Return Lovins' condition F."""
80
        return (len(word)-suffix_len >= 3 and
81
                word[-suffix_len-1] != 'e')
82
83
    def cond_g(word, suffix_len):
84
        """Return Lovins' condition G."""
85
        return (len(word)-suffix_len >= 3 and
86
                word[-suffix_len-1] == 'f')
87
88
    def cond_h(word, suffix_len):
89
        """Return Lovins' condition H."""
90
        return (word[-suffix_len-1] == 't' or
91
                word[-suffix_len-2:-suffix_len] == 'll')
92
93
    def cond_i(word, suffix_len):
94
        """Return Lovins' condition I."""
95
        return word[-suffix_len-1] not in {'e', 'o'}
96
97
    def cond_j(word, suffix_len):
98
        """Return Lovins' condition J."""
99
        return word[-suffix_len-1] not in {'a', 'e'}
100
101
    def cond_k(word, suffix_len):
102
        """Return Lovins' condition K."""
103
        return (len(word)-suffix_len >= 3 and
104
                (word[-suffix_len-1] in {'i', 'l'} or
105
                 (word[-suffix_len-3] == 'u' and word[-suffix_len-1] == 'e')))
106
107
    def cond_l(word, suffix_len):
108
        """Return Lovins' condition L."""
109
        return (word[-suffix_len-1] not in {'s', 'u', 'x'} or
110
                word[-suffix_len-1] == 'os')
111
112
    def cond_m(word, suffix_len):
113
        """Return Lovins' condition M."""
114
        return word[-suffix_len-1] not in {'a', 'c', 'e', 'm'}
115
116
    def cond_n(word, suffix_len):
117
        """Return Lovins' condition N."""
118
        if len(word)-suffix_len >= 3:
119
            if word[-suffix_len-3] == 's':
120
                if len(word)-suffix_len >= 4:
121
                    return True
122
            else:
123
                return True
124
        return False
125
126
    def cond_o(word, suffix_len):
127
        """Return Lovins' condition O."""
128
        return word[-suffix_len-1] in {'i', 'l'}
129
130
    def cond_p(word, suffix_len):
131
        """Return Lovins' condition P."""
132
        return word[-suffix_len-1] != 'c'
133
134
    def cond_q(word, suffix_len):
135
        """Return Lovins' condition Q."""
136
        return (len(word)-suffix_len >= 3 and
137
                word[-suffix_len-1] not in {'l', 'n'})
138
139
    def cond_r(word, suffix_len):
140
        """Return Lovins' condition R."""
141
        return word[-suffix_len-1] in {'n', 'r'}
142
143
    def cond_s(word, suffix_len):
144
        """Return Lovins' condition S."""
145
        return (word[-suffix_len-2:-suffix_len] == 'dr' or
146
                (word[-suffix_len-1] == 't' and
147
                 word[-suffix_len-2:-suffix_len] != 'tt'))
148
149
    def cond_t(word, suffix_len):
150
        """Return Lovins' condition T."""
151
        return (word[-suffix_len-1] in {'s', 't'} and
152
                word[-suffix_len-2:-suffix_len] != 'ot')
153
154
    def cond_u(word, suffix_len):
155
        """Return Lovins' condition U."""
156
        return word[-suffix_len-1] in {'l', 'm', 'n', 'r'}
157
158
    def cond_v(word, suffix_len):
159
        """Return Lovins' condition V."""
160
        return word[-suffix_len-1] == 'c'
161
162
    def cond_w(word, suffix_len):
163
        """Return Lovins' condition W."""
164
        return word[-suffix_len-1] not in {'s', 'u'}
165
166
    def cond_x(word, suffix_len):
167
        """Return Lovins' condition X."""
168
        return (word[-suffix_len-1] in {'i', 'l'} or
169
                (word[-suffix_len-3:-suffix_len] == 'u' and
170
                 word[-suffix_len-1] == 'e'))
171
172
    def cond_y(word, suffix_len):
173
        """Return Lovins' condition Y."""
174
        return word[-suffix_len-2:-suffix_len] == 'in'
175
176
    def cond_z(word, suffix_len):
177
        """Return Lovins' condition Z."""
178
        return word[-suffix_len-1] != 'f'
179
180
    def cond_aa(word, suffix_len):
181
        """Return Lovins' condition AA."""
182
        return (word[-suffix_len-1] in {'d', 'f', 'l', 't'} or
183
                word[-suffix_len-2:-suffix_len] in {'ph', 'th', 'er', 'or',
184
                                                    'es'})
185
186
    def cond_bb(word, suffix_len):
187
        """Return Lovins' condition BB."""
188
        return (len(word)-suffix_len >= 3 and
189
                word[-suffix_len-3:-suffix_len] != 'met' and
190
                word[-suffix_len-4:-suffix_len] != 'ryst')
191
192
    def cond_cc(word, suffix_len):
193
        """Return Lovins' condition CC."""
194
        return word[-suffix_len-1] == 'l'
195
196
    suffix = {'alistically': cond_b, 'arizability': None,
197
              'izationally': cond_b, 'antialness': None,
198
              'arisations': None, 'arizations': None, 'entialness': None,
199
              'allically': cond_c, 'antaneous': None, 'antiality': None,
200
              'arisation': None, 'arization': None, 'ationally': cond_b,
201
              'ativeness': None, 'eableness': cond_e, 'entations': None,
202
              'entiality': None, 'entialize': None, 'entiation': None,
203
              'ionalness': None, 'istically': None, 'itousness': None,
204
              'izability': None, 'izational': None, 'ableness': None,
205
              'arizable': None, 'entation': None, 'entially': None,
206
              'eousness': None, 'ibleness': None, 'icalness': None,
207
              'ionalism': None, 'ionality': None, 'ionalize': None,
208
              'iousness': None, 'izations': None, 'lessness': None,
209
              'ability': None, 'aically': None, 'alistic': cond_b,
210
              'alities': None, 'ariness': cond_e, 'aristic': None,
211
              'arizing': None, 'ateness': None, 'atingly': None,
212
              'ational': cond_b, 'atively': None, 'ativism': None,
213
              'elihood': cond_e, 'encible': None, 'entally': None,
214
              'entials': None, 'entiate': None, 'entness': None,
215
              'fulness': None, 'ibility': None, 'icalism': None,
216
              'icalist': None, 'icality': None, 'icalize': None,
217
              'ication': cond_g, 'icianry': None, 'ination': None,
218
              'ingness': None, 'ionally': None, 'isation': None,
219
              'ishness': None, 'istical': None, 'iteness': None,
220
              'iveness': None, 'ivistic': None, 'ivities': None,
221
              'ization': cond_f, 'izement': None, 'oidally': None,
222
              'ousness': None, 'aceous': None, 'acious': cond_b,
223
              'action': cond_g, 'alness': None, 'ancial': None,
224
              'ancies': None, 'ancing': cond_b, 'ariser': None,
225
              'arized': None, 'arizer': None, 'atable': None,
226
              'ations': cond_b, 'atives': None, 'eature': cond_z,
227
              'efully': None, 'encies': None, 'encing': None,
228
              'ential': None, 'enting': cond_c, 'entist': None,
229
              'eously': None, 'ialist': None, 'iality': None,
230
              'ialize': None, 'ically': None, 'icance': None,
231
              'icians': None, 'icists': None, 'ifully': None,
232
              'ionals': None, 'ionate': cond_d, 'ioning': None,
233
              'ionist': None, 'iously': None, 'istics': None,
234
              'izable': cond_e, 'lessly': None, 'nesses': None,
235
              'oidism': None, 'acies': None, 'acity': None,
236
              'aging': cond_b, 'aical': None, 'alist': None,
237
              'alism': cond_b, 'ality': None, 'alize': None,
238
              'allic': cond_bb, 'anced': cond_b, 'ances': cond_b,
239
              'antic': cond_c, 'arial': None, 'aries': None,
240
              'arily': None, 'arity': cond_b, 'arize': None,
241
              'aroid': None, 'ately': None, 'ating': cond_i,
242
              'ation': cond_b, 'ative': None, 'ators': None,
243
              'atory': None, 'ature': cond_e, 'early': cond_y,
244
              'ehood': None, 'eless': None, 'elity': None,
245
              'ement': None, 'enced': None, 'ences': None,
246
              'eness': cond_e, 'ening': cond_e, 'ental': None,
247
              'ented': cond_c, 'ently': None, 'fully': None,
248
              'ially': None, 'icant': None, 'ician': None,
249
              'icide': None, 'icism': None, 'icist': None,
250
              'icity': None, 'idine': cond_i, 'iedly': None,
251
              'ihood': None, 'inate': None, 'iness': None,
252
              'ingly': cond_b, 'inism': cond_j, 'inity': cond_cc,
253
              'ional': None, 'ioned': None, 'ished': None,
254
              'istic': None, 'ities': None, 'itous': None,
255
              'ively': None, 'ivity': None, 'izers': cond_f,
256
              'izing': cond_f, 'oidal': None, 'oides': None,
257
              'otide': None, 'ously': None, 'able': None, 'ably': None,
258
              'ages': cond_b, 'ally': cond_b, 'ance': cond_b, 'ancy': cond_b,
259
              'ants': cond_b, 'aric': None, 'arly': cond_k, 'ated': cond_i,
260
              'ates': None, 'atic': cond_b, 'ator': None, 'ealy': cond_y,
261
              'edly': cond_e, 'eful': None, 'eity': None, 'ence': None,
262
              'ency': None, 'ened': cond_e, 'enly': cond_e, 'eous': None,
263
              'hood': None, 'ials': None, 'ians': None, 'ible': None,
264
              'ibly': None, 'ical': None, 'ides': cond_l, 'iers': None,
265
              'iful': None, 'ines': cond_m, 'ings': cond_n, 'ions': cond_b,
266
              'ious': None, 'isms': cond_b, 'ists': None, 'itic': cond_h,
267
              'ized': cond_f, 'izer': cond_f, 'less': None, 'lily': None,
268
              'ness': None, 'ogen': None, 'ward': None, 'wise': None,
269
              'ying': cond_b, 'yish': None, 'acy': None, 'age': cond_b,
270
              'aic': None, 'als': cond_bb, 'ant': cond_b, 'ars': cond_o,
271
              'ary': cond_f, 'ata': None, 'ate': None, 'eal': cond_y,
272
              'ear': cond_y, 'ely': cond_e, 'ene': cond_e, 'ent': cond_c,
273
              'ery': cond_e, 'ese': None, 'ful': None, 'ial': None,
274
              'ian': None, 'ics': None, 'ide': cond_l, 'ied': None,
275
              'ier': None, 'ies': cond_p, 'ily': None, 'ine': cond_m,
276
              'ing': cond_n, 'ion': cond_q, 'ish': cond_c, 'ism': cond_b,
277
              'ist': None, 'ite': cond_aa, 'ity': None, 'ium': None,
278
              'ive': None, 'ize': cond_f, 'oid': None, 'one': cond_r,
279
              'ous': None, 'ae': None, 'al': cond_bb, 'ar': cond_x,
280
              'as': cond_b, 'ed': cond_e, 'en': cond_f, 'es': cond_e,
281
              'ia': None, 'ic': None, 'is': None, 'ly': cond_b,
282
              'on': cond_s, 'or': cond_t, 'um': cond_u, 'us': cond_v,
283
              'yl': cond_r, '\'s': None, 's\'': None, 'a': None,
284
              'e': None, 'i': None, 'o': None, 's': cond_w, 'y': cond_b}
285
286
    for suffix_len in range(11, 0, -1):
287
        ending = word[-suffix_len:]
288
        if (ending in suffix and
289
                len(word)-suffix_len >= 2 and
290
                (suffix[ending] is None or
291
                 suffix[ending](word, suffix_len))):
292
            word = word[:-suffix_len]
293
            break
294
295
    def recode9(stem):
296
        """Return Lovins' conditional recode rule 9."""
297
        if stem[-3:-2] in {'a', 'i', 'o'}:
298
            return stem
299
        return stem[:-2]+'l'
300
301
    def recode24(stem):
302
        """Return Lovins' conditional recode rule 24."""
303
        if stem[-4:-3] == 's':
304
            return stem
305
        return stem[:-1]+'s'
306
307
    def recode28(stem):
308
        """Return Lovins' conditional recode rule 28."""
309
        if stem[-4:-3] in {'p', 't'}:
310
            return stem
311
        return stem[:-1]+'s'
312
313
    def recode30(stem):
314
        """Return Lovins' conditional recode rule 30."""
315
        if stem[-4:-3] == 'm':
316
            return stem
317
        return stem[:-1]+'s'
318
319
    def recode32(stem):
320
        """Return Lovins' conditional recode rule 32."""
321
        if stem[-3:-2] == 'n':
322
            return stem
323
        return stem[:-1]+'s'
324
325
    if word[-2:] in {'bb', 'dd', 'gg', 'll', 'mm', 'nn', 'pp', 'rr', 'ss',
326
                     'tt'}:
327
        word = word[:-1]
328
329
    recode = (('iev', 'ief'),
330
              ('uct', 'uc'),
331
              ('umpt', 'um'),
332
              ('rpt', 'rb'),
333
              ('urs', 'ur'),
334
              ('istr', 'ister'),
335
              ('metr', 'meter'),
336
              ('olv', 'olut'),
337
              ('ul', recode9),
338
              ('bex', 'bic'),
339
              ('dex', 'dic'),
340
              ('pex', 'pic'),
341
              ('tex', 'tic'),
342
              ('ax', 'ac'),
343
              ('ex', 'ec'),
344
              ('ix', 'ic'),
345
              ('lux', 'luc'),
346
              ('uad', 'uas'),
347
              ('vad', 'vas'),
348
              ('cid', 'cis'),
349
              ('lid', 'lis'),
350
              ('erid', 'eris'),
351
              ('pand', 'pans'),
352
              ('end', recode24),
353
              ('ond', 'ons'),
354
              ('lud', 'lus'),
355
              ('rud', 'rus'),
356
              ('her', recode28),
357
              ('mit', 'mis'),
358
              ('ent', recode30),
359
              ('ert', 'ers'),
360
              ('et', recode32),
361
              ('yt', 'ys'),
362
              ('yz', 'ys'))
363
364
    for ending, replacement in recode:
365
        if word.endswith(ending):
366
            if callable(replacement):
367
                word = replacement(word)
368
            else:
369
                word = word[:-len(ending)] + replacement
370
371
    return word
372
373
374
def _m_degree(term, vowels):
375
    """Return Porter helper function _m_degree value.
376
377
    m-degree is equal to the number of V to C transitions
378
379
    :param term: the word for which to calculate the m-degree
380
    :param vowels: the set of vowels in the language
381
    :returns: the m-degree as defined in the Porter stemmer definition
382
    """
383
    mdeg = 0
384
    last_was_vowel = False
385
    for letter in term:
386
        if letter in vowels:
387
            last_was_vowel = True
388
        else:
389
            if last_was_vowel:
390
                mdeg += 1
391
            last_was_vowel = False
392
    return mdeg
393
394
395
def _sb_has_vowel(term, vowels):
396
    """Return Porter helper function _sb_has_vowel value.
397
398
    :param term: the word to scan for vowels
399
    :param vowels: the set of vowels in the language
400
    :returns: true iff a vowel exists in the term (as defined in the Porter
401
        stemmer definition)
402
    """
403
    for letter in term:
404
        if letter in vowels:
405
            return True
406
    return False
407
408
409
def _ends_in_doubled_cons(term, vowels):
410
    """Return Porter helper function _ends_in_doubled_cons value.
411
412
    :param term: the word to check for a final doubled consonant
413
    :param vowels: the set of vowels in the language
414
    :returns: true iff the stem ends in a doubled consonant (as defined in the
415
        Porter stemmer definition)
416
    """
417
    if len(term) > 1 and term[-1] not in vowels and term[-2] == term[-1]:
418
        return True
419
    return False
420
421
422
def _ends_in_cvc(term, vowels):
423
    """Return Porter helper function _ends_in_cvc value.
424
425
    :param term: the word to scan for cvc
426
    :param vowels: the set of vowels in the language
427
    :returns: true iff the stem ends in cvc (as defined in the Porter stemmer
428
        definition)
429
    """
430
    if len(term) > 2 and (term[-1] not in vowels and
431
                          term[-2] in vowels and
432
                          term[-3] not in vowels and
433
                          term[-1] not in tuple('wxY')):
434
        return True
435
    return False
436
437
438
def porter(word, early_english=False):
439
    """Return Porter stem.
440
441
    The Porter stemmer is defined at:
442
    http://snowball.tartarus.org/algorithms/porter/stemmer.html
443
444
    :param word: the word to calculate the stem of
445
    :param early_english: set to True in order to remove -eth & -est (2nd & 3rd
446
        person singular verbal agreement suffixes)
447
    :returns: word stem
448
    :rtype: str
449
450
    >>> porter('reading')
451
    'read'
452
    >>> porter('suspension')
453
    'suspens'
454
    >>> porter('elusiveness')
455
    'elus'
456
457
    >>> porter('eateth', early_english=True)
458
    'eat'
459
    """
460
    # pylint: disable=too-many-branches
461
462
    # lowercase, normalize, and compose
463
    word = unicodedata.normalize('NFC', text_type(word.lower()))
464
465
    # Return word if stem is shorter than 2
466
    if len(word) < 3:
467
        return word
468
469
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
470
    # Re-map consonantal y to Y (Y will be C, y will be V)
471
    if word[0] == 'y':
472
        word = 'Y' + word[1:]
473
    for i in range(1, len(word)):
474
        if word[i] == 'y' and word[i-1] in _vowels:
475
            word = word[:i] + 'Y' + word[i+1:]
476
477
    # Step 1a
478
    if word[-1] == 's':
479
        if word[-4:] == 'sses':
480
            word = word[:-2]
481
        elif word[-3:] == 'ies':
482
            word = word[:-2]
483
        elif word[-2:] == 'ss':
484
            pass
485
        else:
486
            word = word[:-1]
487
488
    # Step 1b
489
    step1b_flag = False
490
    if word[-3:] == 'eed':
491
        if _m_degree(word[:-3], _vowels) > 0:
492
            word = word[:-1]
493
    elif word[-2:] == 'ed':
494
        if _sb_has_vowel(word[:-2], _vowels):
495
            word = word[:-2]
496
            step1b_flag = True
497
    elif word[-3:] == 'ing':
498
        if _sb_has_vowel(word[:-3], _vowels):
499
            word = word[:-3]
500
            step1b_flag = True
501
    elif early_english:
502
        if word[-3:] == 'est':
503
            if _sb_has_vowel(word[:-3], _vowels):
504
                word = word[:-3]
505
                step1b_flag = True
506
        elif word[-3:] == 'eth':
507
            if _sb_has_vowel(word[:-3], _vowels):
508
                word = word[:-3]
509
                step1b_flag = True
510
511
    if step1b_flag:
512
        if word[-2:] in {'at', 'bl', 'iz'}:
513
            word += 'e'
514
        elif (_ends_in_doubled_cons(word, _vowels) and
515
              word[-1] not in {'l', 's', 'z'}):
516
            word = word[:-1]
517
        elif _m_degree(word, _vowels) == 1 and _ends_in_cvc(word, _vowels):
518
            word += 'e'
519
520
    # Step 1c
521
    if word[-1] in {'Y', 'y'} and _sb_has_vowel(word[:-1], _vowels):
522
        word = word[:-1] + 'i'
523
524
    # Step 2
525
    if len(word) > 1:
526
        if word[-2] == 'a':
527
            if word[-7:] == 'ational':
528
                if _m_degree(word[:-7], _vowels) > 0:
529
                    word = word[:-5] + 'e'
530
            elif word[-6:] == 'tional':
531
                if _m_degree(word[:-6], _vowels) > 0:
532
                    word = word[:-2]
533
        elif word[-2] == 'c':
534
            if word[-4:] in {'enci', 'anci'}:
535
                if _m_degree(word[:-4], _vowels) > 0:
536
                    word = word[:-1] + 'e'
537
        elif word[-2] == 'e':
538
            if word[-4:] == 'izer':
539
                if _m_degree(word[:-4], _vowels) > 0:
540
                    word = word[:-1]
541
        elif word[-2] == 'g':
542
            if word[-4:] == 'logi':
543
                if _m_degree(word[:-4], _vowels) > 0:
544
                    word = word[:-1]
545
        elif word[-2] == 'l':
546
            if word[-3:] == 'bli':
547
                if _m_degree(word[:-3], _vowels) > 0:
548
                    word = word[:-1] + 'e'
549
            elif word[-4:] == 'alli':
550
                if _m_degree(word[:-4], _vowels) > 0:
551
                    word = word[:-2]
552
            elif word[-5:] == 'entli':
553
                if _m_degree(word[:-5], _vowels) > 0:
554
                    word = word[:-2]
555
            elif word[-3:] == 'eli':
556
                if _m_degree(word[:-3], _vowels) > 0:
557
                    word = word[:-2]
558
            elif word[-5:] == 'ousli':
559
                if _m_degree(word[:-5], _vowels) > 0:
560
                    word = word[:-2]
561
        elif word[-2] == 'o':
562
            if word[-7:] == 'ization':
563
                if _m_degree(word[:-7], _vowels) > 0:
564
                    word = word[:-5] + 'e'
565
            elif word[-5:] == 'ation':
566
                if _m_degree(word[:-5], _vowels) > 0:
567
                    word = word[:-3] + 'e'
568
            elif word[-4:] == 'ator':
569
                if _m_degree(word[:-4], _vowels) > 0:
570
                    word = word[:-2] + 'e'
571
        elif word[-2] == 's':
572
            if word[-5:] == 'alism':
573
                if _m_degree(word[:-5], _vowels) > 0:
574
                    word = word[:-3]
575
            elif word[-7:] in {'iveness', 'fulness', 'ousness'}:
576
                if _m_degree(word[:-7], _vowels) > 0:
577
                    word = word[:-4]
578
        elif word[-2] == 't':
579
            if word[-5:] == 'aliti':
580
                if _m_degree(word[:-5], _vowels) > 0:
581
                    word = word[:-3]
582
            elif word[-5:] == 'iviti':
583
                if _m_degree(word[:-5], _vowels) > 0:
584
                    word = word[:-3] + 'e'
585
            elif word[-6:] == 'biliti':
586
                if _m_degree(word[:-6], _vowels) > 0:
587
                    word = word[:-5] + 'le'
588
589
    # Step 3
590
    if word[-5:] == 'icate':
591
        if _m_degree(word[:-5], _vowels) > 0:
592
            word = word[:-3]
593
    elif word[-5:] == 'ative':
594
        if _m_degree(word[:-5], _vowels) > 0:
595
            word = word[:-5]
596
    elif word[-5:] in {'alize', 'iciti'}:
597
        if _m_degree(word[:-5], _vowels) > 0:
598
            word = word[:-3]
599
    elif word[-4:] == 'ical':
600
        if _m_degree(word[:-4], _vowels) > 0:
601
            word = word[:-2]
602
    elif word[-3:] == 'ful':
603
        if _m_degree(word[:-3], _vowels) > 0:
604
            word = word[:-3]
605
    elif word[-4:] == 'ness':
606
        if _m_degree(word[:-4], _vowels) > 0:
607
            word = word[:-4]
608
609
    # Step 4
610
    if word[-2:] == 'al':
611
        if _m_degree(word[:-2], _vowels) > 1:
612
            word = word[:-2]
613
    elif word[-4:] == 'ance':
614
        if _m_degree(word[:-4], _vowels) > 1:
615
            word = word[:-4]
616
    elif word[-4:] == 'ence':
617
        if _m_degree(word[:-4], _vowels) > 1:
618
            word = word[:-4]
619
    elif word[-2:] == 'er':
620
        if _m_degree(word[:-2], _vowels) > 1:
621
            word = word[:-2]
622
    elif word[-2:] == 'ic':
623
        if _m_degree(word[:-2], _vowels) > 1:
624
            word = word[:-2]
625
    elif word[-4:] == 'able':
626
        if _m_degree(word[:-4], _vowels) > 1:
627
            word = word[:-4]
628
    elif word[-4:] == 'ible':
629
        if _m_degree(word[:-4], _vowels) > 1:
630
            word = word[:-4]
631
    elif word[-3:] == 'ant':
632
        if _m_degree(word[:-3], _vowels) > 1:
633
            word = word[:-3]
634
    elif word[-5:] == 'ement':
635
        if _m_degree(word[:-5], _vowels) > 1:
636
            word = word[:-5]
637
    elif word[-4:] == 'ment':
638
        if _m_degree(word[:-4], _vowels) > 1:
639
            word = word[:-4]
640
    elif word[-3:] == 'ent':
641
        if _m_degree(word[:-3], _vowels) > 1:
642
            word = word[:-3]
643
    elif word[-4:] in {'sion', 'tion'}:
644
        if _m_degree(word[:-3], _vowels) > 1:
645
            word = word[:-3]
646
    elif word[-2:] == 'ou':
647
        if _m_degree(word[:-2], _vowels) > 1:
648
            word = word[:-2]
649
    elif word[-3:] == 'ism':
650
        if _m_degree(word[:-3], _vowels) > 1:
651
            word = word[:-3]
652
    elif word[-3:] == 'ate':
653
        if _m_degree(word[:-3], _vowels) > 1:
654
            word = word[:-3]
655
    elif word[-3:] == 'iti':
656
        if _m_degree(word[:-3], _vowels) > 1:
657
            word = word[:-3]
658
    elif word[-3:] == 'ous':
659
        if _m_degree(word[:-3], _vowels) > 1:
660
            word = word[:-3]
661
    elif word[-3:] == 'ive':
662
        if _m_degree(word[:-3], _vowels) > 1:
663
            word = word[:-3]
664
    elif word[-3:] == 'ize':
665
        if _m_degree(word[:-3], _vowels) > 1:
666
            word = word[:-3]
667
668
    # Step 5a
669
    if word[-1] == 'e':
670
        if _m_degree(word[:-1], _vowels) > 1:
671
            word = word[:-1]
672
        elif (_m_degree(word[:-1], _vowels) == 1 and
673
              not _ends_in_cvc(word[:-1], _vowels)):
674
            word = word[:-1]
675
676
    # Step 5b
677
    if word[-2:] == 'll' and _m_degree(word, _vowels) > 1:
678
        word = word[:-1]
679
680
    # Change 'Y' back to 'y' if it survived stemming
681
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
682
        if word[i] == 'Y':
683
            word = word[:i] + 'y' + word[i+1:]
684
685
    return word
686
687
688
def _sb_r1(term, vowels, r1_prefixes=None):
689
    """Return the R1 region, as defined in the Porter2 specification."""
690
    vowel_found = False
691
    if hasattr(r1_prefixes, '__iter__'):
692
        for prefix in r1_prefixes:
693
            if term[:len(prefix)] == prefix:
694
                return len(prefix)
695
696
    for i in range(len(term)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
697
        if not vowel_found and term[i] in vowels:
698
            vowel_found = True
699
        elif vowel_found and term[i] not in vowels:
700
            return i + 1
701
    return len(term)
702
703
704
def _sb_r2(term, vowels, r1_prefixes=None):
705
    """Return the R2 region, as defined in the Porter2 specification."""
706
    r1_start = _sb_r1(term, vowels, r1_prefixes)
707
    return r1_start + _sb_r1(term[r1_start:], vowels)
708
709
710
def _sb_ends_in_short_syllable(term, vowels, codanonvowels):
711
    """Return True iff term ends in a short syllable.
712
713
    (...according to the Porter2 specification.)
714
715
    NB: This is akin to the CVC test from the Porter stemmer. The description
716
    is unfortunately poor/ambiguous.
717
    """
718
    if not term:
719
        return False
720
    if len(term) == 2:
721
        if term[-2] in vowels and term[-1] not in vowels:
722
            return True
723
    elif len(term) >= 3:
724
        if ((term[-3] not in vowels and term[-2] in vowels and
725
             term[-1] in codanonvowels)):
726
            return True
727
    return False
728
729
730
def _sb_short_word(term, vowels, codanonvowels, r1_prefixes=None):
731
    """Return True iff term is a short word.
732
733
    (...according to the Porter2 specification.)
734
    """
735
    if ((_sb_r1(term, vowels, r1_prefixes) == len(term) and
736
         _sb_ends_in_short_syllable(term, vowels, codanonvowels))):
737
        return True
738
    return False
739
740
741
def porter2(word, early_english=False):
742
    """Return the Porter2 (Snowball English) stem.
743
744
    The Porter2 (Snowball English) stemmer is defined at:
745
    http://snowball.tartarus.org/algorithms/english/stemmer.html
746
747
    :param word: the word to calculate the stem of
748
    :param early_english: set to True in order to remove -eth & -est (2nd & 3rd
749
        person singular verbal agreement suffixes)
750
    :returns: word stem
751
    :rtype: str
752
753
    >>> porter2('reading')
754
    'read'
755
    >>> porter2('suspension')
756
    'suspens'
757
    >>> porter2('elusiveness')
758
    'elus'
759
760
    >>> porter2('eateth', early_english=True)
761
    'eat'
762
    """
763
    # pylint: disable=too-many-branches
764
    # pylint: disable=too-many-return-statements
765
766
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
767
    _codanonvowels = {"'", 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm',
768
                      'n', 'p', 'q', 'r', 's', 't', 'v', 'z'}
769
    _doubles = {'bb', 'dd', 'ff', 'gg', 'mm', 'nn', 'pp', 'rr', 'tt'}
770
    _li = {'c', 'd', 'e', 'g', 'h', 'k', 'm', 'n', 'r', 't'}
771
772
    # R1 prefixes should be in order from longest to shortest to prevent
773
    # masking
774
    _r1_prefixes = ('commun', 'gener', 'arsen')
775
    _exception1dict = {  # special changes:
776
        'skis': 'ski', 'skies': 'sky', 'dying': 'die',
777
        'lying': 'lie', 'tying': 'tie',
778
        # special -LY cases:
779
        'idly': 'idl', 'gently': 'gentl', 'ugly': 'ugli',
780
        'early': 'earli', 'only': 'onli', 'singly': 'singl'}
781
    _exception1set = {'sky', 'news', 'howe', 'atlas', 'cosmos', 'bias',
782
                      'andes'}
783
    _exception2set = {'inning', 'outing', 'canning', 'herring', 'earring',
784
                      'proceed', 'exceed', 'succeed'}
785
786
    # lowercase, normalize, and compose
787
    word = unicodedata.normalize('NFC', text_type(word.lower()))
788
    # replace apostrophe-like characters with U+0027, per
789
    # http://snowball.tartarus.org/texts/apostrophe.html
790
    word = word.replace('’', '\'')
791
    word = word.replace('’', '\'')
792
793
    # Exceptions 1
794
    if word in _exception1dict:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
795
        return _exception1dict[word]
796
    elif word in _exception1set:
797
        return word
798
799
    # Return word if stem is shorter than 3
800
    if len(word) < 3:
801
        return word
802
803
    # Remove initial ', if present.
804
    while word and word[0] == '\'':
805
        word = word[1:]
806
        # Return word if stem is shorter than 2
807
        if len(word) < 2:
808
            return word
809
810
    # Re-map vocalic Y to y (Y will be C, y will be V)
811
    if word[0] == 'y':
812
        word = 'Y' + word[1:]
813
    for i in range(1, len(word)):
814
        if word[i] == 'y' and word[i-1] in _vowels:
815
            word = word[:i] + 'Y' + word[i+1:]
816
817
    r1_start = _sb_r1(word, _vowels, _r1_prefixes)
818
    r2_start = _sb_r2(word, _vowels, _r1_prefixes)
819
820
    # Step 0
821
    if word[-3:] == '\'s\'':
822
        word = word[:-3]
823
    elif word[-2:] == '\'s':
824
        word = word[:-2]
825
    elif word[-1:] == '\'':
826
        word = word[:-1]
827
    # Return word if stem is shorter than 2
828
    if len(word) < 3:
829
        return word
830
831
    # Step 1a
832
    if word[-4:] == 'sses':
833
        word = word[:-2]
834
    elif word[-3:] in {'ied', 'ies'}:
835
        if len(word) > 4:
836
            word = word[:-2]
837
        else:
838
            word = word[:-1]
839
    elif word[-2:] in {'us', 'ss'}:
840
        pass
841
    elif word[-1] == 's':
842
        if _sb_has_vowel(word[:-2], _vowels):
843
            word = word[:-1]
844
845
    # Exceptions 2
846
    if word in _exception2set:
847
        return word
848
849
    # Step 1b
850
    step1b_flag = False
851
    if word[-5:] == 'eedly':
852
        if len(word[r1_start:]) >= 5:
853
            word = word[:-3]
854
    elif word[-5:] == 'ingly':
855
        if _sb_has_vowel(word[:-5], _vowels):
856
            word = word[:-5]
857
            step1b_flag = True
858
    elif word[-4:] == 'edly':
859
        if _sb_has_vowel(word[:-4], _vowels):
860
            word = word[:-4]
861
            step1b_flag = True
862
    elif word[-3:] == 'eed':
863
        if len(word[r1_start:]) >= 3:
864
            word = word[:-1]
865
    elif word[-3:] == 'ing':
866
        if _sb_has_vowel(word[:-3], _vowels):
867
            word = word[:-3]
868
            step1b_flag = True
869
    elif word[-2:] == 'ed':
870
        if _sb_has_vowel(word[:-2], _vowels):
871
            word = word[:-2]
872
            step1b_flag = True
873
    elif early_english:
874
        if word[-3:] == 'est':
875
            if _sb_has_vowel(word[:-3], _vowels):
876
                word = word[:-3]
877
                step1b_flag = True
878
        elif word[-3:] == 'eth':
879
            if _sb_has_vowel(word[:-3], _vowels):
880
                word = word[:-3]
881
                step1b_flag = True
882
883
    if step1b_flag:
884
        if word[-2:] in {'at', 'bl', 'iz'}:
885
            word += 'e'
886
        elif word[-2:] in _doubles:
887
            word = word[:-1]
888
        elif _sb_short_word(word, _vowels, _codanonvowels, _r1_prefixes):
889
            word += 'e'
890
891
    # Step 1c
892
    if ((len(word) > 2 and word[-1] in {'Y', 'y'} and
893
         word[-2] not in _vowels)):
894
        word = word[:-1] + 'i'
895
896
    # Step 2
897
    if word[-2] == 'a':
898
        if word[-7:] == 'ational':
899
            if len(word[r1_start:]) >= 7:
900
                word = word[:-5] + 'e'
901
        elif word[-6:] == 'tional':
902
            if len(word[r1_start:]) >= 6:
903
                word = word[:-2]
904
    elif word[-2] == 'c':
905
        if word[-4:] in {'enci', 'anci'}:
906
            if len(word[r1_start:]) >= 4:
907
                word = word[:-1] + 'e'
908
    elif word[-2] == 'e':
909
        if word[-4:] == 'izer':
910
            if len(word[r1_start:]) >= 4:
911
                word = word[:-1]
912
    elif word[-2] == 'g':
913
        if word[-3:] == 'ogi':
914
            if ((r1_start >= 1 and len(word[r1_start:]) >= 3 and
915
                 word[-4] == 'l')):
916
                word = word[:-1]
917
    elif word[-2] == 'l':
918
        if word[-6:] == 'lessli':
919
            if len(word[r1_start:]) >= 6:
920
                word = word[:-2]
921
        elif word[-5:] in {'entli', 'fulli', 'ousli'}:
922
            if len(word[r1_start:]) >= 5:
923
                word = word[:-2]
924
        elif word[-4:] == 'abli':
925
            if len(word[r1_start:]) >= 4:
926
                word = word[:-1] + 'e'
927
        elif word[-4:] == 'alli':
928
            if len(word[r1_start:]) >= 4:
929
                word = word[:-2]
930
        elif word[-3:] == 'bli':
931
            if len(word[r1_start:]) >= 3:
932
                word = word[:-1] + 'e'
933
        elif word[-2:] == 'li':
934
            if ((r1_start >= 1 and len(word[r1_start:]) >= 2 and
935
                 word[-3] in _li)):
936
                word = word[:-2]
937
    elif word[-2] == 'o':
938
        if word[-7:] == 'ization':
939
            if len(word[r1_start:]) >= 7:
940
                word = word[:-5] + 'e'
941
        elif word[-5:] == 'ation':
942
            if len(word[r1_start:]) >= 5:
943
                word = word[:-3] + 'e'
944
        elif word[-4:] == 'ator':
945
            if len(word[r1_start:]) >= 4:
946
                word = word[:-2] + 'e'
947
    elif word[-2] == 's':
948
        if word[-7:] in {'fulness', 'ousness', 'iveness'}:
949
            if len(word[r1_start:]) >= 7:
950
                word = word[:-4]
951
        elif word[-5:] == 'alism':
952
            if len(word[r1_start:]) >= 5:
953
                word = word[:-3]
954
    elif word[-2] == 't':
955
        if word[-6:] == 'biliti':
956
            if len(word[r1_start:]) >= 6:
957
                word = word[:-5] + 'le'
958
        elif word[-5:] == 'aliti':
959
            if len(word[r1_start:]) >= 5:
960
                word = word[:-3]
961
        elif word[-5:] == 'iviti':
962
            if len(word[r1_start:]) >= 5:
963
                word = word[:-3] + 'e'
964
965
    # Step 3
966
    if word[-7:] == 'ational':
967
        if len(word[r1_start:]) >= 7:
968
            word = word[:-5] + 'e'
969
    elif word[-6:] == 'tional':
970
        if len(word[r1_start:]) >= 6:
971
            word = word[:-2]
972
    elif word[-5:] in {'alize', 'icate', 'iciti'}:
973
        if len(word[r1_start:]) >= 5:
974
            word = word[:-3]
975
    elif word[-5:] == 'ative':
976
        if len(word[r2_start:]) >= 5:
977
            word = word[:-5]
978
    elif word[-4:] == 'ical':
979
        if len(word[r1_start:]) >= 4:
980
            word = word[:-2]
981
    elif word[-4:] == 'ness':
982
        if len(word[r1_start:]) >= 4:
983
            word = word[:-4]
984
    elif word[-3:] == 'ful':
985
        if len(word[r1_start:]) >= 3:
986
            word = word[:-3]
987
988
    # Step 4
989
    for suffix in ('ement', 'ance', 'ence', 'able', 'ible', 'ment', 'ant',
990
                   'ent', 'ism', 'ate', 'iti', 'ous', 'ive', 'ize', 'al', 'er',
991
                   'ic'):
992
        if word[-len(suffix):] == suffix:
993
            if len(word[r2_start:]) >= len(suffix):
994
                word = word[:-len(suffix)]
995
            break
996
    else:
997
        if word[-3:] == 'ion':
998
            if ((len(word[r2_start:]) >= 3 and len(word) >= 4 and
999
                 word[-4] in tuple('st'))):
1000
                word = word[:-3]
1001
1002
    # Step 5
1003
    if word[-1] == 'e':
1004
        if (len(word[r2_start:]) >= 1 or
1005
                (len(word[r1_start:]) >= 1 and
1006
                 not _sb_ends_in_short_syllable(word[:-1], _vowels,
1007
                                                _codanonvowels))):
1008
            word = word[:-1]
1009
    elif word[-1] == 'l':
1010
        if len(word[r2_start:]) >= 1 and word[-2] == 'l':
1011
            word = word[:-1]
1012
1013
    # Change 'Y' back to 'y' if it survived stemming
1014
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1015
        if word[i] == 'Y':
1016
            word = word[:i] + 'y' + word[i+1:]
1017
1018
    return word
1019
1020
1021
def sb_german(word, alternate_vowels=False):
1022
    """Return Snowball German stem.
1023
1024
    The Snowball German stemmer is defined at:
1025
    http://snowball.tartarus.org/algorithms/german/stemmer.html
1026
1027
    :param word: the word to calculate the stem of
1028
    :param alternate_vowels: composes ae as ä, oe as ö, and ue as ü before
1029
        running the algorithm
1030
    :returns: word stem
1031
    :rtype: str
1032
1033
    >>> sb_german('lesen')
1034
    'les'
1035
    >>> sb_german('graues')
1036
    'grau'
1037
    >>> sb_german('buchstabieren')
1038
    'buchstabi'
1039
    """
1040
    # pylint: disable=too-many-branches
1041
1042
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'ö', 'ü'}
1043
    _s_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'r', 't'}
1044
    _st_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
1045
1046
    # lowercase, normalize, and compose
1047
    word = unicodedata.normalize('NFC', word.lower())
1048
    word = word.replace('ß', 'ss')
1049
1050
    if len(word) > 2:
1051
        for i in range(2, len(word)):
1052
            if word[i] in _vowels and word[i-2] in _vowels:
1053
                if word[i-1] == 'u':
1054
                    word = word[:i-1] + 'U' + word[i:]
1055
                elif word[i-1] == 'y':
1056
                    word = word[:i-1] + 'Y' + word[i:]
1057
1058
    if alternate_vowels:
1059
        word = word.replace('ae', 'ä')
1060
        word = word.replace('oe', 'ö')
1061
        word = word.replace('que', 'Q')
1062
        word = word.replace('ue', 'ü')
1063
        word = word.replace('Q', 'que')
1064
1065
    r1_start = max(3, _sb_r1(word, _vowels))
1066
    r2_start = _sb_r2(word, _vowels)
1067
1068
    # Step 1
1069
    niss_flag = False
1070
    if word[-3:] == 'ern':
1071
        if len(word[r1_start:]) >= 3:
1072
            word = word[:-3]
1073
    elif word[-2:] == 'em':
1074
        if len(word[r1_start:]) >= 2:
1075
            word = word[:-2]
1076
    elif word[-2:] == 'er':
1077
        if len(word[r1_start:]) >= 2:
1078
            word = word[:-2]
1079
    elif word[-2:] == 'en':
1080
        if len(word[r1_start:]) >= 2:
1081
            word = word[:-2]
1082
            niss_flag = True
1083
    elif word[-2:] == 'es':
1084
        if len(word[r1_start:]) >= 2:
1085
            word = word[:-2]
1086
            niss_flag = True
1087
    elif word[-1:] == 'e':
1088
        if len(word[r1_start:]) >= 1:
1089
            word = word[:-1]
1090
            niss_flag = True
1091
    elif word[-1:] == 's':
1092
        if ((len(word[r1_start:]) >= 1 and len(word) >= 2 and
1093
             word[-2] in _s_endings)):
1094
            word = word[:-1]
1095
1096
    if niss_flag and word[-4:] == 'niss':
1097
        word = word[:-1]
1098
1099
    # Step 2
1100
    if word[-3:] == 'est':
1101
        if len(word[r1_start:]) >= 3:
1102
            word = word[:-3]
1103
    elif word[-2:] == 'en':
1104
        if len(word[r1_start:]) >= 2:
1105
            word = word[:-2]
1106
    elif word[-2:] == 'er':
1107
        if len(word[r1_start:]) >= 2:
1108
            word = word[:-2]
1109
    elif word[-2:] == 'st':
1110
        if ((len(word[r1_start:]) >= 2 and len(word) >= 6 and
1111
             word[-3] in _st_endings)):
1112
            word = word[:-2]
1113
1114
    # Step 3
1115
    if word[-4:] == 'isch':
1116
        if len(word[r2_start:]) >= 4 and word[-5] != 'e':
1117
            word = word[:-4]
1118
    elif word[-4:] in {'lich', 'heit'}:
1119
        if len(word[r2_start:]) >= 4:
1120
            word = word[:-4]
1121
            if ((word[-2:] in {'er', 'en'} and
1122
                 len(word[r1_start:]) >= 2)):
1123
                word = word[:-2]
1124
    elif word[-4:] == 'keit':
1125
        if len(word[r2_start:]) >= 4:
1126
            word = word[:-4]
1127
            if word[-4:] == 'lich' and len(word[r2_start:]) >= 4:
1128
                word = word[:-4]
1129
            elif word[-2:] == 'ig' and len(word[r2_start:]) >= 2:
1130
                word = word[:-2]
1131
    elif word[-3:] in {'end', 'ung'}:
1132
        if len(word[r2_start:]) >= 3:
1133
            word = word[:-3]
1134
            if ((word[-2:] == 'ig' and len(word[r2_start:]) >= 2 and
1135
                 word[-3] != 'e')):
1136
                word = word[:-2]
1137
    elif word[-2:] in {'ig', 'ik'}:
1138
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1139
            word = word[:-2]
1140
1141
    # Change 'Y' and 'U' back to lowercase if survived stemming
1142
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1143
        if word[i] == 'Y':
1144
            word = word[:i] + 'y' + word[i+1:]
1145
        elif word[i] == 'U':
1146
            word = word[:i] + 'u' + word[i+1:]
1147
1148
    # Remove umlauts
1149
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1150
    word = word.translate(_umlauts)
1151
1152
    return word
1153
1154
1155
def sb_dutch(word):
1156
    """Return Snowball Dutch stem.
1157
1158
    The Snowball Dutch stemmer is defined at:
1159
    http://snowball.tartarus.org/algorithms/dutch/stemmer.html
1160
1161
    :param word: the word to calculate the stem of
1162
    :returns: word stem
1163
    :rtype: str
1164
1165
    >>> sb_dutch('lezen')
1166
    'lez'
1167
    >>> sb_dutch('opschorting')
1168
    'opschort'
1169
    >>> sb_dutch('ongrijpbaarheid')
1170
    'ongrijp'
1171
    """
1172
    # pylint: disable=too-many-branches
1173
1174
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'è'}
1175
    _not_s_endings = {'a', 'e', 'i', 'j', 'o', 'u', 'y', 'è'}
1176
1177
    def _undouble(word):
1178
        """Undouble endings -kk, -dd, and -tt."""
1179
        if ((len(word) > 1 and word[-1] == word[-2] and
1180
             word[-1] in {'d', 'k', 't'})):
1181
            return word[:-1]
1182
        return word
1183
1184
    # lowercase, normalize, decompose, filter umlauts & acutes out, and compose
1185
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1186
    _accented = dict(zip((ord(_) for _ in 'äëïöüáéíóú'), 'aeiouaeiou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1187
    word = word.translate(_accented)
1188
1189
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1190
        if i == 0 and word[0] == 'y':
1191
            word = 'Y' + word[1:]
1192
        elif word[i] == 'y' and word[i-1] in _vowels:
1193
            word = word[:i] + 'Y' + word[i+1:]
1194
        elif (word[i] == 'i' and word[i-1] in _vowels and i+1 < len(word) and
1195
              word[i+1] in _vowels):
1196
            word = word[:i] + 'I' + word[i+1:]
1197
1198
    r1_start = max(3, _sb_r1(word, _vowels))
1199
    r2_start = _sb_r2(word, _vowels)
1200
1201
    # Step 1
1202
    if word[-5:] == 'heden':
1203
        if len(word[r1_start:]) >= 5:
1204
            word = word[:-3] + 'id'
1205
    elif word[-3:] == 'ene':
1206
        if ((len(word[r1_start:]) >= 3 and
1207
             (word[-4] not in _vowels and word[-6:-3] != 'gem'))):
1208
            word = _undouble(word[:-3])
1209
    elif word[-2:] == 'en':
1210
        if ((len(word[r1_start:]) >= 2 and
1211
             (word[-3] not in _vowels and word[-5:-2] != 'gem'))):
1212
            word = _undouble(word[:-2])
1213
    elif word[-2:] == 'se':
1214
        if len(word[r1_start:]) >= 2 and word[-3] not in _not_s_endings:
1215
            word = word[:-2]
1216
    elif word[-1:] == 's':
1217
        if len(word[r1_start:]) >= 1 and word[-2] not in _not_s_endings:
1218
            word = word[:-1]
1219
1220
    # Step 2
1221
    e_removed = False
1222
    if word[-1:] == 'e':
1223
        if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
1224
            word = _undouble(word[:-1])
1225
            e_removed = True
1226
1227
    # Step 3a
1228
    if word[-4:] == 'heid':
1229
        if len(word[r2_start:]) >= 4 and word[-5] != 'c':
1230
            word = word[:-4]
1231
            if word[-2:] == 'en':
1232
                if ((len(word[r1_start:]) >= 2 and
1233
                     (word[-3] not in _vowels and word[-5:-2] != 'gem'))):
1234
                    word = _undouble(word[:-2])
1235
1236
    # Step 3b
1237
    if word[-4:] == 'lijk':
1238
        if len(word[r2_start:]) >= 4:
1239
            word = word[:-4]
1240
            # Repeat step 2
1241
            if word[-1:] == 'e':
1242
                if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
1243
                    word = _undouble(word[:-1])
1244
    elif word[-4:] == 'baar':
1245
        if len(word[r2_start:]) >= 4:
1246
            word = word[:-4]
1247
    elif word[-3:] in ('end', 'ing'):
1248
        if len(word[r2_start:]) >= 3:
1249
            word = word[:-3]
1250
            if ((word[-2:] == 'ig' and len(word[r2_start:]) >= 2 and
1251
                 word[-3] != 'e')):
1252
                word = word[:-2]
1253
            else:
1254
                word = _undouble(word)
1255
    elif word[-3:] == 'bar':
1256
        if len(word[r2_start:]) >= 3 and e_removed:
1257
            word = word[:-3]
1258
    elif word[-2:] == 'ig':
1259
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1260
            word = word[:-2]
1261
1262
    # Step 4
1263
    if ((len(word) >= 4 and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1264
         word[-3] == word[-2] and word[-2] in {'a', 'e', 'o', 'u'} and
1265
         word[-4] not in _vowels and
1266
         word[-1] not in _vowels and word[-1] != 'I')):
1267
        word = word[:-2] + word[-1]
1268
1269
    # Change 'Y' and 'U' back to lowercase if survived stemming
1270
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1271
        if word[i] == 'Y':
1272
            word = word[:i] + 'y' + word[i+1:]
1273
        elif word[i] == 'I':
1274
            word = word[:i] + 'i' + word[i+1:]
1275
1276
    return word
1277
1278
1279
def sb_norwegian(word):
1280
    """Return Snowball Norwegian stem.
1281
1282
    The Snowball Norwegian stemmer is defined at:
1283
    http://snowball.tartarus.org/algorithms/norwegian/stemmer.html
1284
1285
    :param word: the word to calculate the stem of
1286
    :returns: word stem
1287
    :rtype: str
1288
1289
    >>> sb_norwegian('lese')
1290
    'les'
1291
    >>> sb_norwegian('suspensjon')
1292
    'suspensjon'
1293
    >>> sb_norwegian('sikkerhet')
1294
    'sikker'
1295
    """
1296
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1297
    _s_endings = {'b', 'c', 'd', 'f', 'g', 'h', 'j', 'l', 'm', 'n', 'o', 'p',
1298
                  'r', 't', 'v', 'y', 'z'}
1299
    # lowercase, normalize, and compose
1300
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1301
1302
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1303
1304
    # Step 1
1305
    _r1 = word[r1_start:]
1306
    if _r1[-7:] == 'hetenes':
1307
        word = word[:-7]
1308
    elif _r1[-6:] in {'hetene', 'hetens'}:
1309
        word = word[:-6]
1310
    elif _r1[-5:] in {'heten', 'heter', 'endes'}:
1311
        word = word[:-5]
1312
    elif _r1[-4:] in {'ande', 'ende', 'edes', 'enes', 'erte'}:
1313
        if word[-4:] == 'erte':
1314
            word = word[:-2]
1315
        else:
1316
            word = word[:-4]
1317
    elif _r1[-3:] in {'ede', 'ane', 'ene', 'ens', 'ers', 'ets', 'het', 'ast',
1318
                      'ert'}:
1319
        if word[-3:] == 'ert':
1320
            word = word[:-1]
1321
        else:
1322
            word = word[:-3]
1323
    elif _r1[-2:] in {'en', 'ar', 'er', 'as', 'es', 'et'}:
1324
        word = word[:-2]
1325
    elif _r1[-1:] in {'a', 'e'}:
1326
        word = word[:-1]
1327
    elif _r1[-1:] == 's':
1328
        if (((len(word) > 1 and word[-2] in _s_endings) or
1329
             (len(word) > 2 and word[-2] == 'k' and word[-3] not in _vowels))):
1330
            word = word[:-1]
1331
1332
    # Step 2
1333
    if word[r1_start:][-2:] in {'dt', 'vt'}:
1334
        word = word[:-1]
1335
1336
    # Step 3
1337
    _r1 = word[r1_start:]
1338
    if _r1[-7:] == 'hetslov':
1339
        word = word[:-7]
1340
    elif _r1[-4:] in {'eleg', 'elig', 'elov', 'slov'}:
1341
        word = word[:-4]
1342
    elif _r1[-3:] in {'leg', 'eig', 'lig', 'els', 'lov'}:
1343
        word = word[:-3]
1344
    elif _r1[-2:] == 'ig':
1345
        word = word[:-2]
1346
1347
    return word
1348
1349
1350
def sb_swedish(word):
1351
    """Return Snowball Swedish stem.
1352
1353
    The Snowball Swedish stemmer is defined at:
1354
    http://snowball.tartarus.org/algorithms/swedish/stemmer.html
1355
1356
    :param word: the word to calculate the stem of
1357
    :returns: word stem
1358
    :rtype: str
1359
1360
    >>> sb_swedish('undervisa')
1361
    'undervis'
1362
    >>> sb_swedish('suspension')
1363
    'suspension'
1364
    >>> sb_swedish('visshet')
1365
    'viss'
1366
    """
1367
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'å', 'ö'}
1368
    _s_endings = {'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
1369
                  'o', 'p', 'r', 't', 'v', 'y'}
1370
1371
    # lowercase, normalize, and compose
1372
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1373
1374
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1375
1376
    # Step 1
1377
    _r1 = word[r1_start:]
1378 View Code Duplication
    if _r1[-7:] == 'heterna':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1379
        word = word[:-7]
1380
    elif _r1[-6:] == 'hetens':
1381
        word = word[:-6]
1382
    elif _r1[-5:] in {'anden', 'heten', 'heter', 'arnas', 'ernas', 'ornas',
1383
                      'andes', 'arens', 'andet'}:
1384
        word = word[:-5]
1385
    elif _r1[-4:] in {'arna', 'erna', 'orna', 'ande', 'arne', 'aste', 'aren',
1386
                      'ades', 'erns'}:
1387
        word = word[:-4]
1388
    elif _r1[-3:] in {'ade', 'are', 'ern', 'ens', 'het', 'ast'}:
1389
        word = word[:-3]
1390
    elif _r1[-2:] in {'ad', 'en', 'ar', 'er', 'or', 'as', 'es', 'at'}:
1391
        word = word[:-2]
1392
    elif _r1[-1:] in {'a', 'e'}:
1393
        word = word[:-1]
1394
    elif _r1[-1:] == 's':
1395
        if len(word) > 1 and word[-2] in _s_endings:
1396
            word = word[:-1]
1397
1398
    # Step 2
1399
    if word[r1_start:][-2:] in {'dd', 'gd', 'nn', 'dt', 'gt', 'kt', 'tt'}:
1400
        word = word[:-1]
1401
1402
    # Step 3
1403
    _r1 = word[r1_start:]
1404
    if _r1[-5:] == 'fullt':
1405
        word = word[:-1]
1406
    elif _r1[-4:] == 'löst':
1407
        word = word[:-1]
1408
    elif _r1[-3:] in {'lig', 'els'}:
1409
        word = word[:-3]
1410
    elif _r1[-2:] == 'ig':
1411
        word = word[:-2]
1412
1413
    return word
1414
1415
1416
def sb_danish(word):
1417
    """Return Snowball Danish stem.
1418
1419
    The Snowball Danish stemmer is defined at:
1420
    http://snowball.tartarus.org/algorithms/danish/stemmer.html
1421
1422
    :param word: the word to calculate the stem of
1423
    :returns: word stem
1424
    :rtype: str
1425
1426
    >>> sb_danish('underviser')
1427
    'undervis'
1428
    >>> sb_danish('suspension')
1429
    'suspension'
1430
    >>> sb_danish('sikkerhed')
1431
    'sikker'
1432
    """
1433
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1434
    _s_endings = {'a', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
1435
                  'o', 'p', 'r', 't', 'v', 'y', 'z', 'å'}
1436
1437
    # lowercase, normalize, and compose
1438
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1439
1440
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1441
1442
    # Step 1
1443
    _r1 = word[r1_start:]
1444 View Code Duplication
    if _r1[-7:] == 'erendes':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1445
        word = word[:-7]
1446
    elif _r1[-6:] in {'erende', 'hedens'}:
1447
        word = word[:-6]
1448
    elif _r1[-5:] in {'ethed', 'erede', 'heden', 'heder', 'endes', 'ernes',
1449
                      'erens', 'erets'}:
1450
        word = word[:-5]
1451
    elif _r1[-4:] in {'ered', 'ende', 'erne', 'eren', 'erer', 'heds', 'enes',
1452
                      'eres', 'eret'}:
1453
        word = word[:-4]
1454
    elif _r1[-3:] in {'hed', 'ene', 'ere', 'ens', 'ers', 'ets'}:
1455
        word = word[:-3]
1456
    elif _r1[-2:] in {'en', 'er', 'es', 'et'}:
1457
        word = word[:-2]
1458
    elif _r1[-1:] == 'e':
1459
        word = word[:-1]
1460
    elif _r1[-1:] == 's':
1461
        if len(word) > 1 and word[-2] in _s_endings:
1462
            word = word[:-1]
1463
1464
    # Step 2
1465
    if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1466
        word = word[:-1]
1467
1468
    # Step 3
1469
    if word[-4:] == 'igst':
1470
        word = word[:-2]
1471
1472
    _r1 = word[r1_start:]
1473
    repeat_step2 = False
1474
    if _r1[-4:] == 'elig':
1475
        word = word[:-4]
1476
        repeat_step2 = True
1477
    elif _r1[-4:] == 'løst':
1478
        word = word[:-1]
1479
    elif _r1[-3:] in {'lig', 'els'}:
1480
        word = word[:-3]
1481
        repeat_step2 = True
1482
    elif _r1[-2:] == 'ig':
1483
        word = word[:-2]
1484
        repeat_step2 = True
1485
1486
    if repeat_step2:
1487
        if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1488
            word = word[:-1]
1489
1490
    # Step 4
1491
    if ((len(word[r1_start:]) >= 1 and len(word) >= 2 and
1492
         word[-1] == word[-2] and word[-1] not in _vowels)):
1493
        word = word[:-1]
1494
1495
    return word
1496
1497
1498
def clef_german(word):
1499
    """Return CLEF German stem.
1500
1501
    The CLEF German stemmer is defined at:
1502
    http://members.unine.ch/jacques.savoy/clef/germanStemmer.txt
1503
1504
    :param word: the word to calculate the stem of
1505
    :returns: word stem
1506
    :rtype: str
1507
1508
    >>> clef_german('lesen')
1509
    'lese'
1510
    >>> clef_german('graues')
1511
    'grau'
1512
    >>> clef_german('buchstabieren')
1513
    'buchstabier'
1514
    """
1515
    # lowercase, normalize, and compose
1516
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1517
1518
    # remove umlauts
1519
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1520
    word = word.translate(_umlauts)
1521
1522
    # remove plurals
1523
    wlen = len(word)-1
1524
1525
    if wlen > 3:
1526
        if wlen > 5:
1527
            if word[-3:] == 'nen':
1528
                return word[:-3]
1529
        if wlen > 4:
1530
            if word[-2:] in {'en', 'se', 'es', 'er'}:
1531
                return word[:-2]
1532
        if word[-1] in {'e', 'n', 'r', 's'}:
1533
            return word[:-1]
1534
    return word
1535
1536
1537
def clef_german_plus(word):
1538
    """Return 'CLEF German stemmer plus' stem.
1539
1540
    The CLEF German stemmer plus is defined at:
1541
    http://members.unine.ch/jacques.savoy/clef/germanStemmerPlus.txt
1542
1543
    :param word: the word to calculate the stem of
1544
    :returns: word stem
1545
    :rtype: str
1546
1547
    >>> clef_german_plus('lesen')
1548
    'les'
1549
    >>> clef_german_plus('graues')
1550
    'grau'
1551
    >>> clef_german_plus('buchstabieren')
1552
    'buchstabi'
1553
    """
1554
    _st_ending = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
1555
1556
    # lowercase, normalize, and compose
1557
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1558
1559
    # remove umlauts
1560
    _accents = dict(zip((ord(_) for _ in 'äàáâöòóôïìíîüùúû'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1561
                        'aaaaooooiiiiuuuu'))
1562
    word = word.translate(_accents)
1563
1564
    # Step 1
1565
    wlen = len(word)-1
1566
    if wlen > 4 and word[-3:] == 'ern':
1567
        word = word[:-3]
1568
    elif wlen > 3 and word[-2:] in {'em', 'en', 'er', 'es'}:
1569
        word = word[:-2]
1570
    elif wlen > 2 and (word[-1] == 'e' or
1571
                       (word[-1] == 's' and word[-2] in _st_ending)):
1572
        word = word[:-1]
1573
1574
    # Step 2
1575
    wlen = len(word)-1
1576
    if wlen > 4 and word[-3:] == 'est':
1577
        word = word[:-3]
1578
    elif wlen > 3 and (word[-2:] in {'er', 'en'} or
1579
                       (word[-2:] == 'st' and word[-3] in _st_ending)):
1580
        word = word[:-2]
1581
1582
    return word
1583
1584
1585
def clef_swedish(word):
1586
    """Return CLEF Swedish stem.
1587
1588
    The CLEF Swedish stemmer is defined at:
1589
    http://members.unine.ch/jacques.savoy/clef/swedishStemmer.txt
1590
1591
    :param word: the word to calculate the stem of
1592
    :returns: word stem
1593
    :rtype: str
1594
1595
    >>> clef_swedish('undervisa')
1596
    'undervis'
1597
    >>> clef_swedish('suspension')
1598
    'suspensio'
1599
    >>> clef_swedish('visshet')
1600
    'viss'
1601
    """
1602
    wlen = len(word)-1
1603
1604
    if wlen > 3 and word[-1] == 's':
1605
        word = word[:-1]
1606
        wlen -= 1
1607
1608
    if wlen > 6:
1609
        if word[-5:] in {'elser', 'heten'}:
1610
            return word[:-5]
1611
    if wlen > 5:
1612
        if word[-4:] in {'arne', 'erna', 'ande', 'else', 'aste', 'orna',
1613
                         'aren'}:
1614
            return word[:-4]
1615
    if wlen > 4:
1616
        if word[-3:] in {'are', 'ast', 'het'}:
1617
            return word[:-3]
1618
    if wlen > 3:
1619
        if word[-2:] in {'ar', 'er', 'or', 'en', 'at', 'te', 'et'}:
1620
            return word[:-2]
1621
    if wlen > 2:
1622
        if word[-1] in {'a', 'e', 'n', 't'}:
1623
            return word[:-1]
1624
    return word
1625
1626
1627
def caumanns(word):
1628
    """Return Caumanns German stem.
1629
1630
    Jörg Caumanns' stemmer is described in his article at:
1631
    http://edocs.fu-berlin.de/docs/servlets/MCRFileNodeServlet/FUDOCS_derivate_000000000350/tr-b-99-16.pdf
1632
1633
    This implementation is based on the GermanStemFilter described at:
1634
    http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene
1635
1636
    :param word: the word to calculate the stem of
1637
    :returns: word stem
1638
    :rtype: str
1639
1640
    >>> caumanns('lesen')
1641
    'les'
1642
    >>> caumanns('graues')
1643
    'grau'
1644
    >>> caumanns('buchstabieren')
1645
    'buchstabier'
1646
    """
1647
    if not word:
1648
        return ''
1649
1650
    upper_initial = word[0].isupper()
1651
    word = unicodedata.normalize('NFC', text_type(word.lower()))
1652
1653
    # # Part 2: Substitution
1654
    # 1. Change umlauts to corresponding vowels & ß to ss
1655
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1656
    word = word.translate(_umlauts)
1657
    word = word.replace('ß', 'ss')
1658
1659
    # 2. Change second of doubled characters to *
1660
    newword = word[0]
1661
    for i in range(1, len(word)):
1662
        if newword[i-1] == word[i]:
1663
            newword += '*'
1664
        else:
1665
            newword += word[i]
1666
    word = newword
1667
1668
    # 3. Replace sch, ch, ei, ie with $, §, %, &
1669
    word = word.replace('sch', '$')
1670
    word = word.replace('ch', '§')
1671
    word = word.replace('ei', '%')
1672
    word = word.replace('ie', '&')
1673
    word = word.replace('ig', '#')
1674
    word = word.replace('st', '!')
1675
1676
    # # Part 1: Recursive Context-Free Stripping
1677
    # 1. Remove the following 7 suffixes recursively
1678
    while len(word) > 3:
1679
        if (((len(word) > 4 and word[-2:] in {'em', 'er'}) or
1680
             (len(word) > 5 and word[-2:] == 'nd'))):
1681
            word = word[:-2]
1682
        elif ((word[-1] in {'e', 's', 'n'}) or
1683
              (not upper_initial and word[-1] in {'t', '!'})):
1684
            word = word[:-1]
1685
        else:
1686
            break
1687
1688
    # Additional optimizations:
1689
    if len(word) > 5 and word[-5:] == 'erin*':
1690
        word = word[:-1]
1691
    if word[-1] == 'z':
1692
        word = word[:-1] + 'x'
1693
1694
    # Reverse substitutions:
1695
    word = word.replace('$', 'sch')
1696
    word = word.replace('§', 'ch')
1697
    word = word.replace('%', 'ei')
1698
    word = word.replace('&', 'ie')
1699
    word = word.replace('#', 'ig')
1700
    word = word.replace('!', 'st')
1701
1702
    # Expand doubled
1703
    word = ''.join([word[0]] + [word[i-1] if word[i] == '*' else word[i] for
1704
                                i in range(1, len(word))])
1705
1706
    # Finally, convert gege to ge
1707
    if len(word) > 4:
1708
        word = word.replace('gege', 'ge', 1)
1709
1710
    return word
1711
1712
1713
# def uealite(word):
1714
#     """Return UEA-Lite stem.
1715
#
1716
#     The UEA-Lite stemmer is defined in Marie-Claire Jenkins and Dan Smith's
1717
#     article at:
1718
# http://wayback.archive.org/web/20121012154211/http://www.uea.ac.uk/polopoly_fs/1.85493!stemmer25feb.pdf
1719
#
1720
#     :param word: the word to calculate the stem of
1721
#     :returns: word stem
1722
#     :rtype: str
1723
#     """
1724
#     return word
1725
1726
1727
def lancaster(word):
0 ignored issues
show
Unused Code introduced by
The argument word seems to be unused.
Loading history...
1728
    """Return Lancaster stem.
1729
1730
    Implementation of the Lancaster Stemming Algorithm, developed by
1731
    Chris Paice, with the assistance of Gareth Husk
1732
1733
    Arguments:
1734
    word -- the word to calculate the stem of
1735
1736
    Description:
1737
    The Lancaster Stemming Algorithm, described at:
1738
    http://wayback.archive.org/web/20140826000545/http://www.comp.lancs.ac.uk/computing/research/stemming/Links/paice.htm
1739
1740
    Based on the Paice & Husk's original Pascal reference implementation:
1741
    http://wayback.archive.org/web/20150104225538/http://www.comp.lancs.ac.uk/computing/research/stemming/Files/Pascal.zip
1742
    """
1743
    _lancaster_rules = ('ai*2.', 'a*1.', 'bb1.', 'city3s.', 'ci2>', 'cn1t>',
1744
                        'dd1.', 'dei3y>', 'deec2ss.', 'dee1.', 'de2>',
1745
                        'dooh4>', 'e1>', 'feil1v.', 'fi2>', 'gni3>', 'gai3y.',
1746
                        'ga2>', 'gg1.', 'ht*2.', 'hsiug5ct.', 'hsi3>', 'i*1.',
1747
                        'i1y>', 'ji1d.', 'juf1s.', 'ju1d.', 'jo1d.', 'jeh1r.',
1748
                        'jrev1t.', 'jsim2t.', 'jn1d.', 'j1s.', 'lbaifi6.',
1749
                        'lbai4y.', 'lba3>', 'lbi3.', 'lib2l>', 'lc1.',
1750
                        'lufi4y.', 'luf3>', 'lu2.', 'lai3>', 'lau3>', 'la2>',
1751
                        'll1.', 'mui3.', 'mu*2.', 'msi3>', 'mm1.', 'nois4j>',
1752
                        'noix4ct.', 'noi3>', 'nai3>', 'na2>', 'nee0.', 'ne2>',
1753
                        'nn1.', 'pihs4>', 'pp1.', 're2>', 'rae0.', 'ra2.',
1754
                        'ro2>', 'ru2>', 'rr1.', 'rt1>', 'rei3y>', 'sei3y>',
1755
                        'sis2.', 'si2>', 'ssen4>', 'ss0.', 'suo3>', 'su*2.',
1756
                        's*1>', 's0.', 'tacilp4y.', 'ta2>', 'tnem4>', 'tne3>',
1757
                        'tna3>', 'tpir2b.', 'tpro2b.', 'tcud1.', 'tpmus2.',
1758
                        'tpec2iv.', 'tulo2v.', 'tsis0.', 'tsi3>', 'tt1.',
1759
                        'uqi3.', 'ugo1.', 'vis3j>', 'vie0.', 'vi2>', 'ylb1>',
1760
                        'yli3y>', 'ylp0.', 'yl2>', 'ygo1.', 'yhp1.', 'ymo1.',
1761
                        'ypo1.', 'yti3>', 'yte3>', 'ytl2.', 'yrtsi5.',
1762
                        'yra3>', 'yro3>', 'yfi3.', 'ycn2t>', 'yca3>', 'zi2>',
1763
                        'zy1s.')
1764
1765
    _rule_table = []
1766
    _rule_index = {'a': -1, 'b': -1, 'c': -1, 'd': -1, 'e': -1, 'f': -1,
1767
                   'g': -1, 'h': -1, 'i': -1, 'j': -1, 'k': -1, 'l': -1,
1768
                   'm': -1, 'n': -1, 'o': -1, 'p': -1, 'q': -1, 'r': -1,
1769
                   's': -1, 't': -1, 'u': -1, 'v': -1, 'w': -1, 'x': -1,
1770
                   'y': -1, 'z': -1}
1771
1772
    def read_rules(stem_rules=_lancaster_rules):
0 ignored issues
show
Unused Code introduced by
The variable read_rules seems to be unused.
Loading history...
1773
        """Read the rules table.
1774
1775
        read_rules reads in stemming rules from a text file and enter them
1776
        into _rule_table. _rule_index is set up to provide faster access to
1777
        relevant rules.
1778
        """
1779
        for rule in stem_rules:
1780
            _rule_table.append(rule)
1781
            if _rule_index[rule[0]] == -1:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _rule_index does not seem to be defined.
Loading history...
1782
                _rule_index[rule[0]] = len(_rule_table)-1
1783
1784
    def stemmers(word):
0 ignored issues
show
Unused Code introduced by
The variable stemmers seems to be unused.
Loading history...
1785
        """Reduce a word.
1786
1787
        stemmers takes the specified word and reduces it to a set by
1788
        referring to _rule_table
1789
        """
1790
        # TODO: This looks very incomplete.
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
1791
        return word
1792