Completed
Push — master ( 0e5847...f79c73 )
by Chris
14:58
created

abydos.stemmer.sb_swedish()   F

Complexity

Conditions 16

Size

Total Lines 64
Code Lines 38

Duplication

Lines 19
Ratio 29.69 %

Importance

Changes 0
Metric Value
cc 16
eloc 38
nop 1
dl 19
loc 64
rs 2.4
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.stemmer.sb_swedish() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (2488/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19
"""abydos.stemmer.
20
21
The stemmer module defines word stemmers including:
22
23
    - the Lovins stemmer
24
    - the Porter and Porter2 (Snowball English) stemmers
25
    - Snowball stemmers for German, Dutch, Norwegian, Swedish, and Danish
26
    - CLEF German, German plus, and Swedish stemmers
27
    - Caumanns German stemmer
28
    - UEA-Lite Stemmer
29
    - Paice-Husk Stemmer
30
    - Schinke Latin stemmer
31
    - S stemmer
32
"""
33
34
from __future__ import unicode_literals
35
36
from re import match as re_match
37
from unicodedata import normalize
38
39
from six import text_type
40
from six.moves import range
41
42
__all__ = ['caumanns', 'clef_german', 'clef_german_plus', 'clef_swedish',
43
           'lovins', 'paice_husk', 'porter', 'porter2', 's_stemmer',
44
           'sb_danish', 'sb_dutch', 'sb_german', 'sb_norwegian', 'sb_swedish',
45
           'schinke', 'uealite']
46
47
48
def lovins(word):
0 ignored issues
show
Comprehensibility introduced by
This function exceeds the maximum number of variables (39/15).
Loading history...
49
    """Return Lovins stem.
50
51
    Lovins stemmer
52
53
    The Lovins stemmer is described in Julie Beth Lovins's article
54
    :cite:`Lovins:1968`.
55
56
    :param str word: the word to stem
57
    :returns: word stem
58
    :rtype: str
59
60
    >>> lovins('reading')
61
    'read'
62
    >>> lovins('suspension')
63
    'suspens'
64
    >>> lovins('elusiveness')
65
    'elus'
66
    """
67
    # lowercase, normalize, and compose
68
    word = normalize('NFC', text_type(word.lower()))
69
70
    def cond_b(word, suffix_len):
71
        """Return Lovins' condition B.
72
73
        :param str word: word to check
74
        :param int suffix_len: suffix length
75
        :rtype: bool
76
        """
77
        return len(word)-suffix_len >= 3
78
79
    def cond_c(word, suffix_len):
80
        """Return Lovins' condition C.
81
82
        :param str word: word to check
83
        :param int suffix_len: suffix length
84
        :rtype: bool
85
        """
86
        return len(word)-suffix_len >= 4
87
88
    def cond_d(word, suffix_len):
89
        """Return Lovins' condition D.
90
91
        :param str word: word to check
92
        :param int suffix_len: suffix length
93
        :rtype: bool
94
        """
95
        return len(word)-suffix_len >= 5
96
97
    def cond_e(word, suffix_len):
98
        """Return Lovins' condition E.
99
100
        :param str word: word to check
101
        :param int suffix_len: suffix length
102
        :rtype: bool
103
        """
104
        return word[-suffix_len-1] != 'e'
105
106
    def cond_f(word, suffix_len):
107
        """Return Lovins' condition F.
108
109
        :param str word: word to check
110
        :param int suffix_len: suffix length
111
        :rtype: bool
112
        """
113
        return (len(word)-suffix_len >= 3 and
114
                word[-suffix_len-1] != 'e')
115
116
    def cond_g(word, suffix_len):
117
        """Return Lovins' condition G.
118
119
        :param str word: word to check
120
        :param int suffix_len: suffix length
121
        :rtype: bool
122
        """
123
        return (len(word)-suffix_len >= 3 and
124
                word[-suffix_len-1] == 'f')
125
126
    def cond_h(word, suffix_len):
127
        """Return Lovins' condition H.
128
129
        :param str word: word to check
130
        :param int suffix_len: suffix length
131
        :rtype: bool
132
        """
133
        return (word[-suffix_len-1] == 't' or
134
                word[-suffix_len-2:-suffix_len] == 'll')
135
136
    def cond_i(word, suffix_len):
137
        """Return Lovins' condition I.
138
139
        :param str word: word to check
140
        :param int suffix_len: suffix length
141
        :rtype: bool
142
        """
143
        return word[-suffix_len-1] not in {'e', 'o'}
144
145
    def cond_j(word, suffix_len):
146
        """Return Lovins' condition J.
147
148
        :param str word: word to check
149
        :param int suffix_len: suffix length
150
        :rtype: bool
151
        """
152
        return word[-suffix_len-1] not in {'a', 'e'}
153
154
    def cond_k(word, suffix_len):
155
        """Return Lovins' condition K.
156
157
        :param str word: word to check
158
        :param int suffix_len: suffix length
159
        :rtype: bool
160
        """
161
        return (len(word)-suffix_len >= 3 and
162
                (word[-suffix_len-1] in {'i', 'l'} or
163
                 (word[-suffix_len-3] == 'u' and word[-suffix_len-1] == 'e')))
164
165
    def cond_l(word, suffix_len):
166
        """Return Lovins' condition L.
167
168
        :param str word: word to check
169
        :param int suffix_len: suffix length
170
        :rtype: bool
171
        """
172
        return (word[-suffix_len-1] not in {'s', 'u', 'x'} or
173
                word[-suffix_len-1] == 'os')
174
175
    def cond_m(word, suffix_len):
176
        """Return Lovins' condition M.
177
178
        :param str word: word to check
179
        :param int suffix_len: suffix length
180
        :rtype: bool
181
        """
182
        return word[-suffix_len-1] not in {'a', 'c', 'e', 'm'}
183
184
    def cond_n(word, suffix_len):
185
        """Return Lovins' condition N.
186
187
        :param str word: word to check
188
        :param int suffix_len: suffix length
189
        :rtype: bool
190
        """
191
        if len(word)-suffix_len >= 3:
192
            if word[-suffix_len-3] == 's':
193
                if len(word)-suffix_len >= 4:
194
                    return True
195
            else:
196
                return True
197
        return False
198
199
    def cond_o(word, suffix_len):
200
        """Return Lovins' condition O.
201
202
        :param str word: word to check
203
        :param int suffix_len: suffix length
204
        :rtype: bool
205
        """
206
        return word[-suffix_len-1] in {'i', 'l'}
207
208
    def cond_p(word, suffix_len):
209
        """Return Lovins' condition P.
210
211
        :param str word: word to check
212
        :param int suffix_len: suffix length
213
        :rtype: bool
214
        """
215
        return word[-suffix_len-1] != 'c'
216
217
    def cond_q(word, suffix_len):
218
        """Return Lovins' condition Q.
219
220
        :param str word: word to check
221
        :param int suffix_len: suffix length
222
        :rtype: bool
223
        """
224
        return (len(word)-suffix_len >= 3 and
225
                word[-suffix_len-1] not in {'l', 'n'})
226
227
    def cond_r(word, suffix_len):
228
        """Return Lovins' condition R.
229
230
        :param str word: word to check
231
        :param int suffix_len: suffix length
232
        :rtype: bool
233
        """
234
        return word[-suffix_len-1] in {'n', 'r'}
235
236
    def cond_s(word, suffix_len):
237
        """Return Lovins' condition S.
238
239
        :param str word: word to check
240
        :param int suffix_len: suffix length
241
        :rtype: bool
242
        """
243
        return (word[-suffix_len-2:-suffix_len] == 'dr' or
244
                (word[-suffix_len-1] == 't' and
245
                 word[-suffix_len-2:-suffix_len] != 'tt'))
246
247
    def cond_t(word, suffix_len):
248
        """Return Lovins' condition T.
249
250
        :param str word: word to check
251
        :param int suffix_len: suffix length
252
        :rtype: bool
253
        """
254
        return (word[-suffix_len-1] in {'s', 't'} and
255
                word[-suffix_len-2:-suffix_len] != 'ot')
256
257
    def cond_u(word, suffix_len):
258
        """Return Lovins' condition U.
259
260
        :param str word: word to check
261
        :param int suffix_len: suffix length
262
        :rtype: bool
263
        """
264
        return word[-suffix_len-1] in {'l', 'm', 'n', 'r'}
265
266
    def cond_v(word, suffix_len):
267
        """Return Lovins' condition V.
268
269
        :param str word: word to check
270
        :param int suffix_len: suffix length
271
        :rtype: bool
272
        """
273
        return word[-suffix_len-1] == 'c'
274
275
    def cond_w(word, suffix_len):
276
        """Return Lovins' condition W.
277
278
        :param str word: word to check
279
        :param int suffix_len: suffix length
280
        :rtype: bool
281
        """
282
        return word[-suffix_len-1] not in {'s', 'u'}
283
284
    def cond_x(word, suffix_len):
285
        """Return Lovins' condition X.
286
287
        :param str word: word to check
288
        :param int suffix_len: suffix length
289
        :rtype: bool
290
        """
291
        return (word[-suffix_len-1] in {'i', 'l'} or
292
                (word[-suffix_len-3:-suffix_len] == 'u' and
293
                 word[-suffix_len-1] == 'e'))
294
295
    def cond_y(word, suffix_len):
296
        """Return Lovins' condition Y.
297
298
        :param str word: word to check
299
        :param int suffix_len: suffix length
300
        :rtype: bool
301
        """
302
        return word[-suffix_len-2:-suffix_len] == 'in'
303
304
    def cond_z(word, suffix_len):
305
        """Return Lovins' condition Z.
306
307
        :param str word: word to check
308
        :param int suffix_len: suffix length
309
        :rtype: bool
310
        """
311
        return word[-suffix_len-1] != 'f'
312
313
    def cond_aa(word, suffix_len):
314
        """Return Lovins' condition AA.
315
316
        :param str word: word to check
317
        :param int suffix_len: suffix length
318
        :rtype: bool
319
        """
320
        return (word[-suffix_len-1] in {'d', 'f', 'l', 't'} or
321
                word[-suffix_len-2:-suffix_len] in {'ph', 'th', 'er', 'or',
322
                                                    'es'})
323
324
    def cond_bb(word, suffix_len):
325
        """Return Lovins' condition BB.
326
327
        :param str word: word to check
328
        :param int suffix_len: suffix length
329
        :rtype: bool
330
        """
331
        return (len(word)-suffix_len >= 3 and
332
                word[-suffix_len-3:-suffix_len] != 'met' and
333
                word[-suffix_len-4:-suffix_len] != 'ryst')
334
335
    def cond_cc(word, suffix_len):
336
        """Return Lovins' condition CC.
337
338
        :param str word: word to check
339
        :param int suffix_len: suffix length
340
        :rtype: bool
341
        """
342
        return word[-suffix_len-1] == 'l'
343
344
    suffix = {'alistically': cond_b, 'arizability': None,
345
              'izationally': cond_b, 'antialness': None,
346
              'arisations': None, 'arizations': None, 'entialness': None,
347
              'allically': cond_c, 'antaneous': None, 'antiality': None,
348
              'arisation': None, 'arization': None, 'ationally': cond_b,
349
              'ativeness': None, 'eableness': cond_e, 'entations': None,
350
              'entiality': None, 'entialize': None, 'entiation': None,
351
              'ionalness': None, 'istically': None, 'itousness': None,
352
              'izability': None, 'izational': None, 'ableness': None,
353
              'arizable': None, 'entation': None, 'entially': None,
354
              'eousness': None, 'ibleness': None, 'icalness': None,
355
              'ionalism': None, 'ionality': None, 'ionalize': None,
356
              'iousness': None, 'izations': None, 'lessness': None,
357
              'ability': None, 'aically': None, 'alistic': cond_b,
358
              'alities': None, 'ariness': cond_e, 'aristic': None,
359
              'arizing': None, 'ateness': None, 'atingly': None,
360
              'ational': cond_b, 'atively': None, 'ativism': None,
361
              'elihood': cond_e, 'encible': None, 'entally': None,
362
              'entials': None, 'entiate': None, 'entness': None,
363
              'fulness': None, 'ibility': None, 'icalism': None,
364
              'icalist': None, 'icality': None, 'icalize': None,
365
              'ication': cond_g, 'icianry': None, 'ination': None,
366
              'ingness': None, 'ionally': None, 'isation': None,
367
              'ishness': None, 'istical': None, 'iteness': None,
368
              'iveness': None, 'ivistic': None, 'ivities': None,
369
              'ization': cond_f, 'izement': None, 'oidally': None,
370
              'ousness': None, 'aceous': None, 'acious': cond_b,
371
              'action': cond_g, 'alness': None, 'ancial': None,
372
              'ancies': None, 'ancing': cond_b, 'ariser': None,
373
              'arized': None, 'arizer': None, 'atable': None,
374
              'ations': cond_b, 'atives': None, 'eature': cond_z,
375
              'efully': None, 'encies': None, 'encing': None,
376
              'ential': None, 'enting': cond_c, 'entist': None,
377
              'eously': None, 'ialist': None, 'iality': None,
378
              'ialize': None, 'ically': None, 'icance': None,
379
              'icians': None, 'icists': None, 'ifully': None,
380
              'ionals': None, 'ionate': cond_d, 'ioning': None,
381
              'ionist': None, 'iously': None, 'istics': None,
382
              'izable': cond_e, 'lessly': None, 'nesses': None,
383
              'oidism': None, 'acies': None, 'acity': None,
384
              'aging': cond_b, 'aical': None, 'alist': None,
385
              'alism': cond_b, 'ality': None, 'alize': None,
386
              'allic': cond_bb, 'anced': cond_b, 'ances': cond_b,
387
              'antic': cond_c, 'arial': None, 'aries': None,
388
              'arily': None, 'arity': cond_b, 'arize': None,
389
              'aroid': None, 'ately': None, 'ating': cond_i,
390
              'ation': cond_b, 'ative': None, 'ators': None,
391
              'atory': None, 'ature': cond_e, 'early': cond_y,
392
              'ehood': None, 'eless': None, 'elity': None,
393
              'ement': None, 'enced': None, 'ences': None,
394
              'eness': cond_e, 'ening': cond_e, 'ental': None,
395
              'ented': cond_c, 'ently': None, 'fully': None,
396
              'ially': None, 'icant': None, 'ician': None,
397
              'icide': None, 'icism': None, 'icist': None,
398
              'icity': None, 'idine': cond_i, 'iedly': None,
399
              'ihood': None, 'inate': None, 'iness': None,
400
              'ingly': cond_b, 'inism': cond_j, 'inity': cond_cc,
401
              'ional': None, 'ioned': None, 'ished': None,
402
              'istic': None, 'ities': None, 'itous': None,
403
              'ively': None, 'ivity': None, 'izers': cond_f,
404
              'izing': cond_f, 'oidal': None, 'oides': None,
405
              'otide': None, 'ously': None, 'able': None, 'ably': None,
406
              'ages': cond_b, 'ally': cond_b, 'ance': cond_b, 'ancy': cond_b,
407
              'ants': cond_b, 'aric': None, 'arly': cond_k, 'ated': cond_i,
408
              'ates': None, 'atic': cond_b, 'ator': None, 'ealy': cond_y,
409
              'edly': cond_e, 'eful': None, 'eity': None, 'ence': None,
410
              'ency': None, 'ened': cond_e, 'enly': cond_e, 'eous': None,
411
              'hood': None, 'ials': None, 'ians': None, 'ible': None,
412
              'ibly': None, 'ical': None, 'ides': cond_l, 'iers': None,
413
              'iful': None, 'ines': cond_m, 'ings': cond_n, 'ions': cond_b,
414
              'ious': None, 'isms': cond_b, 'ists': None, 'itic': cond_h,
415
              'ized': cond_f, 'izer': cond_f, 'less': None, 'lily': None,
416
              'ness': None, 'ogen': None, 'ward': None, 'wise': None,
417
              'ying': cond_b, 'yish': None, 'acy': None, 'age': cond_b,
418
              'aic': None, 'als': cond_bb, 'ant': cond_b, 'ars': cond_o,
419
              'ary': cond_f, 'ata': None, 'ate': None, 'eal': cond_y,
420
              'ear': cond_y, 'ely': cond_e, 'ene': cond_e, 'ent': cond_c,
421
              'ery': cond_e, 'ese': None, 'ful': None, 'ial': None,
422
              'ian': None, 'ics': None, 'ide': cond_l, 'ied': None,
423
              'ier': None, 'ies': cond_p, 'ily': None, 'ine': cond_m,
424
              'ing': cond_n, 'ion': cond_q, 'ish': cond_c, 'ism': cond_b,
425
              'ist': None, 'ite': cond_aa, 'ity': None, 'ium': None,
426
              'ive': None, 'ize': cond_f, 'oid': None, 'one': cond_r,
427
              'ous': None, 'ae': None, 'al': cond_bb, 'ar': cond_x,
428
              'as': cond_b, 'ed': cond_e, 'en': cond_f, 'es': cond_e,
429
              'ia': None, 'ic': None, 'is': None, 'ly': cond_b,
430
              'on': cond_s, 'or': cond_t, 'um': cond_u, 'us': cond_v,
431
              'yl': cond_r, '\'s': None, 's\'': None, 'a': None,
432
              'e': None, 'i': None, 'o': None, 's': cond_w, 'y': cond_b}
433
434
    for suffix_len in range(11, 0, -1):
435
        ending = word[-suffix_len:]
436
        if (ending in suffix and
437
                len(word)-suffix_len >= 2 and
438
                (suffix[ending] is None or
439
                 suffix[ending](word, suffix_len))):
440
            word = word[:-suffix_len]
441
            break
442
443
    def recode9(stem):
444
        """Return Lovins' conditional recode rule 9."""
445
        if stem[-3:-2] in {'a', 'i', 'o'}:
446
            return stem
447
        return stem[:-2]+'l'
448
449
    def recode24(stem):
450
        """Return Lovins' conditional recode rule 24."""
451
        if stem[-4:-3] == 's':
452
            return stem
453
        return stem[:-1]+'s'
454
455
    def recode28(stem):
456
        """Return Lovins' conditional recode rule 28."""
457
        if stem[-4:-3] in {'p', 't'}:
458
            return stem
459
        return stem[:-1]+'s'
460
461
    def recode30(stem):
462
        """Return Lovins' conditional recode rule 30."""
463
        if stem[-4:-3] == 'm':
464
            return stem
465
        return stem[:-1]+'s'
466
467
    def recode32(stem):
468
        """Return Lovins' conditional recode rule 32."""
469
        if stem[-3:-2] == 'n':
470
            return stem
471
        return stem[:-1]+'s'
472
473
    if word[-2:] in {'bb', 'dd', 'gg', 'll', 'mm', 'nn', 'pp', 'rr', 'ss',
474
                     'tt'}:
475
        word = word[:-1]
476
477
    recode = (('iev', 'ief'),
478
              ('uct', 'uc'),
479
              ('umpt', 'um'),
480
              ('rpt', 'rb'),
481
              ('urs', 'ur'),
482
              ('istr', 'ister'),
483
              ('metr', 'meter'),
484
              ('olv', 'olut'),
485
              ('ul', recode9),
486
              ('bex', 'bic'),
487
              ('dex', 'dic'),
488
              ('pex', 'pic'),
489
              ('tex', 'tic'),
490
              ('ax', 'ac'),
491
              ('ex', 'ec'),
492
              ('ix', 'ic'),
493
              ('lux', 'luc'),
494
              ('uad', 'uas'),
495
              ('vad', 'vas'),
496
              ('cid', 'cis'),
497
              ('lid', 'lis'),
498
              ('erid', 'eris'),
499
              ('pand', 'pans'),
500
              ('end', recode24),
501
              ('ond', 'ons'),
502
              ('lud', 'lus'),
503
              ('rud', 'rus'),
504
              ('her', recode28),
505
              ('mit', 'mis'),
506
              ('ent', recode30),
507
              ('ert', 'ers'),
508
              ('et', recode32),
509
              ('yt', 'ys'),
510
              ('yz', 'ys'))
511
512
    for ending, replacement in recode:
513
        if word.endswith(ending):
514
            if callable(replacement):
515
                word = replacement(word)
516
            else:
517
                word = word[:-len(ending)] + replacement
518
519
    return word
520
521
522
def _m_degree(term, vowels):
523
    """Return Porter helper function _m_degree value.
524
525
    m-degree is equal to the number of V to C transitions
526
527
    :param str term: the word for which to calculate the m-degree
528
    :param set vowels: the set of vowels in the language
529
    :returns: the m-degree as defined in the Porter stemmer definition
530
    :rtype: int
531
    """
532
    mdeg = 0
533
    last_was_vowel = False
534
    for letter in term:
535
        if letter in vowels:
536
            last_was_vowel = True
537
        else:
538
            if last_was_vowel:
539
                mdeg += 1
540
            last_was_vowel = False
541
    return mdeg
542
543
544
def _sb_has_vowel(term, vowels):
545
    """Return Porter helper function _sb_has_vowel value.
546
547
    :param str term: the word to scan for vowels
548
    :param set vowels: the set of vowels in the language
549
    :returns: true iff a vowel exists in the term (as defined in the Porter
550
        stemmer definition)
551
    :rtype: bool
552
    """
553
    for letter in term:
554
        if letter in vowels:
555
            return True
556
    return False
557
558
559
def _ends_in_doubled_cons(term, vowels):
560
    """Return Porter helper function _ends_in_doubled_cons value.
561
562
    :param str term: the word to check for a final doubled consonant
563
    :param set vowels: the set of vowels in the language
564
    :returns: true iff the stem ends in a doubled consonant (as defined in the
565
        Porter stemmer definition)
566
    :rtype: bool
567
    """
568
    return len(term) > 1 and term[-1] not in vowels and term[-2] == term[-1]
569
570
571
def _ends_in_cvc(term, vowels):
572
    """Return Porter helper function _ends_in_cvc value.
573
574
    :param str term: the word to scan for cvc
575
    :param set vowels: the set of vowels in the language
576
    :returns: true iff the stem ends in cvc (as defined in the Porter stemmer
577
        definition)
578
    :rtype: bool
579
    """
580
    return (len(term) > 2 and (term[-1] not in vowels and
581
                               term[-2] in vowels and
582
                               term[-3] not in vowels and
583
                               term[-1] not in tuple('wxY')))
584
585
586
def porter(word, early_english=False):
587
    """Return Porter stem.
588
589
    The Porter stemmer is described in :cite:`Porter:1980`.
590
591
    :param str word: the word to calculate the stem of
592
    :param bool early_english: set to True in order to remove -eth & -est
593
        (2nd & 3rd person singular verbal agreement suffixes)
594
    :returns: word stem
595
    :rtype: str
596
597
    >>> porter('reading')
598
    'read'
599
    >>> porter('suspension')
600
    'suspens'
601
    >>> porter('elusiveness')
602
    'elus'
603
604
    >>> porter('eateth', early_english=True)
605
    'eat'
606
    """
607
    # lowercase, normalize, and compose
608
    word = normalize('NFC', text_type(word.lower()))
609
610
    # Return word if stem is shorter than 2
611
    if len(word) < 3:
612
        return word
613
614
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
615
    # Re-map consonantal y to Y (Y will be C, y will be V)
616
    if word[0] == 'y':
617
        word = 'Y' + word[1:]
618
    for i in range(1, len(word)):
619
        if word[i] == 'y' and word[i-1] in _vowels:
620
            word = word[:i] + 'Y' + word[i+1:]
621
622
    # Step 1a
623
    if word[-1] == 's':
624
        if word[-4:] == 'sses':
625
            word = word[:-2]
626
        elif word[-3:] == 'ies':
627
            word = word[:-2]
628
        elif word[-2:] == 'ss':
629
            pass
630
        else:
631
            word = word[:-1]
632
633
    # Step 1b
634
    step1b_flag = False
635
    if word[-3:] == 'eed':
636
        if _m_degree(word[:-3], _vowels) > 0:
637
            word = word[:-1]
638
    elif word[-2:] == 'ed':
639
        if _sb_has_vowel(word[:-2], _vowels):
640
            word = word[:-2]
641
            step1b_flag = True
642
    elif word[-3:] == 'ing':
643
        if _sb_has_vowel(word[:-3], _vowels):
644
            word = word[:-3]
645
            step1b_flag = True
646
    elif early_english:
647
        if word[-3:] == 'est':
648
            if _sb_has_vowel(word[:-3], _vowels):
649
                word = word[:-3]
650
                step1b_flag = True
651
        elif word[-3:] == 'eth':
652
            if _sb_has_vowel(word[:-3], _vowels):
653
                word = word[:-3]
654
                step1b_flag = True
655
656
    if step1b_flag:
657
        if word[-2:] in {'at', 'bl', 'iz'}:
658
            word += 'e'
659
        elif (_ends_in_doubled_cons(word, _vowels) and
660
              word[-1] not in {'l', 's', 'z'}):
661
            word = word[:-1]
662
        elif _m_degree(word, _vowels) == 1 and _ends_in_cvc(word, _vowels):
663
            word += 'e'
664
665
    # Step 1c
666
    if word[-1] in {'Y', 'y'} and _sb_has_vowel(word[:-1], _vowels):
667
        word = word[:-1] + 'i'
668
669
    # Step 2
670
    if len(word) > 1:
671
        if word[-2] == 'a':
672
            if word[-7:] == 'ational':
673
                if _m_degree(word[:-7], _vowels) > 0:
674
                    word = word[:-5] + 'e'
675
            elif word[-6:] == 'tional':
676
                if _m_degree(word[:-6], _vowels) > 0:
677
                    word = word[:-2]
678
        elif word[-2] == 'c':
679
            if word[-4:] in {'enci', 'anci'}:
680
                if _m_degree(word[:-4], _vowels) > 0:
681
                    word = word[:-1] + 'e'
682
        elif word[-2] == 'e':
683
            if word[-4:] == 'izer':
684
                if _m_degree(word[:-4], _vowels) > 0:
685
                    word = word[:-1]
686
        elif word[-2] == 'g':
687
            if word[-4:] == 'logi':
688
                if _m_degree(word[:-4], _vowels) > 0:
689
                    word = word[:-1]
690
        elif word[-2] == 'l':
691
            if word[-3:] == 'bli':
692
                if _m_degree(word[:-3], _vowels) > 0:
693
                    word = word[:-1] + 'e'
694
            elif word[-4:] == 'alli':
695
                if _m_degree(word[:-4], _vowels) > 0:
696
                    word = word[:-2]
697
            elif word[-5:] == 'entli':
698
                if _m_degree(word[:-5], _vowels) > 0:
699
                    word = word[:-2]
700
            elif word[-3:] == 'eli':
701
                if _m_degree(word[:-3], _vowels) > 0:
702
                    word = word[:-2]
703
            elif word[-5:] == 'ousli':
704
                if _m_degree(word[:-5], _vowels) > 0:
705
                    word = word[:-2]
706
        elif word[-2] == 'o':
707
            if word[-7:] == 'ization':
708
                if _m_degree(word[:-7], _vowels) > 0:
709
                    word = word[:-5] + 'e'
710
            elif word[-5:] == 'ation':
711
                if _m_degree(word[:-5], _vowels) > 0:
712
                    word = word[:-3] + 'e'
713
            elif word[-4:] == 'ator':
714
                if _m_degree(word[:-4], _vowels) > 0:
715
                    word = word[:-2] + 'e'
716
        elif word[-2] == 's':
717
            if word[-5:] == 'alism':
718
                if _m_degree(word[:-5], _vowels) > 0:
719
                    word = word[:-3]
720
            elif word[-7:] in {'iveness', 'fulness', 'ousness'}:
721
                if _m_degree(word[:-7], _vowels) > 0:
722
                    word = word[:-4]
723
        elif word[-2] == 't':
724
            if word[-5:] == 'aliti':
725
                if _m_degree(word[:-5], _vowels) > 0:
726
                    word = word[:-3]
727
            elif word[-5:] == 'iviti':
728
                if _m_degree(word[:-5], _vowels) > 0:
729
                    word = word[:-3] + 'e'
730
            elif word[-6:] == 'biliti':
731
                if _m_degree(word[:-6], _vowels) > 0:
732
                    word = word[:-5] + 'le'
733
734
    # Step 3
735
    if word[-5:] == 'icate':
736
        if _m_degree(word[:-5], _vowels) > 0:
737
            word = word[:-3]
738
    elif word[-5:] == 'ative':
739
        if _m_degree(word[:-5], _vowels) > 0:
740
            word = word[:-5]
741
    elif word[-5:] in {'alize', 'iciti'}:
742
        if _m_degree(word[:-5], _vowels) > 0:
743
            word = word[:-3]
744
    elif word[-4:] == 'ical':
745
        if _m_degree(word[:-4], _vowels) > 0:
746
            word = word[:-2]
747
    elif word[-3:] == 'ful':
748
        if _m_degree(word[:-3], _vowels) > 0:
749
            word = word[:-3]
750
    elif word[-4:] == 'ness':
751
        if _m_degree(word[:-4], _vowels) > 0:
752
            word = word[:-4]
753
754
    # Step 4
755
    if word[-2:] == 'al':
756
        if _m_degree(word[:-2], _vowels) > 1:
757
            word = word[:-2]
758
    elif word[-4:] == 'ance':
759
        if _m_degree(word[:-4], _vowels) > 1:
760
            word = word[:-4]
761
    elif word[-4:] == 'ence':
762
        if _m_degree(word[:-4], _vowels) > 1:
763
            word = word[:-4]
764
    elif word[-2:] == 'er':
765
        if _m_degree(word[:-2], _vowels) > 1:
766
            word = word[:-2]
767
    elif word[-2:] == 'ic':
768
        if _m_degree(word[:-2], _vowels) > 1:
769
            word = word[:-2]
770
    elif word[-4:] == 'able':
771
        if _m_degree(word[:-4], _vowels) > 1:
772
            word = word[:-4]
773
    elif word[-4:] == 'ible':
774
        if _m_degree(word[:-4], _vowels) > 1:
775
            word = word[:-4]
776
    elif word[-3:] == 'ant':
777
        if _m_degree(word[:-3], _vowels) > 1:
778
            word = word[:-3]
779
    elif word[-5:] == 'ement':
780
        if _m_degree(word[:-5], _vowels) > 1:
781
            word = word[:-5]
782
    elif word[-4:] == 'ment':
783
        if _m_degree(word[:-4], _vowels) > 1:
784
            word = word[:-4]
785
    elif word[-3:] == 'ent':
786
        if _m_degree(word[:-3], _vowels) > 1:
787
            word = word[:-3]
788
    elif word[-4:] in {'sion', 'tion'}:
789
        if _m_degree(word[:-3], _vowels) > 1:
790
            word = word[:-3]
791
    elif word[-2:] == 'ou':
792
        if _m_degree(word[:-2], _vowels) > 1:
793
            word = word[:-2]
794
    elif word[-3:] == 'ism':
795
        if _m_degree(word[:-3], _vowels) > 1:
796
            word = word[:-3]
797
    elif word[-3:] == 'ate':
798
        if _m_degree(word[:-3], _vowels) > 1:
799
            word = word[:-3]
800
    elif word[-3:] == 'iti':
801
        if _m_degree(word[:-3], _vowels) > 1:
802
            word = word[:-3]
803
    elif word[-3:] == 'ous':
804
        if _m_degree(word[:-3], _vowels) > 1:
805
            word = word[:-3]
806
    elif word[-3:] == 'ive':
807
        if _m_degree(word[:-3], _vowels) > 1:
808
            word = word[:-3]
809
    elif word[-3:] == 'ize':
810
        if _m_degree(word[:-3], _vowels) > 1:
811
            word = word[:-3]
812
813
    # Step 5a
814
    if word[-1] == 'e':
815
        if _m_degree(word[:-1], _vowels) > 1:
816
            word = word[:-1]
817
        elif (_m_degree(word[:-1], _vowels) == 1 and
818
              not _ends_in_cvc(word[:-1], _vowels)):
819
            word = word[:-1]
820
821
    # Step 5b
822
    if word[-2:] == 'll' and _m_degree(word, _vowels) > 1:
823
        word = word[:-1]
824
825
    # Change 'Y' back to 'y' if it survived stemming
826
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
827
        if word[i] == 'Y':
828
            word = word[:i] + 'y' + word[i+1:]
829
830
    return word
831
832
833
def _sb_r1(term, vowels, r1_prefixes=None):
834
    """Return the R1 region, as defined in the Porter2 specification."""
835
    vowel_found = False
836
    if hasattr(r1_prefixes, '__iter__'):
837
        for prefix in r1_prefixes:
838
            if term[:len(prefix)] == prefix:
839
                return len(prefix)
840
841
    for i in range(len(term)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
842
        if not vowel_found and term[i] in vowels:
843
            vowel_found = True
844
        elif vowel_found and term[i] not in vowels:
845
            return i + 1
846
    return len(term)
847
848
849
def _sb_r2(term, vowels, r1_prefixes=None):
850
    """Return the R2 region, as defined in the Porter2 specification."""
851
    r1_start = _sb_r1(term, vowels, r1_prefixes)
852
    return r1_start + _sb_r1(term[r1_start:], vowels)
853
854
855
def _sb_ends_in_short_syllable(term, vowels, codanonvowels):
856
    """Return True iff term ends in a short syllable.
857
858
    (...according to the Porter2 specification.)
859
860
    NB: This is akin to the CVC test from the Porter stemmer. The description
861
    is unfortunately poor/ambiguous.
862
    """
863
    if not term:
864
        return False
865
    if len(term) == 2:
866
        if term[-2] in vowels and term[-1] not in vowels:
867
            return True
868
    elif len(term) >= 3:
869
        if ((term[-3] not in vowels and term[-2] in vowels and
870
             term[-1] in codanonvowels)):
871
            return True
872
    return False
873
874
875
def _sb_short_word(term, vowels, codanonvowels, r1_prefixes=None):
876
    """Return True iff term is a short word.
877
878
    (...according to the Porter2 specification.)
879
    """
880
    if ((_sb_r1(term, vowels, r1_prefixes) == len(term) and
881
         _sb_ends_in_short_syllable(term, vowels, codanonvowels))):
882
        return True
883
    return False
884
885
886
def porter2(word, early_english=False):
0 ignored issues
show
best-practice introduced by
Too many return statements (7/6)
Loading history...
887
    """Return the Porter2 (Snowball English) stem.
888
889
    The Porter2 (Snowball English) stemmer is defined in :cite:`Porter:2002`.
890
891
    :param str word: the word to calculate the stem of
892
    :param bool early_english: set to True in order to remove -eth & -est
893
        (2nd & 3rd person singular verbal agreement suffixes)
894
    :returns: word stem
895
    :rtype: str
896
897
    >>> porter2('reading')
898
    'read'
899
    >>> porter2('suspension')
900
    'suspens'
901
    >>> porter2('elusiveness')
902
    'elus'
903
904
    >>> porter2('eateth', early_english=True)
905
    'eat'
906
    """
907
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
908
    _codanonvowels = {"'", 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm',
909
                      'n', 'p', 'q', 'r', 's', 't', 'v', 'z'}
910
    _doubles = {'bb', 'dd', 'ff', 'gg', 'mm', 'nn', 'pp', 'rr', 'tt'}
911
    _li = {'c', 'd', 'e', 'g', 'h', 'k', 'm', 'n', 'r', 't'}
912
913
    # R1 prefixes should be in order from longest to shortest to prevent
914
    # masking
915
    _r1_prefixes = ('commun', 'gener', 'arsen')
916
    _exception1dict = {  # special changes:
917
        'skis': 'ski', 'skies': 'sky', 'dying': 'die',
918
        'lying': 'lie', 'tying': 'tie',
919
        # special -LY cases:
920
        'idly': 'idl', 'gently': 'gentl', 'ugly': 'ugli',
921
        'early': 'earli', 'only': 'onli', 'singly': 'singl'}
922
    _exception1set = {'sky', 'news', 'howe', 'atlas', 'cosmos', 'bias',
923
                      'andes'}
924
    _exception2set = {'inning', 'outing', 'canning', 'herring', 'earring',
925
                      'proceed', 'exceed', 'succeed'}
926
927
    # lowercase, normalize, and compose
928
    word = normalize('NFC', text_type(word.lower()))
929
    # replace apostrophe-like characters with U+0027, per
930
    # http://snowball.tartarus.org/texts/apostrophe.html
931
    word = word.replace('’', '\'')
932
    word = word.replace('’', '\'')
933
934
    # Exceptions 1
935
    if word in _exception1dict:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
936
        return _exception1dict[word]
937
    elif word in _exception1set:
938
        return word
939
940
    # Return word if stem is shorter than 3
941
    if len(word) < 3:
942
        return word
943
944
    # Remove initial ', if present.
945
    while word and word[0] == '\'':
946
        word = word[1:]
947
        # Return word if stem is shorter than 2
948
        if len(word) < 2:
949
            return word
950
951
    # Re-map vocalic Y to y (Y will be C, y will be V)
952
    if word[0] == 'y':
953
        word = 'Y' + word[1:]
954
    for i in range(1, len(word)):
955
        if word[i] == 'y' and word[i-1] in _vowels:
956
            word = word[:i] + 'Y' + word[i+1:]
957
958
    r1_start = _sb_r1(word, _vowels, _r1_prefixes)
959
    r2_start = _sb_r2(word, _vowels, _r1_prefixes)
960
961
    # Step 0
962
    if word[-3:] == '\'s\'':
963
        word = word[:-3]
964
    elif word[-2:] == '\'s':
965
        word = word[:-2]
966
    elif word[-1:] == '\'':
967
        word = word[:-1]
968
    # Return word if stem is shorter than 2
969
    if len(word) < 3:
970
        return word
971
972
    # Step 1a
973
    if word[-4:] == 'sses':
974
        word = word[:-2]
975
    elif word[-3:] in {'ied', 'ies'}:
976
        if len(word) > 4:
977
            word = word[:-2]
978
        else:
979
            word = word[:-1]
980
    elif word[-2:] in {'us', 'ss'}:
981
        pass
982
    elif word[-1] == 's':
983
        if _sb_has_vowel(word[:-2], _vowels):
984
            word = word[:-1]
985
986
    # Exceptions 2
987
    if word in _exception2set:
988
        return word
989
990
    # Step 1b
991
    step1b_flag = False
992
    if word[-5:] == 'eedly':
993
        if len(word[r1_start:]) >= 5:
994
            word = word[:-3]
995
    elif word[-5:] == 'ingly':
996
        if _sb_has_vowel(word[:-5], _vowels):
997
            word = word[:-5]
998
            step1b_flag = True
999
    elif word[-4:] == 'edly':
1000
        if _sb_has_vowel(word[:-4], _vowels):
1001
            word = word[:-4]
1002
            step1b_flag = True
1003
    elif word[-3:] == 'eed':
1004
        if len(word[r1_start:]) >= 3:
1005
            word = word[:-1]
1006
    elif word[-3:] == 'ing':
1007
        if _sb_has_vowel(word[:-3], _vowels):
1008
            word = word[:-3]
1009
            step1b_flag = True
1010
    elif word[-2:] == 'ed':
1011
        if _sb_has_vowel(word[:-2], _vowels):
1012
            word = word[:-2]
1013
            step1b_flag = True
1014
    elif early_english:
1015
        if word[-3:] == 'est':
1016
            if _sb_has_vowel(word[:-3], _vowels):
1017
                word = word[:-3]
1018
                step1b_flag = True
1019
        elif word[-3:] == 'eth':
1020
            if _sb_has_vowel(word[:-3], _vowels):
1021
                word = word[:-3]
1022
                step1b_flag = True
1023
1024
    if step1b_flag:
1025
        if word[-2:] in {'at', 'bl', 'iz'}:
1026
            word += 'e'
1027
        elif word[-2:] in _doubles:
1028
            word = word[:-1]
1029
        elif _sb_short_word(word, _vowels, _codanonvowels, _r1_prefixes):
1030
            word += 'e'
1031
1032
    # Step 1c
1033
    if ((len(word) > 2 and word[-1] in {'Y', 'y'} and
1034
         word[-2] not in _vowels)):
1035
        word = word[:-1] + 'i'
1036
1037
    # Step 2
1038
    if word[-2] == 'a':
1039
        if word[-7:] == 'ational':
1040
            if len(word[r1_start:]) >= 7:
1041
                word = word[:-5] + 'e'
1042
        elif word[-6:] == 'tional':
1043
            if len(word[r1_start:]) >= 6:
1044
                word = word[:-2]
1045
    elif word[-2] == 'c':
1046
        if word[-4:] in {'enci', 'anci'}:
1047
            if len(word[r1_start:]) >= 4:
1048
                word = word[:-1] + 'e'
1049
    elif word[-2] == 'e':
1050
        if word[-4:] == 'izer':
1051
            if len(word[r1_start:]) >= 4:
1052
                word = word[:-1]
1053
    elif word[-2] == 'g':
1054
        if word[-3:] == 'ogi':
1055
            if ((r1_start >= 1 and len(word[r1_start:]) >= 3 and
1056
                 word[-4] == 'l')):
1057
                word = word[:-1]
1058
    elif word[-2] == 'l':
1059
        if word[-6:] == 'lessli':
1060
            if len(word[r1_start:]) >= 6:
1061
                word = word[:-2]
1062
        elif word[-5:] in {'entli', 'fulli', 'ousli'}:
1063
            if len(word[r1_start:]) >= 5:
1064
                word = word[:-2]
1065
        elif word[-4:] == 'abli':
1066
            if len(word[r1_start:]) >= 4:
1067
                word = word[:-1] + 'e'
1068
        elif word[-4:] == 'alli':
1069
            if len(word[r1_start:]) >= 4:
1070
                word = word[:-2]
1071
        elif word[-3:] == 'bli':
1072
            if len(word[r1_start:]) >= 3:
1073
                word = word[:-1] + 'e'
1074
        elif word[-2:] == 'li':
1075
            if ((r1_start >= 1 and len(word[r1_start:]) >= 2 and
1076
                 word[-3] in _li)):
1077
                word = word[:-2]
1078
    elif word[-2] == 'o':
1079
        if word[-7:] == 'ization':
1080
            if len(word[r1_start:]) >= 7:
1081
                word = word[:-5] + 'e'
1082
        elif word[-5:] == 'ation':
1083
            if len(word[r1_start:]) >= 5:
1084
                word = word[:-3] + 'e'
1085
        elif word[-4:] == 'ator':
1086
            if len(word[r1_start:]) >= 4:
1087
                word = word[:-2] + 'e'
1088
    elif word[-2] == 's':
1089
        if word[-7:] in {'fulness', 'ousness', 'iveness'}:
1090
            if len(word[r1_start:]) >= 7:
1091
                word = word[:-4]
1092
        elif word[-5:] == 'alism':
1093
            if len(word[r1_start:]) >= 5:
1094
                word = word[:-3]
1095
    elif word[-2] == 't':
1096
        if word[-6:] == 'biliti':
1097
            if len(word[r1_start:]) >= 6:
1098
                word = word[:-5] + 'le'
1099
        elif word[-5:] == 'aliti':
1100
            if len(word[r1_start:]) >= 5:
1101
                word = word[:-3]
1102
        elif word[-5:] == 'iviti':
1103
            if len(word[r1_start:]) >= 5:
1104
                word = word[:-3] + 'e'
1105
1106
    # Step 3
1107
    if word[-7:] == 'ational':
1108
        if len(word[r1_start:]) >= 7:
1109
            word = word[:-5] + 'e'
1110
    elif word[-6:] == 'tional':
1111
        if len(word[r1_start:]) >= 6:
1112
            word = word[:-2]
1113
    elif word[-5:] in {'alize', 'icate', 'iciti'}:
1114
        if len(word[r1_start:]) >= 5:
1115
            word = word[:-3]
1116
    elif word[-5:] == 'ative':
1117
        if len(word[r2_start:]) >= 5:
1118
            word = word[:-5]
1119
    elif word[-4:] == 'ical':
1120
        if len(word[r1_start:]) >= 4:
1121
            word = word[:-2]
1122
    elif word[-4:] == 'ness':
1123
        if len(word[r1_start:]) >= 4:
1124
            word = word[:-4]
1125
    elif word[-3:] == 'ful':
1126
        if len(word[r1_start:]) >= 3:
1127
            word = word[:-3]
1128
1129
    # Step 4
1130
    for suffix in ('ement', 'ance', 'ence', 'able', 'ible', 'ment', 'ant',
1131
                   'ent', 'ism', 'ate', 'iti', 'ous', 'ive', 'ize', 'al', 'er',
1132
                   'ic'):
1133
        if word[-len(suffix):] == suffix:
1134
            if len(word[r2_start:]) >= len(suffix):
1135
                word = word[:-len(suffix)]
1136
            break
1137
    else:
1138
        if word[-3:] == 'ion':
1139
            if ((len(word[r2_start:]) >= 3 and len(word) >= 4 and
1140
                 word[-4] in tuple('st'))):
1141
                word = word[:-3]
1142
1143
    # Step 5
1144
    if word[-1] == 'e':
1145
        if (len(word[r2_start:]) >= 1 or
1146
                (len(word[r1_start:]) >= 1 and
1147
                 not _sb_ends_in_short_syllable(word[:-1], _vowels,
1148
                                                _codanonvowels))):
1149
            word = word[:-1]
1150
    elif word[-1] == 'l':
1151
        if len(word[r2_start:]) >= 1 and word[-2] == 'l':
1152
            word = word[:-1]
1153
1154
    # Change 'Y' back to 'y' if it survived stemming
1155
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1156
        if word[i] == 'Y':
1157
            word = word[:i] + 'y' + word[i+1:]
1158
1159
    return word
1160
1161
1162
def sb_german(word, alternate_vowels=False):
1163
    """Return Snowball German stem.
1164
1165
    The Snowball German stemmer is defined at:
1166
    http://snowball.tartarus.org/algorithms/german/stemmer.html
1167
1168
    :param str word: the word to calculate the stem of
1169
    :param bool alternate_vowels: composes ae as ä, oe as ö, and ue as ü before
1170
        running the algorithm
1171
    :returns: word stem
1172
    :rtype: str
1173
1174
    >>> sb_german('lesen')
1175
    'les'
1176
    >>> sb_german('graues')
1177
    'grau'
1178
    >>> sb_german('buchstabieren')
1179
    'buchstabi'
1180
    """
1181
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'ö', 'ü'}
1182
    _s_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'r', 't'}
1183
    _st_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
1184
1185
    # lowercase, normalize, and compose
1186
    word = normalize('NFC', word.lower())
1187
    word = word.replace('ß', 'ss')
1188
1189
    if len(word) > 2:
1190
        for i in range(2, len(word)):
1191
            if word[i] in _vowels and word[i-2] in _vowels:
1192
                if word[i-1] == 'u':
1193
                    word = word[:i-1] + 'U' + word[i:]
1194
                elif word[i-1] == 'y':
1195
                    word = word[:i-1] + 'Y' + word[i:]
1196
1197
    if alternate_vowels:
1198
        word = word.replace('ae', 'ä')
1199
        word = word.replace('oe', 'ö')
1200
        word = word.replace('que', 'Q')
1201
        word = word.replace('ue', 'ü')
1202
        word = word.replace('Q', 'que')
1203
1204
    r1_start = max(3, _sb_r1(word, _vowels))
1205
    r2_start = _sb_r2(word, _vowels)
1206
1207
    # Step 1
1208
    niss_flag = False
1209
    if word[-3:] == 'ern':
1210
        if len(word[r1_start:]) >= 3:
1211
            word = word[:-3]
1212
    elif word[-2:] == 'em':
1213
        if len(word[r1_start:]) >= 2:
1214
            word = word[:-2]
1215
    elif word[-2:] == 'er':
1216
        if len(word[r1_start:]) >= 2:
1217
            word = word[:-2]
1218
    elif word[-2:] == 'en':
1219
        if len(word[r1_start:]) >= 2:
1220
            word = word[:-2]
1221
            niss_flag = True
1222
    elif word[-2:] == 'es':
1223
        if len(word[r1_start:]) >= 2:
1224
            word = word[:-2]
1225
            niss_flag = True
1226
    elif word[-1:] == 'e':
1227
        if len(word[r1_start:]) >= 1:
1228
            word = word[:-1]
1229
            niss_flag = True
1230
    elif word[-1:] == 's':
1231
        if ((len(word[r1_start:]) >= 1 and len(word) >= 2 and
1232
             word[-2] in _s_endings)):
1233
            word = word[:-1]
1234
1235
    if niss_flag and word[-4:] == 'niss':
1236
        word = word[:-1]
1237
1238
    # Step 2
1239
    if word[-3:] == 'est':
1240
        if len(word[r1_start:]) >= 3:
1241
            word = word[:-3]
1242
    elif word[-2:] == 'en':
1243
        if len(word[r1_start:]) >= 2:
1244
            word = word[:-2]
1245
    elif word[-2:] == 'er':
1246
        if len(word[r1_start:]) >= 2:
1247
            word = word[:-2]
1248
    elif word[-2:] == 'st':
1249
        if ((len(word[r1_start:]) >= 2 and len(word) >= 6 and
1250
             word[-3] in _st_endings)):
1251
            word = word[:-2]
1252
1253
    # Step 3
1254
    if word[-4:] == 'isch':
1255
        if len(word[r2_start:]) >= 4 and word[-5] != 'e':
1256
            word = word[:-4]
1257
    elif word[-4:] in {'lich', 'heit'}:
1258
        if len(word[r2_start:]) >= 4:
1259
            word = word[:-4]
1260
            if ((word[-2:] in {'er', 'en'} and
1261
                 len(word[r1_start:]) >= 2)):
1262
                word = word[:-2]
1263
    elif word[-4:] == 'keit':
1264
        if len(word[r2_start:]) >= 4:
1265
            word = word[:-4]
1266
            if word[-4:] == 'lich' and len(word[r2_start:]) >= 4:
1267
                word = word[:-4]
1268
            elif word[-2:] == 'ig' and len(word[r2_start:]) >= 2:
1269
                word = word[:-2]
1270
    elif word[-3:] in {'end', 'ung'}:
1271
        if len(word[r2_start:]) >= 3:
1272
            word = word[:-3]
1273
            if ((word[-2:] == 'ig' and len(word[r2_start:]) >= 2 and
1274
                 word[-3] != 'e')):
1275
                word = word[:-2]
1276
    elif word[-2:] in {'ig', 'ik'}:
1277
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1278
            word = word[:-2]
1279
1280
    # Change 'Y' and 'U' back to lowercase if survived stemming
1281
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1282
        if word[i] == 'Y':
1283
            word = word[:i] + 'y' + word[i+1:]
1284
        elif word[i] == 'U':
1285
            word = word[:i] + 'u' + word[i+1:]
1286
1287
    # Remove umlauts
1288
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1289
    word = word.translate(_umlauts)
1290
1291
    return word
1292
1293
1294
def sb_dutch(word):
1295
    """Return Snowball Dutch stem.
1296
1297
    The Snowball Dutch stemmer is defined at:
1298
    http://snowball.tartarus.org/algorithms/dutch/stemmer.html
1299
1300
    :param str word: the word to calculate the stem of
1301
    :returns: word stem
1302
    :rtype: str
1303
1304
    >>> sb_dutch('lezen')
1305
    'lez'
1306
    >>> sb_dutch('opschorting')
1307
    'opschort'
1308
    >>> sb_dutch('ongrijpbaarheid')
1309
    'ongrijp'
1310
    """
1311
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'è'}
1312
    _not_s_endings = {'a', 'e', 'i', 'j', 'o', 'u', 'y', 'è'}
1313
1314
    def _undouble(word):
1315
        """Undouble endings -kk, -dd, and -tt."""
1316
        if ((len(word) > 1 and word[-1] == word[-2] and
1317
             word[-1] in {'d', 'k', 't'})):
1318
            return word[:-1]
1319
        return word
1320
1321
    # lowercase, normalize, decompose, filter umlauts & acutes out, and compose
1322
    word = normalize('NFC', text_type(word.lower()))
1323
    _accented = dict(zip((ord(_) for _ in 'äëïöüáéíóú'), 'aeiouaeiou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1324
    word = word.translate(_accented)
1325
1326
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1327
        if i == 0 and word[0] == 'y':
1328
            word = 'Y' + word[1:]
1329
        elif word[i] == 'y' and word[i-1] in _vowels:
1330
            word = word[:i] + 'Y' + word[i+1:]
1331
        elif (word[i] == 'i' and word[i-1] in _vowels and i+1 < len(word) and
1332
              word[i+1] in _vowels):
1333
            word = word[:i] + 'I' + word[i+1:]
1334
1335
    r1_start = max(3, _sb_r1(word, _vowels))
1336
    r2_start = _sb_r2(word, _vowels)
1337
1338
    # Step 1
1339
    if word[-5:] == 'heden':
1340
        if len(word[r1_start:]) >= 5:
1341
            word = word[:-3] + 'id'
1342
    elif word[-3:] == 'ene':
1343
        if ((len(word[r1_start:]) >= 3 and
1344
             (word[-4] not in _vowels and word[-6:-3] != 'gem'))):
1345
            word = _undouble(word[:-3])
1346
    elif word[-2:] == 'en':
1347
        if ((len(word[r1_start:]) >= 2 and
1348
             (word[-3] not in _vowels and word[-5:-2] != 'gem'))):
1349
            word = _undouble(word[:-2])
1350
    elif word[-2:] == 'se':
1351
        if len(word[r1_start:]) >= 2 and word[-3] not in _not_s_endings:
1352
            word = word[:-2]
1353
    elif word[-1:] == 's':
1354
        if len(word[r1_start:]) >= 1 and word[-2] not in _not_s_endings:
1355
            word = word[:-1]
1356
1357
    # Step 2
1358
    e_removed = False
1359
    if word[-1:] == 'e':
1360
        if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
1361
            word = _undouble(word[:-1])
1362
            e_removed = True
1363
1364
    # Step 3a
1365
    if word[-4:] == 'heid':
1366
        if len(word[r2_start:]) >= 4 and word[-5] != 'c':
1367
            word = word[:-4]
1368
            if word[-2:] == 'en':
1369
                if ((len(word[r1_start:]) >= 2 and
1370
                     (word[-3] not in _vowels and word[-5:-2] != 'gem'))):
1371
                    word = _undouble(word[:-2])
1372
1373
    # Step 3b
1374
    if word[-4:] == 'lijk':
1375
        if len(word[r2_start:]) >= 4:
1376
            word = word[:-4]
1377
            # Repeat step 2
1378
            if word[-1:] == 'e':
1379
                if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
1380
                    word = _undouble(word[:-1])
1381
    elif word[-4:] == 'baar':
1382
        if len(word[r2_start:]) >= 4:
1383
            word = word[:-4]
1384
    elif word[-3:] in ('end', 'ing'):
1385
        if len(word[r2_start:]) >= 3:
1386
            word = word[:-3]
1387
            if ((word[-2:] == 'ig' and len(word[r2_start:]) >= 2 and
1388
                 word[-3] != 'e')):
1389
                word = word[:-2]
1390
            else:
1391
                word = _undouble(word)
1392
    elif word[-3:] == 'bar':
1393
        if len(word[r2_start:]) >= 3 and e_removed:
1394
            word = word[:-3]
1395
    elif word[-2:] == 'ig':
1396
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1397
            word = word[:-2]
1398
1399
    # Step 4
1400
    if ((len(word) >= 4 and
0 ignored issues
show
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1401
         word[-3] == word[-2] and word[-2] in {'a', 'e', 'o', 'u'} and
1402
         word[-4] not in _vowels and
1403
         word[-1] not in _vowels and word[-1] != 'I')):
1404
        word = word[:-2] + word[-1]
1405
1406
    # Change 'Y' and 'U' back to lowercase if survived stemming
1407
    for i in range(0, len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
1408
        if word[i] == 'Y':
1409
            word = word[:i] + 'y' + word[i+1:]
1410
        elif word[i] == 'I':
1411
            word = word[:i] + 'i' + word[i+1:]
1412
1413
    return word
1414
1415
1416
def sb_norwegian(word):
1417
    """Return Snowball Norwegian stem.
1418
1419
    The Snowball Norwegian stemmer is defined at:
1420
    http://snowball.tartarus.org/algorithms/norwegian/stemmer.html
1421
1422
    :param str word: the word to calculate the stem of
1423
    :returns: word stem
1424
    :rtype: str
1425
1426
    >>> sb_norwegian('lese')
1427
    'les'
1428
    >>> sb_norwegian('suspensjon')
1429
    'suspensjon'
1430
    >>> sb_norwegian('sikkerhet')
1431
    'sikker'
1432
    """
1433
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1434
    _s_endings = {'b', 'c', 'd', 'f', 'g', 'h', 'j', 'l', 'm', 'n', 'o', 'p',
1435
                  'r', 't', 'v', 'y', 'z'}
1436
    # lowercase, normalize, and compose
1437
    word = normalize('NFC', text_type(word.lower()))
1438
1439
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1440
1441
    # Step 1
1442
    _r1 = word[r1_start:]
1443
    if _r1[-7:] == 'hetenes':
1444
        word = word[:-7]
1445
    elif _r1[-6:] in {'hetene', 'hetens'}:
1446
        word = word[:-6]
1447
    elif _r1[-5:] in {'heten', 'heter', 'endes'}:
1448
        word = word[:-5]
1449
    elif _r1[-4:] in {'ande', 'ende', 'edes', 'enes', 'erte'}:
1450
        if word[-4:] == 'erte':
1451
            word = word[:-2]
1452
        else:
1453
            word = word[:-4]
1454
    elif _r1[-3:] in {'ede', 'ane', 'ene', 'ens', 'ers', 'ets', 'het', 'ast',
1455
                      'ert'}:
1456
        if word[-3:] == 'ert':
1457
            word = word[:-1]
1458
        else:
1459
            word = word[:-3]
1460
    elif _r1[-2:] in {'en', 'ar', 'er', 'as', 'es', 'et'}:
1461
        word = word[:-2]
1462
    elif _r1[-1:] in {'a', 'e'}:
1463
        word = word[:-1]
1464
    elif _r1[-1:] == 's':
1465
        if (((len(word) > 1 and word[-2] in _s_endings) or
1466
             (len(word) > 2 and word[-2] == 'k' and word[-3] not in _vowels))):
1467
            word = word[:-1]
1468
1469
    # Step 2
1470
    if word[r1_start:][-2:] in {'dt', 'vt'}:
1471
        word = word[:-1]
1472
1473
    # Step 3
1474
    _r1 = word[r1_start:]
1475
    if _r1[-7:] == 'hetslov':
1476
        word = word[:-7]
1477
    elif _r1[-4:] in {'eleg', 'elig', 'elov', 'slov'}:
1478
        word = word[:-4]
1479
    elif _r1[-3:] in {'leg', 'eig', 'lig', 'els', 'lov'}:
1480
        word = word[:-3]
1481
    elif _r1[-2:] == 'ig':
1482
        word = word[:-2]
1483
1484
    return word
1485
1486
1487
def sb_swedish(word):
1488
    """Return Snowball Swedish stem.
1489
1490
    The Snowball Swedish stemmer is defined at:
1491
    http://snowball.tartarus.org/algorithms/swedish/stemmer.html
1492
1493
    :param str word: the word to calculate the stem of
1494
    :returns: word stem
1495
    :rtype: str
1496
1497
    >>> sb_swedish('undervisa')
1498
    'undervis'
1499
    >>> sb_swedish('suspension')
1500
    'suspension'
1501
    >>> sb_swedish('visshet')
1502
    'viss'
1503
    """
1504
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'å', 'ö'}
1505
    _s_endings = {'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
1506
                  'o', 'p', 'r', 't', 'v', 'y'}
1507
1508
    # lowercase, normalize, and compose
1509
    word = normalize('NFC', text_type(word.lower()))
1510
1511
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1512
1513
    # Step 1
1514
    _r1 = word[r1_start:]
1515 View Code Duplication
    if _r1[-7:] == 'heterna':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1516
        word = word[:-7]
1517
    elif _r1[-6:] == 'hetens':
1518
        word = word[:-6]
1519
    elif _r1[-5:] in {'anden', 'heten', 'heter', 'arnas', 'ernas', 'ornas',
1520
                      'andes', 'arens', 'andet'}:
1521
        word = word[:-5]
1522
    elif _r1[-4:] in {'arna', 'erna', 'orna', 'ande', 'arne', 'aste', 'aren',
1523
                      'ades', 'erns'}:
1524
        word = word[:-4]
1525
    elif _r1[-3:] in {'ade', 'are', 'ern', 'ens', 'het', 'ast'}:
1526
        word = word[:-3]
1527
    elif _r1[-2:] in {'ad', 'en', 'ar', 'er', 'or', 'as', 'es', 'at'}:
1528
        word = word[:-2]
1529
    elif _r1[-1:] in {'a', 'e'}:
1530
        word = word[:-1]
1531
    elif _r1[-1:] == 's':
1532
        if len(word) > 1 and word[-2] in _s_endings:
1533
            word = word[:-1]
1534
1535
    # Step 2
1536
    if word[r1_start:][-2:] in {'dd', 'gd', 'nn', 'dt', 'gt', 'kt', 'tt'}:
1537
        word = word[:-1]
1538
1539
    # Step 3
1540
    _r1 = word[r1_start:]
1541
    if _r1[-5:] == 'fullt':
1542
        word = word[:-1]
1543
    elif _r1[-4:] == 'löst':
1544
        word = word[:-1]
1545
    elif _r1[-3:] in {'lig', 'els'}:
1546
        word = word[:-3]
1547
    elif _r1[-2:] == 'ig':
1548
        word = word[:-2]
1549
1550
    return word
1551
1552
1553
def sb_danish(word):
1554
    """Return Snowball Danish stem.
1555
1556
    The Snowball Danish stemmer is defined at:
1557
    http://snowball.tartarus.org/algorithms/danish/stemmer.html
1558
1559
    :param str word: the word to calculate the stem of
1560
    :returns: word stem
1561
    :rtype: str
1562
1563
    >>> sb_danish('underviser')
1564
    'undervis'
1565
    >>> sb_danish('suspension')
1566
    'suspension'
1567
    >>> sb_danish('sikkerhed')
1568
    'sikker'
1569
    """
1570
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1571
    _s_endings = {'a', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
1572
                  'o', 'p', 'r', 't', 'v', 'y', 'z', 'å'}
1573
1574
    # lowercase, normalize, and compose
1575
    word = normalize('NFC', text_type(word.lower()))
1576
1577
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1578
1579
    # Step 1
1580
    _r1 = word[r1_start:]
1581 View Code Duplication
    if _r1[-7:] == 'erendes':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1582
        word = word[:-7]
1583
    elif _r1[-6:] in {'erende', 'hedens'}:
1584
        word = word[:-6]
1585
    elif _r1[-5:] in {'ethed', 'erede', 'heden', 'heder', 'endes', 'ernes',
1586
                      'erens', 'erets'}:
1587
        word = word[:-5]
1588
    elif _r1[-4:] in {'ered', 'ende', 'erne', 'eren', 'erer', 'heds', 'enes',
1589
                      'eres', 'eret'}:
1590
        word = word[:-4]
1591
    elif _r1[-3:] in {'hed', 'ene', 'ere', 'ens', 'ers', 'ets'}:
1592
        word = word[:-3]
1593
    elif _r1[-2:] in {'en', 'er', 'es', 'et'}:
1594
        word = word[:-2]
1595
    elif _r1[-1:] == 'e':
1596
        word = word[:-1]
1597
    elif _r1[-1:] == 's':
1598
        if len(word) > 1 and word[-2] in _s_endings:
1599
            word = word[:-1]
1600
1601
    # Step 2
1602
    if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1603
        word = word[:-1]
1604
1605
    # Step 3
1606
    if word[-4:] == 'igst':
1607
        word = word[:-2]
1608
1609
    _r1 = word[r1_start:]
1610
    repeat_step2 = False
1611
    if _r1[-4:] == 'elig':
1612
        word = word[:-4]
1613
        repeat_step2 = True
1614
    elif _r1[-4:] == 'løst':
1615
        word = word[:-1]
1616
    elif _r1[-3:] in {'lig', 'els'}:
1617
        word = word[:-3]
1618
        repeat_step2 = True
1619
    elif _r1[-2:] == 'ig':
1620
        word = word[:-2]
1621
        repeat_step2 = True
1622
1623
    if repeat_step2:
1624
        if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1625
            word = word[:-1]
1626
1627
    # Step 4
1628
    if ((len(word[r1_start:]) >= 1 and len(word) >= 2 and
1629
         word[-1] == word[-2] and word[-1] not in _vowels)):
1630
        word = word[:-1]
1631
1632
    return word
1633
1634
1635
def clef_german(word):
1636
    """Return CLEF German stem.
1637
1638
    The CLEF German stemmer is defined at :cite:`Savoy:2005`.
1639
1640
    :param str word: the word to calculate the stem of
1641
    :returns: word stem
1642
    :rtype: str
1643
1644
    >>> clef_german('lesen')
1645
    'lese'
1646
    >>> clef_german('graues')
1647
    'grau'
1648
    >>> clef_german('buchstabieren')
1649
    'buchstabier'
1650
    """
1651
    # lowercase, normalize, and compose
1652
    word = normalize('NFC', text_type(word.lower()))
1653
1654
    # remove umlauts
1655
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1656
    word = word.translate(_umlauts)
1657
1658
    # remove plurals
1659
    wlen = len(word)-1
1660
1661
    if wlen > 3:
1662
        if wlen > 5:
1663
            if word[-3:] == 'nen':
1664
                return word[:-3]
1665
        if wlen > 4:
1666
            if word[-2:] in {'en', 'se', 'es', 'er'}:
1667
                return word[:-2]
1668
        if word[-1] in {'e', 'n', 'r', 's'}:
1669
            return word[:-1]
1670
    return word
1671
1672
1673
def clef_german_plus(word):
1674
    """Return 'CLEF German stemmer plus' stem.
1675
1676
    The CLEF German stemmer plus is defined at :cite:`Savoy:2005`.
1677
1678
    :param str word: the word to calculate the stem of
1679
    :returns: word stem
1680
    :rtype: str
1681
1682
    >>> clef_german_plus('lesen')
1683
    'les'
1684
    >>> clef_german_plus('graues')
1685
    'grau'
1686
    >>> clef_german_plus('buchstabieren')
1687
    'buchstabi'
1688
    """
1689
    _st_ending = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
1690
1691
    # lowercase, normalize, and compose
1692
    word = normalize('NFC', text_type(word.lower()))
1693
1694
    # remove umlauts
1695
    _accents = dict(zip((ord(_) for _ in 'äàáâöòóôïìíîüùúû'),
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1696
                        'aaaaooooiiiiuuuu'))
1697
    word = word.translate(_accents)
1698
1699
    # Step 1
1700
    wlen = len(word)-1
1701
    if wlen > 4 and word[-3:] == 'ern':
1702
        word = word[:-3]
1703
    elif wlen > 3 and word[-2:] in {'em', 'en', 'er', 'es'}:
1704
        word = word[:-2]
1705
    elif wlen > 2 and (word[-1] == 'e' or
1706
                       (word[-1] == 's' and word[-2] in _st_ending)):
1707
        word = word[:-1]
1708
1709
    # Step 2
1710
    wlen = len(word)-1
1711
    if wlen > 4 and word[-3:] == 'est':
1712
        word = word[:-3]
1713
    elif wlen > 3 and (word[-2:] in {'er', 'en'} or
1714
                       (word[-2:] == 'st' and word[-3] in _st_ending)):
1715
        word = word[:-2]
1716
1717
    return word
1718
1719
1720
def clef_swedish(word):
1721
    """Return CLEF Swedish stem.
1722
1723
    The CLEF Swedish stemmer is defined at :cite:`Savoy:2005`.
1724
1725
    :param str word: the word to calculate the stem of
1726
    :returns: word stem
1727
    :rtype: str
1728
1729
    >>> clef_swedish('undervisa')
1730
    'undervis'
1731
    >>> clef_swedish('suspension')
1732
    'suspensio'
1733
    >>> clef_swedish('visshet')
1734
    'viss'
1735
    """
1736
    wlen = len(word)-1
1737
1738
    if wlen > 3 and word[-1] == 's':
1739
        word = word[:-1]
1740
        wlen -= 1
1741
1742
    if wlen > 6:
1743
        if word[-5:] in {'elser', 'heten'}:
1744
            return word[:-5]
1745
    if wlen > 5:
1746
        if word[-4:] in {'arne', 'erna', 'ande', 'else', 'aste', 'orna',
1747
                         'aren'}:
1748
            return word[:-4]
1749
    if wlen > 4:
1750
        if word[-3:] in {'are', 'ast', 'het'}:
1751
            return word[:-3]
1752
    if wlen > 3:
1753
        if word[-2:] in {'ar', 'er', 'or', 'en', 'at', 'te', 'et'}:
1754
            return word[:-2]
1755
    if wlen > 2:
1756
        if word[-1] in {'a', 'e', 'n', 't'}:
1757
            return word[:-1]
1758
    return word
1759
1760
1761
def caumanns(word):
1762
    """Return Caumanns German stem.
1763
1764
    Jörg Caumanns' stemmer is described in his article in
1765
    :cite:`Caumanns:1999`.
1766
1767
    This implementation is based on the GermanStemFilter described at
1768
    :cite:`Lang:2013`.
1769
1770
    :param str word: the word to calculate the stem of
1771
    :returns: word stem
1772
    :rtype: str
1773
1774
    >>> caumanns('lesen')
1775
    'les'
1776
    >>> caumanns('graues')
1777
    'grau'
1778
    >>> caumanns('buchstabieren')
1779
    'buchstabier'
1780
    """
1781
    if not word:
1782
        return ''
1783
1784
    upper_initial = word[0].isupper()
1785
    word = normalize('NFC', text_type(word.lower()))
1786
1787
    # # Part 2: Substitution
1788
    # 1. Change umlauts to corresponding vowels & ß to ss
1789
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
1790
    word = word.translate(_umlauts)
1791
    word = word.replace('ß', 'ss')
1792
1793
    # 2. Change second of doubled characters to *
1794
    new_word = word[0]
1795
    for i in range(1, len(word)):
1796
        if new_word[i-1] == word[i]:
1797
            new_word += '*'
1798
        else:
1799
            new_word += word[i]
1800
    word = new_word
1801
1802
    # 3. Replace sch, ch, ei, ie with $, §, %, &
1803
    word = word.replace('sch', '$')
1804
    word = word.replace('ch', '§')
1805
    word = word.replace('ei', '%')
1806
    word = word.replace('ie', '&')
1807
    word = word.replace('ig', '#')
1808
    word = word.replace('st', '!')
1809
1810
    # # Part 1: Recursive Context-Free Stripping
1811
    # 1. Remove the following 7 suffixes recursively
1812
    while len(word) > 3:
1813
        if (((len(word) > 4 and word[-2:] in {'em', 'er'}) or
1814
             (len(word) > 5 and word[-2:] == 'nd'))):
1815
            word = word[:-2]
1816
        elif ((word[-1] in {'e', 's', 'n'}) or
1817
              (not upper_initial and word[-1] in {'t', '!'})):
1818
            word = word[:-1]
1819
        else:
1820
            break
1821
1822
    # Additional optimizations:
1823
    if len(word) > 5 and word[-5:] == 'erin*':
1824
        word = word[:-1]
1825
    if word[-1] == 'z':
1826
        word = word[:-1] + 'x'
1827
1828
    # Reverse substitutions:
1829
    word = word.replace('$', 'sch')
1830
    word = word.replace('§', 'ch')
1831
    word = word.replace('%', 'ei')
1832
    word = word.replace('&', 'ie')
1833
    word = word.replace('#', 'ig')
1834
    word = word.replace('!', 'st')
1835
1836
    # Expand doubled
1837
    word = ''.join([word[0]] + [word[i-1] if word[i] == '*' else word[i] for
1838
                                i in range(1, len(word))])
1839
1840
    # Finally, convert gege to ge
1841
    if len(word) > 4:
1842
        word = word.replace('gege', 'ge', 1)
1843
1844
    return word
1845
1846
1847
def uealite(word, max_word_length=20, max_acro_length=8, return_rule_no=False,
1848
            var=None):
1849
    """Return UEA-Lite stem.
1850
1851
    The UEA-Lite stemmer is discussed in :cite:`Jenkins:2005`.
1852
1853
    This is chiefly based on the Java implementation of the algorithm, with
1854
    variants based on the Perl implementation and Jason Adams' Ruby port.
1855
1856
    Java version: :cite:`Churchill:2005`
1857
    Perl version: :cite:`Jenkins:2005`
1858
    Ruby version: :cite:`Adams:2017`
1859
1860
    :param str word: the word to calculate the stem of
1861
    :param int max_word_length: the maximum word length allowed
1862
    :param bool return_rule_no: if True, returns the stem along with rule
1863
        number
1864
    :param str var: variant to use (set to 'Adams' to use Jason Adams' rules,
1865
        or 'Perl' to use the original Perl set of rules)
1866
    :returns: word stem
1867
    :rtype: str or (str, int)
1868
    """
1869
    problem_words = {'is', 'as', 'this', 'has', 'was', 'during'}
1870
1871
    # rule table format:
1872
    # top-level dictionary: length-of-suffix: dict-of-rules
1873
    # dict-of-rules: suffix: (rule_no, suffix_length_to_delete,
1874
    #                         suffix_to_append)
1875
    rule_table = {7: {'titudes': (30, 1, None),
1876
                      'fulness': (34, 4, None),
1877
                      'ousness': (35, 4, None),
1878
                      'eadings': (40.7, 4, None),
1879
                      'oadings': (40.6, 4, None),
1880
                      'ealings': (42.4, 4, None),
1881
                      'ailings': (42.2, 4, None),
1882
                      },
1883
                  6: {'aceous': (1, 6, None),
1884
                      'aining': (24, 3, None),
1885
                      'acting': (25, 3, None),
1886
                      'ttings': (26, 5, None),
1887
                      'viding': (27, 3, 'e'),
1888
                      'ssings': (37, 4, None),
1889
                      'ulting': (38, 3, None),
1890
                      'eading': (40.7, 3, None),
1891
                      'oading': (40.6, 3, None),
1892
                      'edings': (40.5, 4, None),
1893
                      'ddings': (40.4, 5, None),
1894
                      'ldings': (40.3, 4, None),
1895
                      'rdings': (40.2, 4, None),
1896
                      'ndings': (40.1, 4, None),
1897
                      'llings': (41, 5, None),
1898
                      'ealing': (42.4, 3, None),
1899
                      'olings': (42.3, 4, None),
1900
                      'ailing': (42.2, 3, None),
1901
                      'elings': (42.1, 4, None),
1902
                      'mmings': (44.3, 5, None),
1903
                      'ngings': (45.2, 4, None),
1904
                      'ggings': (45.1, 5, None),
1905
                      'stings': (47, 4, None),
1906
                      'etings': (48.4, 4, None),
1907
                      'ntings': (48.2, 4, None),
1908
                      'irings': (54.4, 4, 'e'),
1909
                      'urings': (54.3, 4, 'e'),
1910
                      'ncings': (54.2, 4, 'e'),
1911
                      'things': (58.1, 1, None),
1912
                      },
1913
                  5: {'iases': (11.4, 2, None),
1914
                      'ained': (13.6, 2, None),
1915
                      'erned': (13.5, 2, None),
1916
                      'ifted': (14, 2, None),
1917
                      'ected': (15, 2, None),
1918
                      'vided': (16, 1, None),
1919
                      'erred': (19, 3, None),
1920
                      'urred': (20.5, 3, None),
1921
                      'lored': (20.4, 2, None),
1922
                      'eared': (20.3, 2, None),
1923
                      'tored': (20.2, 1, None),
1924
                      'noted': (22.4, 1, None),
1925
                      'leted': (22.3, 1, None),
1926
                      'anges': (23, 1, None),
1927
                      'tting': (26, 4, None),
1928
                      'ulted': (32, 2, None),
1929
                      'uming': (33, 3, 'e'),
1930
                      'rabed': (36.1, 1, None),
1931
                      'rebed': (36.1, 1, None),
1932
                      'ribed': (36.1, 1, None),
1933
                      'robed': (36.1, 1, None),
1934
                      'rubed': (36.1, 1, None),
1935
                      'ssing': (37, 3, None),
1936
                      'vings': (39, 4, 'e'),
1937
                      'eding': (40.5, 3, None),
1938
                      'dding': (40.4, 4, None),
1939
                      'lding': (40.3, 3, None),
1940
                      'rding': (40.2, 3, None),
1941
                      'nding': (40.1, 3, None),
1942
                      'dings': (40, 4, 'e'),
1943
                      'lling': (41, 4, None),
1944
                      'oling': (42.3, 3, None),
1945
                      'eling': (42.1, 3, None),
1946
                      'lings': (42, 4, 'e'),
1947
                      'mming': (44.3, 4, None),
1948
                      'rming': (44.2, 3, None),
1949
                      'lming': (44.1, 3, None),
1950
                      'mings': (44, 4, 'e'),
1951
                      'nging': (45.2, 3, None),
1952
                      'gging': (45.1, 4, None),
1953
                      'gings': (45, 4, 'e'),
1954
                      'aning': (46.6, 3, None),
1955
                      'ening': (46.5, 3, None),
1956
                      'gning': (46.4, 3, None),
1957
                      'nning': (46.3, 4, None),
1958
                      'oning': (46.2, 3, None),
1959
                      'rning': (46.1, 3, None),
1960
                      'sting': (47, 3, None),
1961
                      'eting': (48.4, 3, None),
1962
                      'pting': (48.3, 3, None),
1963
                      'nting': (48.2, 3, None),
1964
                      'cting': (48.1, 3, None),
1965
                      'tings': (48, 4, 'e'),
1966
                      'iring': (54.4, 3, 'e'),
1967
                      'uring': (54.3, 3, 'e'),
1968
                      'ncing': (54.2, 3, 'e'),
1969
                      'sings': (54, 4, 'e'),
1970
                      # 'lling': (55, 3, None),  # masked by 41
1971
                      'ating': (57, 3, 'e'),
1972
                      'thing': (58.1, 0, None),
1973
                      },
1974
                  4: {'eeds': (7, 1, None),
1975
                      'uses': (11.3, 1, None),
1976
                      'sses': (11.2, 2, None),
1977
                      'eses': (11.1, 2, 'is'),
1978
                      'tled': (12.5, 1, None),
1979
                      'pled': (12.4, 1, None),
1980
                      'bled': (12.3, 1, None),
1981
                      'eled': (12.2, 2, None),
1982
                      'lled': (12.1, 2, None),
1983
                      'ened': (13.7, 2, None),
1984
                      'rned': (13.4, 2, None),
1985
                      'nned': (13.3, 3, None),
1986
                      'oned': (13.2, 2, None),
1987
                      'gned': (13.1, 2, None),
1988
                      'ered': (20.1, 2, None),
1989
                      'reds': (20, 2, None),
1990
                      'tted': (21, 3, None),
1991
                      'uted': (22.2, 1, None),
1992
                      'ated': (22.1, 1, None),
1993
                      'ssed': (28, 2, None),
1994
                      'umed': (31, 1, None),
1995
                      'beds': (36, 3, None),
1996
                      'ving': (39, 3, 'e'),
1997
                      'ding': (40, 3, 'e'),
1998
                      'ling': (42, 3, 'e'),
1999
                      'nged': (43.2, 1, None),
2000
                      'gged': (43.1, 3, None),
2001
                      'ming': (44, 3, 'e'),
2002
                      'ging': (45, 3, 'e'),
2003
                      'ning': (46, 3, 'e'),
2004
                      'ting': (48, 3, 'e'),
2005
                      # 'ssed': (49, 2, None),  # masked by 28
2006
                      # 'lled': (53, 2, None),  # masked by 12.1
2007
                      'zing': (54.1, 3, 'e'),
2008
                      'sing': (54, 3, 'e'),
2009
                      'lves': (60.1, 3, 'f'),
2010
                      'aped': (61.3, 1, None),
2011
                      'uded': (61.2, 1, None),
2012
                      'oded': (61.1, 1, None),
2013
                      # 'ated': (61, 1, None),  # masked by 22.1
2014
                      'ones': (63.6, 1, None),
2015
                      'izes': (63.5, 1, None),
2016
                      'ures': (63.4, 1, None),
2017
                      'ines': (63.3, 1, None),
2018
                      'ides': (63.2, 1, None),
2019
                      },
2020
                  3: {'ces': (2, 1, None),
2021
                      'sis': (4, 0, None),
2022
                      'tis': (5, 0, None),
2023
                      'eed': (7, 0, None),
2024
                      'ued': (8, 1, None),
2025
                      'ues': (9, 1, None),
2026
                      'ees': (10, 1, None),
2027
                      'ses': (11, 1, None),
2028
                      'led': (12, 2, None),
2029
                      'ned': (13, 1, None),
2030
                      'ved': (17, 1, None),
2031
                      'ced': (18, 1, None),
2032
                      'red': (20, 1, None),
2033
                      'ted': (22, 2, None),
2034
                      'sed': (29, 1, None),
2035
                      'bed': (36, 2, None),
2036
                      'ged': (43, 1, None),
2037
                      'les': (50, 1, None),
2038
                      'tes': (51, 1, None),
2039
                      'zed': (52, 1, None),
2040
                      'ied': (56, 3, 'y'),
2041
                      'ies': (59, 3, 'y'),
2042
                      'ves': (60, 1, None),
2043
                      'pes': (63.8, 1, None),
2044
                      'mes': (63.7, 1, None),
2045
                      'ges': (63.1, 1, None),
2046
                      'ous': (65, 0, None),
2047
                      'ums': (66, 0, None),
2048
                      },
2049
                  2: {'cs': (3, 0, None),
2050
                      'ss': (6, 0, None),
2051
                      'es': (63, 2, None),
2052
                      'is': (64, 2, 'e'),
2053
                      'us': (67, 0, None),
2054
                      }}
2055
2056
    if var == 'Perl':
2057
        perl_deletions = {7: ['eadings', 'oadings', 'ealings', 'ailings'],
2058
                          6: ['ttings', 'ssings', 'edings', 'ddings',
2059
                              'ldings', 'rdings', 'ndings', 'llings',
2060
                              'olings', 'elings', 'mmings', 'ngings',
2061
                              'ggings', 'stings', 'etings', 'ntings',
2062
                              'irings', 'urings', 'ncings', 'things'],
2063
                          5: ['vings', 'dings', 'lings', 'mings', 'gings',
2064
                              'tings', 'sings'],
2065
                          4: ['eeds', 'reds', 'beds']}
2066
2067
        # Delete the above rules from rule_table
2068
        for del_len in perl_deletions:
2069
            for term in perl_deletions[del_len]:
2070
                del rule_table[del_len][term]
2071
2072
    elif var == 'Adams':
2073
        adams_additions = {6: {'chited': (22.8, 1, None)},
2074
                           5: {'dying': (58.2, 4, 'ie'),
2075
                               'tying': (58.2, 4, 'ie'),
2076
                               'vited': (22.6, 1, None),
2077
                               'mited': (22.5, 1, None),
2078
                               'vided': (22.9, 1, None),
2079
                               'mided': (22.10, 1, None),
2080
                               'lying': (58.2, 4, 'ie'),
2081
                               'arred': (19.1, 3, None),
2082
                               },
2083
                           4: {'ited': (22.7, 2, None),
2084
                               'oked': (31.1, 1, None),
2085
                               'aked': (31.1, 1, None),
2086
                               'iked': (31.1, 1, None),
2087
                               'uked': (31.1, 1, None),
2088
                               'amed': (31, 1, None),
2089
                               'imed': (31, 1, None),
2090
                               'does': (31.2, 2, None),
2091
                               },
2092
                           3: {'oed': (31.3, 1, None),
2093
                               'oes': (31.2, 1, None),
2094
                               'kes': (63.1, 1, None),
2095
                               'des': (63.10, 1, None),
2096
                               'res': (63.9, 1, None),
2097
                               }}
2098
2099
        # Add the above additional rules to rule_table
2100
        for del_len in adams_additions:
2101
            rule_table[del_len] = dict(rule_table[del_len],
2102
                                       **adams_additions[del_len])
2103
        # Add additional problem word
2104
        problem_words.add('menses')
2105
2106
    def _stem_with_duplicate_character_check(word, del_len):
2107
        if word[-1] == 's':
2108
            del_len += 1
2109
        stemmed_word = word[:-del_len]
2110
        if re_match(r'.*(\w)\1$', stemmed_word):
2111
            stemmed_word = stemmed_word[:-1]
2112
        return stemmed_word
2113
2114
    def _stem(word):
0 ignored issues
show
best-practice introduced by
Too many return statements (16/6)
Loading history...
2115
        stemmed_word = word
2116
        rule_no = 0
2117
2118
        if not word:
2119
            return word, 0
2120
        if word in problem_words:
2121
            return word, 90
2122
        if max_word_length and len(word) > max_word_length:
2123
            return word, 95
2124
2125
        if "'" in word:
2126
            if word[-2:] in {"'s", "'S"}:
2127
                stemmed_word = word[:-2]
2128
            if word[-1:] == "'":
2129
                stemmed_word = word[:-1]
2130
            stemmed_word = stemmed_word.replace("n't", 'not')
2131
            stemmed_word = stemmed_word.replace("'ve", 'have')
2132
            stemmed_word = stemmed_word.replace("'re", 'are')
2133
            stemmed_word = stemmed_word.replace("'m", 'am')
2134
            return stemmed_word, 94
2135
2136
        if word.isdigit():
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2137
            return word, 90.3
2138
        else:
2139
            hyphen = word.find('-')
2140
            if len(word) > hyphen > 0:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
2141
                if word[:hyphen].isalpha() and word[hyphen+1:].isalpha():
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2142
                    return word, 90.2
2143
                else:
2144
                    return word, 90.1
2145
            elif '_' in word:
2146
                return word, 90
2147
            elif word[-1] == 's' and word[:-1].isupper():
2148
                if var == 'Adams' and len(word)-1 > max_acro_length:
2149
                    return word, 96
2150
                return word[:-1], 91.1
2151
            elif word.isupper():
2152
                if var == 'Adams' and len(word) > max_acro_length:
2153
                    return word, 96
2154
                return word, 91
2155
            elif re_match(r'^.*[A-Z].*[A-Z].*$', word):
2156
                return word, 92
2157
            elif word[0].isupper():
2158
                return word, 93
2159
            elif var == 'Adams' and re_match(r'^[a-z]{1}(|[rl])(ing|ed)$',
2160
                                             word):
2161
                return word, 97
2162
2163
        for n in range(7, 1, -1):
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n" doesn't conform to snake_case naming style ('(([a-z_][a-z0-9_]2,)|(_[a-z0-9_]*)|(__[a-z][a-z0-9_]+__))$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
2164
            if word[-n:] in rule_table[n]:
2165
                rule_no, del_len, add_str = rule_table[n][word[-n:]]
2166
                if del_len:
2167
                    stemmed_word = word[:-del_len]
2168
                else:
2169
                    stemmed_word = word
2170
                if add_str:
2171
                    stemmed_word += add_str
2172
                break
2173
2174
        if not rule_no:
2175
            if re_match(r'.*\w\wings?$', word):  # rule 58
2176
                stemmed_word = _stem_with_duplicate_character_check(word, 3)
2177
                rule_no = 58
2178
            elif re_match(r'.*\w\weds?$', word):  # rule 62
2179
                stemmed_word = _stem_with_duplicate_character_check(word, 2)
2180
                rule_no = 62
2181
            elif word[-1] == 's':  # rule 68
2182
                stemmed_word = word[:-1]
2183
                rule_no = 68
2184
2185
        return stemmed_word, rule_no
2186
2187
    stem, rule_no = _stem(word)
2188
    if return_rule_no:
2189
        return stem, rule_no
2190
    return stem
2191
2192
2193
def paice_husk(word):
2194
    """Return Paice-Husk stem.
2195
2196
    Implementation of the Paice-Husk Stemmer, also known as the Lancaster
2197
    Stemmer, developed by Chris Paice, with the assistance of Gareth Husk
2198
2199
    This is based on the algorithm's description in :cite:`Paice:1990`.
2200
2201
    :param str word: the word to stem
2202
    :returns: the stemmed word
2203
    :rtype: str
2204
    """
2205
    rule_table = {6: {'ifiabl': (False, 6, None, True),
2206
                      'plicat': (False, 4, 'y', True)},
2207
                  5: {'guish': (False, 5, 'ct', True),
2208
                      'sumpt': (False, 2, None, True),
2209
                      'istry': (False, 5, None, True)},
2210
                  4: {'ytic': (False, 3, 's', True),
2211
                      'ceed': (False, 2, 'ss', True),
2212
                      'hood': (False, 4, None, False),
2213
                      'lief': (False, 1, 'v', True),
2214
                      'verj': (False, 1, 't', True),
2215
                      'misj': (False, 2, 't', True),
2216
                      'iabl': (False, 4, 'y', True),
2217
                      'iful': (False, 4, 'y', True),
2218
                      'sion': (False, 4, 'j', False),
2219
                      'xion': (False, 4, 'ct', True),
2220
                      'ship': (False, 4, None, False),
2221
                      'ness': (False, 4, None, False),
2222
                      'ment': (False, 4, None, False),
2223
                      'ript': (False, 2, 'b', True),
2224
                      'orpt': (False, 2, 'b', True),
2225
                      'duct': (False, 1, None, True),
2226
                      'cept': (False, 2, 'iv', True),
2227
                      'olut': (False, 2, 'v', True),
2228
                      'sist': (False, 0, None, True)},
2229
                  3: {'ied': (False, 3, 'y', False),
2230
                      'eed': (False, 1, None, True),
2231
                      'ing': (False, 3, None, False),
2232
                      'iag': (False, 3, 'y', True),
2233
                      'ish': (False, 3, None, False),
2234
                      'fuj': (False, 1, 's', True),
2235
                      'hej': (False, 1, 'r', True),
2236
                      'abl': (False, 3, None, False),
2237
                      'ibl': (False, 3, None, True),
2238
                      'bil': (False, 2, 'l', False),
2239
                      'ful': (False, 3, None, False),
2240
                      'ial': (False, 3, None, False),
2241
                      'ual': (False, 3, None, False),
2242
                      'ium': (False, 3, None, True),
2243
                      'ism': (False, 3, None, False),
2244
                      'ion': (False, 3, None, False),
2245
                      'ian': (False, 3, None, False),
2246
                      'een': (False, 0, None, True),
2247
                      'ear': (False, 0, None, True),
2248
                      'ier': (False, 3, 'y', False),
2249
                      'ies': (False, 3, 'y', False),
2250
                      'sis': (False, 2, None, True),
2251
                      'ous': (False, 3, None, False),
2252
                      'ent': (False, 3, None, False),
2253
                      'ant': (False, 3, None, False),
2254
                      'ist': (False, 3, None, False),
2255
                      'iqu': (False, 3, None, True),
2256
                      'ogu': (False, 1, None, True),
2257
                      'siv': (False, 3, 'j', False),
2258
                      'eiv': (False, 0, None, True),
2259
                      'bly': (False, 1, None, False),
2260
                      'ily': (False, 3, 'y', False),
2261
                      'ply': (False, 0, None, True),
2262
                      'ogy': (False, 1, None, True),
2263
                      'phy': (False, 1, None, True),
2264
                      'omy': (False, 1, None, True),
2265
                      'opy': (False, 1, None, True),
2266
                      'ity': (False, 3, None, False),
2267
                      'ety': (False, 3, None, False),
2268
                      'lty': (False, 2, None, True),
2269
                      'ary': (False, 3, None, False),
2270
                      'ory': (False, 3, None, False),
2271
                      'ify': (False, 3, None, True),
2272
                      'ncy': (False, 2, 't', False),
2273
                      'acy': (False, 3, None, False)},
2274
                  2: {'ia': (True, 2, None, True),
2275
                      'bb': (False, 1, None, True),
2276
                      'ic': (False, 2, None, False),
2277
                      'nc': (False, 1, 't', False),
2278
                      'dd': (False, 1, None, True),
2279
                      'ed': (False, 2, None, False),
2280
                      'if': (False, 2, None, False),
2281
                      'ag': (False, 2, None, False),
2282
                      'gg': (False, 1, None, True),
2283
                      'th': (True, 2, None, True),
2284
                      'ij': (False, 1, 'd', True),
2285
                      'uj': (False, 1, 'd', True),
2286
                      'oj': (False, 1, 'd', True),
2287
                      'nj': (False, 1, 'd', True),
2288
                      'cl': (False, 1, None, True),
2289
                      'ul': (False, 2, None, True),
2290
                      'al': (False, 2, None, False),
2291
                      'll': (False, 1, None, True),
2292
                      'um': (True, 2, None, True),
2293
                      'mm': (False, 1, None, True),
2294
                      'an': (False, 2, None, False),
2295
                      'en': (False, 2, None, False),
2296
                      'nn': (False, 1, None, True),
2297
                      'pp': (False, 1, None, True),
2298
                      'er': (False, 2, None, False),
2299
                      'ar': (False, 2, None, True),
2300
                      'or': (False, 2, None, False),
2301
                      'ur': (False, 2, None, False),
2302
                      'rr': (False, 1, None, True),
2303
                      'tr': (False, 1, None, False),
2304
                      'is': (False, 2, None, False),
2305
                      'ss': (False, 0, None, True),
2306
                      'us': (True, 2, None, True),
2307
                      'at': (False, 2, None, False),
2308
                      'tt': (False, 1, None, True),
2309
                      'iv': (False, 2, None, False),
2310
                      'ly': (False, 2, None, False),
2311
                      'iz': (False, 2, None, False),
2312
                      'yz': (False, 1, 's', True)},
2313
                  1: {'a': (True, 1, None, True),
2314
                      'e': (False, 1, None, False),
2315
                      'i': ((True, 1, None, True), (False, 1, 'y', False)),
2316
                      'j': (False, 1, 's', True),
2317
                      's': ((True, 1, None, False), (False, 0, None, True))}}
2318
2319
    def _has_vowel(word):
2320
        for char in word:
2321
            if char in {'a', 'e', 'i', 'o', 'u', 'y'}:
2322
                return True
2323
        return False
2324
2325
    def _acceptable(word):
2326
        if word and word[0] in {'a', 'e', 'i', 'o', 'u'}:
2327
            return len(word) > 1
2328
        return len(word) > 2 and _has_vowel(word[1:])
2329
2330
    def _apply_rule(word, rule, intact):
2331
        old_word = word
2332
        only_intact, del_len, add_str, set_terminate = rule
2333
        # print(word, word[-n:], rule)
2334
2335
        if (not only_intact) or (intact and only_intact):
2336
            if del_len:
2337
                word = word[:-del_len]
2338
            if add_str:
2339
                word += add_str
2340
        else:
2341
            return word, False, intact, terminate
2342
2343
        if _acceptable(word):
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2344
            return word, True, False, set_terminate
2345
        else:
2346
            return old_word, False, intact, terminate
2347
2348
    terminate = False
2349
    intact = True
2350
    while not terminate:
0 ignored issues
show
unused-code introduced by
Too many nested blocks (6/5)
Loading history...
2351
        for n in range(6, 0, -1):
0 ignored issues
show
Coding Style Naming introduced by
Variable name "n" doesn't conform to snake_case naming style ('(([a-z_][a-z0-9_]2,)|(_[a-z0-9_]*)|(__[a-z][a-z0-9_]+__))$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
2352
            if word[-n:] in rule_table[n]:
2353
                accept = False
2354
                if len(rule_table[n][word[-n:]]) < 4:
2355
                    for rule in rule_table[n][word[-n:]]:
2356
                        (word, accept, intact,
2357
                         terminate) = _apply_rule(word, rule, intact)
2358
                        if accept:
2359
                            break
2360
                else:
2361
                    rule = rule_table[n][word[-n:]]
2362
                    (word, accept, intact,
2363
                     terminate) = _apply_rule(word, rule, intact)
2364
2365
                if accept:
2366
                    break
2367
        else:
2368
            break
2369
2370
    return word
2371
2372
2373
def schinke(word):
2374
    """Return the stem of a word according to the Schinke stemmer.
2375
2376
    This is defined in :cite:`Schinke:1996`.
2377
2378
    :param str word: the word to stem
2379
    :returns: the stemmed word
2380
    :rtype: str
2381
    """
2382
    word = normalize('NFKD', text_type(word.lower()))
2383
    word = ''.join(c for c in word if c in
2384
                   {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
2385
                    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
2386
                    'y', 'z'})
2387
2388
    # Rule 2
2389
    word = word.replace('j', 'i').replace('v', 'u')
2390
2391
    # Rule 3
2392
    keep_que = {'at', 'quo', 'ne', 'ita', 'abs', 'aps', 'abus', 'adae', 'adus',
2393
                'deni', 'de', 'sus', 'obli', 'perae', 'plenis', 'quando',
2394
                'quis', 'quae', 'cuius', 'cui', 'quem', 'quam', 'qua', 'qui',
2395
                'quorum', 'quarum', 'quibus', 'quos', 'quas', 'quotusquis',
2396
                'quous', 'ubi', 'undi', 'us', 'uter', 'uti', 'utro', 'utribi',
2397
                'tor', 'co', 'conco', 'contor', 'detor', 'deco', 'exco',
2398
                'extor', 'obtor', 'optor', 'retor', 'reco', 'attor', 'inco',
2399
                'intor', 'praetor'}
2400
    if word[-3:] == 'que':
2401
        # This diverges from the paper by also returning 'que' itself unstemmed
2402
        if word[:-3] in keep_que or word == 'que':
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
2403
            return {'n': word, 'v': word}
2404
        else:
2405
            word = word[:-3]
2406
2407
    # Base case will mean returning the words as is
2408
    noun = word
2409
    verb = word
2410
2411
    # Rule 4
2412
    n_endings = {4: {'ibus'},
2413
                 3: {'ius'},
2414
                 2: {'is', 'nt', 'ae', 'os', 'am', 'ud', 'as', 'um', 'em',
2415
                     'us', 'es', 'ia'},
2416
                 1: {'a', 'e', 'i', 'o', 'u'}}
2417
    for endlen in range(4, 0, -1):
2418
        if word[-endlen:] in n_endings[endlen]:
2419
            if len(word)-2 >= endlen:
2420
                noun = word[:-endlen]
2421
            else:
2422
                noun = word
2423
            break
2424
2425
    v_endings_strip = {6: {},
2426
                       5: {},
2427
                       4: {'mini', 'ntur', 'stis'},
2428
                       3: {'mur', 'mus', 'ris', 'sti', 'tis', 'tur'},
2429
                       2: {'ns', 'nt', 'ri'},
2430
                       1: {'m', 'r', 's', 't'}}
2431
    v_endings_alter = {6: {'iuntur'},
2432
                       5: {'beris', 'erunt', 'untur'},
2433
                       4: {'iunt'},
2434
                       3: {'bor', 'ero', 'unt'},
2435
                       2: {'bo'},
2436
                       1: {}}
2437
    for endlen in range(6, 0, -1):
2438
        if word[-endlen:] in v_endings_strip[endlen]:
2439
            addlen = 0
2440
            if len(word)-2 >= endlen:
2441
                verb = word[:-endlen]
2442
            else:
2443
                verb = word
2444
            break
2445
        if word[-endlen:] in v_endings_alter[endlen]:
2446
            if word[-endlen:] in {'iuntur', 'erunt', 'untur', 'iunt', 'unt'}:
2447
                new_word = word[:-endlen]+'i'
2448
                addlen = 1
2449
            elif word[-endlen:] in {'beris', 'bor', 'bo'}:
2450
                new_word = word[:-endlen]+'bi'
2451
                addlen = 2
2452
            else:
2453
                new_word = word[:-endlen]+'eri'
2454
                addlen = 3
2455
2456
            # Technically this diverges from the paper by considering the
2457
            # length of the stem without the new suffix
2458
            if len(new_word) >= 2+addlen:
2459
                verb = new_word
2460
            else:
2461
                verb = word
2462
            break
2463
2464
    return {'n': noun, 'v': verb}
2465
2466
2467
def s_stemmer(word):
2468
    """Return the S-stemmed form of a word.
2469
2470
    The S stemmer is defined in :cite:`Harman:1991`.
2471
2472
    :param str word: the word to stem
2473
    :returns: the stemmed word
2474
    :rtype: str
2475
    """
2476
    lowered = word.lower()
2477
    if lowered[-3:] == 'ies' and lowered[-4:-3] not in {'e', 'a'}:
2478
        return word[:-3] + ('Y' if word[-1:].isupper() else 'y')
2479
    if lowered[-2:] == 'es' and lowered[-3:-2] not in {'a', 'e', 'o'}:
2480
        return word[:-1]
2481
    if lowered[-1:] == 's' and lowered[-2:-1] not in {'u', 's'}:
2482
        return word[:-1]
2483
    return word
2484
2485
2486
if __name__ == '__main__':
2487
    import doctest
2488
    doctest.testmod()
2489