Completed
Branch master (78a222)
by Chris
14:36
created

abydos.stemmer._snowball.sb_norwegian()   F

Complexity

Conditions 21

Size

Total Lines 96
Code Lines 66

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 39
CRAP Score 21

Importance

Changes 0
Metric Value
eloc 66
dl 0
loc 96
ccs 39
cts 39
cp 1
rs 0
c 0
b 0
f 0
cc 21
nop 1
crap 21

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like abydos.stemmer._snowball.sb_norwegian() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
# -*- coding: utf-8 -*-
0 ignored issues
show
coding-style introduced by
Too many lines in module (1364/1000)
Loading history...
2
3
# Copyright 2014-2018 by Christopher C. Little.
4
# This file is part of Abydos.
5
#
6
# Abydos is free software: you can redistribute it and/or modify
7
# it under the terms of the GNU General Public License as published by
8
# the Free Software Foundation, either version 3 of the License, or
9
# (at your option) any later version.
10
#
11
# Abydos is distributed in the hope that it will be useful,
12
# but WITHOUT ANY WARRANTY; without even the implied warranty of
13
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
# GNU General Public License for more details.
15
#
16
# You should have received a copy of the GNU General Public License
17
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
18
19 1
"""abydos.stemmer._snowball.
20
21
The stemmer._snowball module defines the stemmers:
22
23
    - Porter
24
    - Porter2 (Snowball English)
25
    - Snowball German
26
    - Snowball Dutch
27
    - Snowball Norwegian
28
    - Snowball Swedish
29
    - Snowball Danish
30
"""
31
32 1
from __future__ import unicode_literals
33
34 1
from unicodedata import normalize
35
36 1
from six import text_type
37 1
from six.moves import range
38
39 1
__all__ = [
40
    'porter',
41
    'porter2',
42
    'sb_danish',
43
    'sb_dutch',
44
    'sb_german',
45
    'sb_norwegian',
46
    'sb_swedish',
47
]
48
49
50 1
def _m_degree(term, vowels):
51
    """Return Porter helper function _m_degree value.
52
53
    m-degree is equal to the number of V to C transitions
54
55
    :param str term: the word for which to calculate the m-degree
56
    :param set vowels: the set of vowels in the language
57
    :returns: the m-degree as defined in the Porter stemmer definition
58
    :rtype: int
59
    """
60 1
    mdeg = 0
61 1
    last_was_vowel = False
62 1
    for letter in term:
63 1
        if letter in vowels:
64 1
            last_was_vowel = True
65
        else:
66 1
            if last_was_vowel:
67 1
                mdeg += 1
68 1
            last_was_vowel = False
69 1
    return mdeg
70
71
72 1
def _sb_has_vowel(term, vowels):
73
    """Return Porter helper function _sb_has_vowel value.
74
75
    :param str term: the word to scan for vowels
76
    :param set vowels: the set of vowels in the language
77
    :returns: true iff a vowel exists in the term (as defined in the Porter
78
        stemmer definition)
79
    :rtype: bool
80
    """
81 1
    for letter in term:
82 1
        if letter in vowels:
83 1
            return True
84 1
    return False
85
86
87 1
def _ends_in_doubled_cons(term, vowels):
88
    """Return Porter helper function _ends_in_doubled_cons value.
89
90
    :param str term: the word to check for a final doubled consonant
91
    :param set vowels: the set of vowels in the language
92
    :returns: true iff the stem ends in a doubled consonant (as defined in the
93
        Porter stemmer definition)
94
    :rtype: bool
95
    """
96 1
    return len(term) > 1 and term[-1] not in vowels and term[-2] == term[-1]
97
98
99 1
def _ends_in_cvc(term, vowels):
100
    """Return Porter helper function _ends_in_cvc value.
101
102
    :param str term: the word to scan for cvc
103
    :param set vowels: the set of vowels in the language
104
    :returns: true iff the stem ends in cvc (as defined in the Porter stemmer
105
        definition)
106
    :rtype: bool
107
    """
108 1
    return len(term) > 2 and (
109
        term[-1] not in vowels
110
        and term[-2] in vowels
111
        and term[-3] not in vowels
112
        and term[-1] not in tuple('wxY')
113
    )
114
115
116 1
def porter(word, early_english=False):
117
    """Return Porter stem.
118
119
    The Porter stemmer is described in :cite:`Porter:1980`.
120
121
    :param str word: the word to calculate the stem of
122
    :param bool early_english: set to True in order to remove -eth & -est
123
        (2nd & 3rd person singular verbal agreement suffixes)
124
    :returns: word stem
125
    :rtype: str
126
127
    >>> porter('reading')
128
    'read'
129
    >>> porter('suspension')
130
    'suspens'
131
    >>> porter('elusiveness')
132
    'elus'
133
134
    >>> porter('eateth', early_english=True)
135
    'eat'
136
    """
137
    # lowercase, normalize, and compose
138 1
    word = normalize('NFC', text_type(word.lower()))
139
140
    # Return word if stem is shorter than 2
141 1
    if len(word) < 3:
142 1
        return word
143
144 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
145
    # Re-map consonantal y to Y (Y will be C, y will be V)
146 1
    if word[0] == 'y':
147 1
        word = 'Y' + word[1:]
148 1
    for i in range(1, len(word)):
149 1
        if word[i] == 'y' and word[i - 1] in _vowels:
150 1
            word = word[:i] + 'Y' + word[i + 1 :]
151
152
    # Step 1a
153 1
    if word[-1] == 's':
154 1
        if word[-4:] == 'sses':
155 1
            word = word[:-2]
156 1
        elif word[-3:] == 'ies':
157 1
            word = word[:-2]
158 1
        elif word[-2:] == 'ss':
159 1
            pass
160
        else:
161 1
            word = word[:-1]
162
163
    # Step 1b
164 1
    step1b_flag = False
165 1
    if word[-3:] == 'eed':
166 1
        if _m_degree(word[:-3], _vowels) > 0:
167 1
            word = word[:-1]
168 1
    elif word[-2:] == 'ed':
169 1
        if _sb_has_vowel(word[:-2], _vowels):
170 1
            word = word[:-2]
171 1
            step1b_flag = True
172 1
    elif word[-3:] == 'ing':
173 1
        if _sb_has_vowel(word[:-3], _vowels):
174 1
            word = word[:-3]
175 1
            step1b_flag = True
176 1
    elif early_english:
177 1
        if word[-3:] == 'est':
178 1
            if _sb_has_vowel(word[:-3], _vowels):
179 1
                word = word[:-3]
180 1
                step1b_flag = True
181 1
        elif word[-3:] == 'eth':
182 1
            if _sb_has_vowel(word[:-3], _vowels):
183 1
                word = word[:-3]
184 1
                step1b_flag = True
185
186 1
    if step1b_flag:
187 1
        if word[-2:] in {'at', 'bl', 'iz'}:
188 1
            word += 'e'
189 1
        elif _ends_in_doubled_cons(word, _vowels) and word[-1] not in {
190
            'l',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
191
            's',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
192
            'z',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
193
        }:
194 1
            word = word[:-1]
195 1
        elif _m_degree(word, _vowels) == 1 and _ends_in_cvc(word, _vowels):
196 1
            word += 'e'
197
198
    # Step 1c
199 1
    if word[-1] in {'Y', 'y'} and _sb_has_vowel(word[:-1], _vowels):
200 1
        word = word[:-1] + 'i'
201
202
    # Step 2
203 1
    if len(word) > 1:
204 1
        if word[-2] == 'a':
205 1
            if word[-7:] == 'ational':
206 1
                if _m_degree(word[:-7], _vowels) > 0:
207 1
                    word = word[:-5] + 'e'
208 1
            elif word[-6:] == 'tional':
209 1
                if _m_degree(word[:-6], _vowels) > 0:
210 1
                    word = word[:-2]
211 1
        elif word[-2] == 'c':
212 1
            if word[-4:] in {'enci', 'anci'}:
213 1
                if _m_degree(word[:-4], _vowels) > 0:
214 1
                    word = word[:-1] + 'e'
215 1
        elif word[-2] == 'e':
216 1
            if word[-4:] == 'izer':
217 1
                if _m_degree(word[:-4], _vowels) > 0:
218 1
                    word = word[:-1]
219 1
        elif word[-2] == 'g':
220 1
            if word[-4:] == 'logi':
221 1
                if _m_degree(word[:-4], _vowels) > 0:
222 1
                    word = word[:-1]
223 1
        elif word[-2] == 'l':
224 1
            if word[-3:] == 'bli':
225 1
                if _m_degree(word[:-3], _vowels) > 0:
226 1
                    word = word[:-1] + 'e'
227 1
            elif word[-4:] == 'alli':
228 1
                if _m_degree(word[:-4], _vowels) > 0:
229 1
                    word = word[:-2]
230 1
            elif word[-5:] == 'entli':
231 1
                if _m_degree(word[:-5], _vowels) > 0:
232 1
                    word = word[:-2]
233 1
            elif word[-3:] == 'eli':
234 1
                if _m_degree(word[:-3], _vowels) > 0:
235 1
                    word = word[:-2]
236 1
            elif word[-5:] == 'ousli':
237 1
                if _m_degree(word[:-5], _vowels) > 0:
238 1
                    word = word[:-2]
239 1
        elif word[-2] == 'o':
240 1
            if word[-7:] == 'ization':
241 1
                if _m_degree(word[:-7], _vowels) > 0:
242 1
                    word = word[:-5] + 'e'
243 1
            elif word[-5:] == 'ation':
244 1
                if _m_degree(word[:-5], _vowels) > 0:
245 1
                    word = word[:-3] + 'e'
246 1
            elif word[-4:] == 'ator':
247 1
                if _m_degree(word[:-4], _vowels) > 0:
248 1
                    word = word[:-2] + 'e'
249 1
        elif word[-2] == 's':
250 1
            if word[-5:] == 'alism':
251 1
                if _m_degree(word[:-5], _vowels) > 0:
252 1
                    word = word[:-3]
253 1
            elif word[-7:] in {'iveness', 'fulness', 'ousness'}:
254 1
                if _m_degree(word[:-7], _vowels) > 0:
255 1
                    word = word[:-4]
256 1
        elif word[-2] == 't':
257 1
            if word[-5:] == 'aliti':
258 1
                if _m_degree(word[:-5], _vowels) > 0:
259 1
                    word = word[:-3]
260 1
            elif word[-5:] == 'iviti':
261 1
                if _m_degree(word[:-5], _vowels) > 0:
262 1
                    word = word[:-3] + 'e'
263 1
            elif word[-6:] == 'biliti':
264 1
                if _m_degree(word[:-6], _vowels) > 0:
265 1
                    word = word[:-5] + 'le'
266
267
    # Step 3
268 1
    if word[-5:] == 'icate':
269 1
        if _m_degree(word[:-5], _vowels) > 0:
270 1
            word = word[:-3]
271 1
    elif word[-5:] == 'ative':
272 1
        if _m_degree(word[:-5], _vowels) > 0:
273 1
            word = word[:-5]
274 1
    elif word[-5:] in {'alize', 'iciti'}:
275 1
        if _m_degree(word[:-5], _vowels) > 0:
276 1
            word = word[:-3]
277 1
    elif word[-4:] == 'ical':
278 1
        if _m_degree(word[:-4], _vowels) > 0:
279 1
            word = word[:-2]
280 1
    elif word[-3:] == 'ful':
281 1
        if _m_degree(word[:-3], _vowels) > 0:
282 1
            word = word[:-3]
283 1
    elif word[-4:] == 'ness':
284 1
        if _m_degree(word[:-4], _vowels) > 0:
285 1
            word = word[:-4]
286
287
    # Step 4
288 1
    if word[-2:] == 'al':
289 1
        if _m_degree(word[:-2], _vowels) > 1:
290 1
            word = word[:-2]
291 1
    elif word[-4:] == 'ance':
292 1
        if _m_degree(word[:-4], _vowels) > 1:
293 1
            word = word[:-4]
294 1
    elif word[-4:] == 'ence':
295 1
        if _m_degree(word[:-4], _vowels) > 1:
296 1
            word = word[:-4]
297 1
    elif word[-2:] == 'er':
298 1
        if _m_degree(word[:-2], _vowels) > 1:
299 1
            word = word[:-2]
300 1
    elif word[-2:] == 'ic':
301 1
        if _m_degree(word[:-2], _vowels) > 1:
302 1
            word = word[:-2]
303 1
    elif word[-4:] == 'able':
304 1
        if _m_degree(word[:-4], _vowels) > 1:
305 1
            word = word[:-4]
306 1
    elif word[-4:] == 'ible':
307 1
        if _m_degree(word[:-4], _vowels) > 1:
308 1
            word = word[:-4]
309 1
    elif word[-3:] == 'ant':
310 1
        if _m_degree(word[:-3], _vowels) > 1:
311 1
            word = word[:-3]
312 1
    elif word[-5:] == 'ement':
313 1
        if _m_degree(word[:-5], _vowels) > 1:
314 1
            word = word[:-5]
315 1
    elif word[-4:] == 'ment':
316 1
        if _m_degree(word[:-4], _vowels) > 1:
317 1
            word = word[:-4]
318 1
    elif word[-3:] == 'ent':
319 1
        if _m_degree(word[:-3], _vowels) > 1:
320 1
            word = word[:-3]
321 1
    elif word[-4:] in {'sion', 'tion'}:
322 1
        if _m_degree(word[:-3], _vowels) > 1:
323 1
            word = word[:-3]
324 1
    elif word[-2:] == 'ou':
325 1
        if _m_degree(word[:-2], _vowels) > 1:
326 1
            word = word[:-2]
327 1
    elif word[-3:] == 'ism':
328 1
        if _m_degree(word[:-3], _vowels) > 1:
329 1
            word = word[:-3]
330 1
    elif word[-3:] == 'ate':
331 1
        if _m_degree(word[:-3], _vowels) > 1:
332 1
            word = word[:-3]
333 1
    elif word[-3:] == 'iti':
334 1
        if _m_degree(word[:-3], _vowels) > 1:
335 1
            word = word[:-3]
336 1
    elif word[-3:] == 'ous':
337 1
        if _m_degree(word[:-3], _vowels) > 1:
338 1
            word = word[:-3]
339 1
    elif word[-3:] == 'ive':
340 1
        if _m_degree(word[:-3], _vowels) > 1:
341 1
            word = word[:-3]
342 1
    elif word[-3:] == 'ize':
343 1
        if _m_degree(word[:-3], _vowels) > 1:
344 1
            word = word[:-3]
345
346
    # Step 5a
347 1
    if word[-1] == 'e':
348 1
        if _m_degree(word[:-1], _vowels) > 1:
349 1
            word = word[:-1]
350 1
        elif _m_degree(word[:-1], _vowels) == 1 and not _ends_in_cvc(
351
            word[:-1], _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
352
        ):
353 1
            word = word[:-1]
354
355
    # Step 5b
356 1
    if word[-2:] == 'll' and _m_degree(word, _vowels) > 1:
357 1
        word = word[:-1]
358
359
    # Change 'Y' back to 'y' if it survived stemming
360 1
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
361 1
        if word[i] == 'Y':
362 1
            word = word[:i] + 'y' + word[i + 1 :]
363
364 1
    return word
365
366
367 1
def _sb_r1(term, vowels, r1_prefixes=None):
368
    """Return the R1 region, as defined in the Porter2 specification."""
369 1
    vowel_found = False
370 1
    if hasattr(r1_prefixes, '__iter__'):
371 1
        for prefix in r1_prefixes:
372 1
            if term[: len(prefix)] == prefix:
373 1
                return len(prefix)
374
375 1
    for i in range(len(term)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
376 1
        if not vowel_found and term[i] in vowels:
377 1
            vowel_found = True
378 1
        elif vowel_found and term[i] not in vowels:
379 1
            return i + 1
380 1
    return len(term)
381
382
383 1
def _sb_r2(term, vowels, r1_prefixes=None):
384
    """Return the R2 region, as defined in the Porter2 specification."""
385 1
    r1_start = _sb_r1(term, vowels, r1_prefixes)
386 1
    return r1_start + _sb_r1(term[r1_start:], vowels)
387
388
389 1
def _sb_ends_in_short_syllable(term, vowels, codanonvowels):
390
    """Return True iff term ends in a short syllable.
391
392
    (...according to the Porter2 specification.)
393
394
    NB: This is akin to the CVC test from the Porter stemmer. The description
395
    is unfortunately poor/ambiguous.
396
    """
397 1
    if not term:
398 1
        return False
399 1
    if len(term) == 2:
400 1
        if term[-2] in vowels and term[-1] not in vowels:
401 1
            return True
402 1
    elif len(term) >= 3:
403 1
        if (
404
            term[-3] not in vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
405
            and term[-2] in vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
406
            and term[-1] in codanonvowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
407
        ):
408 1
            return True
409 1
    return False
410
411
412 1
def _sb_short_word(term, vowels, codanonvowels, r1_prefixes=None):
413
    """Return True iff term is a short word.
414
415
    (...according to the Porter2 specification.)
416
    """
417 1
    if _sb_r1(term, vowels, r1_prefixes) == len(
418
        term
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
419
    ) and _sb_ends_in_short_syllable(term, vowels, codanonvowels):
420 1
        return True
421 1
    return False
422
423
424 1
def porter2(word, early_english=False):
0 ignored issues
show
best-practice introduced by
Too many return statements (7/6)
Loading history...
425
    """Return the Porter2 (Snowball English) stem.
426
427
    The Porter2 (Snowball English) stemmer is defined in :cite:`Porter:2002`.
428
429
    :param str word: the word to calculate the stem of
430
    :param bool early_english: set to True in order to remove -eth & -est
431
        (2nd & 3rd person singular verbal agreement suffixes)
432
    :returns: word stem
433
    :rtype: str
434
435
    >>> porter2('reading')
436
    'read'
437
    >>> porter2('suspension')
438
    'suspens'
439
    >>> porter2('elusiveness')
440
    'elus'
441
442
    >>> porter2('eateth', early_english=True)
443
    'eat'
444
    """
445 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y'}
446 1
    _codanonvowels = {
447
        "'",
448
        'b',
449
        'c',
450
        'd',
451
        'f',
452
        'g',
453
        'h',
454
        'j',
455
        'k',
456
        'l',
457
        'm',
458
        'n',
459
        'p',
460
        'q',
461
        'r',
462
        's',
463
        't',
464
        'v',
465
        'z',
466
    }
467 1
    _doubles = {'bb', 'dd', 'ff', 'gg', 'mm', 'nn', 'pp', 'rr', 'tt'}
468 1
    _li = {'c', 'd', 'e', 'g', 'h', 'k', 'm', 'n', 'r', 't'}
469
470
    # R1 prefixes should be in order from longest to shortest to prevent
471
    # masking
472 1
    _r1_prefixes = ('commun', 'gener', 'arsen')
473 1
    _exception1dict = {  # special changes:
474
        'skis': 'ski',
475
        'skies': 'sky',
476
        'dying': 'die',
477
        'lying': 'lie',
478
        'tying': 'tie',
479
        # special -LY cases:
480
        'idly': 'idl',
481
        'gently': 'gentl',
482
        'ugly': 'ugli',
483
        'early': 'earli',
484
        'only': 'onli',
485
        'singly': 'singl',
486
    }
487 1
    _exception1set = {
488
        'sky',
489
        'news',
490
        'howe',
491
        'atlas',
492
        'cosmos',
493
        'bias',
494
        'andes',
495
    }
496 1
    _exception2set = {
497
        'inning',
498
        'outing',
499
        'canning',
500
        'herring',
501
        'earring',
502
        'proceed',
503
        'exceed',
504
        'succeed',
505
    }
506
507
    # lowercase, normalize, and compose
508 1
    word = normalize('NFC', text_type(word.lower()))
509
    # replace apostrophe-like characters with U+0027, per
510
    # http://snowball.tartarus.org/texts/apostrophe.html
511 1
    word = word.replace('’', '\'')
512 1
    word = word.replace('’', '\'')
513
514
    # Exceptions 1
515 1
    if word in _exception1dict:
516 1
        return _exception1dict[word]
517 1
    elif word in _exception1set:
518 1
        return word
519
520
    # Return word if stem is shorter than 3
521 1
    if len(word) < 3:
522 1
        return word
523
524
    # Remove initial ', if present.
525 1
    while word and word[0] == '\'':
526 1
        word = word[1:]
527
        # Return word if stem is shorter than 2
528 1
        if len(word) < 2:
529 1
            return word
530
531
    # Re-map vocalic Y to y (Y will be C, y will be V)
532 1
    if word[0] == 'y':
533 1
        word = 'Y' + word[1:]
534 1
    for i in range(1, len(word)):
535 1
        if word[i] == 'y' and word[i - 1] in _vowels:
536 1
            word = word[:i] + 'Y' + word[i + 1 :]
537
538 1
    r1_start = _sb_r1(word, _vowels, _r1_prefixes)
539 1
    r2_start = _sb_r2(word, _vowels, _r1_prefixes)
540
541
    # Step 0
542 1
    if word[-3:] == '\'s\'':
543 1
        word = word[:-3]
544 1
    elif word[-2:] == '\'s':
545 1
        word = word[:-2]
546 1
    elif word[-1:] == '\'':
547 1
        word = word[:-1]
548
    # Return word if stem is shorter than 2
549 1
    if len(word) < 3:
550 1
        return word
551
552
    # Step 1a
553 1
    if word[-4:] == 'sses':
554 1
        word = word[:-2]
555 1
    elif word[-3:] in {'ied', 'ies'}:
556 1
        if len(word) > 4:
557 1
            word = word[:-2]
558
        else:
559 1
            word = word[:-1]
560 1
    elif word[-2:] in {'us', 'ss'}:
561 1
        pass
562 1
    elif word[-1] == 's':
563 1
        if _sb_has_vowel(word[:-2], _vowels):
564 1
            word = word[:-1]
565
566
    # Exceptions 2
567 1
    if word in _exception2set:
568 1
        return word
569
570
    # Step 1b
571 1
    step1b_flag = False
572 1
    if word[-5:] == 'eedly':
573 1
        if len(word[r1_start:]) >= 5:
574 1
            word = word[:-3]
575 1
    elif word[-5:] == 'ingly':
576 1
        if _sb_has_vowel(word[:-5], _vowels):
577 1
            word = word[:-5]
578 1
            step1b_flag = True
579 1
    elif word[-4:] == 'edly':
580 1
        if _sb_has_vowel(word[:-4], _vowels):
581 1
            word = word[:-4]
582 1
            step1b_flag = True
583 1
    elif word[-3:] == 'eed':
584 1
        if len(word[r1_start:]) >= 3:
585 1
            word = word[:-1]
586 1
    elif word[-3:] == 'ing':
587 1
        if _sb_has_vowel(word[:-3], _vowels):
588 1
            word = word[:-3]
589 1
            step1b_flag = True
590 1
    elif word[-2:] == 'ed':
591 1
        if _sb_has_vowel(word[:-2], _vowels):
592 1
            word = word[:-2]
593 1
            step1b_flag = True
594 1
    elif early_english:
595 1
        if word[-3:] == 'est':
596 1
            if _sb_has_vowel(word[:-3], _vowels):
597 1
                word = word[:-3]
598 1
                step1b_flag = True
599 1
        elif word[-3:] == 'eth':
600 1
            if _sb_has_vowel(word[:-3], _vowels):
601 1
                word = word[:-3]
602 1
                step1b_flag = True
603
604 1
    if step1b_flag:
605 1
        if word[-2:] in {'at', 'bl', 'iz'}:
606 1
            word += 'e'
607 1
        elif word[-2:] in _doubles:
608 1
            word = word[:-1]
609 1
        elif _sb_short_word(word, _vowels, _codanonvowels, _r1_prefixes):
610 1
            word += 'e'
611
612
    # Step 1c
613 1
    if len(word) > 2 and word[-1] in {'Y', 'y'} and word[-2] not in _vowels:
614 1
        word = word[:-1] + 'i'
615
616
    # Step 2
617 1
    if word[-2] == 'a':
618 1
        if word[-7:] == 'ational':
619 1
            if len(word[r1_start:]) >= 7:
620 1
                word = word[:-5] + 'e'
621 1
        elif word[-6:] == 'tional':
622 1
            if len(word[r1_start:]) >= 6:
623 1
                word = word[:-2]
624 1
    elif word[-2] == 'c':
625 1
        if word[-4:] in {'enci', 'anci'}:
626 1
            if len(word[r1_start:]) >= 4:
627 1
                word = word[:-1] + 'e'
628 1
    elif word[-2] == 'e':
629 1
        if word[-4:] == 'izer':
630 1
            if len(word[r1_start:]) >= 4:
631 1
                word = word[:-1]
632 1
    elif word[-2] == 'g':
633 1
        if word[-3:] == 'ogi':
634 1
            if r1_start >= 1 and len(word[r1_start:]) >= 3 and word[-4] == 'l':
635 1
                word = word[:-1]
636 1
    elif word[-2] == 'l':
637 1
        if word[-6:] == 'lessli':
638 1
            if len(word[r1_start:]) >= 6:
639 1
                word = word[:-2]
640 1
        elif word[-5:] in {'entli', 'fulli', 'ousli'}:
641 1
            if len(word[r1_start:]) >= 5:
642 1
                word = word[:-2]
643 1
        elif word[-4:] == 'abli':
644 1
            if len(word[r1_start:]) >= 4:
645 1
                word = word[:-1] + 'e'
646 1
        elif word[-4:] == 'alli':
647 1
            if len(word[r1_start:]) >= 4:
648 1
                word = word[:-2]
649 1
        elif word[-3:] == 'bli':
650 1
            if len(word[r1_start:]) >= 3:
651 1
                word = word[:-1] + 'e'
652 1
        elif word[-2:] == 'li':
653 1
            if r1_start >= 1 and len(word[r1_start:]) >= 2 and word[-3] in _li:
654 1
                word = word[:-2]
655 1
    elif word[-2] == 'o':
656 1
        if word[-7:] == 'ization':
657 1
            if len(word[r1_start:]) >= 7:
658 1
                word = word[:-5] + 'e'
659 1
        elif word[-5:] == 'ation':
660 1
            if len(word[r1_start:]) >= 5:
661 1
                word = word[:-3] + 'e'
662 1
        elif word[-4:] == 'ator':
663 1
            if len(word[r1_start:]) >= 4:
664 1
                word = word[:-2] + 'e'
665 1
    elif word[-2] == 's':
666 1
        if word[-7:] in {'fulness', 'ousness', 'iveness'}:
667 1
            if len(word[r1_start:]) >= 7:
668 1
                word = word[:-4]
669 1
        elif word[-5:] == 'alism':
670 1
            if len(word[r1_start:]) >= 5:
671 1
                word = word[:-3]
672 1
    elif word[-2] == 't':
673 1
        if word[-6:] == 'biliti':
674 1
            if len(word[r1_start:]) >= 6:
675 1
                word = word[:-5] + 'le'
676 1
        elif word[-5:] == 'aliti':
677 1
            if len(word[r1_start:]) >= 5:
678 1
                word = word[:-3]
679 1
        elif word[-5:] == 'iviti':
680 1
            if len(word[r1_start:]) >= 5:
681 1
                word = word[:-3] + 'e'
682
683
    # Step 3
684 1
    if word[-7:] == 'ational':
685 1
        if len(word[r1_start:]) >= 7:
686 1
            word = word[:-5] + 'e'
687 1
    elif word[-6:] == 'tional':
688 1
        if len(word[r1_start:]) >= 6:
689 1
            word = word[:-2]
690 1
    elif word[-5:] in {'alize', 'icate', 'iciti'}:
691 1
        if len(word[r1_start:]) >= 5:
692 1
            word = word[:-3]
693 1
    elif word[-5:] == 'ative':
694 1
        if len(word[r2_start:]) >= 5:
695 1
            word = word[:-5]
696 1
    elif word[-4:] == 'ical':
697 1
        if len(word[r1_start:]) >= 4:
698 1
            word = word[:-2]
699 1
    elif word[-4:] == 'ness':
700 1
        if len(word[r1_start:]) >= 4:
701 1
            word = word[:-4]
702 1
    elif word[-3:] == 'ful':
703 1
        if len(word[r1_start:]) >= 3:
704 1
            word = word[:-3]
705
706
    # Step 4
707 1
    for suffix in (
708
        'ement',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
709
        'ance',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
710
        'ence',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
711
        'able',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
712
        'ible',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
713
        'ment',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
714
        'ant',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
715
        'ent',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
716
        'ism',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
717
        'ate',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
718
        'iti',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
719
        'ous',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
720
        'ive',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
721
        'ize',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
722
        'al',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
723
        'er',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
724
        'ic',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
725
    ):
726 1
        if word[-len(suffix) :] == suffix:
727 1
            if len(word[r2_start:]) >= len(suffix):
728 1
                word = word[: -len(suffix)]
729 1
            break
730
    else:
731 1
        if word[-3:] == 'ion':
732 1
            if (
733
                len(word[r2_start:]) >= 3
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
734
                and len(word) >= 4
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
735
                and word[-4] in tuple('st')
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
736
            ):
737 1
                word = word[:-3]
738
739
    # Step 5
740 1
    if word[-1] == 'e':
741 1
        if len(word[r2_start:]) >= 1 or (
742
            len(word[r1_start:]) >= 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
743
            and not _sb_ends_in_short_syllable(
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
744
                word[:-1], _vowels, _codanonvowels
745
            )
746
        ):
747 1
            word = word[:-1]
748 1
    elif word[-1] == 'l':
749 1
        if len(word[r2_start:]) >= 1 and word[-2] == 'l':
750 1
            word = word[:-1]
751
752
    # Change 'Y' back to 'y' if it survived stemming
753 1
    for i in range(0, len(word)):
754 1
        if word[i] == 'Y':
755 1
            word = word[:i] + 'y' + word[i + 1 :]
756
757 1
    return word
758
759
760 1
def sb_german(word, alternate_vowels=False):
761
    """Return Snowball German stem.
762
763
    The Snowball German stemmer is defined at:
764
    http://snowball.tartarus.org/algorithms/german/stemmer.html
765
766
    :param str word: the word to calculate the stem of
767
    :param bool alternate_vowels: composes ae as ä, oe as ö, and ue as ü before
768
        running the algorithm
769
    :returns: word stem
770
    :rtype: str
771
772
    >>> sb_german('lesen')
773
    'les'
774
    >>> sb_german('graues')
775
    'grau'
776
    >>> sb_german('buchstabieren')
777
    'buchstabi'
778
    """
779 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'ö', 'ü'}
780 1
    _s_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'r', 't'}
781 1
    _st_endings = {'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't'}
782
783
    # lowercase, normalize, and compose
784 1
    word = normalize('NFC', word.lower())
785 1
    word = word.replace('ß', 'ss')
786
787 1
    if len(word) > 2:
788 1
        for i in range(2, len(word)):
789 1
            if word[i] in _vowels and word[i - 2] in _vowels:
790 1
                if word[i - 1] == 'u':
791 1
                    word = word[: i - 1] + 'U' + word[i:]
792 1
                elif word[i - 1] == 'y':
793 1
                    word = word[: i - 1] + 'Y' + word[i:]
794
795 1
    if alternate_vowels:
796 1
        word = word.replace('ae', 'ä')
797 1
        word = word.replace('oe', 'ö')
798 1
        word = word.replace('que', 'Q')
799 1
        word = word.replace('ue', 'ü')
800 1
        word = word.replace('Q', 'que')
801
802 1
    r1_start = max(3, _sb_r1(word, _vowels))
803 1
    r2_start = _sb_r2(word, _vowels)
804
805
    # Step 1
806 1
    niss_flag = False
807 1
    if word[-3:] == 'ern':
808 1
        if len(word[r1_start:]) >= 3:
809 1
            word = word[:-3]
810 1
    elif word[-2:] == 'em':
811 1
        if len(word[r1_start:]) >= 2:
812 1
            word = word[:-2]
813 1
    elif word[-2:] == 'er':
814 1
        if len(word[r1_start:]) >= 2:
815 1
            word = word[:-2]
816 1
    elif word[-2:] == 'en':
817 1
        if len(word[r1_start:]) >= 2:
818 1
            word = word[:-2]
819 1
            niss_flag = True
820 1
    elif word[-2:] == 'es':
821 1
        if len(word[r1_start:]) >= 2:
822 1
            word = word[:-2]
823 1
            niss_flag = True
824 1
    elif word[-1:] == 'e':
825 1
        if len(word[r1_start:]) >= 1:
826 1
            word = word[:-1]
827 1
            niss_flag = True
828 1
    elif word[-1:] == 's':
829 1
        if (
830
            len(word[r1_start:]) >= 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
831
            and len(word) >= 2
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
832
            and word[-2] in _s_endings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
833
        ):
834 1
            word = word[:-1]
835
836 1
    if niss_flag and word[-4:] == 'niss':
837 1
        word = word[:-1]
838
839
    # Step 2
840 1
    if word[-3:] == 'est':
841 1
        if len(word[r1_start:]) >= 3:
842 1
            word = word[:-3]
843 1
    elif word[-2:] == 'en':
844 1
        if len(word[r1_start:]) >= 2:
845 1
            word = word[:-2]
846 1
    elif word[-2:] == 'er':
847 1
        if len(word[r1_start:]) >= 2:
848 1
            word = word[:-2]
849 1
    elif word[-2:] == 'st':
850 1
        if (
851
            len(word[r1_start:]) >= 2
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
852
            and len(word) >= 6
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
853
            and word[-3] in _st_endings
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
854
        ):
855 1
            word = word[:-2]
856
857
    # Step 3
858 1
    if word[-4:] == 'isch':
859 1
        if len(word[r2_start:]) >= 4 and word[-5] != 'e':
860 1
            word = word[:-4]
861 1
    elif word[-4:] in {'lich', 'heit'}:
862 1
        if len(word[r2_start:]) >= 4:
863 1
            word = word[:-4]
864 1
            if word[-2:] in {'er', 'en'} and len(word[r1_start:]) >= 2:
865 1
                word = word[:-2]
866 1
    elif word[-4:] == 'keit':
867 1
        if len(word[r2_start:]) >= 4:
868 1
            word = word[:-4]
869 1
            if word[-4:] == 'lich' and len(word[r2_start:]) >= 4:
870 1
                word = word[:-4]
871 1
            elif word[-2:] == 'ig' and len(word[r2_start:]) >= 2:
872 1
                word = word[:-2]
873 1
    elif word[-3:] in {'end', 'ung'}:
874 1
        if len(word[r2_start:]) >= 3:
875 1
            word = word[:-3]
876 1
            if (
877
                word[-2:] == 'ig'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
878
                and len(word[r2_start:]) >= 2
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
879
                and word[-3] != 'e'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
880
            ):
881 1
                word = word[:-2]
882 1
    elif word[-2:] in {'ig', 'ik'}:
883 1
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
884 1
            word = word[:-2]
885
886
    # Change 'Y' and 'U' back to lowercase if survived stemming
887 1
    for i in range(0, len(word)):
888 1
        if word[i] == 'Y':
889 1
            word = word[:i] + 'y' + word[i + 1 :]
890 1
        elif word[i] == 'U':
891 1
            word = word[:i] + 'u' + word[i + 1 :]
892
893
    # Remove umlauts
894 1
    _umlauts = dict(zip((ord(_) for _ in 'äöü'), 'aou'))
895 1
    word = word.translate(_umlauts)
896
897 1
    return word
898
899
900 1
def sb_dutch(word):
901
    """Return Snowball Dutch stem.
902
903
    The Snowball Dutch stemmer is defined at:
904
    http://snowball.tartarus.org/algorithms/dutch/stemmer.html
905
906
    :param str word: the word to calculate the stem of
907
    :returns: word stem
908
    :rtype: str
909
910
    >>> sb_dutch('lezen')
911
    'lez'
912
    >>> sb_dutch('opschorting')
913
    'opschort'
914
    >>> sb_dutch('ongrijpbaarheid')
915
    'ongrijp'
916
    """
917 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'è'}
918 1
    _not_s_endings = {'a', 'e', 'i', 'j', 'o', 'u', 'y', 'è'}
919
920 1
    def _undouble(word):
921
        """Undouble endings -kk, -dd, and -tt."""
922 1
        if (
923
            len(word) > 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
924
            and word[-1] == word[-2]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
925
            and word[-1] in {'d', 'k', 't'}
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
926
        ):
927 1
            return word[:-1]
928 1
        return word
929
930
    # lowercase, normalize, decompose, filter umlauts & acutes out, and compose
931 1
    word = normalize('NFC', text_type(word.lower()))
932 1
    _accented = dict(zip((ord(_) for _ in 'äëïöüáéíóú'), 'aeiouaeiou'))
933 1
    word = word.translate(_accented)
934
935 1
    for i in range(len(word)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
936 1
        if i == 0 and word[0] == 'y':
937 1
            word = 'Y' + word[1:]
938 1
        elif word[i] == 'y' and word[i - 1] in _vowels:
939 1
            word = word[:i] + 'Y' + word[i + 1 :]
940 1
        elif (
941
            word[i] == 'i'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
942
            and word[i - 1] in _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
943
            and i + 1 < len(word)
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
944
            and word[i + 1] in _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
945
        ):
946 1
            word = word[:i] + 'I' + word[i + 1 :]
947
948 1
    r1_start = max(3, _sb_r1(word, _vowels))
949 1
    r2_start = _sb_r2(word, _vowels)
950
951
    # Step 1
952 1
    if word[-5:] == 'heden':
953 1
        if len(word[r1_start:]) >= 5:
954 1
            word = word[:-3] + 'id'
955 1
    elif word[-3:] == 'ene':
956 1
        if len(word[r1_start:]) >= 3 and (
957
            word[-4] not in _vowels and word[-6:-3] != 'gem'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
958
        ):
959 1
            word = _undouble(word[:-3])
960 1
    elif word[-2:] == 'en':
961 1
        if len(word[r1_start:]) >= 2 and (
962
            word[-3] not in _vowels and word[-5:-2] != 'gem'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
963
        ):
964 1
            word = _undouble(word[:-2])
965 1
    elif word[-2:] == 'se':
966 1
        if len(word[r1_start:]) >= 2 and word[-3] not in _not_s_endings:
967 1
            word = word[:-2]
968 1
    elif word[-1:] == 's':
969 1
        if len(word[r1_start:]) >= 1 and word[-2] not in _not_s_endings:
970 1
            word = word[:-1]
971
972
    # Step 2
973 1
    e_removed = False
974 1
    if word[-1:] == 'e':
975 1
        if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
976 1
            word = _undouble(word[:-1])
977 1
            e_removed = True
978
979
    # Step 3a
980 1
    if word[-4:] == 'heid':
981 1
        if len(word[r2_start:]) >= 4 and word[-5] != 'c':
982 1
            word = word[:-4]
983 1
            if word[-2:] == 'en':
984 1
                if len(word[r1_start:]) >= 2 and (
985
                    word[-3] not in _vowels and word[-5:-2] != 'gem'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
986
                ):
987 1
                    word = _undouble(word[:-2])
988
989
    # Step 3b
990 1
    if word[-4:] == 'lijk':
991 1
        if len(word[r2_start:]) >= 4:
992 1
            word = word[:-4]
993
            # Repeat step 2
994 1
            if word[-1:] == 'e':
995 1
                if len(word[r1_start:]) >= 1 and word[-2] not in _vowels:
996 1
                    word = _undouble(word[:-1])
997 1
    elif word[-4:] == 'baar':
998 1
        if len(word[r2_start:]) >= 4:
999 1
            word = word[:-4]
1000 1
    elif word[-3:] in ('end', 'ing'):
1001 1
        if len(word[r2_start:]) >= 3:
1002 1
            word = word[:-3]
1003 1
            if (
1004
                word[-2:] == 'ig'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1005
                and len(word[r2_start:]) >= 2
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1006
                and word[-3] != 'e'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1007
            ):
1008 1
                word = word[:-2]
1009
            else:
1010 1
                word = _undouble(word)
1011 1
    elif word[-3:] == 'bar':
1012 1
        if len(word[r2_start:]) >= 3 and e_removed:
1013 1
            word = word[:-3]
1014 1
    elif word[-2:] == 'ig':
1015 1
        if len(word[r2_start:]) >= 2 and word[-3] != 'e':
1016 1
            word = word[:-2]
1017
1018
    # Step 4
1019 1
    if (
1020
        len(word) >= 4
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
best-practice introduced by
Too many boolean expressions in if statement (6/5)
Loading history...
1021
        and word[-3] == word[-2]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1022
        and word[-2] in {'a', 'e', 'o', 'u'}
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1023
        and word[-4] not in _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1024
        and word[-1] not in _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1025
        and word[-1] != 'I'
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1026
    ):
1027 1
        word = word[:-2] + word[-1]
1028
1029
    # Change 'Y' and 'U' back to lowercase if survived stemming
1030 1
    for i in range(0, len(word)):
1031 1
        if word[i] == 'Y':
1032 1
            word = word[:i] + 'y' + word[i + 1 :]
1033 1
        elif word[i] == 'I':
1034 1
            word = word[:i] + 'i' + word[i + 1 :]
1035
1036 1
    return word
1037
1038
1039 1
def sb_norwegian(word):
1040
    """Return Snowball Norwegian stem.
1041
1042
    The Snowball Norwegian stemmer is defined at:
1043
    http://snowball.tartarus.org/algorithms/norwegian/stemmer.html
1044
1045
    :param str word: the word to calculate the stem of
1046
    :returns: word stem
1047
    :rtype: str
1048
1049
    >>> sb_norwegian('lese')
1050
    'les'
1051
    >>> sb_norwegian('suspensjon')
1052
    'suspensjon'
1053
    >>> sb_norwegian('sikkerhet')
1054
    'sikker'
1055
    """
1056 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1057 1
    _s_endings = {
1058
        'b',
1059
        'c',
1060
        'd',
1061
        'f',
1062
        'g',
1063
        'h',
1064
        'j',
1065
        'l',
1066
        'm',
1067
        'n',
1068
        'o',
1069
        'p',
1070
        'r',
1071
        't',
1072
        'v',
1073
        'y',
1074
        'z',
1075
    }
1076
    # lowercase, normalize, and compose
1077 1
    word = normalize('NFC', text_type(word.lower()))
1078
1079 1
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1080
1081
    # Step 1
1082 1
    _r1 = word[r1_start:]
1083 1
    if _r1[-7:] == 'hetenes':
1084 1
        word = word[:-7]
1085 1
    elif _r1[-6:] in {'hetene', 'hetens'}:
1086 1
        word = word[:-6]
1087 1
    elif _r1[-5:] in {'heten', 'heter', 'endes'}:
1088 1
        word = word[:-5]
1089 1
    elif _r1[-4:] in {'ande', 'ende', 'edes', 'enes', 'erte'}:
1090 1
        if word[-4:] == 'erte':
1091 1
            word = word[:-2]
1092
        else:
1093 1
            word = word[:-4]
1094 1
    elif _r1[-3:] in {
1095
        'ede',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1096
        'ane',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1097
        'ene',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1098
        'ens',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1099
        'ers',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1100
        'ets',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1101
        'het',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1102
        'ast',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1103
        'ert',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1104
    }:
1105 1
        if word[-3:] == 'ert':
1106 1
            word = word[:-1]
1107
        else:
1108 1
            word = word[:-3]
1109 1
    elif _r1[-2:] in {'en', 'ar', 'er', 'as', 'es', 'et'}:
1110 1
        word = word[:-2]
1111 1
    elif _r1[-1:] in {'a', 'e'}:
1112 1
        word = word[:-1]
1113 1
    elif _r1[-1:] == 's':
1114 1
        if (len(word) > 1 and word[-2] in _s_endings) or (
1115
            len(word) > 2 and word[-2] == 'k' and word[-3] not in _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1116
        ):
1117 1
            word = word[:-1]
1118
1119
    # Step 2
1120 1
    if word[r1_start:][-2:] in {'dt', 'vt'}:
1121 1
        word = word[:-1]
1122
1123
    # Step 3
1124 1
    _r1 = word[r1_start:]
1125 1
    if _r1[-7:] == 'hetslov':
1126 1
        word = word[:-7]
1127 1
    elif _r1[-4:] in {'eleg', 'elig', 'elov', 'slov'}:
1128 1
        word = word[:-4]
1129 1
    elif _r1[-3:] in {'leg', 'eig', 'lig', 'els', 'lov'}:
1130 1
        word = word[:-3]
1131 1
    elif _r1[-2:] == 'ig':
1132 1
        word = word[:-2]
1133
1134 1
    return word
1135
1136
1137 1
def sb_swedish(word):
1138
    """Return Snowball Swedish stem.
1139
1140
    The Snowball Swedish stemmer is defined at:
1141
    http://snowball.tartarus.org/algorithms/swedish/stemmer.html
1142
1143
    :param str word: the word to calculate the stem of
1144
    :returns: word stem
1145
    :rtype: str
1146
1147
    >>> sb_swedish('undervisa')
1148
    'undervis'
1149
    >>> sb_swedish('suspension')
1150
    'suspension'
1151
    >>> sb_swedish('visshet')
1152
    'viss'
1153
    """
1154 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'ä', 'å', 'ö'}
1155 1
    _s_endings = {
1156
        'b',
1157
        'c',
1158
        'd',
1159
        'f',
1160
        'g',
1161
        'h',
1162
        'j',
1163
        'k',
1164
        'l',
1165
        'm',
1166
        'n',
1167
        'o',
1168
        'p',
1169
        'r',
1170
        't',
1171
        'v',
1172
        'y',
1173
    }
1174
1175
    # lowercase, normalize, and compose
1176 1
    word = normalize('NFC', text_type(word.lower()))
1177
1178 1
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1179
1180
    # Step 1
1181 1
    _r1 = word[r1_start:]
1182 1 View Code Duplication
    if _r1[-7:] == 'heterna':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1183 1
        word = word[:-7]
1184 1
    elif _r1[-6:] == 'hetens':
1185 1
        word = word[:-6]
1186 1
    elif _r1[-5:] in {
1187
        'anden',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1188
        'heten',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1189
        'heter',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1190
        'arnas',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1191
        'ernas',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1192
        'ornas',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1193
        'andes',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1194
        'arens',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1195
        'andet',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1196
    }:
1197 1
        word = word[:-5]
1198 1
    elif _r1[-4:] in {
1199
        'arna',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1200
        'erna',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1201
        'orna',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1202
        'ande',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1203
        'arne',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1204
        'aste',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1205
        'aren',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1206
        'ades',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1207
        'erns',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1208
    }:
1209 1
        word = word[:-4]
1210 1
    elif _r1[-3:] in {'ade', 'are', 'ern', 'ens', 'het', 'ast'}:
1211 1
        word = word[:-3]
1212 1
    elif _r1[-2:] in {'ad', 'en', 'ar', 'er', 'or', 'as', 'es', 'at'}:
1213 1
        word = word[:-2]
1214 1
    elif _r1[-1:] in {'a', 'e'}:
1215 1
        word = word[:-1]
1216 1
    elif _r1[-1:] == 's':
1217 1
        if len(word) > 1 and word[-2] in _s_endings:
1218 1
            word = word[:-1]
1219
1220
    # Step 2
1221 1
    if word[r1_start:][-2:] in {'dd', 'gd', 'nn', 'dt', 'gt', 'kt', 'tt'}:
1222 1
        word = word[:-1]
1223
1224
    # Step 3
1225 1
    _r1 = word[r1_start:]
1226 1
    if _r1[-5:] == 'fullt':
1227 1
        word = word[:-1]
1228 1
    elif _r1[-4:] == 'löst':
1229 1
        word = word[:-1]
1230 1
    elif _r1[-3:] in {'lig', 'els'}:
1231 1
        word = word[:-3]
1232 1
    elif _r1[-2:] == 'ig':
1233 1
        word = word[:-2]
1234
1235 1
    return word
1236
1237
1238 1
def sb_danish(word):
1239
    """Return Snowball Danish stem.
1240
1241
    The Snowball Danish stemmer is defined at:
1242
    http://snowball.tartarus.org/algorithms/danish/stemmer.html
1243
1244
    :param str word: the word to calculate the stem of
1245
    :returns: word stem
1246
    :rtype: str
1247
1248
    >>> sb_danish('underviser')
1249
    'undervis'
1250
    >>> sb_danish('suspension')
1251
    'suspension'
1252
    >>> sb_danish('sikkerhed')
1253
    'sikker'
1254
    """
1255 1
    _vowels = {'a', 'e', 'i', 'o', 'u', 'y', 'å', 'æ', 'ø'}
1256 1
    _s_endings = {
1257
        'a',
1258
        'b',
1259
        'c',
1260
        'd',
1261
        'f',
1262
        'g',
1263
        'h',
1264
        'j',
1265
        'k',
1266
        'l',
1267
        'm',
1268
        'n',
1269
        'o',
1270
        'p',
1271
        'r',
1272
        't',
1273
        'v',
1274
        'y',
1275
        'z',
1276
        'å',
1277
    }
1278
1279
    # lowercase, normalize, and compose
1280 1
    word = normalize('NFC', text_type(word.lower()))
1281
1282 1
    r1_start = min(max(3, _sb_r1(word, _vowels)), len(word))
1283
1284
    # Step 1
1285 1
    _r1 = word[r1_start:]
1286 1 View Code Duplication
    if _r1[-7:] == 'erendes':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
1287 1
        word = word[:-7]
1288 1
    elif _r1[-6:] in {'erende', 'hedens'}:
1289 1
        word = word[:-6]
1290 1
    elif _r1[-5:] in {
1291
        'ethed',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1292
        'erede',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1293
        'heden',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1294
        'heder',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1295
        'endes',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1296
        'ernes',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1297
        'erens',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1298
        'erets',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1299
    }:
1300 1
        word = word[:-5]
1301 1
    elif _r1[-4:] in {
1302
        'ered',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1303
        'ende',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1304
        'erne',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1305
        'eren',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1306
        'erer',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1307
        'heds',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1308
        'enes',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1309
        'eres',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1310
        'eret',
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1311
    }:
1312 1
        word = word[:-4]
1313 1
    elif _r1[-3:] in {'hed', 'ene', 'ere', 'ens', 'ers', 'ets'}:
1314 1
        word = word[:-3]
1315 1
    elif _r1[-2:] in {'en', 'er', 'es', 'et'}:
1316 1
        word = word[:-2]
1317 1
    elif _r1[-1:] == 'e':
1318 1
        word = word[:-1]
1319 1
    elif _r1[-1:] == 's':
1320 1
        if len(word) > 1 and word[-2] in _s_endings:
1321 1
            word = word[:-1]
1322
1323
    # Step 2
1324 1
    if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1325 1
        word = word[:-1]
1326
1327
    # Step 3
1328 1
    if word[-4:] == 'igst':
1329 1
        word = word[:-2]
1330
1331 1
    _r1 = word[r1_start:]
1332 1
    repeat_step2 = False
1333 1
    if _r1[-4:] == 'elig':
1334 1
        word = word[:-4]
1335 1
        repeat_step2 = True
1336 1
    elif _r1[-4:] == 'løst':
1337 1
        word = word[:-1]
1338 1
    elif _r1[-3:] in {'lig', 'els'}:
1339 1
        word = word[:-3]
1340 1
        repeat_step2 = True
1341 1
    elif _r1[-2:] == 'ig':
1342 1
        word = word[:-2]
1343 1
        repeat_step2 = True
1344
1345 1
    if repeat_step2:
1346 1
        if word[r1_start:][-2:] in {'gd', 'dt', 'gt', 'kt'}:
1347 1
            word = word[:-1]
1348
1349
    # Step 4
1350 1
    if (
1351
        len(word[r1_start:]) >= 1
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1352
        and len(word) >= 2
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1353
        and word[-1] == word[-2]
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1354
        and word[-1] not in _vowels
0 ignored issues
show
Coding Style introduced by
Wrong hanging indentation before block (add 4 spaces).
Loading history...
1355
    ):
1356 1
        word = word[:-1]
1357
1358 1
    return word
1359
1360
1361
if __name__ == '__main__':
1362
    import doctest
1363
1364
    doctest.testmod()
1365