annif.analyzer.analyzer - Code Metrics - Inspection of "Set Min Token Size to Two" - NatLibFi/Annif - Measure and Improve Code Quality continuously with Scrutinizer

Passed

Pull Request — master (#468)

unknown

created 2021-03-24 08:49 UTC

annif.analyzer.analyzer A

↳ Parent: Project

Complexity

Total Complexity

Size/Duplication

Total Lines	46
Duplicated Lines	0 %

Importance

Changes

Metric	Value
eloc	28
dl	0
loc	46
rs	10
c	0
b	0
f	0
wmc	8

5 Methods

Rating	Name	Size	Complexity
A	Analyzer.tokenize_words()	5	1
A	Analyzer.is_valid_token()	10	4
A	Analyzer.normalize_word()	4	1
A	Analyzer.__init__()	2	1
A	Analyzer.tokenize_sentences()	3	1

"""Common functionality for analyzers."""

import abc
import functools
import unicodedata
import nltk.tokenize

_KEY_TOKEN_MIN_LENGTH = 'token_min_length'


class Analyzer(metaclass=abc.ABCMeta):
    """Base class for language-specific analyzers. The non-implemented
    methods should be overridden in subclasses. Tokenize functions may
    be overridden when necessary."""

    name = None

    def __init__(self, **kwargs):
        self.token_min_length = int(kwargs.get(_KEY_TOKEN_MIN_LENGTH, 3))

    def tokenize_sentences(self, text):
        """Tokenize a piece of text (e.g. a document) into sentences."""
        return nltk.tokenize.sent_tokenize(text)

    @functools.lru_cache(maxsize=50000)
    def is_valid_token(self, word):
        """Return True if the word is an acceptable token."""
        if len(word) < self.token_min_length:
            return False
        for char in word:
            category = unicodedata.category(char)
            if category[0] == 'L':  # letter
                return True
        return False

    def tokenize_words(self, text):
        """Tokenize a piece of text (e.g. a sentence) into words."""
        return [self.normalize_word(word)
                for word in nltk.tokenize.word_tokenize(text)
                if self.is_valid_token(word)]

    @abc.abstractmethod
    def normalize_word(self, word):
        """Normalize (stem or lemmatize) a word form into a normal form."""
        pass  # pragma: no cover


1			"""Common functionality for analyzers."""
2
3			import abc
4			import functools
5			import unicodedata
6			import nltk.tokenize
7
8			_KEY_TOKEN_MIN_LENGTH = 'token_min_length'
9
10
11			class Analyzer(metaclass=abc.ABCMeta):
12			"""Base class for language-specific analyzers. The non-implemented
13			methods should be overridden in subclasses. Tokenize functions may
14			be overridden when necessary."""
15
16			name = None
17
18			def __init__(self, **kwargs):
19			self.token_min_length = int(kwargs.get(_KEY_TOKEN_MIN_LENGTH, 3))
20
21			def tokenize_sentences(self, text):
22			"""Tokenize a piece of text (e.g. a document) into sentences."""
23			return nltk.tokenize.sent_tokenize(text)
24
25			@functools.lru_cache(maxsize=50000)
26			def is_valid_token(self, word):
27			"""Return True if the word is an acceptable token."""
28			if len(word) < self.token_min_length:
29			return False
30			for char in word:
31			category = unicodedata.category(char)
32			if category[0] == 'L': # letter
33			return True
34			return False
35
36			def tokenize_words(self, text):
37			"""Tokenize a piece of text (e.g. a sentence) into words."""
38			return [self.normalize_word(word)
39			for word in nltk.tokenize.word_tokenize(text)
40			if self.is_valid_token(word)]
41
42			@abc.abstractmethod
43			def normalize_word(self, word):
44			"""Normalize (stem or lemmatize) a word form into a normal form."""
45			pass # pragma: no cover
46

NatLibFi / Annif

Pull Request — master (#468)

annif.analyzer.analyzer A

Complexity

Size/Duplication

Importance

5 Methods

Duplication Side-by-Side

Filter issues like