_simple_document()   B
last analyzed

Complexity

Conditions 5

Size

Total Lines 25

Duplication

Lines 0
Ratio 0 %

Importance

Changes 2
Bugs 1 Features 0
Metric Value
c 2
b 1
f 0
dl 0
loc 25
rs 8.0894
cc 5
1
import gensim
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Configuration introduced by
The import gensim could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
2
import logging
0 ignored issues
show
Unused Code introduced by
The import logging seems to be unused.
Loading history...
3
# imports used only for doctests
4
from topik.tokenizers._registry import register
5
6
7
def _simple_document(text, min_length=1, stopwords=None):
8
    """A text tokenizer that simply lowercases, matches alphabetic
9
    characters and removes stopwords.  For use on individual text documents.
10
11
    Parameters
12
    ----------
13
    text : str
14
        A single document's text to be tokenized
15
    min_length : int
16
        Minimum length of any single word
17
    stopwords: None or iterable of str
18
        Collection of words to ignore as tokens
19
20
    Examples
21
    --------
22
    >>> text = "frank FRANK the frank dog cat"
23
    >>> tokenized_text = _simple_document(text)
24
    >>> tokenized_text == ["frank", "frank", "frank", "dog", "cat"]
25
    True
26
    """
27
    if not stopwords:
28
        from gensim.parsing.preprocessing import STOPWORDS as stopwords
0 ignored issues
show
Configuration introduced by
The import gensim.parsing.preprocessing could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
29
    #logging.debug("Tokenizing text: {}".format(text))
30
    return [word for word in gensim.utils.tokenize(text, lower=True)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable word does not seem to be defined.
Loading history...
31
            if word not in stopwords and len(word) >= min_length]
32
33
34
@register
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable register does not seem to be defined.
Loading history...
35
def simple(raw_corpus, min_length=1, stopwords=None):
36
    """A text tokenizer that simply lowercases, matches alphabetic
37
    characters and removes stopwords.
38
39
    Parameters
40
    ----------
41
    raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))
42
        body of documents to examine
43
    min_length : int
44
        Minimum length of any single word
45
    stopwords: None or iterable of str
46
        Collection of words to ignore as tokens
47
48
    Examples
49
    --------
50
    >>> sample_corpus = [("doc1", "frank FRANK the frank dog cat"),
51
    ...               ("doc2", "frank a dog of the llama")]
52
    >>> tokenized_corpora = simple(sample_corpus)
53
    >>> next(tokenized_corpora) == ("doc1",
54
    ... ["frank", "frank", "frank", "dog", "cat"])
55
    True
56
    """
57
    for doc_id, doc_text in raw_corpus:
58
        # logging.debug("Tokenizing doc_id: {}".format(doc_id))
59
        yield(doc_id, _simple_document(doc_text, min_length=min_length, stopwords=stopwords))
0 ignored issues
show
Coding Style introduced by
Final newline missing
Loading history...