_simple_document() - Code Metrics - ContinuumIO/topik - Measure and Improve Code Quality continuously with Scrutinizer

_simple_document() B
last analyzed 2016-04-26 18:00 UTC

↳ Parent: Project

Complexity

Conditions

Size

Total Lines

Duplication

Lines	0
Ratio	0 %

Importance

Changes	2
Bugs	1	Features	0

Metric	Value
c	2
b	1
f	0
dl	0
loc	25
rs	8.0894
cc	5

import gensim
class SomeClass:
    def some_method(self):
        """Do x and return foo."""
import logging

# imports used only for doctests
from topik.tokenizers._registry import register


def _simple_document(text, min_length=1, stopwords=None):
    """A text tokenizer that simply lowercases, matches alphabetic
    characters and removes stopwords.  For use on individual text documents.

    Parameters
    ----------
    text : str
        A single document's text to be tokenized
    min_length : int
        Minimum length of any single word
    stopwords: None or iterable of str
        Collection of words to ignore as tokens

    Examples
    --------
    >>> text = "frank FRANK the frank dog cat"
    >>> tokenized_text = _simple_document(text)
    >>> tokenized_text == ["frank", "frank", "frank", "dog", "cat"]
    True
    """
    if not stopwords:
        from gensim.parsing.preprocessing import STOPWORDS as stopwords
# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
    #logging.debug("Tokenizing text: {}".format(text))
    return [word for word in gensim.utils.tokenize(text, lower=True)

            if word not in stopwords and len(word) >= min_length]


@register

def simple(raw_corpus, min_length=1, stopwords=None):
    """A text tokenizer that simply lowercases, matches alphabetic
    characters and removes stopwords.

    Parameters
    ----------
    raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))
        body of documents to examine
    min_length : int
        Minimum length of any single word
    stopwords: None or iterable of str
        Collection of words to ignore as tokens

    Examples
    --------
    >>> sample_corpus = [("doc1", "frank FRANK the frank dog cat"),
    ...               ("doc2", "frank a dog of the llama")]
    >>> tokenized_corpora = simple(sample_corpus)
    >>> next(tokenized_corpora) == ("doc1",
    ... ["frank", "frank", "frank", "dog", "cat"])
    True
    """
    for doc_id, doc_text in raw_corpus:
        # logging.debug("Tokenizing doc_id: {}".format(doc_id))
        yield(doc_id, _simple_document(doc_text, min_length=min_length, stopwords=stopwords))


_simple_document() B
last analyzed 2016-04-26 18:00 UTC

Complexity

Size

Duplication

Importance

1. Missing Dependencies

2. Missing init.py files

1. Missing Dependencies

2. Missing init.py files

1			import gensim
			0 ignored issues – show Coding Style introduced 2015-11-23 14:51 UTC by Report Bug Copy Issue Report This module should have a docstring. The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass: def some_method(self): """Do x and return foo.""" If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history... Configuration introduced 2015-11-23 14:51 UTC by Report Bug Copy Issue Report The import `gensim` could not be resolved. This can be caused by one of the following: 1. Missing Dependencies This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands. # .scrutinizer.yml before_commands: - sudo pip install abc # Python2 - sudo pip3 install abc # Python3 Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version. 2. Missing __init__.py files This error could also result from missing `__init__.py` files in your module folders. Make sure that you place one file in each sub-folder. Loading history...
2			import logging
			0 ignored issues – show Unused Code introduced 2016-04-21 18:16 UTC by Report Bug Copy Issue Report The import `logging` seems to be unused. Loading history...
3			# imports used only for doctests
4			from topik.tokenizers._registry import register
5
6
7			def _simple_document(text, min_length=1, stopwords=None):
8			"""A text tokenizer that simply lowercases, matches alphabetic
9			characters and removes stopwords. For use on individual text documents.
10
11			Parameters
12			----------
13			text : str
14			A single document's text to be tokenized
15			min_length : int
16			Minimum length of any single word
17			stopwords: None or iterable of str
18			Collection of words to ignore as tokens
19
20			Examples
21			--------
22			>>> text = "frank FRANK the frank dog cat"
23			>>> tokenized_text = _simple_document(text)
24			>>> tokenized_text == ["frank", "frank", "frank", "dog", "cat"]
25			True
26			"""
27			if not stopwords:
28			from gensim.parsing.preprocessing import STOPWORDS as stopwords
			0 ignored issues – show Configuration introduced 2015-11-23 14:51 UTC by Report Bug Copy Issue Report The import `gensim.parsing.preprocessing` could not be resolved. This can be caused by one of the following: 1. Missing Dependencies This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands. # .scrutinizer.yml before_commands: - sudo pip install abc # Python2 - sudo pip3 install abc # Python3 Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version. 2. Missing __init__.py files This error could also result from missing `__init__.py` files in your module folders. Make sure that you place one file in each sub-folder. Loading history...
29			#logging.debug("Tokenizing text: {}".format(text))
30			return [word for word in gensim.utils.tokenize(text, lower=True)
			0 ignored issues – show Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `word` does not seem to be defined. Loading history...
31			if word not in stopwords and len(word) >= min_length]
32
33
34			@register
			0 ignored issues – show Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `register` does not seem to be defined. Loading history...
35			def simple(raw_corpus, min_length=1, stopwords=None):
36			"""A text tokenizer that simply lowercases, matches alphabetic
37			characters and removes stopwords.
38
39			Parameters
40			----------
41			raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))
42			body of documents to examine
43			min_length : int
44			Minimum length of any single word
45			stopwords: None or iterable of str
46			Collection of words to ignore as tokens
47
48			Examples
49			--------
50			>>> sample_corpus = [("doc1", "frank FRANK the frank dog cat"),
51			... ("doc2", "frank a dog of the llama")]
52			>>> tokenized_corpora = simple(sample_corpus)
53			>>> next(tokenized_corpora) == ("doc1",
54			... ["frank", "frank", "frank", "dog", "cat"])
55			True
56			"""
57			for doc_id, doc_text in raw_corpus:
58			# logging.debug("Tokenizing doc_id: {}".format(doc_id))
59			yield(doc_id, _simple_document(doc_text, min_length=min_length, stopwords=stopwords))
			0 ignored issues – show Coding Style introduced 2015-11-23 14:51 UTC by Report Bug Copy Issue Report Final newline missing Loading history...

ContinuumIO / topik

_simple_document() B last analyzed 2016-04-26 18:00 UTC

Complexity

Size

Duplication

Importance

1. Missing Dependencies

2. Missing __init__.py files

1. Missing Dependencies

2. Missing __init__.py files

Duplication Side-by-Side

Filter issues like

_simple_document() B
last analyzed 2016-04-26 18:00 UTC

2. Missing init.py files

2. Missing init.py files