read_document_folder() - Code Metrics - ContinuumIO/topik - Measure and Improve Code Quality continuously with Scrutinizer

read_document_folder() C
last analyzed 2016-04-26 18:00 UTC

↳ Parent: Project

Complexity

Conditions

Size

Total Lines

Duplication

Lines	0
Ratio	0 %

Importance

Changes	2
Bugs	0	Features	0

Metric	Value
c	2
b	0
f	0
dl	0
loc	38
rs	5.5
cc	7

import os
class SomeClass:
    def some_method(self):
        """Do x and return foo."""
import logging
import gzip

from six import text_type
from topik.fileio._registry import register_input
from topik.fileio.tests import test_data_path


@register_input

def read_document_folder(folder, content_field='text'):
    """Iterate over the files in a folder to retrieve the content to process and tokenize.

    Parameters
    ----------
    folder : str
        The folder containing the files you want to analyze.

    content_field : str
        The usage of 'content_field' in this source is different from most other sources.  The 

        assumption in this source is that each file contains raw text, NOT dictionaries of 

        categorized data.  The content_field argument here specifies what key to store the raw
        text under in the returned dictionary for each document.

    Examples
    --------
    >>> documents = read_document_folder(
    ...     '{}/test_data_folder_files'.format(test_data_path))
    >>> next(documents)['text'] == (
    ...     u"'Interstellar' was incredible. The visuals, the score, " +
    ...     u"the acting, were all amazing. The plot is definitely one " +
    ...     u"of the most original I've seen in a while.")
    True
    """

    if not os.path.exists(folder):
        raise IOError("Folder not found!")

    for directory, subdirectories, files in os.walk(folder):

        for n, file in enumerate(sorted(files)):

            _open = gzip.open if file.endswith('.gz') else open

            fullpath = os.path.join(directory, file)
            try:
                with _open(fullpath, 'rb') as fd:

                    yield _process_file(fd, fullpath, content_field)
            except ValueError as err:

                logging.warning("Unable to process file: {}, error: {}".format(fullpath, err))


def _process_file(fd, fullpath, content_field):
class SomeClass:
    def some_method(self):
        """Do x and return foo."""
    content = fd.read()
    try:
        u_content = text_type(content)
    except UnicodeDecodeError:

        logging.warning("Encountered invalid unicode in file {}, ignoring invalid bytes".format(fullpath))
        u_content = text_type(content, errors='ignore')
    return {content_field: u_content, 'filename': fullpath}


1			import os
			0 ignored issues – show Coding Style introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report This module should have a docstring. The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass: def some_method(self): """Do x and return foo.""" If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
2			import logging
3			import gzip
4
5			from six import text_type
6			from topik.fileio._registry import register_input
7			from topik.fileio.tests import test_data_path
			0 ignored issues – show Unused Code introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report Unused test_data_path imported from topik.fileio.tests Loading history...
8
9			@register_input
			0 ignored issues – show Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `register_input` does not seem to be defined. Loading history...
10			def read_document_folder(folder, content_field='text'):
11			"""Iterate over the files in a folder to retrieve the content to process and tokenize.
12
13			Parameters
14			----------
15			folder : str
16			The folder containing the files you want to analyze.
17
18			content_field : str
19			The usage of 'content_field' in this source is different from most other sources. The
			0 ignored issues – show Coding Style introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report Trailing whitespace Loading history...
20			assumption in this source is that each file contains raw text, NOT dictionaries of
			0 ignored issues – show Coding Style introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report Trailing whitespace Loading history...
21			categorized data. The content_field argument here specifies what key to store the raw
22			text under in the returned dictionary for each document.
23
24			Examples
25			--------
26			>>> documents = read_document_folder(
27			... '{}/test_data_folder_files'.format(test_data_path))
28			>>> next(documents)['text'] == (
29			... u"'Interstellar' was incredible. The visuals, the score, " +
30			... u"the acting, were all amazing. The plot is definitely one " +
31			... u"of the most original I've seen in a while.")
32			True
33			"""
34
35			if not os.path.exists(folder):
36			raise IOError("Folder not found!")
37
38			for directory, subdirectories, files in os.walk(folder):
			0 ignored issues – show Unused Code introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report The variable `subdirectories` seems to be unused. Loading history...
39			for n, file in enumerate(sorted(files)):
			0 ignored issues – show Coding Style Naming introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report The name `n` does not conform to the variable naming conventions (`[a-z_][a-z0-9_]{2,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Unused Code introduced 2015-11-23 14:50 UTC by Report Bug Copy Issue Report The variable `n` seems to be unused. Loading history...
40			_open = gzip.open if file.endswith('.gz') else open
			0 ignored issues – show Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `gzip` does not seem to be defined. Loading history... Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `open` does not seem to be defined. Loading history...
41			fullpath = os.path.join(directory, file)
42			try:
43			with _open(fullpath, 'rb') as fd:
			0 ignored issues – show Coding Style Naming introduced 2016-04-21 18:30 UTC by Report Bug Copy Issue Report The name `fd` does not conform to the variable naming conventions (`[a-z_][a-z0-9_]{2,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `fullpath` does not seem to be defined. Loading history... Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `fd` does not seem to be defined. Loading history...
44			yield _process_file(fd, fullpath, content_field)
45			except ValueError as err:
			0 ignored issues – show Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `ValueError` does not seem to be defined. Loading history...
46			logging.warning("Unable to process file: {}, error: {}".format(fullpath, err))
47
48
49			def _process_file(fd, fullpath, content_field):
			0 ignored issues – show Coding Style Naming introduced 2016-04-21 18:30 UTC by Report Bug Copy Issue Report The name `fd` does not conform to the argument naming conventions (`[a-z_][a-z0-9_]{2,30}$`). This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history... Coding Style introduced 2016-04-21 18:16 UTC by Report Bug Copy Issue Report This function should have a docstring. The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass: def some_method(self): """Do x and return foo.""" If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
50			content = fd.read()
51			try:
52			u_content = text_type(content)
53			except UnicodeDecodeError:
			0 ignored issues – show Comprehensibility Best Practice introduced 2016-04-26 18:00 UTC by Report Bug Copy Issue Report The variable `UnicodeDecodeError` does not seem to be defined. Loading history...
54			logging.warning("Encountered invalid unicode in file {}, ignoring invalid bytes".format(fullpath))
55			u_content = text_type(content, errors='ignore')
56			return {content_field: u_content, 'filename': fullpath}
57

ContinuumIO / topik

read_document_folder() C last analyzed 2016-04-26 18:00 UTC

Complexity

Size

Duplication

Importance

Duplication Side-by-Side

Filter issues like

read_document_folder() C
last analyzed 2016-04-26 18:00 UTC