Completed
Push — master ( 980041...30b693 )
by
unknown
10s
created

topik.fileio.read_document_folder()   C

Complexity

Conditions 7

Size

Total Lines 39

Duplication

Lines 0
Ratio 0 %
Metric Value
cc 7
dl 0
loc 39
rs 5.5
1
import os
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
import logging
3
import gzip
4
5
from topik.fileio._registry import register_input
6
from topik.fileio.tests import test_data_path
0 ignored issues
show
Unused Code introduced by
Unused test_data_path imported from topik.fileio.tests
Loading history...
7
8
@register_input
9
def read_document_folder(folder, content_field='text'):
10
    """Iterate over the files in a folder to retrieve the content to process and tokenize.
11
12
    Parameters
13
    ----------
14
    folder : str
15
        The folder containing the files you want to analyze.
16
17
    content_field : str
18
        The usage of 'content_field' in this source is different from most other sources.  The 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
19
        assumption in this source is that each file contains raw text, NOT dictionaries of 
0 ignored issues
show
Coding Style introduced by
Trailing whitespace
Loading history...
20
        categorized data.  The content_field argument here specifies what key to store the raw
21
        text under in the returned dictionary for each document.
22
23
    Examples
24
    --------
25
    >>> documents = read_document_folder(
26
    ...     '{}/test_data_folder_files'.format(test_data_path))
27
    >>> next(documents)['text'] == (
28
    ...     u"'Interstellar' was incredible. The visuals, the score, " +
29
    ...     u"the acting, were all amazing. The plot is definitely one " +
30
    ...     u"of the most original I've seen in a while.")
31
    True
32
    """
33
34
    if not os.path.exists(folder):
35
        raise IOError("Folder not found!")
36
37
    for directory, subdirectories, files in os.walk(folder):
0 ignored issues
show
Unused Code introduced by
The variable subdirectories seems to be unused.
Loading history...
38
        for n, file in enumerate(sorted(files)):
0 ignored issues
show
Coding Style Naming introduced by
The name n does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Unused Code introduced by
The variable n seems to be unused.
Loading history...
39
            _open = gzip.open if file.endswith('.gz') else open
40
            try:
41
                fullpath = os.path.join(directory, file)
42
                with _open(fullpath, 'rb') as f:
0 ignored issues
show
Coding Style Naming introduced by
The name f does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
43
                    yield {content_field: f.read().decode('utf-8'),
44
                           'filename': fullpath}
45
            except (ValueError, UnicodeDecodeError) as err:
46
                logging.warning("Unable to process file: {}, error: {}".format(fullpath, err))
47