read_input()   F
last analyzed

Complexity

Conditions 12

Size

Total Lines 64

Duplication

Lines 0
Ratio 0 %

Importance

Changes 4
Bugs 1 Features 0
Metric Value
c 4
b 1
f 0
dl 0
loc 64
rs 2.7469
cc 12

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like read_input() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
import os
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
3
from topik.fileio._registry import registered_inputs
4
from topik.fileio.tests import test_data_path
0 ignored issues
show
Unused Code introduced by
Unused test_data_path imported from topik.fileio.tests
Loading history...
5
6
# this function is the primary API for people using any registered functions.
7
def read_input(source, source_type="auto", folder_content_field='text', **kwargs):
8
    """
9
    Read data from given source into Topik's internal data structures.
10
11
    Parameters
12
    ----------
13
    source : str
14
        input data.  Can be file path, directory, or server address.
15
    source_type : str
16
        "auto" tries to figure out data type of source.  Can be manually specified instead.
17
        options for manual specification are ['solr', 'elastic', 'json_stream', 'large_json', 'folder']
18
    folder_content_field : str
19
        Only used for document_folder source. This argument is used as the key
20
        (field name), where each document represents the value of that field.
21
    kwargs : any other arguments to pass to input parsers
22
23
    Returns
24
    -------
25
    iterable output object
26
27
    >> ids, texts = zip(*list(iter(raw_data)))
28
    Examples
29
    --------
30
    >>> loaded_corpus = read_input(
31
    ...         '{}/test_data_json_stream.json'.format(test_data_path))
32
    >>> solution_text = (
33
    ... u'Transition metal oxides are being considered as the next generation '+
34
    ... u'materials in field such as electronics and advanced catalysts; '+
35
    ... u'between them is Tantalum (V) Oxide; however, there are few reports '+
36
    ... u'for the synthesis of this material at the nanometer size which could '+
37
    ... u'have unusual properties. Hence, in this work we present the '+
38
    ... u'synthesis of Ta2O5 nanorods by sol gel method using DNA as structure '+
39
    ... u'directing agent, the size of the nanorods was of the order of 40 to '+
40
    ... u'100 nm in diameter and several microns in length; this easy method '+
41
    ... u'can be useful in the preparation of nanomaterials for electronics, '+
42
    ... u'biomedical applications as well as catalysts.')
43
    >>> solution_text == next(loaded_corpus)['abstract']
44
    True
45
    """
46
    json_extensions = [".js", ".json"]
47
48
    # web addresses default to elasticsearch
49
    if (source_type == "auto" and "9200" in source) or source_type == "elastic":
50
        data_iterator = registered_inputs["read_elastic"](source, **kwargs)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable kwargs does not seem to be defined.
Loading history...
51
    # files must end in .json.  Try json parser first, try large_json parser next.  Fail otherwise.
52
    elif (source_type == "auto" and os.path.splitext(source)[1] in json_extensions) or source_type == "json_stream":
53
        try:
54
            data_iterator = registered_inputs["read_json_stream"](source, **kwargs)
55
            # tee the iterator and try to get the first element.  If it fails, this is actually a large_json file.
56
            next(data_iterator)
57
            # reset the iterator after this check so that it starts at document 0 rather than document 1
58
            data_iterator = registered_inputs["read_json_stream"](source, **kwargs)
59
        except ValueError:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable ValueError does not seem to be defined.
Loading history...
60
            data_iterator = registered_inputs["read_large_json"](source, **kwargs)
61
    elif source_type == "large_json":
62
        data_iterator = registered_inputs["read_large_json"](source, **kwargs)
63
    # folder paths are simple strings that don't end in an extension (.+3-4 characters), or end in a /
64
    elif (source_type == "auto" and os.path.splitext(source)[1] == "") or source_type == "folder":
65
        data_iterator = registered_inputs["read_document_folder"](source,
66
                                                                  content_field=folder_content_field)
67
    else:
68
        raise ValueError("Unrecognized source type: {}.  Please either manually specify the type, or convert your input"
69
                         " to a supported type.".format(source))
70
    return data_iterator
71
72