read_large_json()   C
last analyzed

Complexity

Conditions 8

Size

Total Lines 72

Duplication

Lines 0
Ratio 0 %
Metric Value
dl 0
loc 72
rs 5.081
cc 8

How to fix   Long Method   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
import logging
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
3
import json
4
import ijson
5
6
from topik.fileio._registry import register_input
7
from topik.fileio.tests import test_data_path
0 ignored issues
show
Unused Code introduced by
Unused test_data_path imported from topik.fileio.tests
Loading history...
8
9
@register_input
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable register_input does not seem to be defined.
Loading history...
10
def read_json_stream(filename, json_prefix='item', **kwargs):
0 ignored issues
show
Unused Code introduced by
The argument kwargs seems to be unused.
Loading history...
Unused Code introduced by
The argument json_prefix seems to be unused.
Loading history...
11
    # TODO: decide between:
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
12
    #   a) allow this unused json_prefix argument so that current check in read_input works
13
    #   b) allow **kwargs instead
14
    #   c) improve the check in read_input.. (maybe read first line and see if it is a valid, self-contained json object?
15
    #   d) actually do use an optional json_prefix argument to only return a subset of each json object.
16
17
    """Iterate over a json stream of items and get the field that contains the text to process and tokenize.
18
19
    Parameters
20
    ----------
21
    filename : str
22
        The filename of the json stream.
23
24
    Examples
25
    --------
26
    >>> documents = read_json_stream(
27
    ... '{}/test_data_json_stream.json'.format(test_data_path))
28
    >>> next(documents) == {
29
    ... u'doi': u'http://dx.doi.org/10.1557/PROC-879-Z3.3',
30
    ... u'title': u'Sol Gel Preparation of Ta2O5 Nanorods Using DNA as Structure Directing Agent',
31
    ... u'url': u'http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8081671&fulltextType=RA&fileId=S1946427400119281.html',
32
    ... u'abstract': u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.',
33
    ... u'filepath': u'abstracts/879/http%3A%2F%2Fjournals.cambridge.org%2Faction%2FdisplayAbstract%3FfromPage%3Donline%26aid%3D8081671%26fulltextType%3DRA%26fileId%3DS1946427400119281.html',
34
    ... u'filename': '{}/test_data_json_stream.json'.format(test_data_path),
35
    ... u'vol': u'879',
36
    ... u'authors': [u'Humberto A. Monreala', u' Alberto M. Villafa\xf1e',
37
    ...              u' Jos\xe9 G. Chac\xf3n', u' Perla E. Garc\xeda',
38
    ...              u'Carlos A. Mart\xednez'],
39
    ... u'year': u'1917'}
40
    True
41
42
    """
43
44
    with open(filename, 'r') as f:
0 ignored issues
show
Coding Style Naming introduced by
The name f does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Comprehensibility Best Practice introduced by
The variable f does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable filename does not seem to be defined.
Loading history...
45
        for n, line in enumerate(f):
0 ignored issues
show
Coding Style Naming introduced by
The name n does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Unused Code introduced by
The variable n seems to be unused.
Loading history...
46
            try:
47
                output = json.loads(line)
48
                output["filename"] = filename
49
                yield output
50
            except ValueError as e:
0 ignored issues
show
Coding Style Naming introduced by
The name e does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Comprehensibility Best Practice introduced by
The variable ValueError does not seem to be defined.
Loading history...
51
                logging.debug("Unable to process line: {} (error was: {})".format(str(line), e))
52
                raise
53
54
def __is_iterable(obj):
0 ignored issues
show
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
55
    try:
56
        iter(obj)
57
    except TypeError as te:
0 ignored issues
show
Coding Style Naming introduced by
The name te does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Unused Code introduced by
The variable te seems to be unused.
Loading history...
Comprehensibility Best Practice introduced by
The variable TypeError does not seem to be defined.
Loading history...
58
        return False
59
    return True
60
61
62
@register_input
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable register_input does not seem to be defined.
Loading history...
63
def read_large_json(filename, json_prefix='item', **kwargs):
0 ignored issues
show
Unused Code introduced by
The argument kwargs seems to be unused.
Loading history...
64
    # TODO: add the script to automatically find the json_prefix based on a key
0 ignored issues
show
Coding Style introduced by
TODO and FIXME comments should generally be avoided.
Loading history...
65
    # Also should still have the option to manually specify a prefix for complex
66
    # json structures.
67
    """Iterate over all items and sub-items in a json object that match the specified prefix
68
69
70
    Parameters
71
    ----------
72
    filename : str
73
        The filename of the large json file
74
75
    json_prefix : str
76
        The string representation of the hierarchical prefix where the items of
77
        interest may be located within the larger json object.
78
79
        Try the following script if you need help determining the desired prefix:
80
        $   import ijson
81
        $       with open('test_data_large_json_2.json', 'r') as f:
82
        $           parser = ijson.parse(f)
83
        $           for prefix, event, value in parser:
84
        $               print("prefix = '%r' || event = '%r' || value = '%r'" %
85
        $                     (prefix, event, value))
86
87
88
    Examples
89
    --------
90
    >>> documents = read_large_json(
91
    ...             '{}/test_data_large_json.json'.format(test_data_path),
92
    ...             json_prefix='item._source.isAuthorOf')
93
    >>> next(documents) == {
94
    ... u'a': u'ScholarlyArticle',
95
    ... u'name': u'Path planning and formation control via potential function for UAV Quadrotor',
96
    ... u'author': [
97
    ...     u'http://dig.isi.edu/autonomy/data/author/a.a.a.rizqi',
98
    ...     u'http://dig.isi.edu/autonomy/data/author/t.b.adji',
99
    ...     u'http://dig.isi.edu/autonomy/data/author/a.i.cahyadi'],
100
    ... u'text': u"Potential-function-based control strategy for path planning and formation " +
101
    ...     u"control of Quadrotors is proposed in this work. The potential function is " +
102
    ...     u"used to attract the Quadrotor to the goal location as well as avoiding the " +
103
    ...     u"obstacle. The algorithm to solve the so called local minima problem by utilizing " +
104
    ...     u"the wall-following behavior is also explained. The resulted path planning via " +
105
    ...     u"potential function strategy is then used to design formation control algorithm. " +
106
    ...     u"Using the hybrid virtual leader and behavioral approach schema, the formation " +
107
    ...     u"control strategy by means of potential function is proposed. The overall strategy " +
108
    ...     u"has been successfully applied to the Quadrotor's model of Parrot AR Drone 2.0 in " +
109
    ...     u"Gazebo simulator programmed using Robot Operating System.\\nAuthor(s) Rizqi, A.A.A. " +
110
    ...     u"Dept. of Electr. Eng. & Inf. Technol., Univ. Gadjah Mada, Yogyakarta, Indonesia " +
111
    ...     u"Cahyadi, A.I. ; Adji, T.B.\\nReferenced Items are not available for this document.\\n" +
112
    ...     u"No versions found for this document.\\nStandards Dictionary Terms are available to " +
113
    ...     u"subscribers only.",
114
    ... u'uri': u'http://dig.isi.edu/autonomy/data/article/6871517',
115
    ... u'datePublished': u'2014',
116
    ... 'filename': '{}/test_data_large_json.json'.format(test_data_path)}
117
    True
118
    """
119
120
    with open(filename, 'r') as handle:
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable filename does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable handle does not seem to be defined.
Loading history...
121
        for item in ijson.items(handle, json_prefix):
122
            if hasattr(item, 'keys'): # check if item is a dictionary
123
                item['filename'] = filename
124
                yield item
125
            # check if item is both iterable and not a string
126
            elif __is_iterable(item) and not isinstance(item, str):
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable str does not seem to be defined.
Loading history...
127
                for sub_item in item:
128
                    # check if sub_item is a dictionary
129
                    if hasattr(sub_item, 'keys'):
130
                        sub_item['filename'] = filename
131
                        yield sub_item
132
            else:
133
                raise ValueError("'item' in json source is not a dict, and is either a string or not iterable: %r" % item)
134
135