e2edutch.minimize   A
last analyzed

Complexity

Total Complexity 41

Size/Duplication

Total Lines 226
Duplicated Lines 0 %

Importance

Changes 0
Metric Value
wmc 41
eloc 183
dl 0
loc 226
rs 9.1199
c 0
b 0
f 0

6 Functions

Rating   Name   Duplication   Size   Complexity  
A normalize_word() 0 5 3
A minimize_partition() 0 9 4
F handle_line() 0 60 14
A minimize_partition_file() 0 12 3
A get_parser() 0 7 1
B handle_bit() 0 30 5

5 Methods

Rating   Name   Duplication   Size   Complexity  
A DocumentState.assert_empty() 0 10 1
A DocumentState.assert_finalizable() 0 7 1
B DocumentState.finalize() 0 25 7
A DocumentState.__init__() 0 10 1
A DocumentState.span_dict_to_list() 0 2 1

How to fix   Complexity   

Complexity

Complex classes like e2edutch.minimize often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
import re
0 ignored issues
show
introduced by
Missing module docstring
Loading history...
2
import os
3
import sys
4
import json
5
import collections
6
import logging
7
import argparse
8
9
from e2edutch import util
10
from e2edutch import conll
11
12
logger = logging.getLogger('e2edutch')
13
14
15
class DocumentState(object):
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
introduced by
Class 'DocumentState' inherits from object, can be safely removed from bases in python3
Loading history...
best-practice introduced by
Too many instance attributes (9/7)
Loading history...
16
    def __init__(self):
17
        self.doc_key = None
18
        self.text = []
19
        self.sentences = []
20
        self.constituents = {}
21
        self.const_stack = []
22
        self.ner = {}
23
        self.ner_stack = []
24
        self.clusters = collections.defaultdict(list)
25
        self.coref_stacks = collections.defaultdict(list)
26
27
    def assert_empty(self):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
28
        assert self.doc_key is None
29
        assert len(self.text) == 0
30
        assert len(self.sentences) == 0
31
        assert len(self.constituents) == 0
32
        assert len(self.const_stack) == 0
33
        assert len(self.ner) == 0
34
        assert len(self.ner_stack) == 0
35
        assert len(self.coref_stacks) == 0
36
        assert len(self.clusters) == 0
37
38
    def assert_finalizable(self):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
39
        assert self.doc_key is not None
40
        assert len(self.text) == 0
41
        assert len(self.sentences) > 0
42
        assert len(self.const_stack) == 0
43
        assert len(self.ner_stack) == 0
44
        assert all(len(s) == 0 for s in self.coref_stacks.values())
45
46
    def span_dict_to_list(self, span_dict):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
47
        return [(s, e, l) for (s, e), l in span_dict.items()]
48
49
    def finalize(self):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
50
        merged_clusters = []
51
        for c1 in self.clusters.values():
0 ignored issues
show
Coding Style Naming introduced by
Variable name "c1" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
52
            existing = None
53
            for m in c1:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "m" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
54
                for c2 in merged_clusters:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "c2" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
55
                    if m in c2:
56
                        existing = c2
57
                        break
58
                if existing is not None:
59
                    break
60
            if existing is not None:
61
                print("Merging clusters (shouldn't happen very often.)")
62
                print(self.doc_key, m)
0 ignored issues
show
introduced by
The variable m does not seem to be defined for all execution paths.
Loading history...
Bug introduced by
The loop variable m might not be defined here.
Loading history...
63
                existing.update(c1)
64
            else:
65
                merged_clusters.append(set(c1))
66
        merged_clusters = [list(c) for c in merged_clusters]
67
        all_mentions = util.flatten(merged_clusters)
68
        assert len(all_mentions) == len(set(all_mentions))
69
70
        return {
71
            "doc_key": self.doc_key,
72
            "sentences": self.sentences,
73
            "clusters": merged_clusters
74
        }
75
76
77
def normalize_word(word):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
78
    if word == "/." or word == "/?":
0 ignored issues
show
unused-code introduced by
Unnecessary "else" after "return"
Loading history...
Unused Code introduced by
Consider merging these comparisons with "in" to "word in ('/.', '/?')"
Loading history...
79
        return word[1:]
80
    else:
81
        return word
82
83
84
def handle_bit(word_index, bit, stack, spans):
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
85
    asterisk_idx = bit.find("*")
86
    if asterisk_idx >= 0:
87
        open_parens = bit[:asterisk_idx]
88
        close_parens = bit[asterisk_idx + 1:]
89
    else:
90
        open_parens = bit[:-1]
91
        close_parens = bit[-1]
92
93
    current_idx = open_parens.find("(")
94
    while current_idx >= 0:
95
        next_idx = open_parens.find("(", current_idx + 1)
96
        if next_idx >= 0:
97
            label = open_parens[current_idx + 1:next_idx]
98
        else:
99
            label = open_parens[current_idx + 1:]
100
        stack.append((word_index, label))
101
        current_idx = next_idx
102
103
    for c in close_parens:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "c" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
104
        assert c == ")"
105
        open_index, label = stack.pop()
106
        current_span = (open_index, word_index)
107
        """
108
    if current_span in spans:
109
      spans[current_span] += "_" + label
110
    else:
111
      spans[current_span] = label
112
    """
0 ignored issues
show
Unused Code introduced by
This string statement has no effect and could be removed.
Loading history...
113
        spans[current_span] = label
114
115
116
def handle_line(line, document_state, labels, stats, word_col):
0 ignored issues
show
Comprehensibility Bug introduced by
labels is re-defining a name which is already available in the outer-scope (previously defined on line 218).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
stats is re-defining a name which is already available in the outer-scope (previously defined on line 219).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
word_col is re-defining a name which is already available in the outer-scope (previously defined on line 217).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
introduced by
Missing function or method docstring
Loading history...
Unused Code introduced by
The argument labels seems to be unused.
Loading history...
117
    begin_document_match = re.match(conll.BEGIN_DOCUMENT_REGEX, line)
118
    if begin_document_match:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
119
        document_state.assert_empty()
120
        document_state.doc_key = conll.get_doc_key(
121
            *begin_document_match.groups())
122
        return None
123
    elif line.startswith("#end document"):
124
        if len(document_state.text) > 0:  # no newline before end document
125
            stats["max_sent_len"] = max(
126
                len(document_state.text), stats["max_sent_len"])
127
            stats["num_sents"] += 1
128
            document_state.sentences.append(tuple(document_state.text))
129
            del document_state.text[:]
130
        document_state.assert_finalizable()
131
        finalized_state = document_state.finalize()
132
        stats["num_clusters"] += len(finalized_state["clusters"])
133
        stats["num_mentions"] += sum(len(c)
134
                                     for c in finalized_state["clusters"])
135
        # labels["const_labels"].update(
136
        #     l for _, _, l in finalized_state["constituents"])
137
        # labels["ner"].update(l for _, _, l in finalized_state["ner"])
138
        return finalized_state
139
    else:
140
        row = line.split()
141
        if len(row) == 0 and len(document_state.text) > 0:
0 ignored issues
show
unused-code introduced by
Unnecessary "elif" after "return"
Loading history...
142
            stats["max_sent_len"] = max(
143
                len(document_state.text), stats["max_sent_len"])
144
            stats["num_sents"] += 1
145
            document_state.sentences.append(tuple(document_state.text))
146
            del document_state.text[:]
147
            return None
148
        elif len(row) == 0 and len(document_state.text) == 0:
149
            return None
150
        assert len(row) >= 4
151
152
        word = normalize_word(row[word_col])
153
        coref = row[-1]
154
155
        word_index = (len(document_state.text)
156
                      + sum(len(s) for s in document_state.sentences))
157
        document_state.text.append(word)
158
159
        if coref != "-" and coref != '_':
0 ignored issues
show
Unused Code introduced by
Consider merging these comparisons with "in" to "coref not in ('-', '_')"
Loading history...
160
            for segment in coref.split("|"):
161
                if segment[0] == "(":
162
                    if segment[-1] == ")":
163
                        cluster_id = int(segment[1:-1])
164
                        document_state.clusters[cluster_id].append(
165
                            (word_index, word_index))
166
                    else:
167
                        cluster_id = int(segment[1:])
168
                        document_state.coref_stacks[cluster_id].append(
169
                            word_index)
170
                elif segment[-1] == ")":
171
                    cluster_id = int(segment[:-1])
172
                    start = document_state.coref_stacks[cluster_id].pop()
173
                    document_state.clusters[cluster_id].append(
174
                        (start, word_index))
175
        return None
176
177
178
def minimize_partition(input_path, labels, stats, word_col):
0 ignored issues
show
Comprehensibility Bug introduced by
input_path is re-defining a name which is already available in the outer-scope (previously defined on line 215).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
labels is re-defining a name which is already available in the outer-scope (previously defined on line 218).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
stats is re-defining a name which is already available in the outer-scope (previously defined on line 219).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
word_col is re-defining a name which is already available in the outer-scope (previously defined on line 217).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
introduced by
Missing function or method docstring
Loading history...
179
    with open(input_path, "r") as input_file:
180
        document_state = DocumentState()
181
        for line in input_file.readlines():
182
            document = handle_line(line, document_state,
183
                                   labels, stats, word_col)
184
            if document is not None:
185
                yield document
186
                document_state = DocumentState()
187
188
189
def minimize_partition_file(
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
190
        input_path, labels, stats, word_col, output_file=None):
0 ignored issues
show
Comprehensibility Bug introduced by
input_path is re-defining a name which is already available in the outer-scope (previously defined on line 215).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
labels is re-defining a name which is already available in the outer-scope (previously defined on line 218).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
stats is re-defining a name which is already available in the outer-scope (previously defined on line 219).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
word_col is re-defining a name which is already available in the outer-scope (previously defined on line 217).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
Comprehensibility Bug introduced by
output_file is re-defining a name which is already available in the outer-scope (previously defined on line 216).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
191
    if output_file is None:
192
        output_path = "{}.jsonlines".format(os.path.splitext(input_path)[0])
193
        output_file = open(output_path, "w")
194
    count = 0
195
    logger.info("Minimizing {}".format(input_path))
0 ignored issues
show
introduced by
Use lazy % formatting in logging functions
Loading history...
196
    for document in minimize_partition(input_path, labels, stats, word_col):
197
        output_file.write(json.dumps(document))
198
        output_file.write("\n")
199
        count += 1
200
    logger.info("Wrote {} documents to {}".format(count, output_path))
0 ignored issues
show
introduced by
The variable output_path does not seem to be defined in case output_file is None on line 191 is False. Are you sure this can never be the case?
Loading history...
introduced by
Use lazy % formatting in logging functions
Loading history...
201
202
203
def get_parser():
0 ignored issues
show
introduced by
Missing function or method docstring
Loading history...
204
    parser = argparse.ArgumentParser()
0 ignored issues
show
Comprehensibility Bug introduced by
parser is re-defining a name which is already available in the outer-scope (previously defined on line 213).

It is generally a bad practice to shadow variables from the outer-scope. In most cases, this is done unintentionally and might lead to unexpected behavior:

param = 5

class Foo:
    def __init__(self, param):   # "param" would be flagged here
        self.param = param
Loading history...
205
    parser.add_argument('input_filename')
206
    parser.add_argument('-o', '--output_file',
207
                        type=argparse.FileType('w'), default=sys.stdout)
208
    parser.add_argument('-v', '--verbose', action='store_true')
209
    return parser
210
211
212
if __name__ == "__main__":
213
    parser = get_parser()
214
    args = parser.parse_args()
215
    input_path = args.input_filename
216
    output_file = args.output_file
217
    word_col = 2
0 ignored issues
show
Coding Style Naming introduced by
Constant name "word_col" doesn't conform to UPPER_CASE naming style ('([^\\W\\da-z][^\\Wa-z]*|__.*__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
218
    labels = collections.defaultdict(set)
219
    stats = collections.defaultdict(int)
220
    minimize_partition_file(input_path, labels, stats, word_col, output_file)
221
    for k, v in labels.items():
222
        print("{} = [{}]".format(k, ", ".join(
223
            "\"{}\"".format(label) for label in v)))
224
    for k, v in stats.items():
225
        print("{} = {}".format(k, v))
226