Conditions | 1 |
Total Lines | 2 |
Lines | 2 |
Ratio | 100 % |
Metric | Value |
---|---|
dl | 2 |
loc | 2 |
rs | 10 |
cc | 1 |
1 | from six.moves import UserDict |
||
0 ignored issues
–
show
The import
six.moves could not be resolved.
This can be caused by one of the following: 1. Missing DependenciesThis error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands. # .scrutinizer.yml
before_commands:
- sudo pip install abc # Python2
- sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use
the command for the correct version.
2. Missing __init__.py filesThis error could also result from missing ![]() |
|||
2 | import logging |
||
3 | import time |
||
4 | |||
5 | from elasticsearch import Elasticsearch, helpers |
||
6 | |||
7 | from ._registry import register_output |
||
8 | from .base_output import OutputInterface |
||
9 | from topik.vectorizers.vectorizer_output import VectorizerOutput |
||
10 | from topik.models.base_model_output import ModelOutput |
||
11 | |||
12 | def es_setitem(key, value, doc_type, instance, index, batch_size=1000): |
||
13 | """load an iterable of (id, value) pairs to the specified new or |
||
14 | new or existing field within existing documents.""" |
||
15 | batch = [] |
||
16 | for id, val in value: |
||
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$ ).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. ![]() |
|||
17 | action = {'_op_type': 'update', |
||
18 | '_index': index, |
||
19 | '_type': doc_type, |
||
20 | '_id': id, |
||
21 | 'doc': {key: val}, |
||
22 | 'doc_as_upsert': "true", |
||
23 | } |
||
24 | batch.append(action) |
||
25 | if len(batch) >= batch_size: |
||
26 | helpers.bulk(client=instance, actions=batch, |
||
27 | index=index) |
||
28 | batch = [] |
||
29 | if batch: |
||
30 | helpers.bulk(client=instance, actions=batch, index=index) |
||
31 | instance.indices.refresh(index) |
||
32 | |||
33 | def es_getitem(key, doc_type, instance, index, query=None): |
||
0 ignored issues
–
show
This function should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
34 | results = helpers.scan(instance, index=index, |
||
35 | query=query, doc_type=doc_type) |
||
36 | for result in results: |
||
37 | try: |
||
38 | id = int(result["_id"]) |
||
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$ ).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. ![]() |
|||
39 | except ValueError: |
||
0 ignored issues
–
show
|
|||
40 | id = result["_id"] |
||
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$ ).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. ![]() |
|||
41 | yield id, result['_source'][key] |
||
42 | |||
43 | class BaseElasticCorpora(UserDict): |
||
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
44 | def __init__(self, instance, index, corpus_type, query=None, |
||
45 | batch_size=1000): |
||
46 | self.instance = instance |
||
0 ignored issues
–
show
|
|||
47 | self.index = index |
||
0 ignored issues
–
show
|
|||
48 | self.corpus_type = corpus_type |
||
0 ignored issues
–
show
|
|||
49 | self.query = query |
||
0 ignored issues
–
show
|
|||
50 | self.batch_size = batch_size |
||
0 ignored issues
–
show
|
|||
51 | pass |
||
0 ignored issues
–
show
|
|||
52 | |||
53 | def __setitem__(self, key, value): |
||
54 | es_setitem(key, value, self.corpus_type, self.instance, self.index) |
||
0 ignored issues
–
show
|
|||
55 | |||
56 | |||
57 | def __getitem__(self, key): |
||
58 | return es_getitem(key,self.corpus_type,self.instance,self.index, |
||
0 ignored issues
–
show
|
|||
59 | self.query) |
||
60 | |||
61 | View Code Duplication | class VectorizedElasticCorpora(BaseElasticCorpora): |
|
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
62 | def __setitem__(self, key, value): |
||
63 | #id_term_map |
||
64 | es_setitem(key,value.id_term_map.items(),"term",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
65 | #document_term_counts |
||
66 | es_setitem(key,value.document_term_counts.items(),"document_term_count",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
67 | #doc_lengths |
||
68 | es_setitem(key,value.doc_lengths.items(),"document_length",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
69 | #global term_frequency |
||
70 | es_setitem(key,value.term_frequency.items(),"term_frequency",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
71 | #vectors |
||
72 | es_setitem(key,value.vectors.items(),"vector",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
73 | # could either upload vectors explicitly here (above) or using Super (below) |
||
74 | #super(VectorizedElasticCorpora, self).__setitem__(key, value) |
||
75 | |||
76 | def __getitem__(self, key): |
||
77 | # TODO: each of these should be retrieved from a query. Populate the VectorizerOutput object |
||
0 ignored issues
–
show
|
|||
78 | # and return it. These things can be iterators instead of dicts; VectorizerOutput should |
||
79 | # not care. |
||
80 | # TODO: this is the id->term map for the full set of unique terms across all docs |
||
0 ignored issues
–
show
|
|||
81 | id_term_map = {int(term_id): term for term_id, term in es_getitem(key,"term",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
82 | # 15 |
||
83 | # TODO: this is the count of terms associated with each document |
||
0 ignored issues
–
show
|
|||
84 | document_term_count = {int(doc_id): doc_term_count for doc_id, doc_term_count in es_getitem(key,"document_term_count",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
85 | # {"doc1": 3, "doc2": 5} |
||
86 | doc_lengths = {int(doc_id): doc_length for doc_id, doc_length in es_getitem(key,"document_length",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
87 | term_frequency = {int(term_id): global_frequency for term_id, global_frequency in es_getitem(key,"term_frequency",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
88 | # TODO: this is the vectorized representation of each document |
||
0 ignored issues
–
show
|
|||
89 | vectors = {int(doc_id): {int(term_id): term_weight for term_id, term_weight in doc_term_weights.items()} for doc_id, doc_term_weights in es_getitem(key,"vector",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
90 | #vectors = {int(doc_id): {doc_term_weights for doc_id, doc_term_weights in es_getitem(key,"vector",self.instance,self.index,self.query)} |
||
91 | #vectors = list(es_getitem(key,"vector",self.instance,self.index,self.query)) |
||
92 | # {"doc1": {1: 3, 2: 1} # word id is key, word count is value (for bag of words model) |
||
93 | return VectorizerOutput(id_term_map=id_term_map, |
||
0 ignored issues
–
show
|
|||
94 | document_term_counts=document_term_count, |
||
0 ignored issues
–
show
|
|||
95 | doc_lengths=doc_lengths, |
||
0 ignored issues
–
show
|
|||
96 | term_frequency=term_frequency, |
||
0 ignored issues
–
show
|
|||
97 | vectors=vectors) |
||
0 ignored issues
–
show
|
|||
98 | |||
99 | View Code Duplication | class ModeledElasticCorpora(BaseElasticCorpora): |
|
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
100 | def __setitem__(self, key, value): |
||
101 | es_setitem(key,value.vocab.items(),"term",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
102 | es_setitem(key,value.term_frequency.items(),"term_frequency",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
103 | es_setitem(key,value.topic_term_matrix.items(),"topic_term_dist",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
104 | es_setitem(key,value.doc_lengths.items(),"doc_length",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
105 | es_setitem(key,value.doc_topic_matrix.items(),"doc_topic_dist",self.instance,self.index) |
||
0 ignored issues
–
show
|
|||
106 | |||
107 | def __lt__(self, y): |
||
0 ignored issues
–
show
The name
y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$ ).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. ![]() |
|||
108 | return super(ModeledElasticCorpora, self).__lt__(y) |
||
0 ignored issues
–
show
|
|||
109 | |||
110 | def __getitem__(self, key): |
||
111 | vocab = {int(term_id): term for term_id, term in \ |
||
0 ignored issues
–
show
|
|||
112 | es_getitem(key,"term",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
113 | term_frequency = {int(term_id): tf for term_id, tf in \ |
||
0 ignored issues
–
show
|
|||
114 | es_getitem(key,"term_frequency",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
115 | topic_term_matrix = {topic_id: topic_term_dist for topic_id, topic_term_dist in \ |
||
0 ignored issues
–
show
|
|||
116 | es_getitem(key,"topic_term_dist",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
117 | doc_lengths = {topic_id: doc_length for topic_id, doc_length in \ |
||
0 ignored issues
–
show
|
|||
118 | es_getitem(key,"doc_length",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
119 | doc_topic_matrix = {int(doc_id): doc_topic_dist for doc_id, doc_topic_dist in \ |
||
0 ignored issues
–
show
|
|||
120 | es_getitem(key,"doc_topic_dist",self.instance,self.index,self.query)} |
||
0 ignored issues
–
show
|
|||
121 | return ModelOutput(vocab=vocab, term_frequency=term_frequency, |
||
0 ignored issues
–
show
|
|||
122 | topic_term_matrix=topic_term_matrix, |
||
0 ignored issues
–
show
|
|||
123 | doc_lengths=doc_lengths, |
||
0 ignored issues
–
show
|
|||
124 | doc_topic_matrix=doc_topic_matrix) |
||
0 ignored issues
–
show
|
|||
125 | |||
126 | @register_output |
||
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
127 | class ElasticSearchOutput(OutputInterface): |
||
0 ignored issues
–
show
|
|||
128 | def __init__(self, source, index, hash_field=None, doc_type='continuum', |
||
129 | query=None, iterable=None, filter_expression="", |
||
130 | vectorized_corpora=None, tokenized_corpora=None, modeled_corpora=None, |
||
131 | **kwargs): |
||
132 | super(ElasticSearchOutput, self).__init__() |
||
133 | self.hosts = source |
||
0 ignored issues
–
show
|
|||
134 | self.instance = Elasticsearch(hosts=source, **kwargs) |
||
0 ignored issues
–
show
|
|||
135 | self.index = index |
||
0 ignored issues
–
show
|
|||
136 | self.doc_type = doc_type |
||
0 ignored issues
–
show
|
|||
137 | self.query = query |
||
0 ignored issues
–
show
|
|||
138 | self.hash_field = hash_field |
||
0 ignored issues
–
show
|
|||
139 | if iterable: |
||
0 ignored issues
–
show
|
|||
140 | self.import_from_iterable(iterable, hash_field) |
||
141 | self.filter_expression = filter_expression |
||
0 ignored issues
–
show
|
|||
142 | |||
143 | self.tokenized_corpora = tokenized_corpora if tokenized_corpora else \ |
||
0 ignored issues
–
show
|
|||
144 | BaseElasticCorpora(self.instance, self.index, 'tokenized', self.query) |
||
0 ignored issues
–
show
|
|||
145 | self.vectorized_corpora = vectorized_corpora if vectorized_corpora else \ |
||
0 ignored issues
–
show
|
|||
146 | VectorizedElasticCorpora(self.instance, self.index, 'vectorized', self.query) |
||
147 | self.modeled_corpora = modeled_corpora if modeled_corpora else \ |
||
0 ignored issues
–
show
|
|||
148 | ModeledElasticCorpora(self.instance, self.index, "models", self.query) |
||
149 | |||
150 | |||
151 | @property |
||
0 ignored issues
–
show
|
|||
152 | def filter_string(self): |
||
0 ignored issues
–
show
This method should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
153 | return self.filter_expression |
||
0 ignored issues
–
show
|
|||
154 | |||
155 | def import_from_iterable(self, iterable, field_to_hash='text', batch_size=500): |
||
156 | """Load data into Elasticsearch from iterable. |
||
157 | |||
158 | iterable: generally a list of dicts, but possibly a list of strings |
||
159 | This is your data. Your dictionary structure defines the schema |
||
160 | of the elasticsearch index. |
||
161 | field_to_hash: string identifier of field to hash for content ID. For |
||
162 | list of dicts, a valid key value in the dictionary is required. For |
||
163 | list of strings, a dictionary with one key, "text" is created and |
||
164 | used. |
||
165 | """ |
||
166 | if field_to_hash: |
||
0 ignored issues
–
show
|
|||
167 | self.hash_field = field_to_hash |
||
168 | batch = [] |
||
169 | for item in iterable: |
||
0 ignored issues
–
show
|
|||
170 | if isinstance(item, basestring): |
||
0 ignored issues
–
show
|
|||
171 | item = {field_to_hash: item} |
||
172 | id = hash(item[field_to_hash]) |
||
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$ ).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. ![]() |
|||
173 | action = {'_op_type': 'update', |
||
174 | '_index': self.index, |
||
0 ignored issues
–
show
|
|||
175 | '_type': self.doc_type, |
||
176 | '_id': id, |
||
0 ignored issues
–
show
|
|||
177 | 'doc': item, |
||
178 | 'doc_as_upsert': "true", |
||
179 | } |
||
180 | batch.append(action) |
||
0 ignored issues
–
show
|
|||
181 | if len(batch) >= batch_size: |
||
0 ignored issues
–
show
|
|||
182 | helpers.bulk(client=self.instance, actions=batch, index=self.index) |
||
183 | batch = [] |
||
184 | if batch: |
||
185 | helpers.bulk(client=self.instance, actions=batch, index=self.index) |
||
186 | self.instance.indices.refresh(self.index) |
||
187 | else: |
||
188 | raise ValueError("A field_to_hash is required for import_from_iterable") |
||
189 | |||
190 | def convert_date_field_and_reindex(self, field): |
||
0 ignored issues
–
show
This method should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
191 | index = self.index |
||
0 ignored issues
–
show
|
|||
192 | if self.instance.indices.get_field_mapping(fields=[field], |
||
0 ignored issues
–
show
|
|||
193 | index=index, |
||
0 ignored issues
–
show
|
|||
194 | doc_type=self.doc_type) != 'date': |
||
195 | index = self.index+"_{}_alias_date".format(field) |
||
196 | if not self.instance.indices.exists(index) or self.instance.indices.get_field_mapping(field=field, |
||
197 | index=index, |
||
198 | doc_type=self.doc_type) != 'date': |
||
199 | mapping = self.instance.indices.get_mapping(index=self.index, |
||
200 | doc_type=self.doc_type) |
||
201 | mapping[self.index]["mappings"][self.doc_type]["properties"][field] = {"type": "date"} |
||
202 | self.instance.indices.put_alias(index=self.index, |
||
203 | name=index, |
||
204 | body=mapping) |
||
0 ignored issues
–
show
|
|||
205 | self.instance.indices.refresh(index) |
||
206 | while self.instance.count(index=self.index) != self.instance.count(index=index): |
||
207 | logging.info("Waiting for date indexed data to be indexed...") |
||
208 | time.sleep(1) |
||
209 | return index |
||
210 | |||
211 | # TODO: validate input data to ensure that it has valid year data |
||
0 ignored issues
–
show
|
|||
212 | def get_date_filtered_data(self, field_to_get, start, end, filter_field="date"): |
||
0 ignored issues
–
show
This method should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. ![]() |
|||
213 | converted_index = self.convert_date_field_and_reindex(field=filter_field) |
||
0 ignored issues
–
show
|
|||
214 | |||
215 | results = helpers.scan(self.instance, index=converted_index, |
||
0 ignored issues
–
show
|
|||
216 | doc_type=self.doc_type, query={ |
||
217 | "query": {"filtered": {"filter": {"range": {filter_field: { |
||
218 | "gte": start,"lte": end}}}}}}) |
||
0 ignored issues
–
show
|
|||
219 | for result in results: |
||
0 ignored issues
–
show
|
|||
220 | yield result["_id"], result['_source'][field_to_get] |
||
0 ignored issues
–
show
|
|||
221 | |||
222 | def get_filtered_data(self, field_to_get, filter=""): |
||
0 ignored issues
–
show
|
|||
223 | results = helpers.scan(self.instance, index=self.index, |
||
0 ignored issues
–
show
|
|||
224 | query=self.query, doc_type=self.doc_type) |
||
225 | for result in results: |
||
0 ignored issues
–
show
|
|||
226 | yield result["_id"], result['_source'][field_to_get] |
||
0 ignored issues
–
show
|
|||
227 | |||
228 | def save(self, filename, saved_data=None): |
||
229 | if saved_data is None: |
||
0 ignored issues
–
show
|
|||
230 | saved_data = {"source": self.hosts, "index": self.index, "hash_field": self.hash_field, |
||
0 ignored issues
–
show
|
|||
231 | "doc_type": self.doc_type, "query": self.query} |
||
232 | return super(ElasticSearchOutput, self).save(filename, saved_data) |
||
0 ignored issues
–
show
|
|||
233 | |||
234 | def synchronize(self, max_wait, field): |
||
235 | # TODO: change this to a more general condition for wider use, including read_input |
||
0 ignored issues
–
show
|
|||
236 | # could just pass in a string condition and then 'while not eval(condition)' |
||
237 | count_not_yet_updated = -1 |
||
238 | while count_not_yet_updated != 0: |
||
0 ignored issues
–
show
|
|||
239 | count_not_yet_updated = self.instance.count(index=self.index, |
||
0 ignored issues
–
show
|
|||
240 | doc_type=self.doc_type, |
||
241 | body={"query": { |
||
242 | "constant_score" : { |
||
243 | "filter" : { |
||
244 | "missing" : { |
||
245 | "field" : field}}}}})['count'] |
||
0 ignored issues
–
show
|
|||
246 | logging.debug("Count not yet updated: {}".format(count_not_yet_updated)) |
||
247 | time.sleep(0.01) |
||
248 | pass |
||
0 ignored issues
–
show
|
|||
249 | |||
250 |
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.