ContinuumIO /
topik
| Conditions | 1 |
| Total Lines | 2 |
| Lines | 0 |
| Ratio | 0 % |
| Metric | Value |
|---|---|
| dl | 0 |
| loc | 2 |
| rs | 10 |
| cc | 1 |
| 1 | from six.moves import UserDict |
||
|
0 ignored issues
–
show
The import
six.moves could not be resolved.
This can be caused by one of the following: 1. Missing DependenciesThis error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands. # .scrutinizer.yml
before_commands:
- sudo pip install abc # Python2
- sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use
the command for the correct version.
2. Missing __init__.py filesThis error could also result from missing Loading history...
|
|||
| 2 | import logging |
||
| 3 | import time |
||
| 4 | |||
| 5 | from elasticsearch import Elasticsearch, helpers |
||
| 6 | |||
| 7 | from ._registry import register_output |
||
| 8 | from .base_output import OutputInterface |
||
| 9 | from topik.vectorizers.vectorizer_output import VectorizerOutput |
||
| 10 | from topik.models.base_model_output import ModelOutput |
||
| 11 | |||
| 12 | def es_setitem(key, value, doc_type, instance, index, batch_size=1000): |
||
| 13 | """load an iterable of (id, value) pairs to the specified new or |
||
| 14 | new or existing field within existing documents.""" |
||
| 15 | batch = [] |
||
| 16 | for id, val in value: |
||
|
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
|
|||
| 17 | action = {'_op_type': 'update', |
||
| 18 | '_index': index, |
||
| 19 | '_type': doc_type, |
||
| 20 | '_id': id, |
||
| 21 | 'doc': {key: val}, |
||
| 22 | 'doc_as_upsert': "true", |
||
| 23 | } |
||
| 24 | batch.append(action) |
||
| 25 | if len(batch) >= batch_size: |
||
| 26 | helpers.bulk(client=instance, actions=batch, |
||
| 27 | index=index) |
||
| 28 | batch = [] |
||
| 29 | if batch: |
||
| 30 | helpers.bulk(client=instance, actions=batch, index=index) |
||
| 31 | instance.indices.refresh(index) |
||
| 32 | |||
| 33 | def es_getitem(key, doc_type, instance, index, query=None): |
||
|
0 ignored issues
–
show
This function should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 34 | results = helpers.scan(instance, index=index, |
||
| 35 | query=query, doc_type=doc_type) |
||
| 36 | for result in results: |
||
| 37 | try: |
||
| 38 | id = int(result["_id"]) |
||
|
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
|
|||
| 39 | except ValueError: |
||
|
0 ignored issues
–
show
|
|||
| 40 | id = result["_id"] |
||
|
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
|
|||
| 41 | yield id, result['_source'][key] |
||
| 42 | |||
| 43 | class BaseElasticCorpora(UserDict): |
||
|
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 44 | def __init__(self, instance, index, corpus_type, query=None, |
||
| 45 | batch_size=1000): |
||
| 46 | self.instance = instance |
||
|
0 ignored issues
–
show
|
|||
| 47 | self.index = index |
||
|
0 ignored issues
–
show
|
|||
| 48 | self.corpus_type = corpus_type |
||
|
0 ignored issues
–
show
|
|||
| 49 | self.query = query |
||
|
0 ignored issues
–
show
|
|||
| 50 | self.batch_size = batch_size |
||
|
0 ignored issues
–
show
|
|||
| 51 | pass |
||
|
0 ignored issues
–
show
|
|||
| 52 | |||
| 53 | def __setitem__(self, key, value): |
||
| 54 | es_setitem(key, value, self.corpus_type, self.instance, self.index) |
||
|
0 ignored issues
–
show
|
|||
| 55 | |||
| 56 | |||
| 57 | def __getitem__(self, key): |
||
| 58 | return es_getitem(key,self.corpus_type,self.instance,self.index, |
||
|
0 ignored issues
–
show
|
|||
| 59 | self.query) |
||
| 60 | |||
| 61 | View Code Duplication | class VectorizedElasticCorpora(BaseElasticCorpora): |
|
|
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 62 | def __setitem__(self, key, value): |
||
| 63 | #id_term_map |
||
| 64 | es_setitem(key,value.id_term_map.items(),"term",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 65 | #document_term_counts |
||
| 66 | es_setitem(key,value.document_term_counts.items(),"document_term_count",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 67 | #doc_lengths |
||
| 68 | es_setitem(key,value.doc_lengths.items(),"document_length",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 69 | #global term_frequency |
||
| 70 | es_setitem(key,value.term_frequency.items(),"term_frequency",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 71 | #vectors |
||
| 72 | es_setitem(key,value.vectors.items(),"vector",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 73 | # could either upload vectors explicitly here (above) or using Super (below) |
||
| 74 | #super(VectorizedElasticCorpora, self).__setitem__(key, value) |
||
| 75 | |||
| 76 | def __getitem__(self, key): |
||
| 77 | # TODO: each of these should be retrieved from a query. Populate the VectorizerOutput object |
||
|
0 ignored issues
–
show
|
|||
| 78 | # and return it. These things can be iterators instead of dicts; VectorizerOutput should |
||
| 79 | # not care. |
||
| 80 | # TODO: this is the id->term map for the full set of unique terms across all docs |
||
|
0 ignored issues
–
show
|
|||
| 81 | id_term_map = {int(term_id): term for term_id, term in es_getitem(key,"term",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 82 | # 15 |
||
| 83 | # TODO: this is the count of terms associated with each document |
||
|
0 ignored issues
–
show
|
|||
| 84 | document_term_count = {int(doc_id): doc_term_count for doc_id, doc_term_count in es_getitem(key,"document_term_count",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 85 | # {"doc1": 3, "doc2": 5} |
||
| 86 | doc_lengths = {int(doc_id): doc_length for doc_id, doc_length in es_getitem(key,"document_length",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 87 | term_frequency = {int(term_id): global_frequency for term_id, global_frequency in es_getitem(key,"term_frequency",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 88 | # TODO: this is the vectorized representation of each document |
||
|
0 ignored issues
–
show
|
|||
| 89 | vectors = {int(doc_id): {int(term_id): term_weight for term_id, term_weight in doc_term_weights.items()} for doc_id, doc_term_weights in es_getitem(key,"vector",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 90 | #vectors = {int(doc_id): {doc_term_weights for doc_id, doc_term_weights in es_getitem(key,"vector",self.instance,self.index,self.query)} |
||
| 91 | #vectors = list(es_getitem(key,"vector",self.instance,self.index,self.query)) |
||
| 92 | # {"doc1": {1: 3, 2: 1} # word id is key, word count is value (for bag of words model) |
||
| 93 | return VectorizerOutput(id_term_map=id_term_map, |
||
|
0 ignored issues
–
show
|
|||
| 94 | document_term_counts=document_term_count, |
||
|
0 ignored issues
–
show
|
|||
| 95 | doc_lengths=doc_lengths, |
||
|
0 ignored issues
–
show
|
|||
| 96 | term_frequency=term_frequency, |
||
|
0 ignored issues
–
show
|
|||
| 97 | vectors=vectors) |
||
|
0 ignored issues
–
show
|
|||
| 98 | |||
| 99 | View Code Duplication | class ModeledElasticCorpora(BaseElasticCorpora): |
|
|
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 100 | def __setitem__(self, key, value): |
||
| 101 | es_setitem(key,value.vocab.items(),"term",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 102 | es_setitem(key,value.term_frequency.items(),"term_frequency",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 103 | es_setitem(key,value.topic_term_matrix.items(),"topic_term_dist",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 104 | es_setitem(key,value.doc_lengths.items(),"doc_length",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 105 | es_setitem(key,value.doc_topic_matrix.items(),"doc_topic_dist",self.instance,self.index) |
||
|
0 ignored issues
–
show
|
|||
| 106 | |||
| 107 | def __lt__(self, y): |
||
|
0 ignored issues
–
show
The name
y does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
|
|||
| 108 | return super(ModeledElasticCorpora, self).__lt__(y) |
||
|
0 ignored issues
–
show
|
|||
| 109 | |||
| 110 | def __getitem__(self, key): |
||
| 111 | vocab = {int(term_id): term for term_id, term in \ |
||
|
0 ignored issues
–
show
|
|||
| 112 | es_getitem(key,"term",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 113 | term_frequency = {int(term_id): tf for term_id, tf in \ |
||
|
0 ignored issues
–
show
|
|||
| 114 | es_getitem(key,"term_frequency",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 115 | topic_term_matrix = {topic_id: topic_term_dist for topic_id, topic_term_dist in \ |
||
|
0 ignored issues
–
show
|
|||
| 116 | es_getitem(key,"topic_term_dist",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 117 | doc_lengths = {topic_id: doc_length for topic_id, doc_length in \ |
||
|
0 ignored issues
–
show
|
|||
| 118 | es_getitem(key,"doc_length",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 119 | doc_topic_matrix = {int(doc_id): doc_topic_dist for doc_id, doc_topic_dist in \ |
||
|
0 ignored issues
–
show
|
|||
| 120 | es_getitem(key,"doc_topic_dist",self.instance,self.index,self.query)} |
||
|
0 ignored issues
–
show
|
|||
| 121 | return ModelOutput(vocab=vocab, term_frequency=term_frequency, |
||
|
0 ignored issues
–
show
|
|||
| 122 | topic_term_matrix=topic_term_matrix, |
||
|
0 ignored issues
–
show
|
|||
| 123 | doc_lengths=doc_lengths, |
||
|
0 ignored issues
–
show
|
|||
| 124 | doc_topic_matrix=doc_topic_matrix) |
||
|
0 ignored issues
–
show
|
|||
| 125 | |||
| 126 | @register_output |
||
|
0 ignored issues
–
show
This class should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 127 | class ElasticSearchOutput(OutputInterface): |
||
|
0 ignored issues
–
show
|
|||
| 128 | def __init__(self, source, index, hash_field=None, doc_type='continuum', |
||
| 129 | query=None, iterable=None, filter_expression="", |
||
| 130 | vectorized_corpora=None, tokenized_corpora=None, modeled_corpora=None, |
||
| 131 | **kwargs): |
||
| 132 | super(ElasticSearchOutput, self).__init__() |
||
| 133 | self.hosts = source |
||
|
0 ignored issues
–
show
|
|||
| 134 | self.instance = Elasticsearch(hosts=source, **kwargs) |
||
|
0 ignored issues
–
show
|
|||
| 135 | self.index = index |
||
|
0 ignored issues
–
show
|
|||
| 136 | self.doc_type = doc_type |
||
|
0 ignored issues
–
show
|
|||
| 137 | self.query = query |
||
|
0 ignored issues
–
show
|
|||
| 138 | self.hash_field = hash_field |
||
|
0 ignored issues
–
show
|
|||
| 139 | if iterable: |
||
|
0 ignored issues
–
show
|
|||
| 140 | self.import_from_iterable(iterable, hash_field) |
||
| 141 | self.filter_expression = filter_expression |
||
|
0 ignored issues
–
show
|
|||
| 142 | |||
| 143 | self.tokenized_corpora = tokenized_corpora if tokenized_corpora else \ |
||
|
0 ignored issues
–
show
|
|||
| 144 | BaseElasticCorpora(self.instance, self.index, 'tokenized', self.query) |
||
|
0 ignored issues
–
show
|
|||
| 145 | self.vectorized_corpora = vectorized_corpora if vectorized_corpora else \ |
||
|
0 ignored issues
–
show
|
|||
| 146 | VectorizedElasticCorpora(self.instance, self.index, 'vectorized', self.query) |
||
| 147 | self.modeled_corpora = modeled_corpora if modeled_corpora else \ |
||
|
0 ignored issues
–
show
|
|||
| 148 | ModeledElasticCorpora(self.instance, self.index, "models", self.query) |
||
| 149 | |||
| 150 | |||
| 151 | @property |
||
|
0 ignored issues
–
show
|
|||
| 152 | def filter_string(self): |
||
|
0 ignored issues
–
show
This method should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 153 | return self.filter_expression |
||
|
0 ignored issues
–
show
|
|||
| 154 | |||
| 155 | def import_from_iterable(self, iterable, field_to_hash='text', batch_size=500): |
||
| 156 | """Load data into Elasticsearch from iterable. |
||
| 157 | |||
| 158 | iterable: generally a list of dicts, but possibly a list of strings |
||
| 159 | This is your data. Your dictionary structure defines the schema |
||
| 160 | of the elasticsearch index. |
||
| 161 | field_to_hash: string identifier of field to hash for content ID. For |
||
| 162 | list of dicts, a valid key value in the dictionary is required. For |
||
| 163 | list of strings, a dictionary with one key, "text" is created and |
||
| 164 | used. |
||
| 165 | """ |
||
| 166 | if field_to_hash: |
||
|
0 ignored issues
–
show
|
|||
| 167 | self.hash_field = field_to_hash |
||
| 168 | batch = [] |
||
| 169 | for item in iterable: |
||
|
0 ignored issues
–
show
|
|||
| 170 | if isinstance(item, basestring): |
||
|
0 ignored issues
–
show
|
|||
| 171 | item = {field_to_hash: item} |
||
| 172 | id = hash(item[field_to_hash]) |
||
|
0 ignored issues
–
show
The name
id does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).
This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
|
|||
| 173 | action = {'_op_type': 'update', |
||
| 174 | '_index': self.index, |
||
|
0 ignored issues
–
show
|
|||
| 175 | '_type': self.doc_type, |
||
| 176 | '_id': id, |
||
|
0 ignored issues
–
show
|
|||
| 177 | 'doc': item, |
||
| 178 | 'doc_as_upsert': "true", |
||
| 179 | } |
||
| 180 | batch.append(action) |
||
|
0 ignored issues
–
show
|
|||
| 181 | if len(batch) >= batch_size: |
||
|
0 ignored issues
–
show
|
|||
| 182 | helpers.bulk(client=self.instance, actions=batch, index=self.index) |
||
| 183 | batch = [] |
||
| 184 | if batch: |
||
| 185 | helpers.bulk(client=self.instance, actions=batch, index=self.index) |
||
| 186 | self.instance.indices.refresh(self.index) |
||
| 187 | else: |
||
| 188 | raise ValueError("A field_to_hash is required for import_from_iterable") |
||
| 189 | |||
| 190 | def convert_date_field_and_reindex(self, field): |
||
|
0 ignored issues
–
show
This method should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 191 | index = self.index |
||
|
0 ignored issues
–
show
|
|||
| 192 | if self.instance.indices.get_field_mapping(fields=[field], |
||
|
0 ignored issues
–
show
|
|||
| 193 | index=index, |
||
|
0 ignored issues
–
show
|
|||
| 194 | doc_type=self.doc_type) != 'date': |
||
| 195 | index = self.index+"_{}_alias_date".format(field) |
||
| 196 | if not self.instance.indices.exists(index) or self.instance.indices.get_field_mapping(field=field, |
||
| 197 | index=index, |
||
| 198 | doc_type=self.doc_type) != 'date': |
||
| 199 | mapping = self.instance.indices.get_mapping(index=self.index, |
||
| 200 | doc_type=self.doc_type) |
||
| 201 | mapping[self.index]["mappings"][self.doc_type]["properties"][field] = {"type": "date"} |
||
| 202 | self.instance.indices.put_alias(index=self.index, |
||
| 203 | name=index, |
||
| 204 | body=mapping) |
||
|
0 ignored issues
–
show
|
|||
| 205 | self.instance.indices.refresh(index) |
||
| 206 | while self.instance.count(index=self.index) != self.instance.count(index=index): |
||
| 207 | logging.info("Waiting for date indexed data to be indexed...") |
||
| 208 | time.sleep(1) |
||
| 209 | return index |
||
| 210 | |||
| 211 | # TODO: validate input data to ensure that it has valid year data |
||
|
0 ignored issues
–
show
|
|||
| 212 | def get_date_filtered_data(self, field_to_get, start, end, filter_field="date"): |
||
|
0 ignored issues
–
show
This method should have a docstring.
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods: class SomeClass:
def some_method(self):
"""Do x and return foo."""
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions. Loading history...
|
|||
| 213 | converted_index = self.convert_date_field_and_reindex(field=filter_field) |
||
|
0 ignored issues
–
show
|
|||
| 214 | |||
| 215 | results = helpers.scan(self.instance, index=converted_index, |
||
|
0 ignored issues
–
show
|
|||
| 216 | doc_type=self.doc_type, query={ |
||
| 217 | "query": {"filtered": {"filter": {"range": {filter_field: { |
||
| 218 | "gte": start,"lte": end}}}}}}) |
||
|
0 ignored issues
–
show
|
|||
| 219 | for result in results: |
||
|
0 ignored issues
–
show
|
|||
| 220 | yield result["_id"], result['_source'][field_to_get] |
||
|
0 ignored issues
–
show
|
|||
| 221 | |||
| 222 | def get_filtered_data(self, field_to_get, filter=""): |
||
|
0 ignored issues
–
show
|
|||
| 223 | results = helpers.scan(self.instance, index=self.index, |
||
|
0 ignored issues
–
show
|
|||
| 224 | query=self.query, doc_type=self.doc_type) |
||
| 225 | for result in results: |
||
|
0 ignored issues
–
show
|
|||
| 226 | yield result["_id"], result['_source'][field_to_get] |
||
|
0 ignored issues
–
show
|
|||
| 227 | |||
| 228 | def save(self, filename, saved_data=None): |
||
| 229 | if saved_data is None: |
||
|
0 ignored issues
–
show
|
|||
| 230 | saved_data = {"source": self.hosts, "index": self.index, "hash_field": self.hash_field, |
||
|
0 ignored issues
–
show
|
|||
| 231 | "doc_type": self.doc_type, "query": self.query} |
||
| 232 | return super(ElasticSearchOutput, self).save(filename, saved_data) |
||
|
0 ignored issues
–
show
|
|||
| 233 | |||
| 234 | def synchronize(self, max_wait, field): |
||
| 235 | # TODO: change this to a more general condition for wider use, including read_input |
||
|
0 ignored issues
–
show
|
|||
| 236 | # could just pass in a string condition and then 'while not eval(condition)' |
||
| 237 | count_not_yet_updated = -1 |
||
| 238 | while count_not_yet_updated != 0: |
||
|
0 ignored issues
–
show
|
|||
| 239 | count_not_yet_updated = self.instance.count(index=self.index, |
||
|
0 ignored issues
–
show
|
|||
| 240 | doc_type=self.doc_type, |
||
| 241 | body={"query": { |
||
| 242 | "constant_score" : { |
||
| 243 | "filter" : { |
||
| 244 | "missing" : { |
||
| 245 | "field" : field}}}}})['count'] |
||
|
0 ignored issues
–
show
|
|||
| 246 | logging.debug("Count not yet updated: {}".format(count_not_yet_updated)) |
||
| 247 | time.sleep(0.01) |
||
| 248 | pass |
||
|
0 ignored issues
–
show
|
|||
| 249 | |||
| 250 |
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.