Completed
Push — youngblood/update_intermediate... ( 46ae62...676980 )
by
unknown
01:55
created

topik.fileio.TopikProject.select_modeled_corpus()   A

Complexity

Conditions 2

Size

Total Lines 8

Duplication

Lines 0
Ratio 0 %
Metric Value
cc 2
dl 0
loc 8
rs 9.4286
1
import itertools
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
import jsonpickle
3
import os
4
5
from topik import tokenizers, transformers, vectorizers, models, visualizers
6
from ._registry import registered_outputs
7
from .reader import read_input
8
9
10
def _get_parameters_string(**kwargs):
11
    """Used to create identifiers for output"""
12
    _id = ""
13
    if kwargs:
14
        _id = "_" + ''.join('{}={}_'.format(key, val) for key, val in sorted(kwargs.items()))[:-1]
15
    return _id
16
17
18
class TopikProject(object):
0 ignored issues
show
Coding Style introduced by
This class should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
19
    def __init__(self, project_name, output_type=None, output_args=None, **kwargs):
20
        """Class that abstracts persistence.  Drives different output types, and handles
21
        storing intermediate results to given output type.
22
23
        output_type : string
24
            internal format for handling user data.  Current options are
25
            present in topik.fileio.registered_outputs.  default is "InMemoryOutput".
26
        output_args : dictionary or None
27
            configuration to pass through to output
28
        synchronous_wait : integer
29
            number of seconds to wait for data to finish uploading to output, when using an asynchronous
30
             output type.  Only relevant for some output types ("ElasticSearchOutput", not "InMemoryOutput")
31
        **kwargs : passed through to superclass __init__.  Not passed to output.
32
        """
33
        if output_args is None:
34
            output_args = {}
35
        if os.path.exists(project_name + ".topikproject") and output_type is None:
36
            with open(project_name + ".topikproject") as project_meta:
37
                project_data = jsonpickle.decode(project_meta.read())
38
            kwargs.update(project_data)
39
            with open(project_name + ".topikdata") as project_data:
40
                loaded_data = jsonpickle.decode(project_data.read())
0 ignored issues
show
Bug introduced by
The Instance of list does not seem to have a member named read.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
Bug introduced by
The Instance of dict does not seem to have a member named read.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
Bug introduced by
The Instance of _trivialclassic does not seem to have a member named read.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
Bug introduced by
The Instance of set does not seem to have a member named read.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
Bug introduced by
The Instance of tuple does not seem to have a member named read.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
41
                output_type = loaded_data["class"]
42
                output_args.update(loaded_data["saved_data"])
43
        self.project_name = project_name
44
        if output_type is None:
45
            output_type = "InMemoryOutput"
46
        # loading the output here is sufficient to restore all results: the output is responsible for loading them as
47
        #    necessary, and returning iterators or output objects appropriately.
48
        self.output = registered_outputs[output_type](**output_args)
49
        # not used, but stored here for persistence purposes
50
        self._output_type = output_type
51
        self._output_args = output_args
52
        # None or a string expression in Elasticsearch query format
53
        self.corpus_filter = kwargs["corpus_filter"] if "corpus_filter" in kwargs else ""
54
        # None or a string name
55
        self.content_field = kwargs["content_field"] if "content_field" in kwargs else ""
56
        # Initially None, set to string value when tokenize or transform method called
57
        self._selected_source_field = kwargs["_selected_content_field"] if "_selected_content_field" in kwargs else None
58
        # Initially None, set to string value when tokenize or transform method called
59
        self._selected_tokenized_corpus_id = kwargs["_selected_tokenized_corpus_id"] if "_selected_tokenized_corpus_id" in kwargs else None
60
        # Initially None, set to string value when vectorize method called
61
        self._selected_vectorized_corpus_id = kwargs["_selected_vectorized_corpus_id"] if "_selected_vectorized_corpus_id" in kwargs else None
62
        # Initially None, set to string value when run_model method called
63
        self._selected_modeled_corpus_id = kwargs["_selected_modeled_corpus_id"] if "_selected_modeled_corpus_id" in kwargs else None
64
65
    def __enter__(self):
66
        return self
67
68
    def __exit__(self, exc_type, exc_val, exc_tb):
69
        self.close()
70
71
    def close(self):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
72
        self.save()
73
        self.output.close()  # close any open file handles or network connections
74
75
    def save(self):
76
        """Save project as .topikproject metafile and some number of sidecar data files."""
77
        with open(self.project_name + ".topikproject", "w") as f:
0 ignored issues
show
Coding Style Naming introduced by
The name f does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
78
            f.write(jsonpickle.encode({
79
               "_selected_tokenized_corpus_id": self._selected_tokenized_corpus_id,
80
               "_selected_vectorized_corpus_id": self._selected_vectorized_corpus_id,
81
               "_selected_modeled_corpus_id": self._selected_modeled_corpus_id,
82
               "corpus_filter": self.corpus_filter,
83
               "project_name": self.project_name,
84
               "output_type": self._output_type,
85
               "output_args": self._output_args,
86
               "content_field": self.content_field},
87
               f))
88
        self.output.save(self.project_name + ".topikdata")
89
90
    def read_input(self, source, content_field, source_type="auto", **kwargs):
91
        """Import data from external source into Topik's internal format"""
92
        self.output.import_from_iterable(read_input(source,
93
                                                    content_field=content_field,
94
                                                    source_type=source_type,
95
                                                    **kwargs),
96
                                         field_to_hash=content_field)
97
        self.content_field = content_field
98
99
    def get_filtered_corpus_iterator(self, field=None, filter_expression=None):
0 ignored issues
show
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
100
        if field is None:
101
            field = self.content_field
102
        if filter_expression is None:
103
            filter_expression = self.corpus_filter
104
        return self.output.get_filtered_data(field, filter_expression)
105
106
    def get_date_filtered_corpus_iterator(self, start, end, filter_field,
0 ignored issues
show
Coding Style Naming introduced by
The name get_date_filtered_corpus_iterator does not conform to the method naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This method should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
107
                                          field_to_get=None):
108
        if field_to_get is None:
109
            field_to_get = self.content_field
110
        return self.output.get_date_filtered_data(field_to_get=field_to_get,
111
                                                  start=start,
112
                                                  end=end,
113
                                                  filter_field=filter_field)
114
115
    def tokenize(self, method="simple", **kwargs):
116
        """Break raw text into substituent terms (or collections of terms)"""
117
        # tokenize, and store the results on this object somehow
118
        tokenized_corpus = tokenizers.tokenize(self.selected_filtered_corpus,
119
                                             method=method, **kwargs)
120
        tokenize_parameter_string = self.corpus_filter + "_tk_{method}{params}".format(
121
            method=method,
122
            params=_get_parameters_string(**kwargs))
123
124
        # store this
125
        self.output.tokenized_corpora[tokenize_parameter_string] = tokenized_corpus
126
        # set _tokenizer_id internal handle to point to this data
127
        self._selected_tokenized_corpus_id = tokenize_parameter_string
128
129
    def transform(self, method, **kwargs):
130
        """Stem or lemmatize input text that has already been tokenized"""
131
        transformed_data = transformers.transform(method=method, **kwargs)
132
        tokenize_parameter_string = "_".join([self.tokenizer_id, "xform", method,
0 ignored issues
show
Bug introduced by
The Instance of TopikProject does not seem to have a member named tokenizer_id.

This check looks for calls to members that are non-existent. These calls will fail.

The member could have been renamed or removed.

Loading history...
133
                                              _get_parameters_string(**kwargs)])
134
        # store this
135
        self.output.tokenized_corpora[tokenize_parameter_string] = transformed_data
136
        # set _tokenizer_id internal handle to point to this data
137
        self._selected_tokenized_corpus_id = tokenize_parameter_string
138
139
    def vectorize(self, method="bag_of_words", **kwargs):
140
        """Convert tokenized text to vector form - mathematical representation used for modeling."""
141
        tokenizer_iterators = itertools.tee(self.selected_tokenized_corpus)
142
        vectorized_corpus = vectorizers.vectorize(tokenizer_iterators[0],
143
                                                method=method, **kwargs)
144
        vectorize_parameter_string = self.corpus_filter + self._selected_tokenized_corpus_id + "_".join([method, _get_parameters_string(**kwargs)])
145
        # store this internally
146
        self.output.vectorized_corpora[vectorize_parameter_string] = vectorized_corpus
147
        # set _vectorizer_id internal handle to point to this data
148
        self._selected_vectorized_corpus_id = vectorize_parameter_string
149
150
    def run_model(self, model_name="plsa", **kwargs):
151
        """Analyze vectorized text; determine topics and assign document probabilities"""
152
        modeled_corpus = models.run_model(self.selected_vectorized_corpus,
153
                                        model_name=model_name, **kwargs)
154
        model_id = "_".join([model_name, _get_parameters_string(**kwargs)])
155
        # store this internally
156
        self.output.modeled_corpora[model_id] = modeled_corpus
157
        # set _model_id internal handle to point to this data
158
        self._selected_modeled_corpus_id = model_id
159
160
    def visualize(self, vis_name='termite', model_id=None, **kwargs):
161
        """Plot model output"""
162
        if not model_id:
163
            modeled_corpus = self.selected_modeled_corpus
164
        else:
165
            modeled_corpus = self.output.model_data[model_id]
166
        return visualizers.visualize(modeled_corpus, vis_name, **kwargs)
167
168
    def select_tokenized_corpus(self, _id):
169
        """Assign active tokenized corpus.
170
171
        When more than one tokenized corpus available (ran tokenization more than once with different
172
        methods), this allows you to switch to a different data set.
173
        """
174
        if _id in self.output.tokenized_corpora:
175
            self._selected_tokenized_corpus_id = _id
176
        else:
177
            raise ValueError("tokenized data {} not found in storage.".format(id))
178
179
    def select_vectorized_corpus(self, _id):
180
        """Assign active vectorized corpus.
181
182
        When more than one vectorized corpus available (ran tokenization more than once with different
183
        methods), this allows you to switch to a different data set.
184
        """
185
        if _id in self.output.vectorized_corpora:
186
            self._selected_vectorized_corpus_id = _id
187
        else:
188
            raise ValueError("vectorized data {} not found in storage.".format(_id))
189
190
    def select_modeled_corpus(self, _id):
191
        """When more than one model output available (ran modeling more than once with different
192
        methods), this allows you to switch to a different data set.
193
        """
194
        if _id in self.output.modeled_corpus:
195
            self._selected_modeled_corpus_id = _id
196
        else:
197
            raise ValueError("model {} not found in storage.".format(_id))
198
199
    @property
200
    def selected_filtered_corpus(self):
201
        """Corpus documents, potentially a subset.
202
203
        Output from read_input step.
204
        Input to tokenization step.
205
        """
206
        return self.output.get_filtered_data(field_to_get=self.content_field,
207
                                             filter=self.corpus_filter)
208
209
    @property
210
    def selected_tokenized_corpus(self):
211
        """Documents broken into component words.  May also be transformed.
212
213
        Output from tokenization and/or transformation steps.
214
        Input to vectorization step.
215
        """
216
        return self.output.tokenized_corpora[self._selected_tokenized_corpus_id]
217
218
    @property
219
    def selected_vectorized_corpus(self):
220
        """Data that has been vectorized into term frequencies, TF/IDF, or
221
        other vector representation.
222
223
        Output from vectorization step.
224
        Input to modeling step.
225
        """
226
        return self.output.vectorized_corpora[self._selected_vectorized_corpus_id]
227
228
    @property
229
    def selected_modeled_corpus(self):
230
        """matrices representing the model derived.
231
232
        Output from modeling step.
233
        Input to visualization step.
234
        """
235
        return self.output.modeled_corpora[self._selected_modeled_corpus_id]
236
237