| Conditions | 5 |
| Total Lines | 58 |
| Lines | 0 |
| Ratio | 0 % |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
| 1 | from __future__ import absolute_import, print_function |
||
| 18 | def run_pipeline(data_source, source_type="auto", year_field=None, start_year=None, stop_year=None, |
||
| 19 | content_field=None, tokenizer='simple', vectorizer='bag_of_words', ntopics=10, |
||
| 20 | dir_path='./topic_model', model='lda', termite_plot=False, output_file=False, |
||
| 21 | lda_vis=True, seed=42, **kwargs): |
||
| 22 | |||
| 23 | """Run your data through all topik functionality and save all results to a specified directory. |
||
| 24 | |||
| 25 | Parameters |
||
| 26 | ---------- |
||
| 27 | data_source : str |
||
| 28 | Input data (e.g. file or folder or solr/elasticsearch instance). |
||
| 29 | |||
| 30 | source_type : {'json_stream', 'folder_files', 'json_large', 'solr', 'elastic'}. |
||
| 31 | The format of your data input. Currently available a json stream or a folder containing text files. |
||
| 32 | Default is 'json_stream' |
||
| 33 | year_field : str |
||
| 34 | The field name (if any) that contains the year associated with each document (for filtering). |
||
| 35 | start_year : int |
||
| 36 | For beginning of range filter on year_field values |
||
| 37 | stop_year : int |
||
| 38 | For beginning of range filter on year_field values |
||
| 39 | content_field : string |
||
| 40 | The primary text field to parse. |
||
| 41 | tokenizer : {'simple', 'collocations', 'entities', 'mixed'} |
||
| 42 | The type of tokenizer to use. Default is 'simple'. |
||
| 43 | vectorizer : {'bag_of_words', 'tfidf'} |
||
| 44 | The type of vectorizer to use. Default is 'bag_of_words'. |
||
| 45 | ntopics : int |
||
| 46 | Number of topics to find in your data |
||
| 47 | dir_path : str |
||
| 48 | Directory path to store all topic modeling results files. Default is `./topic_model`. |
||
| 49 | model : {'LDA', 'PLSA'}. |
||
| 50 | Statistical modeling algorithm to use. Default 'LDA'. |
||
| 51 | termite_plot : bool |
||
| 52 | Generate termite plot of your model if True. Default is True. |
||
| 53 | ldavis : bool |
||
| 54 | Generate an interactive data visualization of your topics. Default is False. |
||
| 55 | seed : int |
||
| 56 | Set random number generator to seed, to be able to reproduce results. Default 42. |
||
| 57 | **kwargs : additional keyword arguments, passed through to each individual step |
||
| 58 | """ |
||
| 59 | |||
| 60 | np.random.seed(seed) |
||
| 61 | |||
| 62 | raw_data = read_input(data_source, content_field=content_field, |
||
| 63 | source_type=source_type, **kwargs) |
||
| 64 | raw_data = ((hash(item[content_field]), item[content_field]) for item in raw_data) |
||
| 65 | tokenized_data = tokenizers.registered_tokenizers[tokenizer](raw_data, **kwargs) |
||
| 66 | vectorized_data = vectorizers.registered_vectorizers[vectorizer](tokenized_data, **kwargs) |
||
| 67 | model = models.registered_models[model](vectorized_data, ntopics=ntopics, **kwargs) |
||
| 68 | if not os.path.exists(dir_path): |
||
| 69 | os.mkdir(dir_path) |
||
| 70 | |||
| 71 | if termite_plot: |
||
| 72 | termite_html(model, filename="termite.html", plot_title="Termite plot", topn=15) |
||
| 73 | |||
| 74 | if lda_vis: |
||
| 75 | visualizers.visualize(model, "lda_vis") |
||
| 76 | |||
| 78 |
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.