| Conditions | 12 |
| Total Lines | 64 |
| Lines | 0 |
| Ratio | 0 % |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Complex classes like topik.fileio.read_input() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
| 1 | import os |
||
| 7 | def read_input(source, source_type="auto", folder_content_field='text', **kwargs): |
||
| 8 | """ |
||
| 9 | Read data from given source into Topik's internal data structures. |
||
| 10 | |||
| 11 | Parameters |
||
| 12 | ---------- |
||
| 13 | source : str |
||
| 14 | input data. Can be file path, directory, or server address. |
||
| 15 | source_type : str |
||
| 16 | "auto" tries to figure out data type of source. Can be manually specified instead. |
||
| 17 | options for manual specification are ['solr', 'elastic', 'json_stream', 'large_json', 'folder'] |
||
| 18 | folder_content_field : str |
||
| 19 | Only used for document_folder source. This argument is used as the key |
||
| 20 | (field name), where each document represents the value of that field. |
||
| 21 | kwargs : any other arguments to pass to input parsers |
||
| 22 | |||
| 23 | Returns |
||
| 24 | ------- |
||
| 25 | iterable output object |
||
| 26 | |||
| 27 | >> ids, texts = zip(*list(iter(raw_data))) |
||
| 28 | Examples |
||
| 29 | -------- |
||
| 30 | >>> loaded_corpus = read_input( |
||
| 31 | ... '{}/test_data_json_stream.json'.format(test_data_path)) |
||
| 32 | >>> solution_text = ( |
||
| 33 | ... u'Transition metal oxides are being considered as the next generation '+ |
||
| 34 | ... u'materials in field such as electronics and advanced catalysts; '+ |
||
| 35 | ... u'between them is Tantalum (V) Oxide; however, there are few reports '+ |
||
| 36 | ... u'for the synthesis of this material at the nanometer size which could '+ |
||
| 37 | ... u'have unusual properties. Hence, in this work we present the '+ |
||
| 38 | ... u'synthesis of Ta2O5 nanorods by sol gel method using DNA as structure '+ |
||
| 39 | ... u'directing agent, the size of the nanorods was of the order of 40 to '+ |
||
| 40 | ... u'100 nm in diameter and several microns in length; this easy method '+ |
||
| 41 | ... u'can be useful in the preparation of nanomaterials for electronics, '+ |
||
| 42 | ... u'biomedical applications as well as catalysts.') |
||
| 43 | >>> solution_text == next(loaded_corpus)['abstract'] |
||
| 44 | True |
||
| 45 | """ |
||
| 46 | json_extensions = [".js", ".json"] |
||
| 47 | |||
| 48 | # web addresses default to elasticsearch |
||
| 49 | if (source_type == "auto" and "9200" in source) or source_type == "elastic": |
||
| 50 | data_iterator = registered_inputs["read_elastic"](source, **kwargs) |
||
| 51 | # files must end in .json. Try json parser first, try large_json parser next. Fail otherwise. |
||
| 52 | elif (source_type == "auto" and os.path.splitext(source)[1] in json_extensions) or source_type == "json_stream": |
||
| 53 | try: |
||
| 54 | data_iterator = registered_inputs["read_json_stream"](source, **kwargs) |
||
| 55 | # tee the iterator and try to get the first element. If it fails, this is actually a large_json file. |
||
| 56 | next(data_iterator) |
||
| 57 | # reset the iterator after this check so that it starts at document 0 rather than document 1 |
||
| 58 | data_iterator = registered_inputs["read_json_stream"](source, **kwargs) |
||
| 59 | except ValueError: |
||
| 60 | data_iterator = registered_inputs["read_large_json"](source, **kwargs) |
||
| 61 | elif source_type == "large_json": |
||
| 62 | data_iterator = registered_inputs["read_large_json"](source, **kwargs) |
||
| 63 | # folder paths are simple strings that don't end in an extension (.+3-4 characters), or end in a / |
||
| 64 | elif (source_type == "auto" and os.path.splitext(source)[1] == "") or source_type == "folder": |
||
| 65 | data_iterator = registered_inputs["read_document_folder"](source, |
||
| 66 | content_field=folder_content_field) |
||
| 67 | else: |
||
| 68 | raise ValueError("Unrecognized source type: {}. Please either manually specify the type, or convert your input" |
||
| 69 | " to a supported type.".format(source)) |
||
| 70 | return data_iterator |
||
| 71 | |||
| 72 |
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.