| Conditions | 4 |
| Total Lines | 54 |
| Lines | 0 |
| Ratio | 0 % |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
| 1 | import itertools |
||
| 20 | def _collect_bigrams_and_trigrams(raw_corpus, top_n=10000, min_length=1, min_freqs=None, |
||
| 21 | stopwords=None, stop_regex=None): |
||
| 22 | """collects bigrams and trigrams from collection of documents. Input to collocation tokenizer. |
||
| 23 | |||
| 24 | bigrams are pairs of words that recur in the collection; trigrams are triplets. |
||
| 25 | |||
| 26 | Parameters |
||
| 27 | ---------- |
||
| 28 | raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str)) |
||
| 29 | body of documents to examine |
||
| 30 | top_n : int |
||
| 31 | limit results to this many entries |
||
| 32 | min_length : int |
||
| 33 | Minimum length of any single word |
||
| 34 | min_freqs : iterable of int |
||
| 35 | threshold of when to consider a pair of words as a recognized n-gram, |
||
| 36 | starting with bigrams. |
||
| 37 | stopwords : None or iterable of str |
||
| 38 | Collection of words to ignore as tokens |
||
| 39 | stop_regex : str |
||
| 40 | A regular expression of content to remove from text before tokenizing. |
||
| 41 | Potentially useful for ignoring code (HTML tags). |
||
| 42 | |||
| 43 | Examples |
||
| 44 | -------- |
||
| 45 | >>> patterns = _collect_bigrams_and_trigrams(sample_corpus, min_freqs=[2, 2]) |
||
| 46 | >>> patterns[0].pattern |
||
| 47 | u'(frank swank|swank tank|sassy unicorns)' |
||
| 48 | >>> patterns[1].pattern |
||
| 49 | u'(frank swank tank)' |
||
| 50 | """ |
||
| 51 | |||
| 52 | from nltk.collocations import TrigramCollocationFinder |
||
| 53 | from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures |
||
| 54 | |||
| 55 | # generator of documents, turn each element to its list of words |
||
| 56 | doc_texts = (_simple_document(doc_text, min_length=min_length, stopwords=stopwords, |
||
| 57 | stop_regex=stop_regex) |
||
| 58 | for doc_id, doc_text in raw_corpus) |
||
| 59 | # generator, concatenate (chain) all words into a single sequence, lazily |
||
| 60 | words = itertools.chain.from_iterable(doc_texts) |
||
| 61 | tcf = TrigramCollocationFinder.from_words(iter(words)) |
||
| 62 | |||
| 63 | bcf = tcf.bigram_finder() |
||
| 64 | bcf.apply_freq_filter(min_freqs[0]) |
||
| 65 | bigrams = [' '.join(w) for w in bcf.nbest(BigramAssocMeasures.pmi, top_n)] |
||
| 66 | |||
| 67 | tcf.apply_freq_filter(min_freqs[1]) |
||
| 68 | trigrams = [' '.join(w) for w in tcf.nbest(TrigramAssocMeasures.chi_sq, top_n)] |
||
| 69 | |||
| 70 | bigrams_patterns = re.compile('(%s)' % '|'.join(bigrams), re.UNICODE) |
||
| 71 | trigrams_patterns = re.compile('(%s)' % '|'.join(trigrams), re.UNICODE) |
||
| 72 | |||
| 73 | return bigrams_patterns, trigrams_patterns |
||
| 74 | |||
| 162 |
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.