Conditions | 4 |
Total Lines | 54 |
Lines | 0 |
Ratio | 0 % |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
1 | import itertools |
||
20 | def _collect_bigrams_and_trigrams(raw_corpus, top_n=10000, min_length=1, min_freqs=None, |
||
21 | stopwords=None, stop_regex=None): |
||
22 | """collects bigrams and trigrams from collection of documents. Input to collocation tokenizer. |
||
23 | |||
24 | bigrams are pairs of words that recur in the collection; trigrams are triplets. |
||
25 | |||
26 | Parameters |
||
27 | ---------- |
||
28 | raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str)) |
||
29 | body of documents to examine |
||
30 | top_n : int |
||
31 | limit results to this many entries |
||
32 | min_length : int |
||
33 | Minimum length of any single word |
||
34 | min_freqs : iterable of int |
||
35 | threshold of when to consider a pair of words as a recognized n-gram, |
||
36 | starting with bigrams. |
||
37 | stopwords : None or iterable of str |
||
38 | Collection of words to ignore as tokens |
||
39 | stop_regex : str |
||
40 | A regular expression of content to remove from text before tokenizing. |
||
41 | Potentially useful for ignoring code (HTML tags). |
||
42 | |||
43 | Examples |
||
44 | -------- |
||
45 | >>> patterns = _collect_bigrams_and_trigrams(sample_corpus, min_freqs=[2, 2]) |
||
46 | >>> patterns[0].pattern |
||
47 | u'(frank swank|swank tank|sassy unicorns)' |
||
48 | >>> patterns[1].pattern |
||
49 | u'(frank swank tank)' |
||
50 | """ |
||
51 | |||
52 | from nltk.collocations import TrigramCollocationFinder |
||
53 | from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures |
||
54 | |||
55 | # generator of documents, turn each element to its list of words |
||
56 | doc_texts = (_simple_document(doc_text, min_length=min_length, stopwords=stopwords, |
||
57 | stop_regex=stop_regex) |
||
58 | for doc_id, doc_text in raw_corpus) |
||
59 | # generator, concatenate (chain) all words into a single sequence, lazily |
||
60 | words = itertools.chain.from_iterable(doc_texts) |
||
61 | tcf = TrigramCollocationFinder.from_words(iter(words)) |
||
62 | |||
63 | bcf = tcf.bigram_finder() |
||
64 | bcf.apply_freq_filter(min_freqs[0]) |
||
65 | bigrams = [' '.join(w) for w in bcf.nbest(BigramAssocMeasures.pmi, top_n)] |
||
66 | |||
67 | tcf.apply_freq_filter(min_freqs[1]) |
||
68 | trigrams = [' '.join(w) for w in tcf.nbest(TrigramAssocMeasures.chi_sq, top_n)] |
||
69 | |||
70 | bigrams_patterns = re.compile('(%s)' % '|'.join(bigrams), re.UNICODE) |
||
71 | trigrams_patterns = re.compile('(%s)' % '|'.join(trigrams), re.UNICODE) |
||
72 | |||
73 | return bigrams_patterns, trigrams_patterns |
||
74 | |||
162 |
The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:
If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.