Conditions | 20 |
Total Lines | 104 |
Code Lines | 49 |
Lines | 0 |
Ratio | 0 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Complex classes like ethically.we.utils.most_similar() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
1 | import math |
||
124 | def most_similar(model, positive=None, negative=None, |
||
125 | topn=10, restrict_vocab=None, indexer=None, |
||
126 | unrestricted=True): |
||
127 | """ |
||
128 | Find the top-N most similar words. |
||
129 | |||
130 | Positive words contribute positively towards the similarity, |
||
131 | negative words negatively. |
||
132 | |||
133 | This function computes cosine similarity between a simple mean |
||
134 | of the projection weight vectors of the given words and |
||
135 | the vectors for each word in the model. |
||
136 | The function corresponds to the `word-analogy` and `distance` |
||
137 | scripts in the original word2vec implementation. |
||
138 | |||
139 | Based on Gensim implementation. |
||
140 | |||
141 | :param model: Word embedding model of ``gensim.model.KeyedVectors``. |
||
142 | :param list positive: List of words that contribute positively. |
||
143 | :param list negative: List of words that contribute negatively. |
||
144 | :param int topn: Number of top-N similar words to return. |
||
145 | :param int restrict_vocab: Optional integer which limits the |
||
146 | range of vectors |
||
147 | which are searched for most-similar values. |
||
148 | For example, restrict_vocab=10000 would |
||
149 | only check the first 10000 word vectors |
||
150 | in the vocabulary order. (This may be |
||
151 | meaningful if you've sorted the vocabulary |
||
152 | by descending frequency.) |
||
153 | :param bool unrestricted: Whether to restricted the most |
||
154 | similar words to be not from |
||
155 | the positive or negative word list. |
||
156 | :return: Sequence of (word, similarity). |
||
157 | """ |
||
158 | if topn is not None and topn < 1: |
||
159 | return [] |
||
160 | |||
161 | if positive is None: |
||
162 | positive = [] |
||
163 | if negative is None: |
||
164 | negative = [] |
||
165 | |||
166 | model.init_sims() |
||
167 | |||
168 | if (isinstance(positive, string_types) |
||
169 | and not negative): |
||
170 | # allow calls like most_similar('dog'), |
||
171 | # as a shorthand for most_similar(['dog']) |
||
172 | positive = [positive] |
||
173 | |||
174 | if ((isinstance(positive, string_types) and negative) |
||
175 | or (isinstance(negative, string_types) and positive)): |
||
176 | raise ValueError('If positives and negatives are given, ' |
||
177 | 'both should be lists!') |
||
178 | |||
179 | # add weights for each word, if not already present; |
||
180 | # default to 1.0 for positive and -1.0 for negative words |
||
181 | positive = [ |
||
182 | (word, 1.0) if isinstance(word, string_types + (np.ndarray,)) |
||
183 | else word |
||
184 | for word in positive |
||
185 | ] |
||
186 | negative = [ |
||
187 | (word, -1.0) if isinstance(word, string_types + (np.ndarray,)) |
||
188 | else word |
||
189 | for word in negative |
||
190 | ] |
||
191 | |||
192 | # compute the weighted average of all words |
||
193 | all_words, mean = set(), [] |
||
194 | for word, weight in positive + negative: |
||
195 | if isinstance(word, np.ndarray): |
||
196 | mean.append(weight * word) |
||
197 | else: |
||
198 | mean.append(weight * model.word_vec(word, use_norm=True)) |
||
199 | if word in model.vocab: |
||
200 | all_words.add(model.vocab[word].index) |
||
201 | |||
202 | if not mean: |
||
203 | raise ValueError("Cannot compute similarity with no input.") |
||
204 | mean = gensim.matutils.unitvec(np.array(mean) |
||
205 | .mean(axis=0)).astype(float) |
||
206 | |||
207 | if indexer is not None: |
||
208 | return indexer.most_similar(mean, topn) |
||
209 | |||
210 | limited = (model.vectors_norm if restrict_vocab is None |
||
211 | else model.vectors_norm[:restrict_vocab]) |
||
212 | dists = limited @ mean |
||
213 | |||
214 | if topn is None: |
||
215 | return dists |
||
216 | |||
217 | best = gensim.matutils.argsort(dists, |
||
218 | topn=topn + len(all_words), |
||
219 | reverse=True) |
||
220 | |||
221 | # if not unrestricted, then ignore (don't return) |
||
222 | # words from the input |
||
223 | result = [(model.index2word[sim], float(dists[sim])) |
||
224 | for sim in best |
||
225 | if unrestricted or sim not in all_words] |
||
226 | |||
227 | return result[:topn] |
||
228 | |||
274 |