Conditions | 20 |
Total Lines | 104 |
Code Lines | 49 |
Lines | 0 |
Ratio | 0 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Complex classes like responsibly.we.utils.most_similar() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
1 | import math |
||
127 | def most_similar(model, positive=None, negative=None, |
||
128 | topn=10, restrict_vocab=None, indexer=None, |
||
129 | unrestricted=True): |
||
130 | """ |
||
131 | Find the top-N most similar words. |
||
132 | |||
133 | Positive words contribute positively towards the similarity, |
||
134 | negative words negatively. |
||
135 | |||
136 | This function computes cosine similarity between a simple mean |
||
137 | of the projection weight vectors of the given words and |
||
138 | the vectors for each word in the model. |
||
139 | The function corresponds to the `word-analogy` and `distance` |
||
140 | scripts in the original word2vec implementation. |
||
141 | |||
142 | Based on Gensim implementation. |
||
143 | |||
144 | :param model: Word embedding model of ``gensim.model.KeyedVectors``. |
||
145 | :param list positive: List of words that contribute positively. |
||
146 | :param list negative: List of words that contribute negatively. |
||
147 | :param int topn: Number of top-N similar words to return. |
||
148 | :param int restrict_vocab: Optional integer which limits the |
||
149 | range of vectors |
||
150 | which are searched for most-similar values. |
||
151 | For example, restrict_vocab=10000 would |
||
152 | only check the first 10000 word vectors |
||
153 | in the vocabulary order. (This may be |
||
154 | meaningful if you've sorted the vocabulary |
||
155 | by descending frequency.) |
||
156 | :param bool unrestricted: Whether to restricted the most |
||
157 | similar words to be not from |
||
158 | the positive or negative word list. |
||
159 | :return: Sequence of (word, similarity). |
||
160 | """ |
||
161 | if topn is not None and topn < 1: |
||
162 | return [] |
||
163 | |||
164 | if positive is None: |
||
165 | positive = [] |
||
166 | if negative is None: |
||
167 | negative = [] |
||
168 | |||
169 | model.init_sims() |
||
170 | |||
171 | if (isinstance(positive, string_types) |
||
172 | and not negative): |
||
173 | # allow calls like most_similar('dog'), |
||
174 | # as a shorthand for most_similar(['dog']) |
||
175 | positive = [positive] |
||
176 | |||
177 | if ((isinstance(positive, string_types) and negative) |
||
178 | or (isinstance(negative, string_types) and positive)): |
||
179 | raise ValueError('If positives and negatives are given, ' |
||
180 | 'both should be lists!') |
||
181 | |||
182 | # add weights for each word, if not already present; |
||
183 | # default to 1.0 for positive and -1.0 for negative words |
||
184 | positive = [ |
||
185 | (word, 1.0) if isinstance(word, string_types + (np.ndarray,)) |
||
186 | else word |
||
187 | for word in positive |
||
188 | ] |
||
189 | negative = [ |
||
190 | (word, -1.0) if isinstance(word, string_types + (np.ndarray,)) |
||
191 | else word |
||
192 | for word in negative |
||
193 | ] |
||
194 | |||
195 | # compute the weighted average of all words |
||
196 | all_words, mean = set(), [] |
||
197 | for word, weight in positive + negative: |
||
198 | if isinstance(word, np.ndarray): |
||
199 | mean.append(weight * word) |
||
200 | else: |
||
201 | mean.append(weight * model.word_vec(word, use_norm=True)) |
||
202 | if word in model.vocab: |
||
203 | all_words.add(model.vocab[word].index) |
||
204 | |||
205 | if not mean: |
||
206 | raise ValueError("Cannot compute similarity with no input.") |
||
207 | mean = gensim.matutils.unitvec(np.array(mean) |
||
208 | .mean(axis=0)).astype(float) |
||
209 | |||
210 | if indexer is not None: |
||
211 | return indexer.most_similar(mean, topn) |
||
212 | |||
213 | limited = (model.vectors_norm if restrict_vocab is None |
||
214 | else model.vectors_norm[:restrict_vocab]) |
||
215 | dists = limited @ mean |
||
216 | |||
217 | if topn is None: |
||
218 | return dists |
||
219 | |||
220 | best = gensim.matutils.argsort(dists, |
||
221 | topn=topn + len(all_words), |
||
222 | reverse=True) |
||
223 | |||
224 | # if not unrestricted, then ignore (don't return) |
||
225 | # words from the input |
||
226 | result = [(model.index2word[sim], float(dists[sim])) |
||
227 | for sim in best |
||
228 | if unrestricted or sim not in all_words] |
||
229 | |||
230 | return result[:topn] |
||
231 | |||
277 |