Conditions | 31 |
Total Lines | 177 |
Code Lines | 75 |
Lines | 0 |
Ratio | 0 % |
Tests | 57 |
CRAP Score | 31 |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Complex classes like abydos.distance._jaro_winkler.JaroWinkler.sim() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.
There are several approaches to avoid long parameter lists:
1 | # -*- coding: utf-8 -*- |
||
55 | 1 | def sim( |
|
56 | self, |
||
57 | src, |
||
58 | tar, |
||
59 | qval=1, |
||
60 | mode='winkler', |
||
61 | long_strings=False, |
||
62 | boost_threshold=0.7, |
||
63 | scaling_factor=0.1, |
||
64 | ): |
||
65 | """Return the Jaro or Jaro-Winkler similarity of two strings. |
||
66 | |||
67 | Parameters |
||
68 | ---------- |
||
69 | src : str |
||
70 | Source string for comparison |
||
71 | tar : str |
||
72 | Target string for comparison |
||
73 | qval : int |
||
74 | The length of each q-gram (defaults to 1: character-wise matching) |
||
75 | mode : str |
||
76 | Indicates which variant of this distance metric to compute: |
||
77 | |||
78 | - ``winkler`` -- computes the Jaro-Winkler distance (default) |
||
79 | which increases the score for matches near the start of the |
||
80 | word |
||
81 | - ``jaro`` -- computes the Jaro distance |
||
82 | |||
83 | long_strings : bool |
||
84 | Set to True to "Increase the probability of a match when the number |
||
85 | of matched characters is large. This option allows for a little |
||
86 | more tolerance when the strings are large. It is not an appropriate |
||
87 | test when comparing fixed length fields such as phone and social |
||
88 | security numbers." (Used in 'winkler' mode only.) |
||
89 | boost_threshold : float |
||
90 | A value between 0 and 1, below which the Winkler boost is not |
||
91 | applied (defaults to 0.7). (Used in 'winkler' mode only.) |
||
92 | scaling_factor : float |
||
93 | A value between 0 and 0.25, indicating by how much to boost scores |
||
94 | for matching prefixes (defaults to 0.1). (Used in 'winkler' mode |
||
95 | only.) |
||
96 | |||
97 | Returns |
||
98 | ------- |
||
99 | float |
||
100 | Jaro or Jaro-Winkler similarity |
||
101 | |||
102 | Raises |
||
103 | ------ |
||
104 | ValueError |
||
105 | Unsupported boost_threshold assignment; boost_threshold must be |
||
106 | between 0 and 1. |
||
107 | ValueError |
||
108 | Unsupported scaling_factor assignment; scaling_factor must be |
||
109 | between 0 and 0.25.' |
||
110 | |||
111 | Examples |
||
112 | -------- |
||
113 | >>> round(sim_jaro_winkler('cat', 'hat'), 12) |
||
114 | 0.777777777778 |
||
115 | >>> round(sim_jaro_winkler('Niall', 'Neil'), 12) |
||
116 | 0.805 |
||
117 | >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12) |
||
118 | 0.60119047619 |
||
119 | >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12) |
||
120 | 0.833333333333 |
||
121 | |||
122 | >>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12) |
||
123 | 0.777777777778 |
||
124 | >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12) |
||
125 | 0.783333333333 |
||
126 | >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12) |
||
127 | 0.60119047619 |
||
128 | >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12) |
||
129 | 0.833333333333 |
||
130 | |||
131 | """ |
||
132 | 1 | if mode == 'winkler': |
|
133 | 1 | if boost_threshold > 1 or boost_threshold < 0: |
|
134 | 1 | raise ValueError( |
|
135 | 'Unsupported boost_threshold assignment; ' |
||
136 | + 'boost_threshold must be between 0 and 1.' |
||
137 | ) |
||
138 | 1 | if scaling_factor > 0.25 or scaling_factor < 0: |
|
139 | 1 | raise ValueError( |
|
140 | 'Unsupported scaling_factor assignment; ' |
||
141 | + 'scaling_factor must be between 0 and 0.25.' |
||
142 | ) |
||
143 | |||
144 | 1 | if src == tar: |
|
145 | 1 | return 1.0 |
|
146 | |||
147 | 1 | src = QGrams(src.strip(), qval)._ordered_list |
|
148 | 1 | tar = QGrams(tar.strip(), qval)._ordered_list |
|
149 | |||
150 | 1 | lens = len(src) |
|
151 | 1 | lent = len(tar) |
|
152 | |||
153 | # If either string is blank - return - added in Version 2 |
||
154 | 1 | if lens == 0 or lent == 0: |
|
155 | 1 | return 0.0 |
|
156 | |||
157 | 1 | if lens > lent: |
|
158 | 1 | search_range = lens |
|
159 | 1 | minv = lent |
|
160 | else: |
||
161 | 1 | search_range = lent |
|
162 | 1 | minv = lens |
|
163 | |||
164 | # Zero out the flags |
||
165 | 1 | src_flag = [0] * search_range |
|
166 | 1 | tar_flag = [0] * search_range |
|
167 | 1 | search_range = max(0, search_range // 2 - 1) |
|
168 | |||
169 | # Looking only within the search range, |
||
170 | # count and flag the matched pairs. |
||
171 | 1 | num_com = 0 |
|
172 | 1 | yl1 = lent - 1 |
|
173 | 1 | for i in range(lens): |
|
174 | 1 | low_lim = (i - search_range) if (i >= search_range) else 0 |
|
175 | 1 | hi_lim = (i + search_range) if ((i + search_range) <= yl1) else yl1 |
|
176 | 1 | for j in range(low_lim, hi_lim + 1): |
|
177 | 1 | if (tar_flag[j] == 0) and (tar[j] == src[i]): |
|
178 | 1 | tar_flag[j] = 1 |
|
179 | 1 | src_flag[i] = 1 |
|
180 | 1 | num_com += 1 |
|
181 | 1 | break |
|
182 | |||
183 | # If no characters in common - return |
||
184 | 1 | if num_com == 0: |
|
185 | 1 | return 0.0 |
|
186 | |||
187 | # Count the number of transpositions |
||
188 | 1 | k = n_trans = 0 |
|
189 | 1 | for i in range(lens): |
|
190 | 1 | if src_flag[i] != 0: |
|
191 | 1 | j = 0 |
|
192 | 1 | for j in range(k, lent): # pragma: no branch |
|
193 | 1 | if tar_flag[j] != 0: |
|
194 | 1 | k = j + 1 |
|
195 | 1 | break |
|
196 | 1 | if src[i] != tar[j]: |
|
197 | 1 | n_trans += 1 |
|
198 | 1 | n_trans //= 2 |
|
199 | |||
200 | # Main weight computation for Jaro distance |
||
201 | 1 | weight = ( |
|
202 | num_com / lens + num_com / lent + (num_com - n_trans) / num_com |
||
203 | ) |
||
204 | 1 | weight /= 3.0 |
|
205 | |||
206 | # Continue to boost the weight if the strings are similar |
||
207 | # This is the Winkler portion of Jaro-Winkler distance |
||
208 | 1 | if mode == 'winkler' and weight > boost_threshold: |
|
209 | |||
210 | # Adjust for having up to the first 4 characters in common |
||
211 | 1 | j = 4 if (minv >= 4) else minv |
|
212 | 1 | i = 0 |
|
213 | 1 | while (i < j) and (src[i] == tar[i]): |
|
214 | 1 | i += 1 |
|
215 | 1 | weight += i * scaling_factor * (1.0 - weight) |
|
216 | |||
217 | # Optionally adjust for long strings. |
||
218 | |||
219 | # After agreeing beginning chars, at least two more must agree and |
||
220 | # the agreeing characters must be > .5 of remaining characters. |
||
221 | 1 | if ( |
|
222 | long_strings |
||
223 | and (minv > 4) |
||
224 | and (num_com > i + 1) |
||
225 | and (2 * num_com >= minv + i) |
||
226 | ): |
||
227 | 1 | weight += (1.0 - weight) * ( |
|
228 | (num_com - i - 1) / (lens + lent - i * 2 + 2) |
||
229 | ) |
||
230 | |||
231 | 1 | return weight |
|
232 | |||
382 |