Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.
Common duplication problems, and corresponding solutions are:
Complex classes like DecisionStump often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
While breaking up the class, it is a good idea to analyze how other classes use DecisionStump, and based on these observations, apply Extract Interface, too.
1 | <?php |
||
12 | class DecisionStump extends WeightedClassifier |
||
13 | { |
||
14 | use Predictable, OneVsRest; |
||
15 | |||
16 | const AUTO_SELECT = -1; |
||
17 | |||
18 | /** |
||
19 | * @var int |
||
20 | */ |
||
21 | protected $givenColumnIndex; |
||
22 | |||
23 | /** |
||
24 | * @var array |
||
25 | */ |
||
26 | protected $binaryLabels; |
||
27 | |||
28 | /** |
||
29 | * Sample weights : If used the optimization on the decision value |
||
30 | * will take these weights into account. If not given, all samples |
||
31 | * will be weighed with the same value of 1 |
||
32 | * |
||
33 | * @var array |
||
34 | */ |
||
35 | protected $weights = null; |
||
36 | |||
37 | /** |
||
38 | * Lowest error rate obtained while training/optimizing the model |
||
39 | * |
||
40 | * @var float |
||
41 | */ |
||
42 | protected $trainingErrorRate; |
||
43 | |||
44 | /** |
||
45 | * @var int |
||
46 | */ |
||
47 | protected $column; |
||
48 | |||
49 | /** |
||
50 | * @var mixed |
||
51 | */ |
||
52 | protected $value; |
||
53 | |||
54 | /** |
||
55 | * @var string |
||
56 | */ |
||
57 | protected $operator; |
||
58 | |||
59 | /** |
||
60 | * @var array |
||
61 | */ |
||
62 | protected $columnTypes; |
||
63 | |||
64 | /** |
||
65 | * @var int |
||
66 | */ |
||
67 | protected $featureCount; |
||
68 | |||
69 | /** |
||
70 | * @var float |
||
71 | */ |
||
72 | protected $numSplitCount = 100.0; |
||
73 | |||
74 | /** |
||
75 | * Distribution of samples in the leaves |
||
76 | * |
||
77 | * @var array |
||
78 | */ |
||
79 | protected $prob; |
||
80 | |||
81 | /** |
||
82 | * A DecisionStump classifier is a one-level deep DecisionTree. It is generally |
||
83 | * used with ensemble algorithms as in the weak classifier role. <br> |
||
84 | * |
||
85 | * If columnIndex is given, then the stump tries to produce a decision node |
||
86 | * on this column, otherwise in cases given the value of -1, the stump itself |
||
87 | * decides which column to take for the decision (Default DecisionTree behaviour) |
||
88 | * |
||
89 | * @param int $columnIndex |
||
90 | */ |
||
91 | public function __construct(int $columnIndex = self::AUTO_SELECT) |
||
95 | |||
96 | /** |
||
97 | * @param array $samples |
||
98 | * @param array $targets |
||
99 | */ |
||
100 | protected function trainBinary(array $samples, array $targets) |
||
154 | |||
155 | /** |
||
156 | * While finding best split point for a numerical valued column, |
||
157 | * DecisionStump looks for equally distanced values between minimum and maximum |
||
158 | * values in the column. Given <i>$count</i> value determines how many split |
||
159 | * points to be probed. The more split counts, the better performance but |
||
160 | * worse processing time (Default value is 10.0) |
||
161 | * |
||
162 | * @param float $count |
||
163 | */ |
||
164 | public function setNumericalSplitCount(float $count) |
||
168 | |||
169 | /** |
||
170 | * Determines best split point for the given column |
||
171 | * |
||
172 | * @param int $col |
||
173 | * |
||
174 | * @return array |
||
175 | */ |
||
176 | protected function getBestNumericalSplit(int $col) |
||
214 | |||
215 | /** |
||
216 | * |
||
217 | * @param int $col |
||
218 | * |
||
219 | * @return array |
||
220 | */ |
||
221 | protected function getBestNominalSplit(int $col) |
||
243 | |||
244 | |||
245 | /** |
||
246 | * |
||
247 | * @param type $leftValue |
||
248 | * @param type $operator |
||
249 | * @param type $rightValue |
||
250 | * |
||
251 | * @return boolean |
||
252 | */ |
||
253 | protected function evaluate($leftValue, $operator, $rightValue) |
||
267 | |||
268 | /** |
||
269 | * Calculates the ratio of wrong predictions based on the new threshold |
||
270 | * value given as the parameter |
||
271 | * |
||
272 | * @param float $threshold |
||
273 | * @param string $operator |
||
274 | * @param array $values |
||
275 | * |
||
276 | * @return array |
||
277 | */ |
||
278 | protected function calculateErrorRate(float $threshold, string $operator, array $values) |
||
316 | |||
317 | /** |
||
318 | * Returns the probability of the sample of belonging to the given label |
||
319 | * |
||
320 | * Probability of a sample is calculated as the proportion of the label |
||
321 | * within the labels of the training samples in the decision node |
||
322 | * |
||
323 | * @param array $sample |
||
324 | * @param mixed $label |
||
325 | * |
||
326 | * @return float |
||
327 | */ |
||
328 | protected function predictProbability(array $sample, $label) |
||
337 | |||
338 | /** |
||
339 | * @param array $sample |
||
340 | * |
||
341 | * @return mixed |
||
342 | */ |
||
343 | protected function predictSampleBinary(array $sample) |
||
351 | |||
352 | /** |
||
353 | * @return string |
||
354 | */ |
||
355 | public function __toString() |
||
361 | } |
||
362 |
This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.
Consider making the comparison explicit by using
empty(..)
or! empty(...)
instead.