Test Failed
Branch master (01bb82)
by Arkadiusz
02:59
created

DecisionStump   B

Complexity

Total Complexity 43

Size/Duplication

Total Lines 350
Duplicated Lines 4.29 %

Coupling/Cohesion

Components 1
Dependencies 4

Importance

Changes 0
Metric Value
wmc 43
lcom 1
cbo 4
dl 15
loc 350
rs 8.3157
c 0
b 0
f 0

10 Methods

Rating   Name   Duplication   Size   Complexity  
A __construct() 0 4 1
B trainBinary() 0 54 9
A setNumericalSplitCount() 0 4 1
B getBestNumericalSplit() 10 38 6
B getBestNominalSplit() 5 22 5
B evaluate() 0 14 8
C calculateErrorRate() 0 38 8
A predictProbability() 0 9 2
A predictSampleBinary() 0 8 2
A __toString() 0 6 1

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like DecisionStump often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use DecisionStump, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
declare(strict_types=1);
4
5
namespace Phpml\Classification\Linear;
6
7
use Phpml\Helper\Predictable;
8
use Phpml\Helper\OneVsRest;
9
use Phpml\Classification\WeightedClassifier;
10
use Phpml\Classification\DecisionTree;
11
12
class DecisionStump extends WeightedClassifier
13
{
14
    use Predictable, OneVsRest;
15
16
    const AUTO_SELECT = -1;
17
18
    /**
19
     * @var int
20
     */
21
    protected $givenColumnIndex;
22
23
    /**
24
     * @var array
25
     */
26
    protected $binaryLabels;
27
28
    /**
29
     * Sample weights : If used the optimization on the decision value
30
     * will take these weights into account. If not given, all samples
31
     * will be weighed with the same value of 1
32
     *
33
     * @var array
34
     */
35
    protected $weights = null;
36
37
    /**
38
     * Lowest error rate obtained while training/optimizing the model
39
     *
40
     * @var float
41
     */
42
    protected $trainingErrorRate;
43
44
    /**
45
     * @var int
46
     */
47
    protected $column;
48
49
    /**
50
     * @var mixed
51
     */
52
    protected $value;
53
54
    /**
55
     * @var string
56
     */
57
    protected $operator;
58
59
    /**
60
     * @var array
61
     */
62
    protected $columnTypes;
63
64
    /**
65
     * @var int
66
     */
67
    protected $featureCount;
68
69
    /**
70
     * @var float
71
     */
72
    protected $numSplitCount = 100.0;
73
74
    /**
75
     * Distribution of samples in the leaves
76
     *
77
     * @var array
78
     */
79
    protected $prob;
80
81
    /**
82
     * A DecisionStump classifier is a one-level deep DecisionTree. It is generally
83
     * used with ensemble algorithms as in the weak classifier role. <br>
84
     *
85
     * If columnIndex is given, then the stump tries to produce a decision node
86
     * on this column, otherwise in cases given the value of -1, the stump itself
87
     * decides which column to take for the decision (Default DecisionTree behaviour)
88
     *
89
     * @param int $columnIndex
90
     */
91
    public function __construct(int $columnIndex = self::AUTO_SELECT)
92
    {
93
        $this->givenColumnIndex = $columnIndex;
94
    }
95
96
    /**
97
     * @param array $samples
98
     * @param array $targets
99
     */
100
    protected function trainBinary(array $samples, array $targets)
101
    {
102
        $this->samples = array_merge($this->samples, $samples);
103
        $this->targets = array_merge($this->targets, $targets);
104
        $this->binaryLabels = array_keys(array_count_values($this->targets));
105
        $this->featureCount = count($this->samples[0]);
106
107
        // If a column index is given, it should be among the existing columns
108
        if ($this->givenColumnIndex > count($this->samples[0]) - 1) {
109
            $this->givenColumnIndex = self::AUTO_SELECT;
110
        }
111
112
        // Check the size of the weights given.
113
        // If none given, then assign 1 as a weight to each sample
114
        if ($this->weights) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->weights of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
115
            $numWeights = count($this->weights);
116
            if ($numWeights != count($this->samples)) {
117
                throw new \Exception("Number of sample weights does not match with number of samples");
118
            }
119
        } else {
120
            $this->weights = array_fill(0, count($this->samples), 1);
121
        }
122
123
        // Determine type of each column as either "continuous" or "nominal"
124
        $this->columnTypes = DecisionTree::getColumnTypes($this->samples);
125
126
        // Try to find the best split in the columns of the dataset
127
        // by calculating error rate for each split point in each column
128
        $columns = range(0, count($this->samples[0]) - 1);
129
        if ($this->givenColumnIndex != self::AUTO_SELECT) {
130
            $columns = [$this->givenColumnIndex];
131
        }
132
133
        $bestSplit = [
134
            'value' => 0, 'operator' => '',
135
            'prob' => [], 'column' => 0,
136
            'trainingErrorRate' => 1.0];
137
        foreach ($columns as $col) {
138
            if ($this->columnTypes[$col] == DecisionTree::CONTINUOS) {
139
                $split = $this->getBestNumericalSplit($col);
140
            } else {
141
                $split = $this->getBestNominalSplit($col);
142
            }
143
144
            if ($split['trainingErrorRate'] < $bestSplit['trainingErrorRate']) {
145
                $bestSplit = $split;
146
            }
147
        }
148
149
        // Assign determined best values to the stump
150
        foreach ($bestSplit as $name => $value) {
151
            $this->{$name} = $value;
152
        }
153
    }
154
155
    /**
156
     * While finding best split point for a numerical valued column,
157
     * DecisionStump looks for equally distanced values between minimum and maximum
158
     * values in the column. Given <i>$count</i> value determines how many split
159
     * points to be probed. The more split counts, the better performance but
160
     * worse processing time (Default value is 10.0)
161
     *
162
     * @param float $count
163
     */
164
    public function setNumericalSplitCount(float $count)
165
    {
166
        $this->numSplitCount = $count;
167
    }
168
169
    /**
170
     * Determines best split point for the given column
171
     *
172
     * @param int $col
173
     *
174
     * @return array
175
     */
176
    protected function getBestNumericalSplit(int $col)
177
    {
178
        $values = array_column($this->samples, $col);
179
        // Trying all possible points may be accomplished in two general ways:
180
        // 1- Try all values in the $samples array ($values)
181
        // 2- Artificially split the range of values into several parts and try them
182
        // We choose the second one because it is faster in larger datasets
183
        $minValue = min($values);
184
        $maxValue = max($values);
185
        $stepSize = ($maxValue - $minValue) / $this->numSplitCount;
186
187
        $split = null;
188
189
        foreach (['<=', '>'] as $operator) {
190
            // Before trying all possible split points, let's first try
191
            // the average value for the cut point
192
            $threshold = array_sum($values) / (float) count($values);
193
            list($errorRate, $prob) = $this->calculateErrorRate($threshold, $operator, $values);
194 View Code Duplication
            if ($split == null || $errorRate < $split['trainingErrorRate']) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
195
                $split = ['value' => $threshold, 'operator' => $operator,
196
                        'prob' => $prob, 'column' => $col,
197
                        'trainingErrorRate' => $errorRate];
198
            }
199
200
            // Try other possible points one by one
201
            for ($step = $minValue; $step <= $maxValue; $step+= $stepSize) {
202
                $threshold = (float)$step;
203
                list($errorRate, $prob) = $this->calculateErrorRate($threshold, $operator, $values);
204 View Code Duplication
                if ($errorRate < $split['trainingErrorRate']) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
205
                    $split = ['value' => $threshold, 'operator' => $operator,
206
                        'prob' => $prob, 'column' => $col,
207
                        'trainingErrorRate' => $errorRate];
208
                }
209
            }// for
210
        }
211
212
        return $split;
213
    }
214
215
    /**
216
     *
217
     * @param int $col
218
     *
219
     * @return array
220
     */
221
    protected function getBestNominalSplit(int $col)
222
    {
223
        $values = array_column($this->samples, $col);
224
        $valueCounts = array_count_values($values);
225
        $distinctVals= array_keys($valueCounts);
226
227
        $split = null;
228
229
        foreach (['=', '!='] as $operator) {
230
            foreach ($distinctVals as $val) {
231
                list($errorRate, $prob) = $this->calculateErrorRate($val, $operator, $values);
232
233 View Code Duplication
                if ($split == null || $split['trainingErrorRate'] < $errorRate) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
234
                    $split = ['value' => $val, 'operator' => $operator,
235
                        'prob' => $prob, 'column' => $col,
236
                        'trainingErrorRate' => $errorRate];
237
                }
238
            }// for
239
        }
240
241
        return $split;
242
    }
243
244
245
    /**
246
     *
247
     * @param type $leftValue
248
     * @param type $operator
249
     * @param type $rightValue
250
     *
251
     * @return boolean
252
     */
253
    protected function evaluate($leftValue, $operator, $rightValue)
254
    {
255
        switch ($operator) {
256
            case '>': return $leftValue > $rightValue;
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
Coding Style introduced by
Terminating statement must be on a line by itself

As per the PSR-2 coding standard, the break (or other terminating) statement must be on a line of its own.

switch ($expr) {
     case "A":
         doSomething();
         break; //wrong
     case "B":
         doSomething();
         break; //right
     case "C:":
         doSomething();
         return true; //right
 }

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
257
            case '>=': return $leftValue >= $rightValue;
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
Coding Style introduced by
Terminating statement must be on a line by itself

As per the PSR-2 coding standard, the break (or other terminating) statement must be on a line of its own.

switch ($expr) {
     case "A":
         doSomething();
         break; //wrong
     case "B":
         doSomething();
         break; //right
     case "C:":
         doSomething();
         return true; //right
 }

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
258
            case '<': return $leftValue < $rightValue;
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
Coding Style introduced by
Terminating statement must be on a line by itself

As per the PSR-2 coding standard, the break (or other terminating) statement must be on a line of its own.

switch ($expr) {
     case "A":
         doSomething();
         break; //wrong
     case "B":
         doSomething();
         break; //right
     case "C:":
         doSomething();
         return true; //right
 }

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
259
            case '<=': return $leftValue <= $rightValue;
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
Coding Style introduced by
Terminating statement must be on a line by itself

As per the PSR-2 coding standard, the break (or other terminating) statement must be on a line of its own.

switch ($expr) {
     case "A":
         doSomething();
         break; //wrong
     case "B":
         doSomething();
         break; //right
     case "C:":
         doSomething();
         return true; //right
 }

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
260
            case '=': return $leftValue === $rightValue;
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
Coding Style introduced by
Terminating statement must be on a line by itself

As per the PSR-2 coding standard, the break (or other terminating) statement must be on a line of its own.

switch ($expr) {
     case "A":
         doSomething();
         break; //wrong
     case "B":
         doSomething();
         break; //right
     case "C:":
         doSomething();
         return true; //right
 }

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
261
            case '!=':
262
            case '<>': return $leftValue !== $rightValue;
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
Coding Style introduced by
Terminating statement must be on a line by itself

As per the PSR-2 coding standard, the break (or other terminating) statement must be on a line of its own.

switch ($expr) {
     case "A":
         doSomething();
         break; //wrong
     case "B":
         doSomething();
         break; //right
     case "C:":
         doSomething();
         return true; //right
 }

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
263
        }
264
265
        return false;
266
    }
267
268
    /**
269
     * Calculates the ratio of wrong predictions based on the new threshold
270
     * value given as the parameter
271
     *
272
     * @param float $threshold
273
     * @param string $operator
274
     * @param array $values
275
     *
276
     * @return array
277
     */
278
    protected function calculateErrorRate(float $threshold, string $operator, array $values)
279
    {
280
        $wrong = 0.0;
281
        $prob = [];
282
        $leftLabel = $this->binaryLabels[0];
283
        $rightLabel= $this->binaryLabels[1];
284
285
        foreach ($values as $index => $value) {
286
            if ($this->evaluate($value, $operator, $threshold)) {
0 ignored issues
show
Documentation introduced by
$operator is of type string, but the function expects a object<Phpml\Classification\Linear\type>.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
Documentation introduced by
$threshold is of type double, but the function expects a object<Phpml\Classification\Linear\type>.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
287
                $predicted = $leftLabel;
288
            } else {
289
                $predicted = $rightLabel;
290
            }
291
292
            $target = $this->targets[$index];
293
            if (strval($predicted) != strval($this->targets[$index])) {
294
                $wrong += $this->weights[$index];
295
            }
296
297
            if (! isset($prob[$predicted][$target])) {
298
                $prob[$predicted][$target] = 0;
299
            }
300
            $prob[$predicted][$target]++;
301
        }
302
303
        // Calculate probabilities: Proportion of labels in each leaf
304
        $dist = array_combine($this->binaryLabels, array_fill(0, 2, 0.0));
305
        foreach ($prob as $leaf => $counts) {
306
            $leafTotal = (float)array_sum($prob[$leaf]);
307
            foreach ($counts as $label => $count) {
308
                if (strval($leaf) == strval($label)) {
309
                    $dist[$leaf] = $count / $leafTotal;
310
                }
311
            }
312
        }
313
314
        return [$wrong / (float) array_sum($this->weights), $dist];
315
    }
316
317
    /**
318
     * Returns the probability of the sample of belonging to the given label
319
     *
320
     * Probability of a sample is calculated as the proportion of the label
321
     * within the labels of the training samples in the decision node
322
     *
323
     * @param array $sample
324
     * @param mixed $label
325
     *
326
     * @return float
327
     */
328
    protected function predictProbability(array $sample, $label)
329
    {
330
        $predicted = $this->predictSampleBinary($sample);
331
        if (strval($predicted) == strval($label)) {
332
            return $this->prob[$label];
333
        }
334
335
        return 0.0;
336
    }
337
338
    /**
339
     * @param array $sample
340
     *
341
     * @return mixed
342
     */
343
    protected function predictSampleBinary(array $sample)
344
    {
345
        if ($this->evaluate($sample[$this->column], $this->operator, $this->value)) {
0 ignored issues
show
Documentation introduced by
$this->operator is of type string, but the function expects a object<Phpml\Classification\Linear\type>.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
346
            return $this->binaryLabels[0];
347
        }
348
349
        return $this->binaryLabels[1];
350
    }
351
352
    /**
353
     * @return string
354
     */
355
    public function __toString()
356
    {
357
        return "IF $this->column $this->operator $this->value " .
358
            "THEN " . $this->binaryLabels[0] . " ".
359
            "ELSE " . $this->binaryLabels[1];
360
    }
361
}
362