Completed
Push — master ( f0a798...cf222b )
by Arkadiusz
02:58
created

DecisionTree::getSelectedFeatures()   B

Complexity

Conditions 5
Paths 4

Size

Total Lines 21
Code Lines 13

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
dl 0
loc 21
rs 8.7624
c 0
b 0
f 0
cc 5
eloc 13
nc 4
nop 0
1
<?php
2
3
declare(strict_types=1);
4
5
namespace Phpml\Classification;
6
7
use Phpml\Helper\Predictable;
8
use Phpml\Helper\Trainable;
9
use Phpml\Math\Statistic\Mean;
10
use Phpml\Classification\DecisionTree\DecisionTreeLeaf;
11
12
class DecisionTree implements Classifier
13
{
14
    use Trainable, Predictable;
15
16
    const CONTINUOS = 1;
17
    const NOMINAL = 2;
18
19
    /**
20
     * @var array
21
     */
22
    private $samples = [];
23
24
    /**
25
     * @var array
26
     */
27
    private $columnTypes;
28
29
    /**
30
     * @var array
31
     */
32
    private $labels = [];
33
34
    /**
35
     * @var int
36
     */
37
    private $featureCount = 0;
38
39
    /**
40
     * @var DecisionTreeLeaf
41
     */
42
    private $tree = null;
43
44
    /**
45
     * @var int
46
     */
47
    private $maxDepth;
48
49
    /**
50
     * @var int
51
     */
52
    public $actualDepth = 0;
53
54
    /**
55
     * @var int
56
     */
57
    private $numUsableFeatures = 0;
58
59
    /**
60
     * @var array
61
     */
62
    private $selectedFeatures;
63
64
    /**
65
     * @var array
66
     */
67
    private $featureImportances = null;
68
69
    /**
70
     *
71
     * @var array
72
     */
73
    private $columnNames = null;
74
75
    /**
76
     * @param int $maxDepth
77
     */
78
    public function __construct($maxDepth = 10)
79
    {
80
        $this->maxDepth = $maxDepth;
81
    }
82
    /**
83
     * @param array $samples
84
     * @param array $targets
85
     */
86
    public function train(array $samples, array $targets)
87
    {
88
        $this->samples = array_merge($this->samples, $samples);
89
        $this->targets = array_merge($this->targets, $targets);
90
91
        $this->featureCount = count($this->samples[0]);
92
        $this->columnTypes = $this->getColumnTypes($this->samples);
93
        $this->labels = array_keys(array_count_values($this->targets));
94
        $this->tree = $this->getSplitLeaf(range(0, count($this->samples) - 1));
95
96
        // Each time the tree is trained, feature importances are reset so that
97
        // we will have to compute it again depending on the new data
98
        $this->featureImportances = null;
0 ignored issues
show
Documentation Bug introduced by
It seems like null of type null is incompatible with the declared type array of property $featureImportances.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
99
100
        // If column names are given or computed before, then there is no
101
        // need to init it and accidentally remove the previous given names
102
        if ($this->columnNames === null) {
103
            $this->columnNames = range(0, $this->featureCount - 1);
104
        } elseif (count($this->columnNames) > $this->featureCount) {
105
            $this->columnNames = array_slice($this->columnNames, 0, $this->featureCount);
106
        } elseif (count($this->columnNames) < $this->featureCount) {
107
            $this->columnNames = array_merge($this->columnNames,
108
                range(count($this->columnNames), $this->featureCount - 1));
109
        }
110
    }
111
112
    protected function getColumnTypes(array $samples)
113
    {
114
        $types = [];
115
        for ($i=0; $i<$this->featureCount; $i++) {
116
            $values = array_column($samples, $i);
117
            $isCategorical = $this->isCategoricalColumn($values);
118
            $types[] = $isCategorical ? self::NOMINAL : self::CONTINUOS;
119
        }
120
        return $types;
121
    }
122
123
    /**
124
     * @param null|array $records
125
     * @return DecisionTreeLeaf
126
     */
127
    protected function getSplitLeaf($records, $depth = 0)
128
    {
129
        $split = $this->getBestSplit($records);
0 ignored issues
show
Bug introduced by
It seems like $records defined by parameter $records on line 127 can also be of type null; however, Phpml\Classification\DecisionTree::getBestSplit() does only seem to accept array, maybe add an additional type check?

This check looks at variables that have been passed in as parameters and are passed out again to other methods.

If the outgoing method call has stricter type requirements than the method itself, an issue is raised.

An additional type check may prevent trouble.

Loading history...
130
        $split->level = $depth;
131
        if ($this->actualDepth < $depth) {
132
            $this->actualDepth = $depth;
133
        }
134
135
        // Traverse all records to see if all records belong to the same class,
136
        // otherwise group the records so that we can classify the leaf
137
        // in case maximum depth is reached
138
        $leftRecords = [];
139
        $rightRecords= [];
140
        $remainingTargets = [];
141
        $prevRecord = null;
142
        $allSame = true;
143
144
        foreach ($records as $recordNo) {
0 ignored issues
show
Bug introduced by
The expression $records of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
145
            // Check if the previous record is the same with the current one
146
            $record = $this->samples[$recordNo];
147
            if ($prevRecord && $prevRecord != $record) {
148
                $allSame = false;
149
            }
150
            $prevRecord = $record;
151
152
            // According to the split criteron, this record will
153
            // belong to either left or the right side in the next split
154
            if ($split->evaluate($record)) {
155
                $leftRecords[] = $recordNo;
156
            } else {
157
                $rightRecords[]= $recordNo;
158
            }
159
160
            // Group remaining targets
161
            $target = $this->targets[$recordNo];
162
            if (! array_key_exists($target, $remainingTargets)) {
163
                $remainingTargets[$target] = 1;
164
            } else {
165
                $remainingTargets[$target]++;
166
            }
167
        }
168
169
        if (count($remainingTargets) == 1 || $allSame || $depth >= $this->maxDepth) {
170
            $split->isTerminal = 1;
0 ignored issues
show
Documentation Bug introduced by
The property $isTerminal was declared of type boolean, but 1 is of type integer. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
171
            arsort($remainingTargets);
172
            $split->classValue = key($remainingTargets);
173
        } else {
174
            if ($leftRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $leftRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
175
                $split->leftLeaf = $this->getSplitLeaf($leftRecords, $depth + 1);
176
            }
177
            if ($rightRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $rightRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
178
                $split->rightLeaf= $this->getSplitLeaf($rightRecords, $depth + 1);
179
            }
180
        }
181
        return $split;
182
    }
183
184
    /**
185
     * @param array $records
186
     * @return DecisionTreeLeaf[]
187
     */
188
    protected function getBestSplit($records)
189
    {
190
        $targets = array_intersect_key($this->targets, array_flip($records));
191
        $samples = array_intersect_key($this->samples, array_flip($records));
192
        $samples = array_combine($records, $this->preprocess($samples));
193
        $bestGiniVal = 1;
194
        $bestSplit = null;
195
        $features = $this->getSelectedFeatures();
196
        foreach ($features as $i) {
197
            $colValues = [];
198
            foreach ($samples as $index => $row) {
199
                $colValues[$index] = $row[$i];
200
            }
201
            $counts = array_count_values($colValues);
202
            arsort($counts);
203
            $baseValue = key($counts);
204
            $gini = $this->getGiniIndex($baseValue, $colValues, $targets);
205
            if ($bestSplit == null || $bestGiniVal > $gini) {
206
                $split = new DecisionTreeLeaf();
207
                $split->value = $baseValue;
208
                $split->giniIndex = $gini;
0 ignored issues
show
Documentation Bug introduced by
It seems like $gini can also be of type integer. However, the property $giniIndex is declared as type double. Maybe add an additional type check?

Our type inference engine has found a suspicous assignment of a value to a property. This check raises an issue when a value that can be of a mixed type is assigned to a property that is type hinted more strictly.

For example, imagine you have a variable $accountId that can either hold an Id object or false (if there is no account id yet). Your code now assigns that value to the id property of an instance of the Account class. This class holds a proper account, so the id value must no longer be false.

Either this assignment is in error or a type check should be added for that assignment.

class Id
{
    public $id;

    public function __construct($id)
    {
        $this->id = $id;
    }

}

class Account
{
    /** @var  Id $id */
    public $id;
}

$account_id = false;

if (starsAreRight()) {
    $account_id = new Id(42);
}

$account = new Account();
if ($account instanceof Id)
{
    $account->id = $account_id;
}
Loading history...
209
                $split->columnIndex = $i;
210
                $split->isContinuous = $this->columnTypes[$i] == self::CONTINUOS;
211
                $split->records = $records;
212
                $bestSplit = $split;
213
                $bestGiniVal = $gini;
214
            }
215
        }
216
        return $bestSplit;
217
    }
218
219
    /**
220
     * Returns available features/columns to the tree for the decision making
221
     * process. <br>
222
     *
223
     * If a number is given with setNumFeatures() method, then a random selection
224
     * of features up to this number is returned. <br>
225
     *
226
     * If some features are manually selected by use of setSelectedFeatures(),
227
     * then only these features are returned <br>
228
     *
229
     * If any of above methods were not called beforehand, then all features
230
     * are returned by default.
231
     *
232
     * @return array
233
     */
234
    protected function getSelectedFeatures()
235
    {
236
        $allFeatures = range(0, $this->featureCount - 1);
237
        if ($this->numUsableFeatures == 0 && ! $this->selectedFeatures) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->selectedFeatures of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
238
            return $allFeatures;
239
        }
240
241
        if ($this->selectedFeatures) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->selectedFeatures of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
242
            return $this->selectedFeatures;
243
        }
244
245
        $numFeatures = $this->numUsableFeatures;
246
        if ($numFeatures > $this->featureCount) {
247
            $numFeatures = $this->featureCount;
248
        }
249
        shuffle($allFeatures);
250
        $selectedFeatures = array_slice($allFeatures, 0, $numFeatures, false);
251
        sort($selectedFeatures);
252
253
        return $selectedFeatures;
254
    }
255
256
    /**
257
     * @param string $baseValue
258
     * @param array $colValues
259
     * @param array $targets
260
     */
261
    public function getGiniIndex($baseValue, $colValues, $targets)
262
    {
263
        $countMatrix = [];
264
        foreach ($this->labels as $label) {
265
            $countMatrix[$label] = [0, 0];
266
        }
267
        foreach ($colValues as $index => $value) {
268
            $label = $targets[$index];
269
            $rowIndex = $value == $baseValue ? 0 : 1;
270
            $countMatrix[$label][$rowIndex]++;
271
        }
272
        $giniParts = [0, 0];
273
        for ($i=0; $i<=1; $i++) {
274
            $part = 0;
275
            $sum = array_sum(array_column($countMatrix, $i));
276
            if ($sum > 0) {
277
                foreach ($this->labels as $label) {
278
                    $part += pow($countMatrix[$label][$i] / floatval($sum), 2);
279
                }
280
            }
281
            $giniParts[$i] = (1 - $part) * $sum;
282
        }
283
        return array_sum($giniParts) / count($colValues);
284
    }
285
286
    /**
287
     * @param array $samples
288
     * @return array
289
     */
290
    protected function preprocess(array $samples)
291
    {
292
        // Detect and convert continuous data column values into
293
        // discrete values by using the median as a threshold value
294
        $columns = [];
295
        for ($i=0; $i<$this->featureCount; $i++) {
296
            $values = array_column($samples, $i);
297
            if ($this->columnTypes[$i] == self::CONTINUOS) {
298
                $median = Mean::median($values);
299
                foreach ($values as &$value) {
300
                    if ($value <= $median) {
301
                        $value = "<= $median";
302
                    } else {
303
                        $value = "> $median";
304
                    }
305
                }
306
            }
307
            $columns[] = $values;
308
        }
309
        // Below method is a strange yet very simple & efficient method
310
        // to get the transpose of a 2D array
311
        return array_map(null, ...$columns);
312
    }
313
314
    /**
315
     * @param array $columnValues
316
     * @return bool
317
     */
318
    protected function isCategoricalColumn(array $columnValues)
319
    {
320
        $count = count($columnValues);
321
        // There are two main indicators that *may* show whether a
322
        // column is composed of discrete set of values:
323
        // 1- Column may contain string values
324
        // 2- Number of unique values in the column is only a small fraction of
325
        //	  all values in that column (Lower than or equal to %20 of all values)
326
        $numericValues = array_filter($columnValues, 'is_numeric');
327
        if (count($numericValues) != $count) {
328
            return true;
329
        }
330
        $distinctValues = array_count_values($columnValues);
331
        if (count($distinctValues) <= $count / 5) {
332
            return true;
333
        }
334
        return false;
335
    }
336
337
    /**
338
     * This method is used to set number of columns to be used
339
     * when deciding a split at an internal node of the tree.  <br>
340
     * If the value is given 0, then all features are used (default behaviour),
341
     * otherwise the given value will be used as a maximum for number of columns
342
     * randomly selected for each split operation.
343
     *
344
     * @param int $numFeatures
345
     * @return $this
346
     * @throws Exception
347
     */
348
    public function setNumFeatures(int $numFeatures)
349
    {
350
        if ($numFeatures < 0) {
351
            throw new \Exception("Selected column count should be greater or equal to zero");
352
        }
353
354
        $this->numUsableFeatures = $numFeatures;
355
356
        return $this;
357
    }
358
359
    /**
360
     * Used to set predefined features to consider while deciding which column to use for a split,
361
     *
362
     * @param array $features
0 ignored issues
show
Documentation introduced by
There is no parameter named $features. Did you maybe mean $selectedFeatures?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function. It has, however, found a similar but not annotated parameter which might be a good fit.

Consider the following example. The parameter $ireland is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $ireland
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was changed, but the annotation was not.

Loading history...
363
     */
364
    protected function setSelectedFeatures(array $selectedFeatures)
365
    {
366
        $this->selectedFeatures = $selectedFeatures;
367
    }
368
369
    /**
370
     * A string array to represent columns. Useful when HTML output or
371
     * column importances are desired to be inspected.
372
     *
373
     * @param array $names
374
     * @return $this
375
     */
376
    public function setColumnNames(array $names)
377
    {
378
        if ($this->featureCount != 0 && count($names) != $this->featureCount) {
379
            throw new \Exception("Length of the given array should be equal to feature count ($this->featureCount)");
380
        }
381
382
        $this->columnNames = $names;
383
384
        return $this;
385
    }
386
387
    /**
388
     * @return string
389
     */
390
    public function getHtml()
391
    {
392
        return $this->tree->getHTML($this->columnNames);
393
    }
394
395
    /**
396
     * This will return an array including an importance value for
397
     * each column in the given dataset. The importance values are
398
     * normalized and their total makes 1.<br/>
399
     *
400
     * @param array $labels
0 ignored issues
show
Bug introduced by
There is no parameter named $labels. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
401
     * @return array
402
     */
403
    public function getFeatureImportances()
404
    {
405
        if ($this->featureImportances !== null) {
406
            return $this->featureImportances;
407
        }
408
409
        $sampleCount = count($this->samples);
410
        $this->featureImportances = [];
411
        foreach ($this->columnNames as $column => $columnName) {
412
            $nodes = $this->getSplitNodesByColumn($column, $this->tree);
413
414
            $importance = 0;
415
            foreach ($nodes as $node) {
416
                $importance += $node->getNodeImpurityDecrease($sampleCount);
417
            }
418
419
            $this->featureImportances[$columnName] = $importance;
420
        }
421
422
        // Normalize & sort the importances
423
        $total = array_sum($this->featureImportances);
424
        if ($total > 0) {
425
            foreach ($this->featureImportances as &$importance) {
426
                $importance /= $total;
427
            }
428
            arsort($this->featureImportances);
429
        }
430
431
        return $this->featureImportances;
432
    }
433
434
    /**
435
     * Collects and returns an array of internal nodes that use the given
436
     * column as a split criteron
437
     *
438
     * @param int $column
439
     * @param DecisionTreeLeaf
440
     * @param array $collected
0 ignored issues
show
Bug introduced by
There is no parameter named $collected. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
441
     *
442
     * @return array
443
     */
444
    protected function getSplitNodesByColumn($column, DecisionTreeLeaf $node)
445
    {
446
        if (!$node || $node->isTerminal) {
447
            return [];
448
        }
449
450
        $nodes = [];
451
        if ($node->columnIndex == $column) {
452
            $nodes[] = $node;
453
        }
454
455
        $lNodes = [];
456
        $rNodes = [];
457
        if ($node->leftLeaf) {
458
            $lNodes = $this->getSplitNodesByColumn($column, $node->leftLeaf);
459
        }
460
        if ($node->rightLeaf) {
461
            $rNodes = $this->getSplitNodesByColumn($column, $node->rightLeaf);
462
        }
463
        $nodes = array_merge($nodes, $lNodes, $rNodes);
464
465
        return $nodes;
466
    }
467
468
    /**
469
     * @param array $sample
470
     * @return mixed
471
     */
472
    protected function predictSample(array $sample)
473
    {
474
        $node = $this->tree;
475
        do {
476
            if ($node->isTerminal) {
477
                break;
478
            }
479
            if ($node->evaluate($sample)) {
480
                $node = $node->leftLeaf;
481
            } else {
482
                $node = $node->rightLeaf;
483
            }
484
        } while ($node);
485
486
        return $node ? $node->classValue : $this->labels[0];
487
    }
488
}
489