Completed
Push — master ( cf222b...4daa0a )
by Arkadiusz
03:24
created

DecisionTree::isCategoricalColumn()   B

Complexity

Conditions 4
Paths 4

Size

Total Lines 24
Code Lines 12

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
dl 0
loc 24
rs 8.6845
c 0
b 0
f 0
cc 4
eloc 12
nc 4
nop 1
1
<?php
2
3
declare(strict_types=1);
4
5
namespace Phpml\Classification;
6
7
use Phpml\Helper\Predictable;
8
use Phpml\Helper\Trainable;
9
use Phpml\Math\Statistic\Mean;
10
use Phpml\Classification\DecisionTree\DecisionTreeLeaf;
11
12
class DecisionTree implements Classifier
13
{
14
    use Trainable, Predictable;
15
16
    const CONTINUOS = 1;
17
    const NOMINAL = 2;
18
19
    /**
20
     * @var array
21
     */
22
    private $samples = [];
23
24
    /**
25
     * @var array
26
     */
27
    protected $columnTypes;
28
29
    /**
30
     * @var array
31
     */
32
    private $labels = [];
33
34
    /**
35
     * @var int
36
     */
37
    private $featureCount = 0;
38
39
    /**
40
     * @var DecisionTreeLeaf
41
     */
42
    protected $tree = null;
43
44
    /**
45
     * @var int
46
     */
47
    protected $maxDepth;
48
49
    /**
50
     * @var int
51
     */
52
    public $actualDepth = 0;
53
54
    /**
55
     * @var int
56
     */
57
    private $numUsableFeatures = 0;
58
59
    /**
60
     * @var array
61
     */
62
    private $selectedFeatures;
63
64
    /**
65
     * @var array
66
     */
67
    private $featureImportances = null;
68
69
    /**
70
     *
71
     * @var array
72
     */
73
    private $columnNames = null;
74
75
    /**
76
     * @param int $maxDepth
77
     */
78
    public function __construct($maxDepth = 10)
79
    {
80
        $this->maxDepth = $maxDepth;
81
    }
82
83
    /**
84
     * @param array $samples
85
     * @param array $targets
86
     */
87
    public function train(array $samples, array $targets)
88
    {
89
        $this->samples = array_merge($this->samples, $samples);
90
        $this->targets = array_merge($this->targets, $targets);
91
92
        $this->featureCount = count($this->samples[0]);
93
        $this->columnTypes = $this->getColumnTypes($this->samples);
94
        $this->labels = array_keys(array_count_values($this->targets));
95
        $this->tree = $this->getSplitLeaf(range(0, count($this->samples) - 1));
96
97
        // Each time the tree is trained, feature importances are reset so that
98
        // we will have to compute it again depending on the new data
99
        $this->featureImportances = null;
0 ignored issues
show
Documentation Bug introduced by
It seems like null of type null is incompatible with the declared type array of property $featureImportances.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
100
101
        // If column names are given or computed before, then there is no
102
        // need to init it and accidentally remove the previous given names
103
        if ($this->columnNames === null) {
104
            $this->columnNames = range(0, $this->featureCount - 1);
105
        } elseif (count($this->columnNames) > $this->featureCount) {
106
            $this->columnNames = array_slice($this->columnNames, 0, $this->featureCount);
107
        } elseif (count($this->columnNames) < $this->featureCount) {
108
            $this->columnNames = array_merge($this->columnNames,
109
                range(count($this->columnNames), $this->featureCount - 1));
110
        }
111
    }
112
113
    protected function getColumnTypes(array $samples)
114
    {
115
        $types = [];
116
        for ($i=0; $i<$this->featureCount; $i++) {
117
            $values = array_column($samples, $i);
118
            $isCategorical = $this->isCategoricalColumn($values);
119
            $types[] = $isCategorical ? self::NOMINAL : self::CONTINUOS;
120
        }
121
        return $types;
122
    }
123
124
    /**
125
     * @param null|array $records
126
     * @return DecisionTreeLeaf
127
     */
128
    protected function getSplitLeaf($records, $depth = 0)
129
    {
130
        $split = $this->getBestSplit($records);
0 ignored issues
show
Bug introduced by
It seems like $records defined by parameter $records on line 128 can also be of type null; however, Phpml\Classification\DecisionTree::getBestSplit() does only seem to accept array, maybe add an additional type check?

This check looks at variables that have been passed in as parameters and are passed out again to other methods.

If the outgoing method call has stricter type requirements than the method itself, an issue is raised.

An additional type check may prevent trouble.

Loading history...
131
        $split->level = $depth;
132
        if ($this->actualDepth < $depth) {
133
            $this->actualDepth = $depth;
134
        }
135
136
        // Traverse all records to see if all records belong to the same class,
137
        // otherwise group the records so that we can classify the leaf
138
        // in case maximum depth is reached
139
        $leftRecords = [];
140
        $rightRecords= [];
141
        $remainingTargets = [];
142
        $prevRecord = null;
143
        $allSame = true;
144
145
        foreach ($records as $recordNo) {
0 ignored issues
show
Bug introduced by
The expression $records of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
146
            // Check if the previous record is the same with the current one
147
            $record = $this->samples[$recordNo];
148
            if ($prevRecord && $prevRecord != $record) {
149
                $allSame = false;
150
            }
151
            $prevRecord = $record;
152
153
            // According to the split criteron, this record will
154
            // belong to either left or the right side in the next split
155
            if ($split->evaluate($record)) {
156
                $leftRecords[] = $recordNo;
157
            } else {
158
                $rightRecords[]= $recordNo;
159
            }
160
161
            // Group remaining targets
162
            $target = $this->targets[$recordNo];
163
            if (! array_key_exists($target, $remainingTargets)) {
164
                $remainingTargets[$target] = 1;
165
            } else {
166
                $remainingTargets[$target]++;
167
            }
168
        }
169
170
        if (count($remainingTargets) == 1 || $allSame || $depth >= $this->maxDepth) {
171
            $split->isTerminal = 1;
0 ignored issues
show
Documentation Bug introduced by
The property $isTerminal was declared of type boolean, but 1 is of type integer. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
172
            arsort($remainingTargets);
173
            $split->classValue = key($remainingTargets);
174
        } else {
175
            if ($leftRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $leftRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
176
                $split->leftLeaf = $this->getSplitLeaf($leftRecords, $depth + 1);
177
            }
178
            if ($rightRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $rightRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
179
                $split->rightLeaf= $this->getSplitLeaf($rightRecords, $depth + 1);
180
            }
181
        }
182
        return $split;
183
    }
184
185
    /**
186
     * @param array $records
187
     * @return DecisionTreeLeaf[]
188
     */
189
    protected function getBestSplit($records)
190
    {
191
        $targets = array_intersect_key($this->targets, array_flip($records));
192
        $samples = array_intersect_key($this->samples, array_flip($records));
193
        $samples = array_combine($records, $this->preprocess($samples));
194
        $bestGiniVal = 1;
195
        $bestSplit = null;
196
        $features = $this->getSelectedFeatures();
197
        foreach ($features as $i) {
198
            $colValues = [];
199
            foreach ($samples as $index => $row) {
200
                $colValues[$index] = $row[$i];
201
            }
202
            $counts = array_count_values($colValues);
203
            arsort($counts);
204
            $baseValue = key($counts);
205
            $gini = $this->getGiniIndex($baseValue, $colValues, $targets);
206
            if ($bestSplit == null || $bestGiniVal > $gini) {
207
                $split = new DecisionTreeLeaf();
208
                $split->value = $baseValue;
209
                $split->giniIndex = $gini;
0 ignored issues
show
Documentation Bug introduced by
It seems like $gini can also be of type integer. However, the property $giniIndex is declared as type double. Maybe add an additional type check?

Our type inference engine has found a suspicous assignment of a value to a property. This check raises an issue when a value that can be of a mixed type is assigned to a property that is type hinted more strictly.

For example, imagine you have a variable $accountId that can either hold an Id object or false (if there is no account id yet). Your code now assigns that value to the id property of an instance of the Account class. This class holds a proper account, so the id value must no longer be false.

Either this assignment is in error or a type check should be added for that assignment.

class Id
{
    public $id;

    public function __construct($id)
    {
        $this->id = $id;
    }

}

class Account
{
    /** @var  Id $id */
    public $id;
}

$account_id = false;

if (starsAreRight()) {
    $account_id = new Id(42);
}

$account = new Account();
if ($account instanceof Id)
{
    $account->id = $account_id;
}
Loading history...
210
                $split->columnIndex = $i;
211
                $split->isContinuous = $this->columnTypes[$i] == self::CONTINUOS;
212
                $split->records = $records;
213
214
                // If a numeric column is to be selected, then
215
                // the original numeric value and the selected operator
216
                // will also be saved into the leaf for future access
217
                if ($this->columnTypes[$i] == self::CONTINUOS) {
218
                    $matches = [];
219
                    preg_match("/^([<>=]{1,2})\s*(.*)/", strval($split->value), $matches);
220
                    $split->operator = $matches[1];
221
                    $split->numericValue = floatval($matches[2]);
222
                }
223
224
                $bestSplit = $split;
225
                $bestGiniVal = $gini;
226
            }
227
        }
228
        return $bestSplit;
229
    }
230
231
    /**
232
     * Returns available features/columns to the tree for the decision making
233
     * process. <br>
234
     *
235
     * If a number is given with setNumFeatures() method, then a random selection
236
     * of features up to this number is returned. <br>
237
     *
238
     * If some features are manually selected by use of setSelectedFeatures(),
239
     * then only these features are returned <br>
240
     *
241
     * If any of above methods were not called beforehand, then all features
242
     * are returned by default.
243
     *
244
     * @return array
245
     */
246
    protected function getSelectedFeatures()
247
    {
248
        $allFeatures = range(0, $this->featureCount - 1);
249
        if ($this->numUsableFeatures == 0 && ! $this->selectedFeatures) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->selectedFeatures of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
250
            return $allFeatures;
251
        }
252
253
        if ($this->selectedFeatures) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->selectedFeatures of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
254
            return $this->selectedFeatures;
255
        }
256
257
        $numFeatures = $this->numUsableFeatures;
258
        if ($numFeatures > $this->featureCount) {
259
            $numFeatures = $this->featureCount;
260
        }
261
        shuffle($allFeatures);
262
        $selectedFeatures = array_slice($allFeatures, 0, $numFeatures, false);
263
        sort($selectedFeatures);
264
265
        return $selectedFeatures;
266
    }
267
268
    /**
269
     * @param string $baseValue
270
     * @param array $colValues
271
     * @param array $targets
272
     */
273
    public function getGiniIndex($baseValue, $colValues, $targets)
274
    {
275
        $countMatrix = [];
276
        foreach ($this->labels as $label) {
277
            $countMatrix[$label] = [0, 0];
278
        }
279
        foreach ($colValues as $index => $value) {
280
            $label = $targets[$index];
281
            $rowIndex = $value == $baseValue ? 0 : 1;
282
            $countMatrix[$label][$rowIndex]++;
283
        }
284
        $giniParts = [0, 0];
285
        for ($i=0; $i<=1; $i++) {
286
            $part = 0;
287
            $sum = array_sum(array_column($countMatrix, $i));
288
            if ($sum > 0) {
289
                foreach ($this->labels as $label) {
290
                    $part += pow($countMatrix[$label][$i] / floatval($sum), 2);
291
                }
292
            }
293
            $giniParts[$i] = (1 - $part) * $sum;
294
        }
295
        return array_sum($giniParts) / count($colValues);
296
    }
297
298
    /**
299
     * @param array $samples
300
     * @return array
301
     */
302
    protected function preprocess(array $samples)
303
    {
304
        // Detect and convert continuous data column values into
305
        // discrete values by using the median as a threshold value
306
        $columns = [];
307
        for ($i=0; $i<$this->featureCount; $i++) {
308
            $values = array_column($samples, $i);
309
            if ($this->columnTypes[$i] == self::CONTINUOS) {
310
                $median = Mean::median($values);
311
                foreach ($values as &$value) {
312
                    if ($value <= $median) {
313
                        $value = "<= $median";
314
                    } else {
315
                        $value = "> $median";
316
                    }
317
                }
318
            }
319
            $columns[] = $values;
320
        }
321
        // Below method is a strange yet very simple & efficient method
322
        // to get the transpose of a 2D array
323
        return array_map(null, ...$columns);
324
    }
325
326
    /**
327
     * @param array $columnValues
328
     * @return bool
329
     */
330
    protected function isCategoricalColumn(array $columnValues)
331
    {
332
        $count = count($columnValues);
333
334
        // There are two main indicators that *may* show whether a
335
        // column is composed of discrete set of values:
336
        // 1- Column may contain string values and not float values
337
        // 2- Number of unique values in the column is only a small fraction of
338
        //	  all values in that column (Lower than or equal to %20 of all values)
339
        $numericValues = array_filter($columnValues, 'is_numeric');
340
        $floatValues = array_filter($columnValues, 'is_float');
341
        if ($floatValues) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $floatValues of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
342
            return false;
343
        }
344
        if (count($numericValues) != $count) {
345
            return true;
346
        }
347
348
        $distinctValues = array_count_values($columnValues);
349
        if (count($distinctValues) <= $count / 5) {
350
            return true;
351
        }
352
        return false;
353
    }
354
355
    /**
356
     * This method is used to set number of columns to be used
357
     * when deciding a split at an internal node of the tree.  <br>
358
     * If the value is given 0, then all features are used (default behaviour),
359
     * otherwise the given value will be used as a maximum for number of columns
360
     * randomly selected for each split operation.
361
     *
362
     * @param int $numFeatures
363
     * @return $this
364
     * @throws Exception
365
     */
366
    public function setNumFeatures(int $numFeatures)
367
    {
368
        if ($numFeatures < 0) {
369
            throw new \Exception("Selected column count should be greater or equal to zero");
370
        }
371
372
        $this->numUsableFeatures = $numFeatures;
373
374
        return $this;
375
    }
376
377
    /**
378
     * Used to set predefined features to consider while deciding which column to use for a split
379
     *
380
     * @param array $selectedFeatures
381
     */
382
    protected function setSelectedFeatures(array $selectedFeatures)
383
    {
384
        $this->selectedFeatures = $selectedFeatures;
385
    }
386
387
    /**
388
     * A string array to represent columns. Useful when HTML output or
389
     * column importances are desired to be inspected.
390
     *
391
     * @param array $names
392
     * @return $this
393
     */
394
    public function setColumnNames(array $names)
395
    {
396
        if ($this->featureCount != 0 && count($names) != $this->featureCount) {
397
            throw new \Exception("Length of the given array should be equal to feature count ($this->featureCount)");
398
        }
399
400
        $this->columnNames = $names;
401
402
        return $this;
403
    }
404
405
    /**
406
     * @return string
407
     */
408
    public function getHtml()
409
    {
410
        return $this->tree->getHTML($this->columnNames);
411
    }
412
413
    /**
414
     * This will return an array including an importance value for
415
     * each column in the given dataset. The importance values are
416
     * normalized and their total makes 1.<br/>
417
     *
418
     * @param array $labels
0 ignored issues
show
Bug introduced by
There is no parameter named $labels. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
419
     * @return array
420
     */
421
    public function getFeatureImportances()
422
    {
423
        if ($this->featureImportances !== null) {
424
            return $this->featureImportances;
425
        }
426
427
        $sampleCount = count($this->samples);
428
        $this->featureImportances = [];
429
        foreach ($this->columnNames as $column => $columnName) {
430
            $nodes = $this->getSplitNodesByColumn($column, $this->tree);
431
432
            $importance = 0;
433
            foreach ($nodes as $node) {
434
                $importance += $node->getNodeImpurityDecrease($sampleCount);
435
            }
436
437
            $this->featureImportances[$columnName] = $importance;
438
        }
439
440
        // Normalize & sort the importances
441
        $total = array_sum($this->featureImportances);
442
        if ($total > 0) {
443
            foreach ($this->featureImportances as &$importance) {
444
                $importance /= $total;
445
            }
446
            arsort($this->featureImportances);
447
        }
448
449
        return $this->featureImportances;
450
    }
451
452
    /**
453
     * Collects and returns an array of internal nodes that use the given
454
     * column as a split criteron
455
     *
456
     * @param int $column
457
     * @param DecisionTreeLeaf
458
     * @param array $collected
0 ignored issues
show
Bug introduced by
There is no parameter named $collected. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
459
     *
460
     * @return array
461
     */
462
    protected function getSplitNodesByColumn($column, DecisionTreeLeaf $node)
463
    {
464
        if (!$node || $node->isTerminal) {
465
            return [];
466
        }
467
468
        $nodes = [];
469
        if ($node->columnIndex == $column) {
470
            $nodes[] = $node;
471
        }
472
473
        $lNodes = [];
474
        $rNodes = [];
475
        if ($node->leftLeaf) {
476
            $lNodes = $this->getSplitNodesByColumn($column, $node->leftLeaf);
477
        }
478
        if ($node->rightLeaf) {
479
            $rNodes = $this->getSplitNodesByColumn($column, $node->rightLeaf);
480
        }
481
        $nodes = array_merge($nodes, $lNodes, $rNodes);
482
483
        return $nodes;
484
    }
485
486
    /**
487
     * @param array $sample
488
     * @return mixed
489
     */
490
    protected function predictSample(array $sample)
491
    {
492
        $node = $this->tree;
493
        do {
494
            if ($node->isTerminal) {
495
                break;
496
            }
497
            if ($node->evaluate($sample)) {
498
                $node = $node->leftLeaf;
499
            } else {
500
                $node = $node->rightLeaf;
501
            }
502
        } while ($node);
503
504
        return $node ? $node->classValue : $this->labels[0];
505
    }
506
}
507