Test Failed
Pull Request — master (#100)
by Никита
03:55
created

DecisionTree::arrayCountValues()   A

Complexity

Conditions 4
Paths 3

Size

Total Lines 10
Code Lines 6

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
dl 0
loc 10
rs 9.2
c 0
b 0
f 0
cc 4
eloc 6
nc 3
nop 1
1
<?php
2
3
declare(strict_types=1);
4
5
namespace Phpml\Classification;
6
7
use Phpml\Exception\InvalidArgumentException;
8
use Phpml\Helper\Predictable;
9
use Phpml\Helper\Trainable;
10
use Phpml\Math\Statistic\Mean;
11
use Phpml\Classification\DecisionTree\DecisionTreeLeaf;
12
13
class DecisionTree implements Classifier
14
{
15
    use Trainable, Predictable;
16
17
    const CONTINUOUS = 1;
18
    const NOMINAL = 2;
19
20
    /**
21
     * @var array
22
     */
23
    protected $columnTypes = [];
24
25
    /**
26
     * @var array
27
     */
28
    private $labels = [];
29
30
    /**
31
     * @var int
32
     */
33
    private $featureCount = 0;
34
35
    /**
36
     * @var DecisionTreeLeaf
37
     */
38
    protected $tree = null;
39
40
    /**
41
     * @var int
42
     */
43
    protected $maxDepth;
44
45
    public static $categoricalColumnMinimumUniqueValueCount = 0.2;
46
47
    /**
48
     * @var int
49
     */
50
    public $actualDepth = 0;
51
52
    /**
53
     * @var int
54
     */
55
    private $numUsableFeatures = 0;
56
57
    /**
58
     * @var array
59
     */
60
    private $selectedFeatures;
61
62
    /**
63
     * @var array
64
     */
65
    private $featureImportances = null;
66
67
    /**
68
     *
69
     * @var array
70
     */
71
    private $columnNames = null;
72
73
    /**
74
     * @param int $maxDepth
75
     */
76
    public function __construct(int $maxDepth = 10)
77
    {
78
        $this->maxDepth = $maxDepth;
79
    }
80
81
    /**
82
     * @param array $samples
83
     * @param array $targets
84
     */
85
    public function train(array $samples, array $targets)
86
    {
87
        $this->samples = array_merge($this->samples, $samples);
88
        $this->targets = array_merge($this->targets, $targets);
89
90
        $this->featureCount = count($this->samples[0]);
91
        if (count($this->columnTypes) != $this->featureCount) {
92
            $this->columnTypes = self::getColumnTypes($this->samples);
93
        } else {
94
            foreach (self::getColumnTypes($this->samples) as $key => $value) {
95
                if (is_null($this->columnTypes[$key])) {
96
                    $this->columnTypes[$key] = $value;
97
                }
98
            }
99
        }
100
        unset($key, $value);
101
        $this->labels = array_keys(array_count_values($this->targets));
102
        $this->tree = $this->getSplitLeaf(range(0, count($this->samples) - 1));
103
104
        // Each time the tree is trained, feature importances are reset so that
105
        // we will have to compute it again depending on the new data
106
        $this->featureImportances = null;
0 ignored issues
show
Documentation Bug introduced by
It seems like null of type null is incompatible with the declared type array of property $featureImportances.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
107
108
        // If column names are given or computed before, then there is no
109
        // need to init it and accidentally remove the previous given names
110
        if ($this->columnNames === null) {
111
            $this->columnNames = range(0, $this->featureCount - 1);
112
        } elseif (count($this->columnNames) > $this->featureCount) {
113
            $this->columnNames = array_slice($this->columnNames, 0, $this->featureCount);
114
        } elseif (count($this->columnNames) < $this->featureCount) {
115
            $this->columnNames = array_merge($this->columnNames,
116
                range(count($this->columnNames), $this->featureCount - 1)
117
            );
118
        }
119
    }
120
121
    /**
122
     * @param array $samples
123
     *
124
     * @return array
125
     */
126
    public static function getColumnTypes(array $samples) : array
127
    {
128
        $types = [];
129
        $featureCount = count($samples[0]);
130
        for ($i = 0; $i < $featureCount; ++$i) {
131
            $values = array_column($samples, $i);
132
            $isCategorical = self::isCategoricalColumn($values);
133
            $types[] = $isCategorical ? self::NOMINAL : self::CONTINUOUS;
134
        }
135
136
        return $types;
137
    }
138
139
    /**
140
     * @param array $records
141
     * @param int   $depth
142
     *
143
     * @return DecisionTreeLeaf
144
     */
145
    protected function getSplitLeaf(array $records, int $depth = 0) : DecisionTreeLeaf
146
    {
147
        $split = $this->getBestSplit($records);
148
        $split->level = $depth;
149
        if ($this->actualDepth < $depth) {
150
            $this->actualDepth = $depth;
151
        }
152
153
        // Traverse all records to see if all records belong to the same class,
154
        // otherwise group the records so that we can classify the leaf
155
        // in case maximum depth is reached
156
        $leftRecords = [];
157
        $rightRecords= [];
158
        $remainingTargets = [];
159
        $prevRecord = null;
160
        $allSame = true;
161
162
        foreach ($records as $recordNo) {
163
            // Check if the previous record is the same with the current one
164
            $record = $this->samples[$recordNo];
165
            if ($prevRecord && $prevRecord != $record) {
166
                $allSame = false;
167
            }
168
            $prevRecord = $record;
169
170
            // According to the split criteron, this record will
171
            // belong to either left or the right side in the next split
172
            if ($split->evaluate($record)) {
173
                $leftRecords[] = $recordNo;
174
            } else {
175
                $rightRecords[]= $recordNo;
176
            }
177
178
            // Group remaining targets
179
            $target = $this->targets[$recordNo];
180
            if (!array_key_exists($target, $remainingTargets)) {
181
                $remainingTargets[$target] = 1;
182
            } else {
183
                ++$remainingTargets[$target];
184
            }
185
        }
186
187
        if ($allSame || $depth >= $this->maxDepth || count($remainingTargets) === 1) {
188
            $split->isTerminal = 1;
0 ignored issues
show
Documentation Bug introduced by
The property $isTerminal was declared of type boolean, but 1 is of type integer. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
189
            arsort($remainingTargets);
190
            $split->classValue = key($remainingTargets);
191
        } else {
192
            if ($leftRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $leftRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
193
                $split->leftLeaf = $this->getSplitLeaf($leftRecords, $depth + 1);
194
            }
195
            if ($rightRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $rightRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
196
                $split->rightLeaf= $this->getSplitLeaf($rightRecords, $depth + 1);
197
            }
198
        }
199
200
        return $split;
201
    }
202
203
    /**
204
     * @param array $records
205
     *
206
     * @return DecisionTreeLeaf|null
207
     */
208
    protected function getBestSplit(array $records) : DecisionTreeLeaf
209
    {
210
        $targets = array_intersect_key($this->targets, array_flip($records));
211
        $samples = array_intersect_key($this->samples, array_flip($records));
212
        $samples = array_combine($records, $this->preprocess($samples));
213
        $bestGiniVal = 1;
214
        $bestSplit = null;
215
        $features = $this->getSelectedFeatures();
216
        foreach ($features as $i) {
217
            $colValues = [];
218
            foreach ($samples as $index => $row) {
219
                if (!is_null($row[$i])) {
220
                    $colValues[$index] = $row[$i];
221
                }
222
            }
223
            $counts = self::arrayCountValues($colValues);
224
            arsort($counts);
225
            $baseValue = key($counts);
226
            $gini = $this->getGiniIndex($baseValue, $colValues, $targets);
227
            if (!is_null($gini) and (is_null($bestSplit) or ($bestGiniVal > $gini))) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
Using logical operators such as and instead of && is generally not recommended.

PHP has two types of connecting operators (logical operators, and boolean operators):

  Logical Operators Boolean Operator
AND - meaning and &&
OR - meaning or ||

The difference between these is the order in which they are executed. In most cases, you would want to use a boolean operator like &&, or ||.

Let’s take a look at a few examples:

// Logical operators have lower precedence:
$f = false or true;

// is executed like this:
($f = false) or true;


// Boolean operators have higher precedence:
$f = false || true;

// is executed like this:
$f = (false || true);

Logical Operators are used for Control-Flow

One case where you explicitly want to use logical operators is for control-flow such as this:

$x === 5
    or die('$x must be 5.');

// Instead of
if ($x !== 5) {
    die('$x must be 5.');
}

Since die introduces problems of its own, f.e. it makes our code hardly testable, and prevents any kind of more sophisticated error handling; you probably do not want to use this in real-world code. Unfortunately, logical operators cannot be combined with throw at this point:

// The following is currently a parse error.
$x === 5
    or throw new RuntimeException('$x must be 5.');

These limitations lead to logical operators rarely being of use in current PHP code.

Loading history...
Comprehensibility Best Practice introduced by
Using logical operators such as or instead of || is generally not recommended.

PHP has two types of connecting operators (logical operators, and boolean operators):

  Logical Operators Boolean Operator
AND - meaning and &&
OR - meaning or ||

The difference between these is the order in which they are executed. In most cases, you would want to use a boolean operator like &&, or ||.

Let’s take a look at a few examples:

// Logical operators have lower precedence:
$f = false or true;

// is executed like this:
($f = false) or true;


// Boolean operators have higher precedence:
$f = false || true;

// is executed like this:
$f = (false || true);

Logical Operators are used for Control-Flow

One case where you explicitly want to use logical operators is for control-flow such as this:

$x === 5
    or die('$x must be 5.');

// Instead of
if ($x !== 5) {
    die('$x must be 5.');
}

Since die introduces problems of its own, f.e. it makes our code hardly testable, and prevents any kind of more sophisticated error handling; you probably do not want to use this in real-world code. Unfortunately, logical operators cannot be combined with throw at this point:

// The following is currently a parse error.
$x === 5
    or throw new RuntimeException('$x must be 5.');

These limitations lead to logical operators rarely being of use in current PHP code.

Loading history...
228
                $split = new DecisionTreeLeaf();
229
                $split->value = $baseValue;
230
                $split->giniIndex = $gini;
231
                $split->columnIndex = $i;
232
                $split->isContinuous = $this->columnTypes[$i] == self::CONTINUOUS;
233
                $split->records = $records;
234
235
                // If a numeric column is to be selected, then
236
                // the original numeric value and the selected operator
237
                // will also be saved into the leaf for future access
238
                if ($this->columnTypes[$i] == self::CONTINUOUS) {
239
                    $matches = [];
240
                    preg_match("/^([<>=]{1,2})\s*(.*)/", strval($split->value), $matches);
241
                    $split->operator = $matches[1];
242
                    $split->numericValue = floatval($matches[2]);
243
                }
244
245
                $bestSplit = $split;
246
                $bestGiniVal = $gini;
247
            }
248
        }
249
250
        return $bestSplit;
251
    }
252
253
    /**
254
     * Returns available features/columns to the tree for the decision making
255
     * process. <br>
256
     *
257
     * If a number is given with setNumFeatures() method, then a random selection
258
     * of features up to this number is returned. <br>
259
     *
260
     * If some features are manually selected by use of setSelectedFeatures(),
261
     * then only these features are returned <br>
262
     *
263
     * If any of above methods were not called beforehand, then all features
264
     * are returned by default.
265
     *
266
     * @return array
267
     */
268
    protected function getSelectedFeatures() : array
269
    {
270
        $allFeatures = range(0, $this->featureCount - 1);
271
        if ($this->numUsableFeatures === 0 && !$this->selectedFeatures) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->selectedFeatures of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
272
            return $allFeatures;
273
        }
274
275
        if ($this->selectedFeatures) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->selectedFeatures of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
276
            return $this->selectedFeatures;
277
        }
278
279
        $numFeatures = $this->numUsableFeatures;
280
        if ($numFeatures > $this->featureCount) {
281
            $numFeatures = $this->featureCount;
282
        }
283
        shuffle($allFeatures);
284
        $selectedFeatures = array_slice($allFeatures, 0, $numFeatures, false);
285
        sort($selectedFeatures);
286
287
        return $selectedFeatures;
288
    }
289
290
    /**
291
     * @param mixed $baseValue
292
     * @param array $colValues
293
     * @param array $targets
294
     *
295
     * @return float
296
     */
297
    public function getGiniIndex($baseValue, array $colValues, array $targets) : float
298
    {
299
        if (empty($colValues)) {
300
            return 1;
301
        }
302
        $countMatrix = [];
303
        foreach ($this->labels as $label) {
304
            $countMatrix[$label] = [0, 0];
305
        }
306
307
        foreach ($colValues as $index => $value) {
308
            $label = $targets[$index];
309
            $rowIndex = $value === $baseValue ? 0 : 1;
310
            ++$countMatrix[$label][$rowIndex];
311
        }
312
313
        $giniParts = [0, 0];
314
        for ($i = 0; $i <= 1; ++$i) {
315
            $part = 0;
316
            $sum = array_sum(array_column($countMatrix, $i));
317
            if ($sum > 0) {
318
                foreach ($this->labels as $label) {
319
                    $part += pow($countMatrix[$label][$i] / floatval($sum), 2);
320
                }
321
            }
322
323
            $giniParts[$i] = (1 - $part) * $sum;
324
        }
325
326
        return array_sum($giniParts) / count($colValues);
327
    }
328
329
    /**
330
     * @param array $samples
331
     *
332
     * @return array
333
     */
334
    protected function preprocess(array $samples) : array
335
    {
336
        // Detect and convert continuous data column values into
337
        // discrete values by using the median as a threshold value
338
        $columns = [];
339
        for ($i = 0; $i < $this->featureCount; ++$i) {
340
            $values = array_column($samples, $i);
341
            if ($this->columnTypes[$i] == self::CONTINUOUS) {
342
                $median = Mean::median($values);
343
                foreach ($values as &$value) {
344
                    if ($value <= $median) {
345
                        $value = "<= $median";
346
                    } else {
347
                        $value = "> $median";
348
                    }
349
                }
350
            }
351
            $columns[] = $values;
352
        }
353
        // Below method is a strange yet very simple & efficient method
354
        // to get the transpose of a 2D array
355
        return array_map(null, ...$columns);
356
    }
357
358
    /**
359
     * @param array $columnValues
360
     *
361
     * @return bool
362
     */
363
    protected static function isCategoricalColumn(array $columnValues) : bool
364
    {
365
        $count = count($columnValues);
366
367
        // There are two main indicators that *may* show whether a
368
        // column is composed of discrete set of values:
369
        // 1- Column may contain string values and non-float values
370
        // 2- Number of unique values in the column is only a small fraction of
371
        //	  all values in that column (Lower than or equal to %20 of all values)
372
        $numericValues = array_filter($columnValues, 'is_numeric');
373
        $floatValues = array_filter($columnValues, 'is_float');
374
        $nullCount = count(array_filter($columnValues, 'is_null'));
375
        if (!empty($floatValues)) {
376
            return false;
377
        }
378
379
        if (count($numericValues) !== $count - $nullCount) {
380
            return true;
381
        }
382
383
        $distinctValuesCount = count(self::arrayCountValues($columnValues));
384
385
        return $distinctValuesCount <= ($count - $nullCount) * self::$categoricalColumnMinimumUniqueValueCount;
386
    }
387
388
    /**
389
     * This method is used to set number of columns to be used
390
     * when deciding a split at an internal node of the tree.  <br>
391
     * If the value is given 0, then all features are used (default behaviour),
392
     * otherwise the given value will be used as a maximum for number of columns
393
     * randomly selected for each split operation.
394
     *
395
     * @param int $numFeatures
396
     *
397
     * @return $this
398
     *
399
     * @throws InvalidArgumentException
400
     */
401
    public function setNumFeatures(int $numFeatures)
402
    {
403
        if ($numFeatures < 0) {
404
            throw new InvalidArgumentException('Selected column count should be greater or equal to zero');
405
        }
406
407
        $this->numUsableFeatures = $numFeatures;
408
409
        return $this;
410
    }
411
412
    /**
413
     * Used to set predefined features to consider while deciding which column to use for a split
414
     *
415
     * @param array $selectedFeatures
416
     */
417
    protected function setSelectedFeatures(array $selectedFeatures)
418
    {
419
        $this->selectedFeatures = $selectedFeatures;
420
    }
421
422
    /**
423
     * A string array to represent columns. Useful when HTML output or
424
     * column importances are desired to be inspected.
425
     *
426
     * @param array $names
427
     *
428
     * @return $this
429
     *
430
     * @throws InvalidArgumentException
431
     */
432
    public function setColumnNames(array $names)
433
    {
434
        if ($this->featureCount !== 0 && count($names) !== $this->featureCount) {
435
            throw new InvalidArgumentException(sprintf('Length of the given array should be equal to feature count %s', $this->featureCount));
436
        }
437
438
        $this->columnNames = $names;
439
440
        return $this;
441
    }
442
443
    /**
444
     * @return string
445
     */
446
    public function getHtml()
447
    {
448
        return $this->tree->getHTML($this->columnNames);
449
    }
450
451
    /**
452
     * This will return an array including an importance value for
453
     * each column in the given dataset. The importance values are
454
     * normalized and their total makes 1.<br/>
455
     *
456
     * @return array
457
     */
458
    public function getFeatureImportances()
459
    {
460
        if ($this->featureImportances !== null) {
461
            return $this->featureImportances;
462
        }
463
464
        $sampleCount = count($this->samples);
465
        $this->featureImportances = [];
466
        foreach ($this->columnNames as $column => $columnName) {
467
            $nodes = $this->getSplitNodesByColumn($column, $this->tree);
468
469
            $importance = 0;
470
            foreach ($nodes as $node) {
471
                $importance += $node->getNodeImpurityDecrease($sampleCount);
472
            }
473
474
            $this->featureImportances[$columnName] = $importance;
475
        }
476
477
        // Normalize & sort the importances
478
        $total = array_sum($this->featureImportances);
479
        if ($total > 0) {
480
            foreach ($this->featureImportances as &$importance) {
481
                $importance /= $total;
482
            }
483
            arsort($this->featureImportances);
484
        }
485
486
        return $this->featureImportances;
487
    }
488
489
    /**
490
     * Collects and returns an array of internal nodes that use the given
491
     * column as a split criterion
492
     *
493
     * @param int              $column
494
     * @param DecisionTreeLeaf $node
495
     *
496
     * @return array
497
     */
498
    protected function getSplitNodesByColumn(int $column, DecisionTreeLeaf $node) : array
499
    {
500
        if (!$node || $node->isTerminal) {
501
            return [];
502
        }
503
504
        $nodes = [];
505
        if ($node->columnIndex === $column) {
506
            $nodes[] = $node;
507
        }
508
509
        $lNodes = [];
510
        $rNodes = [];
511
        if ($node->leftLeaf) {
512
            $lNodes = $this->getSplitNodesByColumn($column, $node->leftLeaf);
513
        }
514
515
        if ($node->rightLeaf) {
516
            $rNodes = $this->getSplitNodesByColumn($column, $node->rightLeaf);
517
        }
518
519
        $nodes = array_merge($nodes, $lNodes, $rNodes);
520
521
        return $nodes;
522
    }
523
524
    /**
525
     * @param array $sample
526
     *
527
     * @return mixed
528
     */
529
    protected function predictSample(array $sample)
530
    {
531
        $node = $this->tree;
532
        do {
533
            if ($node->isTerminal) {
534
                break;
535
            }
536
537
            if ($node->evaluate($sample)) {
538
                $node = $node->leftLeaf;
539
            } else {
540
                $node = $node->rightLeaf;
541
            }
542
        } while ($node);
543
544
        return $node ? $node->classValue : $this->labels[0];
545
    }
546
547
    /**
548
     * @return integer[]|null[]
549
     */
550
    public function getInstanceColumnTypes() {
551
        return $this->columnTypes;
552
    }
553
554
    /**
555
     * @param integer[]|null[] $columnTypes
556
     */
557
    public function setInstanceColumnTypes(array $columnTypes) {
558
        $this->columnTypes = $columnTypes;
559
    }
560
561
    /**
562
     * @param array $values
563
     *
564
     * @return array
565
     */
566
    protected static function arrayCountValues(array $values) {
567
        $raw = [];
568
        foreach ($values as &$value) {
569
            if (!is_null($value) and !is_float($value)) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
Using logical operators such as and instead of && is generally not recommended.

PHP has two types of connecting operators (logical operators, and boolean operators):

  Logical Operators Boolean Operator
AND - meaning and &&
OR - meaning or ||

The difference between these is the order in which they are executed. In most cases, you would want to use a boolean operator like &&, or ||.

Let’s take a look at a few examples:

// Logical operators have lower precedence:
$f = false or true;

// is executed like this:
($f = false) or true;


// Boolean operators have higher precedence:
$f = false || true;

// is executed like this:
$f = (false || true);

Logical Operators are used for Control-Flow

One case where you explicitly want to use logical operators is for control-flow such as this:

$x === 5
    or die('$x must be 5.');

// Instead of
if ($x !== 5) {
    die('$x must be 5.');
}

Since die introduces problems of its own, f.e. it makes our code hardly testable, and prevents any kind of more sophisticated error handling; you probably do not want to use this in real-world code. Unfortunately, logical operators cannot be combined with throw at this point:

// The following is currently a parse error.
$x === 5
    or throw new RuntimeException('$x must be 5.');

These limitations lead to logical operators rarely being of use in current PHP code.

Loading history...
570
                $raw[] = $value;
571
            }
572
        }
573
574
        return array_count_values($raw);
575
    }
576
}
577