Completed
Push — master ( 240a22...a33d5f )
by Arkadiusz
02:53
created

DecisionTree::__construct()   A

Complexity

Conditions 1
Paths 1

Size

Total Lines 4
Code Lines 2

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
dl 0
loc 4
rs 10
c 0
b 0
f 0
cc 1
eloc 2
nc 1
nop 1
1
<?php
2
3
declare(strict_types=1);
4
5
namespace Phpml\Classification;
6
7
use Phpml\Helper\Predictable;
8
use Phpml\Helper\Trainable;
9
use Phpml\Math\Statistic\Mean;
10
use Phpml\Classification\DecisionTree\DecisionTreeLeaf;
11
12
class DecisionTree implements Classifier
13
{
14
    use Trainable, Predictable;
15
16
    const CONTINUOS = 1;
17
    const NOMINAL = 2;
18
19
    /**
20
     * @var array
21
     */
22
    private $samples = [];
23
24
    /**
25
     * @var array
26
     */
27
    private $columnTypes;
28
29
    /**
30
     * @var array
31
     */
32
    private $labels = [];
33
34
    /**
35
     * @var int
36
     */
37
    private $featureCount = 0;
38
39
    /**
40
     * @var DecisionTreeLeaf
41
     */
42
    private $tree = null;
43
44
    /**
45
     * @var int
46
     */
47
    private $maxDepth;
48
49
    /**
50
     * @var int
51
     */
52
    public $actualDepth = 0;
53
54
    /**
55
     * @var int
56
     */
57
    private $numUsableFeatures = 0;
58
59
    /**
60
     * @var array
61
     */
62
    private $featureImportances = null;
63
64
    /**
65
     *
66
     * @var array
67
     */
68
    private $columnNames = null;
69
70
    /**
71
     * @param int $maxDepth
72
     */
73
    public function __construct($maxDepth = 10)
74
    {
75
        $this->maxDepth = $maxDepth;
76
    }
77
    /**
78
     * @param array $samples
79
     * @param array $targets
80
     */
81
    public function train(array $samples, array $targets)
82
    {
83
        $this->samples = array_merge($this->samples, $samples);
84
        $this->targets = array_merge($this->targets, $targets);
85
86
        $this->featureCount = count($this->samples[0]);
87
        $this->columnTypes = $this->getColumnTypes($this->samples);
88
        $this->labels = array_keys(array_count_values($this->targets));
89
        $this->tree = $this->getSplitLeaf(range(0, count($this->samples) - 1));
90
91
        // Each time the tree is trained, feature importances are reset so that
92
        // we will have to compute it again depending on the new data
93
        $this->featureImportances = null;
0 ignored issues
show
Documentation Bug introduced by
It seems like null of type null is incompatible with the declared type array of property $featureImportances.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
94
95
        // If column names are given or computed before, then there is no
96
        // need to init it and accidentally remove the previous given names
97
        if ($this->columnNames === null) {
98
            $this->columnNames = range(0, $this->featureCount - 1);
99
        } elseif (count($this->columnNames) > $this->featureCount) {
100
            $this->columnNames = array_slice($this->columnNames, 0, $this->featureCount);
101
        } elseif (count($this->columnNames) < $this->featureCount) {
102
            $this->columnNames = array_merge($this->columnNames,
103
                range(count($this->columnNames), $this->featureCount - 1));
104
        }
105
    }
106
107
    protected function getColumnTypes(array $samples)
108
    {
109
        $types = [];
110
        for ($i=0; $i<$this->featureCount; $i++) {
111
            $values = array_column($samples, $i);
112
            $isCategorical = $this->isCategoricalColumn($values);
113
            $types[] = $isCategorical ? self::NOMINAL : self::CONTINUOS;
114
        }
115
        return $types;
116
    }
117
118
    /**
119
     * @param null|array $records
120
     * @return DecisionTreeLeaf
121
     */
122
    protected function getSplitLeaf($records, $depth = 0)
123
    {
124
        $split = $this->getBestSplit($records);
0 ignored issues
show
Bug introduced by
It seems like $records defined by parameter $records on line 122 can also be of type null; however, Phpml\Classification\DecisionTree::getBestSplit() does only seem to accept array, maybe add an additional type check?

This check looks at variables that have been passed in as parameters and are passed out again to other methods.

If the outgoing method call has stricter type requirements than the method itself, an issue is raised.

An additional type check may prevent trouble.

Loading history...
125
        $split->level = $depth;
126
        if ($this->actualDepth < $depth) {
127
            $this->actualDepth = $depth;
128
        }
129
        $leftRecords = [];
130
        $rightRecords= [];
131
        $remainingTargets = [];
132
        $prevRecord = null;
133
        $allSame = true;
134
        foreach ($records as $recordNo) {
0 ignored issues
show
Bug introduced by
The expression $records of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
135
            $record = $this->samples[$recordNo];
136
            if ($prevRecord && $prevRecord != $record) {
137
                $allSame = false;
138
            }
139
            $prevRecord = $record;
140
            if ($split->evaluate($record)) {
141
                $leftRecords[] = $recordNo;
142
            } else {
143
                $rightRecords[]= $recordNo;
144
            }
145
            $target = $this->targets[$recordNo];
146
            if (! in_array($target, $remainingTargets)) {
147
                $remainingTargets[] = $target;
148
            }
149
        }
150
151
        if (count($remainingTargets) == 1 || $allSame || $depth >= $this->maxDepth) {
152
            $split->isTerminal = 1;
0 ignored issues
show
Documentation Bug introduced by
The property $isTerminal was declared of type boolean, but 1 is of type integer. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
153
            $classes = array_count_values($remainingTargets);
154
            arsort($classes);
155
            $split->classValue = key($classes);
156
        } else {
157
            if ($leftRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $leftRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
158
                $split->leftLeaf = $this->getSplitLeaf($leftRecords, $depth + 1);
159
            }
160
            if ($rightRecords) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $rightRecords of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
161
                $split->rightLeaf= $this->getSplitLeaf($rightRecords, $depth + 1);
162
            }
163
        }
164
        return $split;
165
    }
166
167
    /**
168
     * @param array $records
169
     * @return DecisionTreeLeaf[]
170
     */
171
    protected function getBestSplit($records)
172
    {
173
        $targets = array_intersect_key($this->targets, array_flip($records));
174
        $samples = array_intersect_key($this->samples, array_flip($records));
175
        $samples = array_combine($records, $this->preprocess($samples));
176
        $bestGiniVal = 1;
177
        $bestSplit = null;
178
        $features = $this->getSelectedFeatures();
179
        foreach ($features as $i) {
180
            $colValues = [];
181
            foreach ($samples as $index => $row) {
182
                $colValues[$index] = $row[$i];
183
            }
184
            $counts = array_count_values($colValues);
185
            arsort($counts);
186
            $baseValue = key($counts);
187
            $gini = $this->getGiniIndex($baseValue, $colValues, $targets);
188
            if ($bestSplit == null || $bestGiniVal > $gini) {
189
                $split = new DecisionTreeLeaf();
190
                $split->value = $baseValue;
191
                $split->giniIndex = $gini;
0 ignored issues
show
Documentation Bug introduced by
It seems like $gini can also be of type integer. However, the property $giniIndex is declared as type double. Maybe add an additional type check?

Our type inference engine has found a suspicous assignment of a value to a property. This check raises an issue when a value that can be of a mixed type is assigned to a property that is type hinted more strictly.

For example, imagine you have a variable $accountId that can either hold an Id object or false (if there is no account id yet). Your code now assigns that value to the id property of an instance of the Account class. This class holds a proper account, so the id value must no longer be false.

Either this assignment is in error or a type check should be added for that assignment.

class Id
{
    public $id;

    public function __construct($id)
    {
        $this->id = $id;
    }

}

class Account
{
    /** @var  Id $id */
    public $id;
}

$account_id = false;

if (starsAreRight()) {
    $account_id = new Id(42);
}

$account = new Account();
if ($account instanceof Id)
{
    $account->id = $account_id;
}
Loading history...
192
                $split->columnIndex = $i;
193
                $split->isContinuous = $this->columnTypes[$i] == self::CONTINUOS;
194
                $split->records = $records;
195
                $bestSplit = $split;
196
                $bestGiniVal = $gini;
197
            }
198
        }
199
        return $bestSplit;
200
    }
201
202
    /**
203
     * @return array
204
     */
205
    protected function getSelectedFeatures()
206
    {
207
        $allFeatures = range(0, $this->featureCount - 1);
208
        if ($this->numUsableFeatures == 0) {
209
            return $allFeatures;
210
        }
211
212
        $numFeatures = $this->numUsableFeatures;
213
        if ($numFeatures > $this->featureCount) {
214
            $numFeatures = $this->featureCount;
215
        }
216
        shuffle($allFeatures);
217
        $selectedFeatures = array_slice($allFeatures, 0, $numFeatures, false);
218
        sort($selectedFeatures);
219
220
        return $selectedFeatures;
221
    }
222
223
    /**
224
     * @param string $baseValue
225
     * @param array $colValues
226
     * @param array $targets
227
     */
228
    public function getGiniIndex($baseValue, $colValues, $targets)
229
    {
230
        $countMatrix = [];
231
        foreach ($this->labels as $label) {
232
            $countMatrix[$label] = [0, 0];
233
        }
234
        foreach ($colValues as $index => $value) {
235
            $label = $targets[$index];
236
            $rowIndex = $value == $baseValue ? 0 : 1;
237
            $countMatrix[$label][$rowIndex]++;
238
        }
239
        $giniParts = [0, 0];
240
        for ($i=0; $i<=1; $i++) {
241
            $part = 0;
242
            $sum = array_sum(array_column($countMatrix, $i));
243
            if ($sum > 0) {
244
                foreach ($this->labels as $label) {
245
                    $part += pow($countMatrix[$label][$i] / floatval($sum), 2);
246
                }
247
            }
248
            $giniParts[$i] = (1 - $part) * $sum;
249
        }
250
        return array_sum($giniParts) / count($colValues);
251
    }
252
253
    /**
254
     * @param array $samples
255
     * @return array
256
     */
257
    protected function preprocess(array $samples)
258
    {
259
        // Detect and convert continuous data column values into
260
        // discrete values by using the median as a threshold value
261
        $columns = [];
262
        for ($i=0; $i<$this->featureCount; $i++) {
263
            $values = array_column($samples, $i);
264
            if ($this->columnTypes[$i] == self::CONTINUOS) {
265
                $median = Mean::median($values);
266
                foreach ($values as &$value) {
267
                    if ($value <= $median) {
268
                        $value = "<= $median";
269
                    } else {
270
                        $value = "> $median";
271
                    }
272
                }
273
            }
274
            $columns[] = $values;
275
        }
276
        // Below method is a strange yet very simple & efficient method
277
        // to get the transpose of a 2D array
278
        return array_map(null, ...$columns);
279
    }
280
281
    /**
282
     * @param array $columnValues
283
     * @return bool
284
     */
285
    protected function isCategoricalColumn(array $columnValues)
286
    {
287
        $count = count($columnValues);
288
        // There are two main indicators that *may* show whether a
289
        // column is composed of discrete set of values:
290
        // 1- Column may contain string values
291
        // 2- Number of unique values in the column is only a small fraction of
292
        //	  all values in that column (Lower than or equal to %20 of all values)
293
        $numericValues = array_filter($columnValues, 'is_numeric');
294
        if (count($numericValues) != $count) {
295
            return true;
296
        }
297
        $distinctValues = array_count_values($columnValues);
298
        if (count($distinctValues) <= $count / 5) {
299
            return true;
300
        }
301
        return false;
302
    }
303
304
    /**
305
     * This method is used to set number of columns to be used
306
     * when deciding a split at an internal node of the tree.  <br>
307
     * If the value is given 0, then all features are used (default behaviour),
308
     * otherwise the given value will be used as a maximum for number of columns
309
     * randomly selected for each split operation.
310
     *
311
     * @param int $numFeatures
312
     * @return $this
313
     * @throws Exception
314
     */
315
    public function setNumFeatures(int $numFeatures)
316
    {
317
        if ($numFeatures < 0) {
318
            throw new \Exception("Selected column count should be greater or equal to zero");
319
        }
320
321
        $this->numUsableFeatures = $numFeatures;
322
323
        return $this;
324
    }
325
326
    /**
327
     * A string array to represent columns. Useful when HTML output or
328
     * column importances are desired to be inspected.
329
     *
330
     * @param array $names
331
     * @return $this
332
     */
333
    public function setColumnNames(array $names)
334
    {
335
        if ($this->featureCount != 0 && count($names) != $this->featureCount) {
336
            throw new \Exception("Length of the given array should be equal to feature count ($this->featureCount)");
337
        }
338
339
        $this->columnNames = $names;
340
341
        return $this;
342
    }
343
344
    /**
345
     * @return string
346
     */
347
    public function getHtml()
348
    {
349
        return $this->tree->getHTML($this->columnNames);
350
    }
351
352
    /**
353
     * This will return an array including an importance value for
354
     * each column in the given dataset. The importance values are
355
     * normalized and their total makes 1.<br/>
356
     *
357
     * @param array $labels
0 ignored issues
show
Bug introduced by
There is no parameter named $labels. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
358
     * @return array
359
     */
360
    public function getFeatureImportances()
361
    {
362
        if ($this->featureImportances !== null) {
363
            return $this->featureImportances;
364
        }
365
366
        $sampleCount = count($this->samples);
367
        $this->featureImportances = [];
368
        foreach ($this->columnNames as $column => $columnName) {
369
            $nodes = $this->getSplitNodesByColumn($column, $this->tree);
370
371
            $importance = 0;
372
            foreach ($nodes as $node) {
373
                $importance += $node->getNodeImpurityDecrease($sampleCount);
374
            }
375
376
            $this->featureImportances[$columnName] = $importance;
377
        }
378
379
        // Normalize & sort the importances
380
        $total = array_sum($this->featureImportances);
381
        if ($total > 0) {
382
            foreach ($this->featureImportances as &$importance) {
383
                $importance /= $total;
384
            }
385
            arsort($this->featureImportances);
386
        }
387
388
        return $this->featureImportances;
389
    }
390
391
    /**
392
     * Collects and returns an array of internal nodes that use the given
393
     * column as a split criteron
394
     *
395
     * @param int $column
396
     * @param DecisionTreeLeaf
397
     * @param array $collected
0 ignored issues
show
Bug introduced by
There is no parameter named $collected. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
398
     *
399
     * @return array
400
     */
401
    protected function getSplitNodesByColumn($column, DecisionTreeLeaf $node)
402
    {
403
        if (!$node || $node->isTerminal) {
404
            return [];
405
        }
406
407
        $nodes = [];
408
        if ($node->columnIndex == $column) {
409
            $nodes[] = $node;
410
        }
411
412
        $lNodes = [];
413
        $rNodes = [];
414
        if ($node->leftLeaf) {
415
            $lNodes = $this->getSplitNodesByColumn($column, $node->leftLeaf);
416
        }
417
        if ($node->rightLeaf) {
418
            $rNodes = $this->getSplitNodesByColumn($column, $node->rightLeaf);
419
        }
420
        $nodes = array_merge($nodes, $lNodes, $rNodes);
421
422
        return $nodes;
423
    }
424
425
    /**
426
     * @param array $sample
427
     * @return mixed
428
     */
429
    protected function predictSample(array $sample)
430
    {
431
        $node = $this->tree;
432
        do {
433
            if ($node->isTerminal) {
434
                break;
435
            }
436
            if ($node->evaluate($sample)) {
437
                $node = $node->leftLeaf;
438
            } else {
439
                $node = $node->rightLeaf;
440
            }
441
        } while ($node);
442
443
        return $node ? $node->classValue : $this->labels[0];
444
    }
445
}
446