Passed
Pull Request — master (#691)
by Matias
06:21 queued 04:13
created

ChineseWhispers::predict()   B

Complexity

Conditions 11
Paths 121

Size

Total Lines 67
Code Lines 34

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 32
CRAP Score 11.0245

Importance

Changes 2
Bugs 0 Features 0
Metric Value
cc 11
eloc 34
c 2
b 0
f 0
nc 121
nop 3
dl 0
loc 67
ccs 32
cts 34
cp 0.9412
crap 11.0245
rs 7.1416

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
declare(strict_types=1);
3
/**
4
 * @copyright Copyright (c) 2023, Matias De lellis
5
 *
6
 * @author Matias De lellis <[email protected]>
7
 *
8
 * @license AGPL-3.0-or-later
9
 *
10
 * This code is free software: you can redistribute it and/or modify
11
 * it under the terms of the GNU Affero General Public License, version 3,
12
 * as published by the Free Software Foundation.
13
 *
14
 * This program is distributed in the hope that it will be useful,
15
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
16
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17
 * GNU Affero General Public License for more details.
18
 *
19
 * You should have received a copy of the GNU Affero General Public License, version 3,
20
 * along with this program. If not, see <http://www.gnu.org/licenses/>
21
 *
22
 */
23
24
namespace OCA\FaceRecognition\Clusterer;
25
26
27
/**
28
 * This class implements the graph clustering algorithm described in the
29
 * paper: Chinese Whispers - an Efficient Graph Clustering Algorithm and its
30
 * Application to Natural Language Processing Problems by Chris Biemann.
31
 *
32
 * In particular, it tries to be a shameless copy of the original dlib
33
 * implementation.
34
 *  - https://github.com/davisking/dlib/blob/master/dlib/clustering/chinese_whispers.h
35
 */
36
class ChineseWhispers {
37
38
	/**
39
	 * Cluster the dataset by assigning a label to each sample.from the edges
40
	 */
41 1
	static public function predict(array &$edges, array &$labels, int $num_iterations = 100)
42
	{
43
		// To improve the stability of the clusters, we must
44
		// iterate the neighbors in a pseudo-random way.
45 1
		mt_srand(2023);
46
47 1
		$labels = [];
48 1
		if (count($edges) == 0)
49
			return 0;
50
51 1
		$neighbors = [];
52 1
		self::find_neighbor_ranges($edges, $neighbors);
53
54
		// Initialize the labels, each node gets a different label.
55 1
		for ($i = 0; $i < count($neighbors); ++$i)
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
56 1
			$labels[$i] = $i;
57 1
		for ($iter = 0; $iter < count($neighbors)*$num_iterations; ++$iter)
58
		{
59
			// Pick a random node.
60 1
			$idx = mt_rand()%count($neighbors);
61
62
			// Count how many times each label happens amongst our neighbors.
63 1
			$labels_to_counts = [];
64 1
			$end = $neighbors[$idx][1];
65
66 1
			for ($i = $neighbors[$idx][0]; $i != $end; ++$i)
67
			{
68 1
				$iLabelFirst = $edges[$i][1];
69 1
				$iLabel = $labels[$iLabelFirst];
70 1
				if (isset($labels_to_counts[$iLabel]))
71
					$labels_to_counts[$iLabel]++;
72
				else
73 1
					$labels_to_counts[$iLabel] = 1;
74
			}
75
76
			// find the most common label
77
			// std::map<unsigned long, double>::iterator i;
78 1
			$best_score = PHP_INT_MIN;
79 1
			$best_label = $labels[$idx];
80 1
			foreach ($labels_to_counts as $key => $value)
81
			{
82 1
				if ($value > $best_score)
83
				{
84 1
					$best_score = $value;
85 1
					$best_label = $key;
86
				}
87
			}
88
89 1
			$labels[$idx] = $best_label;
90
		}
91
92
		// Remap the labels into a contiguous range.  First we find the
93
		// mapping.
94 1
		$label_remap = [];
95 1
		for ($i = 0; $i < count($labels); ++$i)
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
96
		{
97 1
			$next_id = count($label_remap);
98 1
			if (!isset($label_remap[$labels[$i]]))
99 1
				$label_remap[$labels[$i]] = $next_id;
100
		}
101
		// now apply the mapping to all the labels.
102 1
		for ($i = 0; $i < count($labels); ++$i)
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
103
		{
104 1
			$labels[$i] = $label_remap[$labels[$i]];
105
		}
106
107 1
		return count($label_remap);
108
	}
109
110 1
	static function find_neighbor_ranges (&$edges, &$neighbors) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
111
		// setup neighbors so that [neighbors[i].first, neighbors[i].second) is the range
112
		// within edges that contains all node i's edges.
113 1
		$num_nodes = self::max_index_plus_one($edges);
114 1
		for ($i = 0; $i < $num_nodes; ++$i) $neighbors[$i] = [0, 0];
115 1
		$cur_node = 0;
116 1
		$start_idx = 0;
117 1
		for ($i = 0; $i < count($edges); ++$i)
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
118
		{
119 1
			if ($edges[$i][0] != $cur_node)
120
			{
121
				$neighbors[$cur_node] = [$start_idx, $i];
122
				$start_idx = $i;
123
				$cur_node = $edges[$i][0];
124
			}
125
		}
126 1
		if (count($neighbors) !== 0)
127 1
			$neighbors[$cur_node] = [$start_idx, count($edges)];
128
	}
129
130 1
	static function max_index_plus_one ($pairs): int {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
131 1
		if (count($pairs) === 0)
132
		{
133
			return 0;
134
		}
135
		else {
136 1
			$max_idx = 0;
137 1
			for ($i = 0; $i < count($pairs); ++$i)
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
138
			{
139 1
				if ($pairs[$i][0] > $max_idx)
140
					$max_idx = $pairs[$i][0];
141 1
				if ($pairs[$i][1] > $max_idx)
142
					$max_idx = $pairs[$i][1];
143
			}
144 1
			return $max_idx + 1;
145
		}
146
	}
147
148 1
	static function convert_unordered_to_ordered (&$edges, &$out_edges)
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
149
	{
150 1
		$out_edges = [];
151 1
		for ($i = 0; $i < count($edges); ++$i)
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
152
		{
153 1
			$out_edges[] = [$edges[$i][0], $edges[$i][1]];
154 1
			if ($edges[$i][0] != $edges[$i][1])
155
				$out_edges[] = [$edges[$i][1], $edges[$i][0]];
156
		}
157
	}
158
}
159