Completed
Push — master ( c8bc69...89bc14 )
by Andrei
17:38
created

kmeans_plusplus_initializer.initialize()   A

Complexity

Conditions 2

Size

Total Lines 17

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 2
dl 0
loc 17
rs 9.4285
c 0
b 0
f 0
1
"""!
2
3
@brief Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
4
@details Implementations based on articles:
5
         - K-Means++: The Advantages of careful seeding. D. Arthur, S. Vassilvitskii. 2007.
6
         
7
@authors Andrei Novikov, Aleksey Kukushkin ([email protected])
8
@date 2014-2018
9
@copyright GNU Public License
10
11
@see kmeans
12
@see xmeans
13
14
@cond GNU_PUBLIC_LICENSE
15
    PyClustering is free software: you can redistribute it and/or modify
16
    it under the terms of the GNU General Public License as published by
17
    the Free Software Foundation, either version 3 of the License, or
18
    (at your option) any later version.
19
    
20
    PyClustering is distributed in the hope that it will be useful,
21
    but WITHOUT ANY WARRANTY; without even the implied warranty of
22
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
    GNU General Public License for more details.
24
    
25
    You should have received a copy of the GNU General Public License
26
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
27
@endcond
28
29
"""
30
31
32
import numpy;
0 ignored issues
show
Configuration introduced by
The import numpy could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
33
import random;
34
35
from pyclustering.utils import euclidean_distance;
0 ignored issues
show
Unused Code introduced by
Unused euclidean_distance imported from pyclustering.utils
Loading history...
36
37
38
class random_center_initializer:
39
    """!
40
    @brief Random center initializer is for generation specified amount of random of centers for specified data.
41
    
42
    """
43
44
    def __init__(self, data, amount_centers):
45
        """!
46
        @brief Creates instance of random center initializer.
47
        
48
        @param[in] data (list): List of points where each point is represented by list of coordinates.
49
        @param[in] amount_centers (unit): Amount of centers that should be initialized.
50
        
51
        """
52
        
53
        self.__data = data;
54
        self.__amount = amount_centers;
55
56
        if self.__amount <= 0:
57
            raise AttributeError("Amount of cluster centers should be at least 1.");
58
59
60
    def initialize(self):
61
        """!
62
        @brief Generates random centers in line with input parameters.
63
        
64
        @return (list) List of centers where each center is represented by list of coordinates.
65
        
66
        """
67
        return [ self.__create_center() for _ in range(self.__amount) ];
68
69
70
    def __create_center(self):
71
        """!
72
        @brief Generates and returns random center.
73
        
74
        """
75
        return [ random.random() for _ in range(len(self.__data[0])) ];
76
77
78
79
class kmeans_plusplus_initializer:
80
    """!
0 ignored issues
show
Bug introduced by
A suspicious escape sequence \s was found. Did you maybe forget to add an r prefix?

Escape sequences in Python are generally interpreted according to rules similar to standard C. Only if strings are prefixed with r or R are they interpreted as regular expressions.

The escape sequence that was used indicates that you might have intended to write a regular expression.

Learn more about the available escape sequences. in the Python documentation.

Loading history...
81
    @brief K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means.
82
    @details K-Means++ algorithm guarantees an approximation ratio O(log k). Clustering results are depends on
83
              initial centers in case of K-Means algorithm and even in case of X-Means. This method is used to find
84
              out optimal initial centers.
85
86
    Algorithm can be divided into three steps. The first center is chosen from input data randomly with
87
    uniform distribution at the first step. At the second, probability to being center is calculated for each point:
88
    \f[p_{i}=\frac{D(x_{i})}{\sum_{j=0}^{N}D(x_{j})}\f]
89
    where \f$D(x_{i})\f$ is a distance from point \f$i\f$ to the closest center. Using this probabilities next center
90
    is chosen. The last step is repeated until required amount of centers is initialized.
91
92
    Pyclustering implementation of the algorithm provides feature to consider several candidates on the second
93
    step, for example:
94
95
    @code
96
        amount_centers = 4;
97
        amount_candidates = 3;
98
        initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
99
    @endcode
100
101
    If the farthest points should be used as centers then special constant 'FARTHEST_CENTER_CANDIDATE' should be used
102
    for that purpose, for example:
103
    @code
104
        amount_centers = 4;
105
        amount_candidates = kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE;
106
        initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
107
    @endcode
108
109
    There is an example of initial centers that were calculated by the K-Means++ method:
110
111
    @image html kmeans_plusplus_initializer_results.png
112
    
113
    Code example where initial centers are prepared for K-Means algorithm:
114
    @code
115
        # Read data 'SampleSimple3' from Simple Sample collection.
116
        sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3);
117
        
118
        # Calculate initial centers using K-Means++ method.
119
        centers = kmeans_plusplus_initializer(sample, 4).initialize();
120
        
121
        # Display initial centers.
122
        visualizer = cluster_visualizer();
123
        visualizer.append_cluster(sample);
124
        visualizer.append_cluster(centers, marker = '*', markersize = 10);
125
        visualizer.show();
126
        
127
        # Perform cluster analysis using K-Means algorithm with initial centers.
128
        kmeans_instance = kmeans(sample, centers);
129
        
130
        # Run clustering process and obtain result.
131
        kmeans_instance.process();
132
        clusters = kmeans_instance.get_clusters();
133
    @endcode
134
    
135
    """
136
137
138
    ## Constant denotes that only points with highest probabilities should be considered as centers.
139
    FARTHEST_CENTER_CANDIDATE = "farthest";
140
141
142
    def __init__(self, data, amount_centers, amount_candidates = 1):
143
        """!
144
        @brief Creates K-Means++ center initializer instance.
145
        
146
        @param[in] data (array_like): List of points where each point is represented by list of coordinates.
147
        @param[in] amount_centers (uint): Amount of centers that should be initialized.
148
        @param[in] amount_candidates (uint): Amount of candidates that is considered as a center, if the farthest points (with the highest probability) should
149
                    be considered as centers then special constant should be used 'FARTHEST_CENTER_CANDIDATE'.
150
151
        @see FARTHEST_CENTER_CANDIDATE
152
153
        """
154
        
155
        self.__data = numpy.array(data);
156
        self.__amount = amount_centers;
157
        self.__candidates = amount_candidates;
158
159
        self.__check_parameters();
160
161
162
    def __check_parameters(self):
163
        """!
164
        @brief Checks input parameters of the algorithm and if something wrong then corresponding exception is thrown.
165
166
        """
167
        if (self.__amount <= 0) or (self.__amount > len(self.__data)):
168
            raise AttributeError("Amount of cluster centers should be at least 1 and should be less or equal to amount of points in data.");
169
170
        if self.__candidates != kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
171
            if (self.__candidates <= 0) or (self.__candidates > len(self.__data)):
172
                raise AttributeError("Amount of candidates centers should be at least 1 and should be less or equal to amount of points in data.");
173
174
        if len(self.__data) == 0:
175
            raise AttributeError("Data is empty.")
176
177
178
    def __calculate_shortest_distances(self, data, centers):
179
        """!
180
        @brief Calculates distance from each data point to nearest center.
181
        
182
        @param[in] data (numpy.array): Array of points for that initialization is performed.
183
        @param[in] centers (numpy.array): Array of points that represents centers.
184
        
185
        @return (numpy.array) List of distances to closest center for each data point.
186
        
187
        """
188
189
        dataset_differences = numpy.zeros((len(centers), len(data)));
190
        for index_center in range(len(centers)):
191
            dataset_differences[index_center] = numpy.sum(
192
                numpy.square(data - centers[index_center]), axis=1).T;
193
194
        shortest_distances = numpy.min(dataset_differences, axis=0);
195
        return shortest_distances;
196
197
198
    def __get_next_center(self, centers):
199
        """!
200
        @brief Calculates the next center for the data.
201
202
        @param[in] centers (array_like): Current initialized centers.
203
204
        @return (array_like) Next initialized center.
205
206
        """
207
208
        distances = self.__calculate_shortest_distances(data=self.__data, centers=centers);
209
210
        if self.__candidates == kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
211
            center_index = numpy.argmax(distances);
212
        else:
213
            probabilities = self.__calculate_probabilities(distances);
214
            center_index = self.__get_probable_center(distances, probabilities);
215
216
        return self.__data[center_index];
217
218
219
    def __calculate_probabilities(self, distances):
220
        """!
221
        @brief Calculates cumulative probabilities of being center of each point.
222
223
        @param[in] distances (array_like): Distances from each point to closest center.
224
225
        @return (array_like) Cumulative probabilities of being center of each point.
226
227
        """
228
229
        total_distance = numpy.sum(distances);
230
        if total_distance != 0.0:
231
            probabilities = distances / total_distance;
232
            return numpy.cumsum(probabilities);
233
        else:
234
            return numpy.zeros(len(distances));
235
236
237
    def __get_probable_center(self, distances, probabilities):
238
        """!
239
        @brief Calculates the next probable center considering amount candidates.
240
241
        @param[in] distances (array_like): Distances from each point to closest center.
242
        @param[in] probabilities (array_like): Cumulative probabilities of being center of each point.
243
244
        @return (uint) Index point that is next initialized center.
245
246
        """
247
248
        index_best_candidate = -1;
249
        for _ in range(self.__candidates):
250
            candidate_probability = random.random();
251
            index_candidate = 0;
0 ignored issues
show
Unused Code introduced by
The variable index_candidate seems to be unused.
Loading history...
252
253
            for index_object in range(len(probabilities)):
254
                if candidate_probability < probabilities[index_object]:
255
                    index_candidate = index_object;
256
                    break;
257
258
            if index_best_candidate == -1:
259
                index_best_candidate = index_object;
0 ignored issues
show
Bug introduced by
The loop variable index_object might not be defined here.
Loading history...
260
            elif distances[index_best_candidate] < distances[index_object]:
0 ignored issues
show
Bug introduced by
The loop variable index_object might not be defined here.
Loading history...
261
                index_best_candidate = index_object;
0 ignored issues
show
Bug introduced by
The loop variable index_object might not be defined here.
Loading history...
262
263
        return index_best_candidate;
264
265
266
    def initialize(self):
267
        """!
268
        @brief Calculates initial centers using K-Means++ method.
269
        
270
        @return (list) List of initialized initial centers.
271
        
272
        """
273
274
        index_center = random.randint(0, len(self.__data) - 1);
275
        centers = [ self.__data[ index_center ] ];
276
277
        # For each next center
278
        for _ in range(1, self.__amount):
279
            next_center = self.__get_next_center(centers);
280
            centers.append(next_center);
281
282
        return centers;