Completed
Push — 0.8.dev ( 2aeeeb...e36293 )
by Andrei
01:07
created

kmeans_visualizer   A

Complexity

Total Complexity 19

Size/Duplication

Total Lines 100
Duplicated Lines 0 %

Importance

Changes 0
Metric Value
dl 0
loc 100
rs 10
c 0
b 0
f 0
wmc 19

6 Methods

Rating   Name   Duplication   Size   Complexity  
A __draw_rays() 0 7 2
A __draw_centers() 0 10 3
A __draw_center() 0 10 4
B show_clusters() 0 34 3
A __get_argument() 0 6 2
B __draw_cluster_rays() 0 12 5
1
"""!
2
3
@brief Cluster analysis algorithm: K-Means
4
@details Based on book description:
5
         - J.B.MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. 1967.
6
7
@authors Andrei Novikov ([email protected])
8
@date 2014-2018
9
@copyright GNU Public License
10
11
@cond GNU_PUBLIC_LICENSE
12
    PyClustering is free software: you can redistribute it and/or modify
13
    it under the terms of the GNU General Public License as published by
14
    the Free Software Foundation, either version 3 of the License, or
15
    (at your option) any later version.
16
    
17
    PyClustering is distributed in the hope that it will be useful,
18
    but WITHOUT ANY WARRANTY; without even the implied warranty of
19
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20
    GNU General Public License for more details.
21
    
22
    You should have received a copy of the GNU General Public License
23
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
24
@endcond
25
26
"""
27
28
29
import numpy;
0 ignored issues
show
Configuration introduced by
The import numpy could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
30
31
import matplotlib.pyplot as plt;
0 ignored issues
show
Configuration introduced by
The import matplotlib.pyplot could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
32
33
import pyclustering.core.kmeans_wrapper as wrapper;
34
35
from pyclustering.core.wrapper import ccore_library;
36
37
from pyclustering.cluster.encoder import type_encoding;
38
from pyclustering.cluster import cluster_visualizer;
39
40
41
class kmeans_observer:
42
    """!
43
    @brief Observer of K-Means algorithm that is used to collect information about clustering process on each iteration of the algorithm.
44
    
45
    @see kmeans
46
    
47
    """
48
    
49
    def __init__(self, evolution_clusters = [], evolution_centers = []):
0 ignored issues
show
Bug Best Practice introduced by
The default value [] might cause unintended side-effects.

Objects as default values are only created once in Python and not on each invocation of the function. If the default object is modified, this modification is carried over to the next invocation of the method.

# Bad:
# If array_param is modified inside the function, the next invocation will
# receive the modified object.
def some_function(array_param=[]):
    # ...

# Better: Create an array on each invocation
def some_function(array_param=None):
    array_param = array_param or []
    # ...
Loading history...
Unused Code introduced by
The argument evolution_centers seems to be unused.
Loading history...
Unused Code introduced by
The argument evolution_clusters seems to be unused.
Loading history...
50
        """!
51
        @brief Initializer of observer of K-Means algorithm.
52
        
53
        @param[in] evolution_clusters (array_like): Evolution of clusters changes during clustering process.
54
        @param[in] evolution_centers (array_like): Evolution of centers changes during clustering process.
55
        
56
        """
57
        self.__evolution_clusters   = [];
58
        self.__evolution_centers    = [];
59
60
61
    def __len__(self):
62
        """!
63
        @brief Returns amount of steps that were observer during clustering process in K-Means algorithm.
64
        
65
        """
66
        return len(self.__evolution_clusters);
67
68
69
    def notify(self, clusters, centers):
70
        """!
71
        @brief This method is called by K-Means algorithm to notify about changes.
72
        
73
        @param[in] clusters (array_like): Allocated clusters by K-Means algorithm.
74
        @param[in] centers (array_like): Allocated centers by K-Means algorithm.
75
        
76
        """
77
        self.__evolution_clusters.append(clusters);
78
        self.__evolution_clusters.append(centers);
79
80
81
    def get_centers(self, iteration):
82
        """!
83
        @brief Get method to return centers at specific iteration of clustering process.
84
        
85
        @param[in] iteration (uint): Clustering process iteration at which centers are required.
86
        
87
        @return (array_like) Centers at specific iteration.
88
        
89
        """
90
        return self.__evolution_centers[iteration];
91
92
93
    def get_clusters(self, iteration):
94
        """!
95
        @brief Get method to return allocated clusters at specific iteration of clustering process.
96
        
97
        @param[in] iteration (uint): Clustering process iteration at which clusters are required.
98
        
99
        @return (array_like) Clusters at specific iteration.
100
        
101
        """
102
        return self.__evolution_clusters[iteration];
103
104
105
106
class kmeans_visualizer:
107
    """!
108
    @brief Visualizer of K-Means algorithm's results.
109
    @details K-Means visualizer provides visualization services that are specific for K-Means algorithm.
110
    
111
    """
112
    
113
    __default_2d_marker_size = 15;
114
    __default_3d_marker_size = 70;
115
    
116
    
117
    @staticmethod
118
    def show_clusters(sample, clusters, centers, initial_centers = None, **kwargs):
119
        """!
120
        @brief 
121
        
122
        @param[in] sample (list): Dataset that were used for clustering.
123
        @param[in] clusters (array_like): Clusters that were allocated by the algorithm.
124
        @param[in] centers (array_like): Centers that were allocated by the algorithm.
125
        @param[in] initial_centers (array_like): Initial centers that were used by the algorithm, if 'None' then initial centers are not displyed.
126
        @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'figure', 'display').
127
        
128
        Keyword Args:
129
            figure (figure): If 'None' then new is figure is creater, otherwise specified figure is used for visualization.
130
            display (bool): If 'True' then figure will be shown by the method, otherwise it should be shown manually using matplotlib function 'plt.show()'.
131
        
132
        @return (figure) Figure where clusters were drawn.
133
        
134
        """
135
136
        visualizer = cluster_visualizer();
137
        visualizer.append_clusters(clusters, sample);
138
        
139
        if (kmeans_visualizer.__get_argument('figure', None, **kwargs) is None):
140
            figure = visualizer.show(display = False);
141
        else:
142
            visualizer.show(figure = figure, display = False);
143
        
144
        kmeans_visualizer.__draw_centers(figure, visualizer, centers, initial_centers);
145
        kmeans_visualizer.__draw_rays(figure, visualizer, sample, clusters, centers);
146
        
147
        if (kmeans_visualizer.__get_argument('display', True, **kwargs) is True):
148
            plt.show();
149
150
        return figure;
151
152
153
    @staticmethod
154
    def __draw_rays(figure, visualizer, sample, clusters, centers):
155
        ax = figure.get_axes()[0];
156
        
157
        for index_cluster in range(len(clusters)):
158
            color = visualizer.get_cluster_color(index_cluster, 0);
159
            kmeans_visualizer.__draw_cluster_rays(ax, color, sample, clusters[index_cluster], centers[index_cluster])
160
161
162
    @staticmethod
163
    def __draw_cluster_rays(ax, color, sample, cluster, center):
164
        dimension = len(sample[0]);
165
        
166
        for index_point in cluster:
167
            point = sample[index_point];
168
            if (dimension == 1):
169
                ax.plot([point[0], center[0]], [0.0, 0.0], '-', color=color, linewidth=0.5);
170
            elif (dimension == 2):
171
                ax.plot([point[0], center[0]], [point[1], center[1]], '-', color=color, linewidth=0.5);
172
            elif (dimension == 3):
173
                ax.plot([point[0], center[0]], [point[1], center[1]], [point[2], center[2]], '-', color=color, linewidth=0.5)
174
175
176
    @staticmethod
177
    def __draw_center(ax, center, color, marker, alpha):
178
        dimension = len(center);
179
        
180
        if (dimension == 1):
181
            ax.plot(center[0], 0.0, color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size);
182
        elif (dimension == 2):
183
            ax.plot(center[0], center[1], color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size);
184
        elif (dimension == 3):
185
            ax.scatter(center[0], center[1], center[2], c=color, alpha=alpha, marker=marker, s=kmeans_visualizer.__default_3d_marker_size);
186
187
188
    @staticmethod
189
    def __draw_centers(figure, visualizer, centers, initial_centers):
190
        ax = figure.get_axes()[0];
191
        
192
        for index_center in range(len(centers)):
193
            color = visualizer.get_cluster_color(index_center, 0);
194
            kmeans_visualizer.__draw_center(ax, centers[index_center], color, '*', 1.0);
195
            
196
            if initial_centers:
197
                kmeans_visualizer.__draw_center(ax, initial_centers[index_center], color, '*', 0.4);
198
199
200
    @staticmethod
201
    def __get_argument(argument_name, default_value, **kwargs):
202
        if (argument_name in kwargs):
203
            return kwargs[argument_name];
204
        
205
        return default_value;
206
207
208
209
class kmeans:
210
    """!
211
    @brief Class represents clustering algorithm K-Means.
212
    @details CCORE option can be used to use the pyclustering core - C/C++ shared library for processing that significantly increases performance.
213
    
214
             CCORE implementation of the algorithm uses thread pool to parallelize the clustering process.
215
             
216
             K-Means clustering results depend on initial centers. Algorithm K-Means++ can used for initialization 
217
             initial centers from module 'pyclustering.cluster.center_initializer'.
218
    
219
    Example #1 - Trivial clustering:
220
    @code
221
        # load list of points for cluster analysis
222
        sample = read_sample(path);
223
        
224
        # create instance of K-Means algorithm
225
        kmeans_instance = kmeans(sample, [ [0.0, 0.1], [2.5, 2.6] ]);
226
        
227
        # run cluster analysis and obtain results
228
        kmeans_instance.process();
229
        clusters = kmeans_instance.get_clusters();
230
    @endcode
231
    
232
    Example #2 - Clustering using K-Means++ for center initialization:
233
    @code
234
        # load list of points for cluster analysis
235
        sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE2);
236
        
237
        # initialize initial centers using K-Means++ method
238
        initial_centers = kmeans_plusplus_initializer(sample, 3).initialize();
239
        
240
        # create instance of K-Means algorithm with prepared centers
241
        kmeans_instance = kmeans(sample, initial_centers);
242
        
243
        # run cluster analysis and obtain results
244
        kmeans_instance.process();
245
        clusters = kmeans_instance.get_clusters();
246
        final_centers = kmeans_instance.get_centers();
247
    @endcode
248
    
249
    @see center_initializer
250
    
251
    """
252
    
253
    def __init__(self, data, initial_centers, tolerance = 0.001, ccore = True, **kwargs):
254
        """!
255
        @brief Constructor of clustering algorithm K-Means.
256
        @details Center initializer can be used for creating initial centers, for example, K-Means++ method.
257
        
258
        @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure.
259
        @param[in] initial_centers (array_like): Initial coordinates of centers of clusters that are represented by array_like data structure: [center1, center2, ...].
260
        @param[in] tolerance (double): Stop condition: if maximum value of change of centers of clusters is less than tolerance then algorithm stops processing.
261
        @param[in] ccore (bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not.
262
        @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'observer').
263
        
264
        Keyword Args:
265
            observer (kmeans_observer): Observer of the algorithm to collect information about clustering process on each iteration.
266
        
267
        @see center_initializer
268
        
269
        """
270
        self.__pointer_data = numpy.matrix(data);
271
        self.__clusters = [];
272
        self.__centers = numpy.matrix(initial_centers);
273
        self.__tolerance = tolerance;
274
        
275
        self.__observer = None;
276
        if 'observer' in kwargs:
277
            self.__observer = kwargs['observer'];
278
        
279
        self.__ccore = ccore;
280
        if (self.__ccore):
281
            self.__ccore = ccore_library.workable();
282
283
284
    def process(self):
285
        """!
286
        @brief Performs cluster analysis in line with rules of K-Means algorithm.
287
        
288
        @remark Results of clustering can be obtained using corresponding get methods.
289
        
290
        @see get_clusters()
291
        @see get_centers()
292
        
293
        """
294
        
295
        if (self.__ccore is True):
296
            results = wrapper.kmeans(self.__pointer_data, self.__centers, self.__tolerance, (self.__observer is not None));
297
            self.__clusters = results[0];
298
            self.__centers  = results[1];
299
            
300
            if self.__observer:
301
                self.__observer = kmeans_observer(results[2], results[3]);
302
        else:
303
            maximum_change = float('inf');
304
             
305
            stop_condition = self.__tolerance * self.__tolerance;   # Fast solution
306
            #stop_condition = self.__tolerance;              # Slow solution
307
             
308
            # Check for dimension
309
            if (len(self.__pointer_data[0]) != len(self.__centers[0])):
310
                raise NameError('Dimension of the input data and dimension of the initial cluster centers must be equal.');
311
             
312
            while (maximum_change > stop_condition):
313
                self.__clusters = self.__update_clusters();
314
                updated_centers = self.__update_centers();  # changes should be calculated before asignment
315
                
316
                if self.__observer:
317
                    self.__observer.notify(self.__clusters, updated_centers);
318
                
319
                if (len(self.__centers) != len(updated_centers)):
320
                    maximum_change = float('inf');
321
                
322
                else:
323
                    changes = numpy.sum(numpy.square(self.__centers - updated_centers), axis=1);
324
                    maximum_change = numpy.max(changes);
325
                
326
                self.__centers = updated_centers;
327
328
329
    def get_clusters(self):
330
        """!
331
        @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
332
        
333
        @see process()
334
        @see get_centers()
335
        
336
        """
337
        
338
        return self.__clusters;
339
    
340
    
341
    def get_centers(self):
342
        """!
343
        @brief Returns list of centers of allocated clusters.
344
        
345
        @see process()
346
        @see get_clusters()
347
        
348
        """
349
350
        return self.__centers.tolist();
351
352
353
    def get_cluster_encoding(self):
354
        """!
355
        @brief Returns clustering result representation type that indicate how clusters are encoded.
356
        
357
        @return (type_encoding) Clustering result representation.
358
        
359
        @see get_clusters()
360
        
361
        """
362
        
363
        return type_encoding.CLUSTER_INDEX_LIST_SEPARATION;
364
365
366
    def __update_clusters(self):
367
        """!
368
        @brief Calculate Euclidean distance to each point from the each cluster. Nearest points are captured by according clusters and as a result clusters are updated.
369
        
370
        @return (list) updated clusters as list of clusters. Each cluster contains indexes of objects from data.
371
        
372
        """
373
        
374
        clusters = [[] for _ in range(len(self.__centers))];
375
        
376
        dataset_differences = numpy.zeros((len(clusters), len(self.__pointer_data)));
377
        for index_center in range(len(self.__centers)):
378
            dataset_differences[index_center] = numpy.sum(numpy.square(self.__pointer_data - self.__centers[index_center]), axis=1).T;
379
        
380
        optimum_indexes = numpy.argmin(dataset_differences, axis=0);
381
        for index_point in range(len(optimum_indexes)):
382
            index_cluster = optimum_indexes[index_point];
383
            clusters[index_cluster].append(index_point);
384
        
385
        clusters = [cluster for cluster in clusters if len(cluster) > 0];
386
        
387
        return clusters;
388
    
389
    
390
    def __update_centers(self):
391
        """!
392
        @brief Calculate centers of clusters in line with contained objects.
393
        
394
        @return (numpy.matrix) Updated centers as list of centers.
395
        
396
        """
397
        
398
        dimension = self.__pointer_data.shape[1];
399
        centers = numpy.zeros((len(self.__clusters), dimension));
400
        
401
        for index in range(len(self.__clusters)):
402
            cluster_points = self.__pointer_data[self.__clusters[index], :];
403
            centers[index] = cluster_points.mean(axis=0);
404
405
        return numpy.matrix(centers);
406