Completed
Push — master ( 0ab444...af3192 )
by Andrei
01:25
created

kmeans_visualizer   A

Complexity

Total Complexity 21

Size/Duplication

Total Lines 137
Duplicated Lines 23.36 %

Importance

Changes 0
Metric Value
dl 32
loc 137
rs 10
c 0
b 0
f 0
wmc 21

8 Methods

Rating   Name   Duplication   Size   Complexity  
A __draw_rays() 0 7 2
A __draw_centers() 0 10 3
A __draw_center() 0 10 4
B show_clusters() 0 39 3
A init_frame() 2 2 1
A frame_generation() 12 12 1
B animate_cluster_allocation() 32 38 4
B __draw_cluster_rays() 0 12 5

How to fix   Duplicated Code   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

1
"""!
2
3
@brief Cluster analysis algorithm: K-Means
4
@details Based on book description:
5
         - J.B.MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. 1967.
6
7
@authors Andrei Novikov ([email protected])
8
@date 2014-2018
9
@copyright GNU Public License
10
11
@cond GNU_PUBLIC_LICENSE
12
    PyClustering is free software: you can redistribute it and/or modify
13
    it under the terms of the GNU General Public License as published by
14
    the Free Software Foundation, either version 3 of the License, or
15
    (at your option) any later version.
16
    
17
    PyClustering is distributed in the hope that it will be useful,
18
    but WITHOUT ANY WARRANTY; without even the implied warranty of
19
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20
    GNU General Public License for more details.
21
    
22
    You should have received a copy of the GNU General Public License
23
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
24
@endcond
25
26
"""
27
28
29
import numpy;
0 ignored issues
show
Configuration introduced by
The import numpy could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
30
31
import matplotlib.pyplot as plt;
0 ignored issues
show
Configuration introduced by
The import matplotlib.pyplot could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
32
import matplotlib.animation as animation;
0 ignored issues
show
Configuration introduced by
The import matplotlib.animation could not be resolved.

This can be caused by one of the following:

1. Missing Dependencies

This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands.

# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version.

2. Missing __init__.py files

This error could also result from missing __init__.py files in your module folders. Make sure that you place one file in each sub-folder.

Loading history...
33
34
import pyclustering.core.kmeans_wrapper as wrapper;
35
36
from pyclustering.core.wrapper import ccore_library;
37
38
from pyclustering.cluster.encoder import type_encoding;
39
from pyclustering.cluster import cluster_visualizer;
40
41
42
class kmeans_observer:
43
    """!
44
    @brief Observer of K-Means algorithm that is used to collect information about clustering process on each iteration of the algorithm.
45
    
46
    @see kmeans
47
    
48
    """
49
    
50
    def __init__(self):
51
        """!
52
        @brief Initializer of observer of K-Means algorithm.
53
        
54
        """
55
        self.__evolution_clusters   = [];
56
        self.__evolution_centers    = [];
57
        self.__initial_centers      = [];
58
59
60
    def __len__(self):
61
        """!
62
        @brief Returns amount of steps that were observer during clustering process in K-Means algorithm.
63
        
64
        """
65
        return len(self.__evolution_clusters);
66
67
68
    def notify(self, clusters, centers):
69
        """!
70
        @brief This method is called by K-Means algorithm to notify about changes.
71
        
72
        @param[in] clusters (array_like): Allocated clusters by K-Means algorithm.
73
        @param[in] centers (array_like): Allocated centers by K-Means algorithm.
74
        
75
        """
76
        self.__evolution_clusters.append(clusters);
77
        self.__evolution_centers.append(centers);
78
79
80
    def set_evolution_centers(self, evolution_centers):
81
        """!
82
        @brief Set evolution of changes of centers during clustering process.
83
        
84
        @param[in] evolution_centers (array_like): Evolution of changes of centers during clustering process.
85
        
86
        """
87
        self.__evolution_centers = evolution_centers;
88
89
90
    def get_centers(self, iteration):
91
        """!
92
        @brief Get method to return centers at specific iteration of clustering process.
93
        
94
        @param[in] iteration (uint): Clustering process iteration at which centers are required.
95
        
96
        @return (array_like) Centers at specific iteration.
97
        
98
        """
99
        return self.__evolution_centers[iteration];
100
101
102
    def set_evolution_clusters(self, evolution_clusters):
103
        """!
104
        @brief Set evolution of changes of centers during clustering process.
105
        
106
        @param[in] evolution_clusters (array_like): Evolution of changes of clusters during clustering process.
107
        
108
        """
109
        self.__evolution_clusters = evolution_clusters;
110
111
112
    def get_clusters(self, iteration):
113
        """!
114
        @brief Get method to return allocated clusters at specific iteration of clustering process.
115
        
116
        @param[in] iteration (uint): Clustering process iteration at which clusters are required.
117
        
118
        @return (array_like) Clusters at specific iteration.
119
        
120
        """
121
        return self.__evolution_clusters[iteration];
122
123
124
125
class kmeans_visualizer:
126
    """!
127
    @brief Visualizer of K-Means algorithm's results.
128
    @details K-Means visualizer provides visualization services that are specific for K-Means algorithm.
129
    
130
    """
131
    
132
    __default_2d_marker_size = 15;
133
    __default_3d_marker_size = 70;
134
    
135
    
136
    @staticmethod
137
    def show_clusters(sample, clusters, centers, initial_centers = None, **kwargs):
138
        """!
139
        @brief Display K-Means clustering results.
140
        
141
        @param[in] sample (list): Dataset that was used for clustering.
142
        @param[in] clusters (array_like): Clusters that were allocated by the algorithm.
143
        @param[in] centers (array_like): Centers that were allocated by the algorithm.
144
        @param[in] initial_centers (array_like): Initial centers that were used by the algorithm, if 'None' then initial centers are not displyed.
145
        @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'figure', 'display').
146
        
147
        Keyword Args:
148
            figure (figure): If 'None' then new is figure is created, otherwise specified figure is used for visualization.
149
            display (bool): If 'True' then figure will be shown by the method, otherwise it should be shown manually using matplotlib function 'plt.show()'.
150
            offset (uint): Specify axes index on the figure where results should be drawn (only if argument 'figure' is specified).
151
        
152
        @return (figure) Figure where clusters were drawn.
153
        
154
        """
155
156
        visualizer = cluster_visualizer();
157
        visualizer.append_clusters(clusters, sample);
158
        
159
        offset = kwargs.get('offset', 0);
160
        figure = kwargs.get('figure', None);
161
        display = kwargs.get('display', True);
162
163
        if (figure is None):
164
            figure = visualizer.show(display = False);
165
        else:
166
            visualizer.show(figure = figure, display = False);
167
        
168
        kmeans_visualizer.__draw_centers(figure, offset, visualizer, centers, initial_centers);
169
        kmeans_visualizer.__draw_rays(figure, offset, visualizer, sample, clusters, centers);
170
        
171
        if (display is True):
172
            plt.show();
173
174
        return figure;
175
176
177
    @staticmethod
178
    def __draw_rays(figure, offset, visualizer, sample, clusters, centers):
179
        ax = figure.get_axes()[offset];
180
        
181
        for index_cluster in range(len(clusters)):
182
            color = visualizer.get_cluster_color(index_cluster, 0);
183
            kmeans_visualizer.__draw_cluster_rays(ax, color, sample, clusters[index_cluster], centers[index_cluster])
184
185
186
    @staticmethod
187
    def __draw_cluster_rays(ax, color, sample, cluster, center):
188
        dimension = len(sample[0]);
189
        
190
        for index_point in cluster:
191
            point = sample[index_point];
192
            if dimension == 1:
193
                ax.plot([point[0], center[0]], [0.0, 0.0], '-', color=color, linewidth=0.5);
194
            elif dimension == 2:
195
                ax.plot([point[0], center[0]], [point[1], center[1]], '-', color=color, linewidth=0.5);
196
            elif dimension == 3:
197
                ax.plot([point[0], center[0]], [point[1], center[1]], [point[2], center[2]], '-', color=color, linewidth=0.5)
198
199
200
    @staticmethod
201
    def __draw_center(ax, center, color, marker, alpha):
202
        dimension = len(center);
203
        
204
        if dimension == 1:
205
            ax.plot(center[0], 0.0, color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size);
206
        elif dimension == 2:
207
            ax.plot(center[0], center[1], color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size);
208
        elif dimension == 3:
209
            ax.scatter(center[0], center[1], center[2], c=color, alpha=alpha, marker=marker, s=kmeans_visualizer.__default_3d_marker_size);
210
211
212
    @staticmethod
213
    def __draw_centers(figure, offset, visualizer, centers, initial_centers):
214
        ax = figure.get_axes()[offset];
215
        
216
        for index_center in range(len(centers)):
217
            color = visualizer.get_cluster_color(index_center, 0);
218
            kmeans_visualizer.__draw_center(ax, centers[index_center], color, '*', 1.0);
219
            
220
            if initial_centers is not None:
221
                kmeans_visualizer.__draw_center(ax, initial_centers[index_center], color, '*', 0.4);
222
223
224
    @staticmethod
225
    def animate_cluster_allocation(data, observer, animation_velocity = 500, movie_fps = 1, save_movie = None):
226
        """!
227
        @brief Animates clustering process that is performed by K-Means algorithm.
228
229
        @param[in] data (list): Dataset that is used for clustering.
230 View Code Duplication
        @param[in] observer (kmeans_observer): EM observer that was used for collection information about clustering process.
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
231
        @param[in] animation_velocity (uint): Interval between frames in milliseconds (for run-time animation only).
232
        @param[in] movie_fps (uint): Defines frames per second (for rendering movie only).
233
        @param[in] save_movie (string): If it is specified then animation will be stored to file that is specified in this parameter.
234
235
        """
236
        figure = plt.figure();
237
238
        def init_frame():
239
            return frame_generation(0);
240
241
        def frame_generation(index_iteration):
242
            figure.clf();
243
244
            figure.suptitle("K-Means algorithm (iteration: " + str(index_iteration) + ")", fontsize=18, fontweight='bold');
245
246
            clusters = observer.get_clusters(index_iteration);
247
            centers = observer.get_centers(index_iteration);
248
            kmeans_visualizer.show_clusters(data, clusters, centers, None, figure=figure, display=False);
249
250
            figure.subplots_adjust(top=0.85);
251
252
            return [figure.gca()];
253
254
        iterations = len(observer);
255
        cluster_animation = animation.FuncAnimation(figure, frame_generation, iterations, interval=animation_velocity,
256
                                                    init_func=init_frame, repeat_delay=5000);
257
258
        if save_movie is not None:
259
            cluster_animation.save(save_movie, writer='ffmpeg', fps=movie_fps, bitrate=3000);
260
        else:
261
            plt.show();
262
263
264
265
class kmeans:
266
    """!
267
    @brief Class represents K-Means clustering algorithm.
268
    @details CCORE option can be used to use the pyclustering core - C/C++ shared library for processing that significantly increases performance.
269
    
270
    CCORE implementation of the algorithm uses thread pool to parallelize the clustering process.
271
    
272
    K-Means clustering results depend on initial centers. Algorithm K-Means++ can used for initialization 
273
    initial centers from module 'pyclustering.cluster.center_initializer'.
274
    
275
    @image html kmeans_example_clustering.png "K-Means clustering results. At the left - 'Simple03.data' sample, at the right - 'Lsun.data' sample."
276
    
277
    Example #1 - Trivial clustering:
278
    @code
279
        # load list of points for cluster analysis
280
        sample = read_sample(path);
281
        
282
        # create instance of K-Means algorithm
283
        kmeans_instance = kmeans(sample, [ [0.0, 0.1], [2.5, 2.6] ]);
284
        
285
        # run cluster analysis and obtain results
286
        kmeans_instance.process();
287
        clusters = kmeans_instance.get_clusters();
288
    @endcode
289
    
290
    Example #2 - Clustering using K-Means++ for center initialization:
291
    @code
292
        # load list of points for cluster analysis
293
        sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE2);
294
        
295
        # initialize initial centers using K-Means++ method
296
        initial_centers = kmeans_plusplus_initializer(sample, 3).initialize();
297
        
298
        # create instance of K-Means algorithm with prepared centers
299
        kmeans_instance = kmeans(sample, initial_centers);
300
        
301
        # run cluster analysis and obtain results
302
        kmeans_instance.process();
303
        clusters = kmeans_instance.get_clusters();
304
        final_centers = kmeans_instance.get_centers();
305
    @endcode
306
    
307
    @see center_initializer
308
    
309
    """
310
    
311
    def __init__(self, data, initial_centers, tolerance = 0.001, ccore = True, **kwargs):
312
        """!
313
        @brief Constructor of clustering algorithm K-Means.
314
        @details Center initializer can be used for creating initial centers, for example, K-Means++ method.
315
        
316
        @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure.
317
        @param[in] initial_centers (array_like): Initial coordinates of centers of clusters that are represented by array_like data structure: [center1, center2, ...].
318
        @param[in] tolerance (double): Stop condition: if maximum value of change of centers of clusters is less than tolerance then algorithm stops processing.
319
        @param[in] ccore (bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not.
320
        @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'observer').
321
        
322
        Keyword Args:
323
            observer (kmeans_observer): Observer of the algorithm to collect information about clustering process on each iteration.
324
        
325
        @see center_initializer
326
        
327
        """
328
        self.__pointer_data = numpy.matrix(data);
329
        self.__clusters = [];
330
        self.__centers = numpy.matrix(initial_centers);
331
        self.__tolerance = tolerance;
332
        
333
        self.__observer = None;
334
        if 'observer' in kwargs:
335
            self.__observer = kwargs['observer'];
336
        
337
        self.__ccore = ccore;
338
        if self.__ccore is True:
339
            self.__ccore = ccore_library.workable();
340
341
342
    def process(self):
343
        """!
344
        @brief Performs cluster analysis in line with rules of K-Means algorithm.
345
        
346
        @remark Results of clustering can be obtained using corresponding get methods.
347
        
348
        @see get_clusters()
349
        @see get_centers()
350
        
351
        """
352
353
        if len(self.__pointer_data[0]) != len(self.__centers[0]):
354
            raise NameError('Dimension of the input data and dimension of the initial cluster centers must be equal.');
355
356
        if self.__ccore is True:
357
            self.__process_by_ccore();
358
        else:
359
            self.__process_by_python();
360
361
362
    def __process_by_ccore(self):
363
        """!
364
        @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
365
366
        """
367
        results = wrapper.kmeans(self.__pointer_data, self.__centers, self.__tolerance, (self.__observer is not None));
368
        self.__clusters = results[0];
369
        self.__centers = results[1];
370
371
        if self.__observer is not None:
372
            self.__observer.set_evolution_clusters(results[2]);
373
            self.__observer.set_evolution_centers(results[3]);
374
375
376
    def __process_by_python(self):
377
        """!
378
        @brief Performs cluster analysis using python code.
379
380
        """
381
382
        maximum_change = float('inf');
383
        stop_condition = self.__tolerance * self.__tolerance;
384
385
        if self.__observer is not None:
386
            initial_clusters = self.__update_clusters();
387
            self.__observer.notify(initial_clusters, self.__centers.tolist());
388
389
        while maximum_change > stop_condition:
390
            self.__clusters = self.__update_clusters();
391
            updated_centers = self.__update_centers();  # changes should be calculated before assignment
392
393
            if self.__observer is not None:
394
                self.__observer.notify(self.__clusters, updated_centers.tolist());
395
396
            if len(self.__centers) != len(updated_centers):
397
                maximum_change = float('inf');
398
399
            else:
400
                changes = numpy.sum(numpy.square(self.__centers - updated_centers), axis=1);
401
                maximum_change = numpy.max(changes);
402
403
            self.__centers = updated_centers.tolist();
404
405
406
    def get_clusters(self):
407
        """!
408
        @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
409
        
410
        @see process()
411
        @see get_centers()
412
        
413
        """
414
        
415
        return self.__clusters;
416
    
417
    
418
    def get_centers(self):
419
        """!
420
        @brief Returns list of centers of allocated clusters.
421
        
422
        @see process()
423
        @see get_clusters()
424
        
425
        """
426
427
        if isinstance(self.__centers, list):
428
            return self.__centers;
429
        
430
        return self.__centers.tolist();
431
432
433
    def get_cluster_encoding(self):
434
        """!
435
        @brief Returns clustering result representation type that indicate how clusters are encoded.
436
        
437
        @return (type_encoding) Clustering result representation.
438
        
439
        @see get_clusters()
440
        
441
        """
442
        
443
        return type_encoding.CLUSTER_INDEX_LIST_SEPARATION;
444
445
446
    def __update_clusters(self):
447
        """!
448
        @brief Calculate Euclidean distance to each point from the each cluster. Nearest points are captured by according clusters and as a result clusters are updated.
449
        
450
        @return (list) updated clusters as list of clusters. Each cluster contains indexes of objects from data.
451
        
452
        """
453
        
454
        clusters = [[] for _ in range(len(self.__centers))];
455
        
456
        dataset_differences = numpy.zeros((len(clusters), len(self.__pointer_data)));
457
        for index_center in range(len(self.__centers)):
458
            dataset_differences[index_center] = numpy.sum(numpy.square(self.__pointer_data - self.__centers[index_center]), axis=1).T;
459
        
460
        optimum_indexes = numpy.argmin(dataset_differences, axis=0);
461
        for index_point in range(len(optimum_indexes)):
462
            index_cluster = optimum_indexes[index_point];
463
            clusters[index_cluster].append(index_point);
464
        
465
        clusters = [cluster for cluster in clusters if len(cluster) > 0];
466
        
467
        return clusters;
468
    
469
    
470
    def __update_centers(self):
471
        """!
472
        @brief Calculate centers of clusters in line with contained objects.
473
        
474
        @return (numpy.matrix) Updated centers as list of centers.
475
        
476
        """
477
        
478
        dimension = self.__pointer_data.shape[1];
479
        centers = numpy.zeros((len(self.__clusters), dimension));
480
        
481
        for index in range(len(self.__clusters)):
482
            cluster_points = self.__pointer_data[self.__clusters[index], :];
483
            centers[index] = cluster_points.mean(axis=0);
484
485
        return numpy.matrix(centers);
486