kmeans_plusplus_initializer.initialize() - Code Metrics - Inspection of "[.] Merge branch '0.8.dev'." - annoviko/pyclustering - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Push — master ( c8bc69...89bc14 )

by Andrei

created 2018-02-23 14:25 UTC

kmeans_plusplus_initializer.initialize() A

↳ Parent: kmeans_plusplus_initializer

Complexity

Conditions

Size

Total Lines

Duplication

Lines	0
Ratio	0 %

Importance

Changes

Metric	Value
cc	2
dl	0
loc	17
rs	9.4285
c	0
b	0
f	0

"""!

@brief Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
@details Implementations based on articles:
         - K-Means++: The Advantages of careful seeding. D. Arthur, S. Vassilvitskii. 2007.
         
@authors Andrei Novikov, Aleksey Kukushkin ([email protected])
@date 2014-2018
@copyright GNU Public License

@see kmeans
@see xmeans

@cond GNU_PUBLIC_LICENSE
    PyClustering is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
    
    PyClustering is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    
    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
@endcond

"""


import numpy;
# .scrutinizer.yml
before_commands:
    - sudo pip install abc # Python2
    - sudo pip3 install abc # Python3
import random;

from pyclustering.utils import euclidean_distance;



class random_center_initializer:
    """!
    @brief Random center initializer is for generation specified amount of random of centers for specified data.
    
    """

    def __init__(self, data, amount_centers):
        """!
        @brief Creates instance of random center initializer.
        
        @param[in] data (list): List of points where each point is represented by list of coordinates.
        @param[in] amount_centers (unit): Amount of centers that should be initialized.
        
        """
        
        self.__data = data;
        self.__amount = amount_centers;

        if self.__amount <= 0:
            raise AttributeError("Amount of cluster centers should be at least 1.");


    def initialize(self):
        """!
        @brief Generates random centers in line with input parameters.
        
        @return (list) List of centers where each center is represented by list of coordinates.
        
        """
        return [ self.__create_center() for _ in range(self.__amount) ];


    def __create_center(self):
        """!
        @brief Generates and returns random center.
        
        """
        return [ random.random() for _ in range(len(self.__data[0])) ];



class kmeans_plusplus_initializer:
    """!

    @brief K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means.
    @details K-Means++ algorithm guarantees an approximation ratio O(log k). Clustering results are depends on
              initial centers in case of K-Means algorithm and even in case of X-Means. This method is used to find
              out optimal initial centers.

    Algorithm can be divided into three steps. The first center is chosen from input data randomly with
    uniform distribution at the first step. At the second, probability to being center is calculated for each point:
    \f[p_{i}=\frac{D(x_{i})}{\sum_{j=0}^{N}D(x_{j})}\f]
    where \f$D(x_{i})\f$ is a distance from point \f$i\f$ to the closest center. Using this probabilities next center
    is chosen. The last step is repeated until required amount of centers is initialized.

    Pyclustering implementation of the algorithm provides feature to consider several candidates on the second
    step, for example:

    @code
        amount_centers = 4;
        amount_candidates = 3;
        initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
    @endcode

    If the farthest points should be used as centers then special constant 'FARTHEST_CENTER_CANDIDATE' should be used
    for that purpose, for example:
    @code
        amount_centers = 4;
        amount_candidates = kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE;
        initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
    @endcode

    There is an example of initial centers that were calculated by the K-Means++ method:

    @image html kmeans_plusplus_initializer_results.png
    
    Code example where initial centers are prepared for K-Means algorithm:
    @code
        # Read data 'SampleSimple3' from Simple Sample collection.
        sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3);
        
        # Calculate initial centers using K-Means++ method.
        centers = kmeans_plusplus_initializer(sample, 4).initialize();
        
        # Display initial centers.
        visualizer = cluster_visualizer();
        visualizer.append_cluster(sample);
        visualizer.append_cluster(centers, marker = '*', markersize = 10);
        visualizer.show();
        
        # Perform cluster analysis using K-Means algorithm with initial centers.
        kmeans_instance = kmeans(sample, centers);
        
        # Run clustering process and obtain result.
        kmeans_instance.process();
        clusters = kmeans_instance.get_clusters();
    @endcode
    
    """


    ## Constant denotes that only points with highest probabilities should be considered as centers.
    FARTHEST_CENTER_CANDIDATE = "farthest";


    def __init__(self, data, amount_centers, amount_candidates = 1):
        """!
        @brief Creates K-Means++ center initializer instance.
        
        @param[in] data (array_like): List of points where each point is represented by list of coordinates.
        @param[in] amount_centers (uint): Amount of centers that should be initialized.
        @param[in] amount_candidates (uint): Amount of candidates that is considered as a center, if the farthest points (with the highest probability) should
                    be considered as centers then special constant should be used 'FARTHEST_CENTER_CANDIDATE'.

        @see FARTHEST_CENTER_CANDIDATE

        """
        
        self.__data = numpy.array(data);
        self.__amount = amount_centers;
        self.__candidates = amount_candidates;

        self.__check_parameters();


    def __check_parameters(self):
        """!
        @brief Checks input parameters of the algorithm and if something wrong then corresponding exception is thrown.

        """
        if (self.__amount <= 0) or (self.__amount > len(self.__data)):
            raise AttributeError("Amount of cluster centers should be at least 1 and should be less or equal to amount of points in data.");

        if self.__candidates != kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
            if (self.__candidates <= 0) or (self.__candidates > len(self.__data)):
                raise AttributeError("Amount of candidates centers should be at least 1 and should be less or equal to amount of points in data.");

        if len(self.__data) == 0:
            raise AttributeError("Data is empty.")


    def __calculate_shortest_distances(self, data, centers):
        """!
        @brief Calculates distance from each data point to nearest center.
        
        @param[in] data (numpy.array): Array of points for that initialization is performed.
        @param[in] centers (numpy.array): Array of points that represents centers.
        
        @return (numpy.array) List of distances to closest center for each data point.
        
        """

        dataset_differences = numpy.zeros((len(centers), len(data)));
        for index_center in range(len(centers)):
            dataset_differences[index_center] = numpy.sum(
                numpy.square(data - centers[index_center]), axis=1).T;

        shortest_distances = numpy.min(dataset_differences, axis=0);
        return shortest_distances;


    def __get_next_center(self, centers):
        """!
        @brief Calculates the next center for the data.

        @param[in] centers (array_like): Current initialized centers.

        @return (array_like) Next initialized center.

        """

        distances = self.__calculate_shortest_distances(data=self.__data, centers=centers);

        if self.__candidates == kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
            center_index = numpy.argmax(distances);
        else:
            probabilities = self.__calculate_probabilities(distances);
            center_index = self.__get_probable_center(distances, probabilities);

        return self.__data[center_index];


    def __calculate_probabilities(self, distances):
        """!
        @brief Calculates cumulative probabilities of being center of each point.

        @param[in] distances (array_like): Distances from each point to closest center.

        @return (array_like) Cumulative probabilities of being center of each point.

        """

        total_distance = numpy.sum(distances);
        if total_distance != 0.0:
            probabilities = distances / total_distance;
            return numpy.cumsum(probabilities);
        else:
            return numpy.zeros(len(distances));


    def __get_probable_center(self, distances, probabilities):
        """!
        @brief Calculates the next probable center considering amount candidates.

        @param[in] distances (array_like): Distances from each point to closest center.
        @param[in] probabilities (array_like): Cumulative probabilities of being center of each point.

        @return (uint) Index point that is next initialized center.

        """

        index_best_candidate = -1;
        for _ in range(self.__candidates):
            candidate_probability = random.random();
            index_candidate = 0;


            for index_object in range(len(probabilities)):
                if candidate_probability < probabilities[index_object]:
                    index_candidate = index_object;
                    break;

            if index_best_candidate == -1:
                index_best_candidate = index_object;

            elif distances[index_best_candidate] < distances[index_object]:

                index_best_candidate = index_object;


        return index_best_candidate;


    def initialize(self):
        """!
        @brief Calculates initial centers using K-Means++ method.
        
        @return (list) List of initialized initial centers.
        
        """

        index_center = random.randint(0, len(self.__data) - 1);
        centers = [ self.__data[ index_center ] ];

        # For each next center
        for _ in range(1, self.__amount):
            next_center = self.__get_next_center(centers);
            centers.append(next_center);

        return centers;

Push — master ( c8bc69...89bc14 )

kmeans_plusplus_initializer.initialize() A

Complexity

Size

Duplication

Importance

1. Missing Dependencies

2. Missing init.py files

1			"""!
2
3			@brief Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
4			@details Implementations based on articles:
5			- K-Means++: The Advantages of careful seeding. D. Arthur, S. Vassilvitskii. 2007.
6
7			@authors Andrei Novikov, Aleksey Kukushkin ([email protected])
8			@date 2014-2018
9			@copyright GNU Public License
10
11			@see kmeans
12			@see xmeans
13
14			@cond GNU_PUBLIC_LICENSE
15			PyClustering is free software: you can redistribute it and/or modify
16			it under the terms of the GNU General Public License as published by
17			the Free Software Foundation, either version 3 of the License, or
18			(at your option) any later version.
19
20			PyClustering is distributed in the hope that it will be useful,
21			but WITHOUT ANY WARRANTY; without even the implied warranty of
22			MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
23			GNU General Public License for more details.
24
25			You should have received a copy of the GNU General Public License
26			along with this program. If not, see <http://www.gnu.org/licenses/>.
27			@endcond
28
29			"""
30
31
32			import numpy;
			0 ignored issues – show Configuration introduced 2018-02-21 18:56 UTC by Report Bug Copy Issue Report The import `numpy` could not be resolved. This can be caused by one of the following: 1. Missing Dependencies This error could indicate a configuration issue of Pylint. Make sure that your libraries are available by adding the necessary commands. # .scrutinizer.yml before_commands: - sudo pip install abc # Python2 - sudo pip3 install abc # Python3 Tip: We are currently not using virtualenv to run pylint, when installing your modules make sure to use the command for the correct version. 2. Missing __init__.py files This error could also result from missing `__init__.py` files in your module folders. Make sure that you place one file in each sub-folder. Loading history...
33			import random;
34
35			from pyclustering.utils import euclidean_distance;
			0 ignored issues – show Unused Code introduced 2017-10-27 12:21 UTC by Report Bug Copy Issue Report Unused euclidean_distance imported from pyclustering.utils Loading history...
36
37
38			class random_center_initializer:
39			"""!
40			@brief Random center initializer is for generation specified amount of random of centers for specified data.
41
42			"""
43
44			def __init__(self, data, amount_centers):
45			"""!
46			@brief Creates instance of random center initializer.
47
48			@param[in] data (list): List of points where each point is represented by list of coordinates.
49			@param[in] amount_centers (unit): Amount of centers that should be initialized.
50
51			"""
52
53			self.__data = data;
54			self.__amount = amount_centers;
55
56			if self.__amount <= 0:
57			raise AttributeError("Amount of cluster centers should be at least 1.");
58
59
60			def initialize(self):
61			"""!
62			@brief Generates random centers in line with input parameters.
63
64			@return (list) List of centers where each center is represented by list of coordinates.
65
66			"""
67			return [ self.__create_center() for _ in range(self.__amount) ];
68
69
70			def __create_center(self):
71			"""!
72			@brief Generates and returns random center.
73
74			"""
75			return [ random.random() for _ in range(len(self.__data[0])) ];
76
77
78
79			class kmeans_plusplus_initializer:
80			"""!
			0 ignored issues – show Bug introduced 2018-02-22 12:38 UTC by Report Bug Copy Issue Report A suspicious escape sequence `\s` was found. Did you maybe forget to add an `r` prefix? Escape sequences in Python are generally interpreted according to rules similar to standard C. Only if strings are prefixed with `r` or `R` are they interpreted as regular expressions. The escape sequence that was used indicates that you might have intended to write a regular expression. Learn more about the available escape sequences. in the Python documentation. Loading history...
81			@brief K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means.
82			@details K-Means++ algorithm guarantees an approximation ratio O(log k). Clustering results are depends on
83			initial centers in case of K-Means algorithm and even in case of X-Means. This method is used to find
84			out optimal initial centers.
85
86			Algorithm can be divided into three steps. The first center is chosen from input data randomly with
87			uniform distribution at the first step. At the second, probability to being center is calculated for each point:
88			\f[p_{i}=\frac{D(x_{i})}{\sum_{j=0}^{N}D(x_{j})}\f]
89			where \f$D(x_{i})\f$ is a distance from point \f$i\f$ to the closest center. Using this probabilities next center
90			is chosen. The last step is repeated until required amount of centers is initialized.
91
92			Pyclustering implementation of the algorithm provides feature to consider several candidates on the second
93			step, for example:
94
95			@code
96			amount_centers = 4;
97			amount_candidates = 3;
98			initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
99			@endcode
100
101			If the farthest points should be used as centers then special constant 'FARTHEST_CENTER_CANDIDATE' should be used
102			for that purpose, for example:
103			@code
104			amount_centers = 4;
105			amount_candidates = kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE;
106			initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
107			@endcode
108
109			There is an example of initial centers that were calculated by the K-Means++ method:
110
111			@image html kmeans_plusplus_initializer_results.png
112
113			Code example where initial centers are prepared for K-Means algorithm:
114			@code
115			# Read data 'SampleSimple3' from Simple Sample collection.
116			sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3);
117
118			# Calculate initial centers using K-Means++ method.
119			centers = kmeans_plusplus_initializer(sample, 4).initialize();
120
121			# Display initial centers.
122			visualizer = cluster_visualizer();
123			visualizer.append_cluster(sample);
124			visualizer.append_cluster(centers, marker = '*', markersize = 10);
125			visualizer.show();
126
127			# Perform cluster analysis using K-Means algorithm with initial centers.
128			kmeans_instance = kmeans(sample, centers);
129
130			# Run clustering process and obtain result.
131			kmeans_instance.process();
132			clusters = kmeans_instance.get_clusters();
133			@endcode
134
135			"""
136
137
138			## Constant denotes that only points with highest probabilities should be considered as centers.
139			FARTHEST_CENTER_CANDIDATE = "farthest";
140
141
142			def __init__(self, data, amount_centers, amount_candidates = 1):
143			"""!
144			@brief Creates K-Means++ center initializer instance.
145
146			@param[in] data (array_like): List of points where each point is represented by list of coordinates.
147			@param[in] amount_centers (uint): Amount of centers that should be initialized.
148			@param[in] amount_candidates (uint): Amount of candidates that is considered as a center, if the farthest points (with the highest probability) should
149			be considered as centers then special constant should be used 'FARTHEST_CENTER_CANDIDATE'.
150
151			@see FARTHEST_CENTER_CANDIDATE
152
153			"""
154
155			self.__data = numpy.array(data);
156			self.__amount = amount_centers;
157			self.__candidates = amount_candidates;
158
159			self.__check_parameters();
160
161
162			def __check_parameters(self):
163			"""!
164			@brief Checks input parameters of the algorithm and if something wrong then corresponding exception is thrown.
165
166			"""
167			if (self.__amount <= 0) or (self.__amount > len(self.__data)):
168			raise AttributeError("Amount of cluster centers should be at least 1 and should be less or equal to amount of points in data.");
169
170			if self.__candidates != kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
171			if (self.__candidates <= 0) or (self.__candidates > len(self.__data)):
172			raise AttributeError("Amount of candidates centers should be at least 1 and should be less or equal to amount of points in data.");
173
174			if len(self.__data) == 0:
175			raise AttributeError("Data is empty.")
176
177
178			def __calculate_shortest_distances(self, data, centers):
179			"""!
180			@brief Calculates distance from each data point to nearest center.
181
182			@param[in] data (numpy.array): Array of points for that initialization is performed.
183			@param[in] centers (numpy.array): Array of points that represents centers.
184
185			@return (numpy.array) List of distances to closest center for each data point.
186
187			"""
188
189			dataset_differences = numpy.zeros((len(centers), len(data)));
190			for index_center in range(len(centers)):
191			dataset_differences[index_center] = numpy.sum(
192			numpy.square(data - centers[index_center]), axis=1).T;
193
194			shortest_distances = numpy.min(dataset_differences, axis=0);
195			return shortest_distances;
196
197
198			def __get_next_center(self, centers):
199			"""!
200			@brief Calculates the next center for the data.
201
202			@param[in] centers (array_like): Current initialized centers.
203
204			@return (array_like) Next initialized center.
205
206			"""
207
208			distances = self.__calculate_shortest_distances(data=self.__data, centers=centers);
209
210			if self.__candidates == kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
211			center_index = numpy.argmax(distances);
212			else:
213			probabilities = self.__calculate_probabilities(distances);
214			center_index = self.__get_probable_center(distances, probabilities);
215
216			return self.__data[center_index];
217
218
219			def __calculate_probabilities(self, distances):
220			"""!
221			@brief Calculates cumulative probabilities of being center of each point.
222
223			@param[in] distances (array_like): Distances from each point to closest center.
224
225			@return (array_like) Cumulative probabilities of being center of each point.
226
227			"""
228
229			total_distance = numpy.sum(distances);
230			if total_distance != 0.0:
231			probabilities = distances / total_distance;
232			return numpy.cumsum(probabilities);
233			else:
234			return numpy.zeros(len(distances));
235
236
237			def __get_probable_center(self, distances, probabilities):
238			"""!
239			@brief Calculates the next probable center considering amount candidates.
240
241			@param[in] distances (array_like): Distances from each point to closest center.
242			@param[in] probabilities (array_like): Cumulative probabilities of being center of each point.
243
244			@return (uint) Index point that is next initialized center.
245
246			"""
247
248			index_best_candidate = -1;
249			for _ in range(self.__candidates):
250			candidate_probability = random.random();
251			index_candidate = 0;
			0 ignored issues – show Unused Code introduced 2018-02-21 19:29 UTC by Report Bug Copy Issue Report The variable `index_candidate` seems to be unused. Loading history...
252
253			for index_object in range(len(probabilities)):
254			if candidate_probability < probabilities[index_object]:
255			index_candidate = index_object;
256			break;
257
258			if index_best_candidate == -1:
259			index_best_candidate = index_object;
			0 ignored issues – show Bug introduced 2018-02-21 19:29 UTC by Report Bug Copy Issue Report The loop variable `index_object` might not be defined here. Loading history...
260			elif distances[index_best_candidate] < distances[index_object]:
			0 ignored issues – show Bug introduced 2018-02-21 19:29 UTC by Report Bug Copy Issue Report The loop variable `index_object` might not be defined here. Loading history...
261			index_best_candidate = index_object;
			0 ignored issues – show Bug introduced 2018-02-21 19:29 UTC by Report Bug Copy Issue Report The loop variable `index_object` might not be defined here. Loading history...
262
263			return index_best_candidate;
264
265
266			def initialize(self):
267			"""!
268			@brief Calculates initial centers using K-Means++ method.
269
270			@return (list) List of initialized initial centers.
271
272			"""
273
274			index_center = random.randint(0, len(self.__data) - 1);
275			centers = [ self.__data[ index_center ] ];
276
277			# For each next center
278			for _ in range(1, self.__amount):
279			next_center = self.__get_next_center(centers);
280			centers.append(next_center);
281
282			return centers;

annoviko / pyclustering

Push — master ( c8bc69...89bc14 )

kmeans_plusplus_initializer.initialize() A

Complexity

Size

Duplication

Importance

1. Missing Dependencies

2. Missing __init__.py files

Duplication Side-by-Side

Filter issues like

2. Missing init.py files