Completed
Push — master ( f02a27...82a0b0 )
by Andrei
01:51
created

kmedoids   A

Complexity

Total Complexity 18

Size/Duplication

Total Lines 131
Duplicated Lines 20.61 %
Metric Value
dl 27
loc 131
rs 10
wmc 18

6 Methods

Rating   Name   Duplication   Size   Complexity  
A __init__() 0 13 2
A __update_medoids() 2 15 3
A get_medoids() 0 10 1
A get_clusters() 0 10 1
D __update_clusters() 23 27 8
A process() 0 23 3

How to fix   Duplicated Code   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

1
"""!
2
3
@brief Cluster analysis algorithm: K-Medoids (PAM - Partitioning Around Medoids).
4
@details Based on book description:
5
         - A.K. Jain, R.C Dubes, Algorithms for Clustering Data. 1988.
6
         - L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis. 1990.
7
8
@authors Andrei Novikov ([email protected])
9
@date 2014-2016
10
@copyright GNU Public License
11
12
@cond GNU_PUBLIC_LICENSE
13
    PyClustering is free software: you can redistribute it and/or modify
14
    it under the terms of the GNU General Public License as published by
15
    the Free Software Foundation, either version 3 of the License, or
16
    (at your option) any later version.
17
    
18
    PyClustering is distributed in the hope that it will be useful,
19
    but WITHOUT ANY WARRANTY; without even the implied warranty of
20
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
21
    GNU General Public License for more details.
22
    
23
    You should have received a copy of the GNU General Public License
24
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
25
@endcond
26
27
"""
28
29
30
from pyclustering.utils import euclidean_distance_sqrt, median;
31
32
class kmedoids:
33
    """!
34
    @brief Class represents clustering algorithm K-Medoids (another one title is PAM - Parti).
35
    @details The algorithm is less sensitive to outliers tham K-Means. The principle difference between K-Medoids and K-Medians is that
36
             K-Medoids uses existed points from input data space as medoids, but median in K-Medians can be unreal object (not from
37
             input data space).
38
    
39
    Example:
40
    @code
41
        # load list of points for cluster analysis
42
        sample = read_sample(path);
43
        
44
        # create instance of K-Medoids algorithm
45
        kmedians_instance = kmedians(sample, [1, 10]);
46
        
47
        # run cluster analysis and obtain results
48
        kmedians_instance.process();
49
        kmedians_instance.get_clusters();    
50
    @endcode
51
    
52
    """
53
    
54
    
55
    def __init__(self, data, initial_index_medoids, tolerance = 0.25):
56
        """!
57
        @brief Constructor of clustering algorithm K-Medoids.
58
        
59
        @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple.
60
        @param[in] initial_index_medoids (list): Indexes of intial medoids (indexes of points in input data).
61
        @param[in] tolerance (double): Stop condition: if maximum value of distance change of medoids of clusters is less than tolerance than algorithm will stop processing
62
        
63
        """
64
        self.__pointer_data = data;
65
        self.__clusters = [];
66
        self.__medoids = [ self.__pointer_data[medoid_index] for medoid_index in initial_index_medoids ];
67
        self.__tolerance = tolerance;
68
69
70
    def process(self):
71
        """!
72
        @brief Performs cluster analysis in line with rules of K-Medoids algorithm.
73
        
74
        @remark Results of clustering can be obtained using corresponding get methods.
75
        
76
        @see get_clusters()
77
        @see get_medoids()
78
        
79
        """
80
        
81
        changes = float('inf');
82
         
83
        stop_condition = self.__tolerance * self.__tolerance;   # Fast solution
84
        #stop_condition = self.__tolerance;              # Slow solution
85
         
86
        while (changes > stop_condition):
87
            self.__clusters = self.__update_clusters();
88
            updated_medoids = self.__update_medoids();  # changes should be calculated before asignment
89
         
90
            changes = max([euclidean_distance_sqrt(self.__medoids[index], updated_medoids[index]) for index in range(len(updated_medoids))]);    # Fast solution
91
             
92
            self.__medoids = updated_medoids;
93
94
95
    def get_clusters(self):
96
        """!
97
        @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
98
        
99
        @see process()
100
        @see get_medoids()
101
        
102
        """
103
        
104
        return self.__clusters;
105
    
106
    
107
    def get_medoids(self):
108
        """!
109
        @brief Returns list of medoids of allocated clusters.
110
        
111
        @see process()
112
        @see get_clusters()
113
        
114
        """
115
116
        return self.__medoids;
117
118
119
    def __update_clusters(self):
120
        """!
121
        @brief Calculate distance to each point from the each cluster. 
122
        @details Nearest points are captured by according clusters and as a result clusters are updated.
123 View Code Duplication
        
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
124
        @return (list) updated clusters as list of clusters where each cluster contains indexes of objects from data.
125
        
126
        """
127
        
128
        clusters = [[] for i in range(len(self.__medoids))];
129
        for index_point in range(len(self.__pointer_data)):
130
            index_optim = -1;
131
            dist_optim = 0.0;
132
             
133
            for index in range(len(self.__medoids)):
134
                dist = euclidean_distance_sqrt(self.__pointer_data[index_point], self.__medoids[index]);
135
                 
136
                if ( (dist < dist_optim) or (index is 0)):
137
                    index_optim = index;
138
                    dist_optim = dist;
139
             
140
            clusters[index_optim].append(index_point);
141
        
142
        # If cluster is not able to capture object it should be removed
143
        clusters = [cluster for cluster in clusters if len(cluster) > 0];
144
        
145
        return clusters;
146
    
147
    
148
    def __update_medoids(self):
149
        """!
150
        @brief Find medoids of clusters in line with contained objects.
151
        
152
        @return (list) list of medoids for current number of clusters.
153
        
154
        """
155
         
156
        medoids = [[] for i in range(len(self.__clusters))];
157
        
158
        for index in range(len(self.__clusters)):
159
            medoid_index = median(self.__pointer_data, self.__clusters[index]);
160
            medoids[index] = self.__pointer_data[medoid_index];
161
             
162
        return medoids;