Canonical PSO Based K-Means Clustering Approach for Real Datasets

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.


Introduction
One of the best known problems in the data mining is the clustering. Clustering is the task of categorising objects having several attributes into different classes such that the objects belonging to the same class are similar, and those that are broken down into different classes are not [1]. There are several clustering algorithms that have been proposed till now. Due to no prior information in clustering, the suitable evaluation of the results is necessary. Evaluation means measuring the similarity between clusters, measuring the compactness, and separation between clusters [2]. Evaluation measurement is also proposed as a key feature in internal and external cluster validation indexes [3]. Such a measure can be used to compare the performance of different data clustering algorithms on different real life datasets. These measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method. Three different techniques are available to evaluate the clustering results: external, internal, and relative [4]. Both internal and external criteria are based on statistical methods and they have high computation demand. The external validity methods evaluate the clustering based on some user specific intuitions [4]. The objective of this paper is the comparison of the different clustering schemas that have been already proposed [5] with Canonical PSO based K-means clustering algorithm.
The rest of the paper is organized as follows. The Canonical PSO based K-means algorithm is proposed in Section 2 with some other existing clustering algorithms. Some popular and widely used validity indices are introduced in Section 3. Section 4 demonstrates the clustering compactness measurements on a toy example dataset using K-means and DBSCAN clustering algorithms. Section 5 demonstrates the clustering compactness measurements with experimental results and comparison of the indices is outlined in this section, and Section 7 gives a brief conclusion of this paper. Interested 2 International Scholarly Research Notices readers may found some significant references at the end section of this paper. This paper mainly focuses on the implementation of Canonical PSO based K-means algorithm. Some cluster validity indices (e.g., Dunn index, DB index, Silhouette index, Rand index, Mirkin index, etc.) and ANOVA test are analysed. All these indices are individually experimented on the air pollution dataset, customer, wine, and vehicle dataset using typical K-means, DBSCAN, Hierarchical, simple PSO based K-means, and Canonical PSO based K-means algorithms. The overall motivation of this paper is given in Figure 2.

Clustering Algorithms
This section introduces a new modified Canonical PSO based K-means clustering algorithm and also describes all these well-known algorithms briefly.

Simple PSO Based K-Means
Clustering. James Kennedy and Russell C. Eberhert originally proposed the Particle Swarm Optimization (PSO) algorithm for optimization. It is a population based, robust, and stochastic optimization technique mainly designed for the balancing weights in neural networks [6]. In data clustering, it is possible to view the clustering problem as an optimization problem that locates the optimal centroids of the clusters rather than to find an optimal partition. This view offers us a probability to apply PSO optimal algorithm on the clustering solution. The use of basic PSO technique in document clustering analysis is proposed in many papers [7,8].
In the PSO based K-means algorithm, the capability of globalized searching of the PSO algorithm and the fast convergence of the K-means algorithm are combined [9]. The algorithm results in better accuracy than existing algorithms. In the original PSO, at any instance each particle has a position and a velocity. At the beginning, population of particles is initialized with random positions denoted by vectors and random velocities V . The population of such particles is called a swarm, . Each particle is searching for the optimum. Each particle remembers the position it was in where it had its best result so far (its personal best) [7]. The particles in the swarm cooperate. They exchange information about what they have discovered in the places they have visited.
In each time step, a particle has to move to a new position. It does this by adjusting its velocity. Velocity is updated based on information obtained in previous steps of the algorithm.
Here, V ( + 1) is the new velocity at time step ( + 1), V ( ) is the old velocity at time step , ( ) is the best position of each particle, ( ) is the best position of swarm, ( + 1) is current position of each particle, and ( ) is old position of each particle.
1 and 2 are random variables uniformly distributed within [0, 1] and 1 2 are weighting factors, also called the cognitive and social parameter, respectively. In the first version of PSO, a single weight, = 1 = 2, called acceleration constant, was used instead of the two distinct weights in (1).

Canonical PSO Based K-Means
Clustering. Oscan and Mohan (1999) focused on the early PSO model of (1) and (2), and they showed that particles were actually moving on sinusoidal waves per coordinate of the search space, while causticity was offering a means to control its frequency and amplitude. They had modified the PSO model and the model is defined by the following equation: where = 1, 2, . . . , particles (vector data) and = 1, 2, . . . , .
Where chi( ) is a parameter called constriction coefficient or constriction factor, responsible for keeping the particle moving in the same direction, it was originally heading, while the rest of the parameters remain the same as for the previously described PSO models [9]. (1) Initialize each particle randomly by taking different data samples from the dataset as the initial cluster centers.
(2) Initialize velocity and personal best position of each particle.
(a) Calculate the performance of each particle based on the -means fitness function where denotes the th data point, V denotes the center of the th cluster , and (V , ) denotes the distance (e.g., Euclidean distance) of from V . (4) Repeat Step (3) until one of the following termination conditions is satisfied.
(a) The maximum number of iterations is exceeded.
The average change of centroid vector is less than the predefined value.

K-Means
Algorithm. K-means clustering is a method of cluster analysis which aims to partition observations into K clusters depending on some similarity/dissimilarity metric where the value of K may or may not be known a priori [7]. The objective of K-means clustering algorithm is usually to create one set of clusters that partitions the data into similar groups. As a result, maximum similarity samples are placed in same cluster and low similarity samples are placed in different clusters [10].

Hierarchical Algorithm.
In hierarchical clustering clusters are generated by grouping data with similar pattern of expression across a range of samples located near each other.
Hierarchical clustering calculates all pairs-wise distance associations between samples and experiments to merge pairs of values that are mainly similar [9].
2.6. DBSCAN Algorithm. This is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm. DBSCAN requires two parameters: (eps) and the minimum number of points required to form a dense region (minpts) [11]. It starts with an arbitrary starting point that has not been visited. This point's -neighbourhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labelled as noise. If a point is found to be a dense part of a cluster, its -neighbourhood is also part of that cluster [12]. Hence, all points that are found within theneighbourhood are added as is their own -neighbourhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

Clustering Validity Indices
In this paper the behaviour of several clustering validity indices has been examined ( Figure 1). Among these indices

Silhouette
Dunn index

DB index Rand index
Compare the quality of the results and find optimal clustering algorithm for the given dataset  some of them are clearly illustrated in this section. These indices are used for measuring "goodness" of a clustering result comparing to other ones which were created by other clustering algorithms or by the same algorithms but using different parameter values [13].

Silhouette Index.
Silhouette index is used for cluster analysis formed by different clustering algorithms (Tables 6,  7, and 8). The silhouette validation technique calculates the silhouette width for each sample, average silhouette width for each cluster, and overall average silhouette width for a total dataset in this paper [14]. The average silhouette width could be applied for evaluation of clustering validity and also could be used to decide how good the number of selected clusters is.

International Scholarly Research Notices
To construct the silhouettes ( ) the following formula is used: where ( ) is the average dissimilarity of th object to all other objects in the same cluster and ( ) is the minimum of average dissimilarity of -object to all objects in other clusters [9]. If silhouette value is close to 1, it means that sample is "well-clustered" and it was assigned to a very appropriate cluster. If silhouette value is about zero, it means that sample could be assigned to another closest cluster as well, and the sample lies equally far away from both clusters. If silhouette value is close to -1, it means that sample is "misclassified" and is merely somewhere in between the clusters. The overall average silhouette width for the entire plot is simply the average of the ( ) for all objects in the whole dataset.

Davies-Bouldin Index (DB).
The Davies-Bouldin index (DB) can be calculated by the following formula: where is the number of clusters, is the centroid of cluster, is the average distance of all elements in cluster to centroids, and ( , ) is the distance between centroids [15]. Since algorithms that produce clusters with low intracluster distances (high intracluster similarity) and high intercluster distances (low intercluster similarity) will have a low Davies-Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies-Bouldin index is considered the best algorithm based on this criterion [9].

Dunn Index.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal intercluster distances to maximal intracluster distance. For each cluster partition, the Dunn index can be calculated by the following formula: where ( , ) represents the distance between clusters and and / ( ) measures the intracluster distance of cluster . The intercluster distance ( , ) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters [15]. Similarly, the intracluster distance / ( ) may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster . Since internal criterion seeks clusters with high intracluster similarity and low intercluster similarity, algorithms that produce clusters with high Dunn index are more desirable [16].

Rand
Index. The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm [16].
Given a set of elements = { 1 , . . . , } and two partitions of to compare, = { 1 , . . . , }, a partition of into subsets, and = { 1 , . . . , }, a partition of into subsets, then the Rand index, , is where is the number of pairs of elements in that are in the same set in and in the same set in , is the number of pairs of elements in that are in different sets in and in different sets in , is the number of pairs of elements in that are in the same set in and in different sets in , and is the number of pairs of elements in that are in different sets in and in the same set in .

Mirkin Index.
Mirkin index is also known as equivalence mismatch distance. It is defined by where = 1, . . . , , = 1, . . . , . Let be a finite set with cardinality | | − . A clustering is a set{ 1 , . . . , } of nonempty disjoint subsets of such that their union equals . Clustering is a refinement of cluster of , formally: 3.6. ANOVA Test. "Analysis of variance" test is used to compare three or more groups or conditions in an experiment. A one-way ANOVA can help to find out if the means for each group/condition are significantly different from one another or if they are relatively the same. The null hypothesis typically corresponds to a general or default position. Oneway ANOVA is a simple special case of the linear model. The one-way ANOVA form of the model is where is a matrix of observations in which each column represents a different group. . is a matrix whose columns are the group means (the "dot " notation means that applies to all rows of the th column. That is, the value is the same for all ). C is a matrix of random disturbances.
The standard ANOVA table has six columns: (i) the source of the variability,  (v) the -statistic, which is the ratio of the mean squares, (vi) the probability value, which is derived from the cdf of .
The ANOVA test makes the following assumptions about the data: (i) all sample populations are normally distributed; (ii) all sample populations have equal variance; (iii) all observations are mutually independent.

K-Means Clustering
Step 1. In the first step typical K-means algorithm is applied on this dataset. Suppose the total number of clusters is 3 and also assume their centroids are Clust1 = 10, Clust2 = 22, and Clust3 = 1. Now use Manhattan distance measure | − | to calculate the following.
First iteration: see Table 1 and Figure 3. Now, the new centroids are Clust1 = 10.5, Clust2 = 27.3, and Clust3 = 3.25.  Second iteration: see Table 2. →No change occurs. So those three clusters are final clusters and now based on those clusters two popular indexes Davis-Bouldin (DB) index and Dunn index which are calculated below.

Davies-Bouldin Index. Consider
where min is the minimum distance between two points belonging to different clusters and max is the maximum distance between any two points selected from the same cluster [13]. According to this form, the calculated values are as shown in Table 3.
So DBSCAN Clustering. Now, on the same dataset DBSCAN clustering algorithm is applied with assuming the value of minimum number of points (minpts) = 3 and the radius of a cluster (eps) = 1 cm. Initially there are three clusters with the same means and they are shown in Figure 4. According to the DBSCAN clustering criteria, we have the following.
Here two data objects are treated like outliers because both objects 15 and 31 are ≥eps (1 cm) and due to the minimum number points condition does not satisfied by the Clust3 that is why the objects of the clust3 also acts like outliers. But the objects of Clust3 and the objects 15 and 31 all together form a cluster ( new ) which satisfies both criteria of DBSCAN clustering. It is also called the cluster of noise.  Evaluation. After checking the values of the indexes it can be said that for that particular set of data objects the typical K-means clustering gives better compactness and better separability than DBSCAN clustering (due to the fact that less value of DB index and high value of Dunn index give more desirable results). But in most of the cases DBSCAN gives bad results compared to K-means clustering because it forms arbitrary shape of clusters.

Customer Dataset.
The wholesale customer dataset is taken from UCI Machine Learning Repository. The dataset contains eight attributes like fresh products, milk, grocery, frozen, detergents, delicatessen, channel, and region.

Wine Dataset.
The wholesale wine dataset is taken from UCI Machine Learning Repository. The dataset contains thirteen attributes like alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline.

Vehicle Dataset.
The wine dataset is taken from UCI Machine Learning Repository. The dataset contains eighteen     attributes like compactness, circularity, distance circularity area, radius ratio, Pr.Axis aspect ratio, max. length aspect ratio, scatter ratio, elongatedness, Pr.Axis rectangularity, max. length rectangularity area, scaled variance along major axis, scaled variance along minor axis, scaled radius of gyration, skewness about major axis, skewness about minor axis, kurtosis about minor axis, kurtosis about major axis, and hollows ratio. Here 4 classes are considered: Opel, Saab, Bus, and Van. The 4 real datasets and their main characteristics are shown in " Table 4. " The experiment is based on 150 iterations and five clustering algorithms.

Experimental Analysis.
At first analysis, the accuracy of different clustering method is measured using silhouette index depicted in Tables 2, 3, 4, and 5. Probably 150 iterations are considered in each case. One-way ANOVA test is applied on the values of silhouette index at different iterations and null hypothesis has been measured and reported in Tables 9, 10, 11, and 12. The accuracy comparisons of all clustering algorithms on different datasets are explained in Table 13. The results of ANOVA test is as shown in Tables 9, 10, 11, and 12 and Figures 5, 6, 7, and 8.
The reason for doing an ANOVA test is to see if there is any difference between groups on some variables. If probability > , we reject null hypothesis. Here in all cases, Prob > . Therefore, we reject the null hypothesis. The means of the five algorithms are not equal. At least one of the means is different. However, the ANOVA test does not tell where the difference lies. It requires a " -test" to test each pair of means.

Discussion
So from Table 13 it can be concluded that Canonical PSO based -means algorithm provides most desirable results compared to other clustering algorithms stated in Section 2 and DBSCAN clustering provides the worst results than the others (based on the initial values of eps = 25 and minpts = 65). In this part, the results of these algorithms are evaluated by applying other well-known indexes (such as, Dunn, Davies-Bouldin, Rand, and Mirkin indices). From " Table 14" the results of this experiment can be summerised as follows.
(1) According to the Dunn index, algorithm that produces clusters with high Dunn index is more desirable. So here the Canonical PSO based -means algorithm produces better results compared to the others (0.7132 > 0.6542 > 0.6152 > 0.5922 > 0.3984).
(2) According to the Davis-Bouldin (DB) index, algorithm that produces a collection of clusters with the smallest Davies-Bouldin index is considered the best algorithm. So here again the Canonical PSO based -means algorithm produces better results compared to the others (0.2074 < 0.3283 < 0.3344 < 0.5484 < 0.5816).
(3) According to the Rand index, algorithm that produces clusters with high Rand index is more suitable and compact. So here the Canonical PSO based -means algorithm produces more good results than the others (0.7825 > 0.6812 > 0.4687 > 0.3702 > 0.3022).    All these results and analysis are represented in a pictorial form as in Figure 9.
The same approach and results are also valid for the other 3 datasets (such as customer, wine, and vehicle). The detailed analysis of results is tabulated in Tables 15, 16, and 17 and their corresponding pictorial forms are represented in Figures 10,  11, and 12.

Conclusion and Future Work
Although there are several proposals for validity indices in the literature, most of them succeed only in certain situations and are based on some certain conditions. This paper introduces the new concept like Canonical PSO based -means algorithm and also presents ample comparisons of six popular clustering validity approaches. All these analyses are done on a few real time datasets which contain air pollutant particles, wholesale customer, wine, and vehicle data. This proposed algorithm provides better impact to produce desirable clustering results compared to other existing algorithms described in this paper (provided some certain assumptions). Several extensions to the present work are in progress. Finally, studies of other clustering algorithms and others indices and their effects to measure proper, compact clusters are also underway. In future work, we can amalgamate DBSCAN and Hierarchical algorithms with PSO approach to seek some better ways of producing improved clustering output.