Random selection of initial centroids (centers) for clusters is a fundamental defect in
Clustering is a branch of unsupervised learning. This method is widely used as a first step to interpret the data. In this method, samples are divided into groups whose members are similar to each other [
Meanwhile, an important task in cluster analysis is evaluating the results of a clustering method or comparing it to another clustering result. Lots of different validity measures have been proposed in the literature [
Accordingly, the organization of this paper is as follows. Ordinary
All hybrid methods introduced with
Description of seven datasets utilized for comparisons among the methods1.
Name of dataset | Sample size (+/-) | No. of variables (features) | No. of classes (labels) | No. of optimal clusters2 |
---|---|---|---|---|
Leukemia | 64 (26/38) | 4 | 2 | 2 |
Prostate | 30 (15/15) | 3 | 2 | 2 |
Colon Cancer | 111 (56/55) | 4 | 2 | 2 |
Haberman | 306 | 3 | 2 | 2 |
Iris | 150 | 4 | 3 | 3 |
Wine | 178 | 13 | 3 | 3 |
Glass | 214 | 10 | 7 | 7 |
1Gene Expression Omnibus (
The basic idea in
The goal of
In general, the algorithmic steps of this method are summarized as follows (Figure Initial Distance between each observation and clusters’ centroid is calculated and the observation is assigned to a cluster with minimal distance from the centroid of that cluster Cluster centroids are updated by averaging the observations contained in each cluster Distance between each observation and new centroids of clusters is recalculated and data are placed in new clusters based on the minimum distance to the centroids Steps 3 and 4 are repeated until the centroids of clusters are not changed and the convergence occurs
Minimum spanning trees (MSTs) have been applied in data mining, pattern recognition, and machine learning for a long time [
In graph theory, a dataset can be shown by a complete graph
Tree is an undirected connected graph that does not contain any distance. The spanning tree is a subset of a complete weighted graph in a way that it has all features of a tree and also contains all vertices of a complete weighted graph. For a complete weighted graph, the minimum spanning tree has the least weight among all spanning trees of that graph. In present study, we followed the idea introduced by Yang et al. [
Accordingly, the MST-based Number of points ( MST is generated using Prim’s algorithm The set Distances between any two skeleton points of The skeleton point The rest skeleton point
Step 6 is repeated until the number of initial cluster centroids is equal to
Figure
MST-based
The genetic algorithms (GAs) in clustering analysis are usually used to determine the number of clusters automatically and to find initial centroids for
The genetic algorithm is inspired by the genetic science and Darwin’s theory of evolution and is based on the survival of the superiors or natural selection. A common application of genetic algorithms is its use as an optimizer function. Inspired by the evolutionary process of nature, these algorithms solve problems. In other words, they create a population of beings like nature, and reach an optimal set or being by acting on this set. The hybrid method used in the present paper provided a hybrid version of the The input parameters are determined including A target function is calculated for each chromosome. Based on the target function, the fitness value is calculated Crossover, selection, and mutation operators are used to generate the next generation If the number of produced generations is less than number of generations that is determined by user, it goes to stage 3 otherwise, it goes to stage 6 The amount of fit is calculated for the last generation per chromosome and compares the optimal amount of fit in this generation with the best fit obtained from previous generations and selects the largest one based on the estimator function Finally, the initial centroids obtained from the best chromosome are used according to stage 2 as the initial centroids in the
GA-based
The hierarchical method is the second most important crisp clustering method in microarray technology. In this method, clusters are formed by calculating the size of similarities or distance between each pair of elements [
Steps of hierarchical-based An agglomerative hierarchical clustering method is applied to data and the resulting tree is divided by Centroid of each cluster (mean clusters) is calculated and set
Figure
Hierarchical-based
To evaluate the results of clustering algorithms, some cluster validation methods are used. These methods prevent the occurrence of random patterns in data and also allow the comparison of different clustering algorithms. A good validity measure should be invariant to the changes of sample size, cluster size, and number of clusters [
In general, clustering evaluation indices are classified into three categories:
It should be noted that the optimal number of clusters in the present paper was determined by the majority rule and using three methods including the
To compare the performance of three hybrid methods and ordinary
To decrease the dimension of gene expression datasets and find the important genes (attributes), the result of the article by Ram et al. [
It is necessary to mention that these datasets already contain some classes (labels). Ignoring these classes, we obtain the optimal number of clusters (among 2-15 clusters) for each dataset based on the majority rule according to the mean value of
Then, the data analysis was organized in two phases:
To investigate whether
Comparison of four different ordinary clustering methods based on the silhouette and RPT indexes.
Index method | Hierarchical | DB scan | EM algorithm | |
---|---|---|---|---|
Leukemia dataset | ||||
Silhouette | 0.4663 | 0.2693 | 0.4419 | |
RPT | 0.8612 | 0.5087 | 0.8160 | |
Prostate dataset | ||||
Silhouette | 0.3265 | 0.3339 | 0.2756 | |
RPT | 0.6141 | 0.6319 | 0.5295 | |
Colon | ||||
Silhouette | 0.5189 | 0.3156 | 0.5176 | |
RPT | 0.9516 | 0.5747 | 0.9478 | |
Haberman | ||||
Silhouette | 0.2477 | 0.6266 | 0.1384 | |
RPT | 0.4787 | 1.15 | 0.2704 | |
Iris | ||||
Silhouette | 0.4589 | 0.4796 | 0.3728 | |
RPT | 0.8446 | 0.8614 | 0.6812 | |
Wine | ||||
Silhouette | 0.2469 | 0.1575 | 0.1911 | |
RPT | 0.4788 | 0.3092 | 0.3742 | |
Glass | ||||
Silhouette | 0.3411 | 0.4281 | 0.2809 | |
RPT | 0.6369 | 0.7921 | 0.5148 |
RPT: robustness performance trade-off.
The hybrid
Comparison among the hybrid and ordinary
Indexes | SSE | Si | RPT | Dunn | RI | ARI | AC | F | HI | VI | |
---|---|---|---|---|---|---|---|---|---|---|---|
Leukemia dataset | |||||||||||
5 | 116.2 | 0.4702 | 0.880 | 0.1431 | 0.8809 | 0.7617 | 0.9375 | 0.8848 | 0.76197 | 0.6477 | |
2 | 116.2 | 0.4650 | 0.880 | 0.1431 | 0.8809 | 0.7617 | 0.9375 | 0.8848 | 0.76197 | 0.6477 | |
1 | 116.8 | 0.4675 | 0.8719 | 0.1679 | 0.9092 | 0.8183 | 0.9531 | 0.9115 | 0.8184 | 0.5357 | |
4 | 116.2 | 0.4702 | 0.8801 | 0.1431 | 0.8809 | 0.7617 | 0.9375 | 0.8848 | 0.76197 | 0.6477 | |
Prostate dataset | |||||||||||
6 | 60.3 | 0.2677 | 0.5149 | 0.0969 | 0.6298 | 0.2599 | 0.7667 | 0.6247 | 0.2602 | 1.51 | |
1 | 58.1 | 0.3944 | 0.6141 | 0.1549 | 0.5954 | 0.1980 | 0.7337 | 0.6364 | 0.2069 | 1.21 | |
2 | 62.1 | 0.3935 | 0.7498 | 0.2239 | 0.4919 | 0.0019 | 0.5667 | 0.5915 | 0.0022 | 1.51 | |
4 | 58.7 | 0.2796 | 0.5385 | 0.1498 | 0.7126 | 0.4247 | 0.8333 | 0.7031 | 0.4247 | 1.29 | |
Colon dataset | |||||||||||
4 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 | |
2 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 | |
3 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 | |
2 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 | |
Haberman dataset | |||||||||||
6 | 698.8 | 0.2477 | 0.4787 | 0.023 | 0.4991 | -0.002 | 0.5196 | 0.5483 | -0.0015 | 1.83 | |
4 | 684.4 | 0.2733 | 0.5256 | 0.035 | 0.5038 | 0.0083 | 0.5523 | 0.5523 | 0.0085 | 1.82 | |
4 | 702.8 | 0.3888 | 0.7427 | 0.073 | 0.6189 | 0.1284 | 0.7451 | 0.7270 | 0.7451 | 0.1405 | |
5 | 682.1 | 0.2751 | 0.5305 | 0.039 | 0.4997 | -0.001 | 0.5261 | 0.5488 | -0.003 | 1.83 | |
Iris dataset | |||||||||||
7 | 140 | 0.4589 | 0.8446 | 0.02637 | 0.8322 | 0.6201 | 0.8333 | 0.7452 | 0.6201 | 1.079 | |
3 | 141.1 | 0.4554 | 0.8359 | 0.07756 | 0.8431 | 0.6451 | 0.8533 | 0.7622 | 0.6452 | 1.072 | |
5 | 191.7 | 0.4787 | 0.8917 | 0.05309 | 0.7197 | 0.4290 | 0.5732 | 0.6505 | 0.4488 | 1.19 | |
3 | 140 | 0.4589 | 0.8446 | 0.02637 | 0.8322 | 0.6201 | 0.8333 | 0.7452 | 0.6201 | 1.079 | |
Wine dataset | |||||||||||
8 | 1589.1 | 0.2469 | 0.4788 | 0.1357 | 0.6915 | 0.3757 | 0.6067 | 0.6237 | 0.3927 | 1.42 | |
2 | 1270.2 | 0.2905 | 0.5481 | 0.2323 | 0.9543 | 0.8975 | 0.9663 | 0.9319 | 0.8976 | 0.39 | |
4 | 1270.2 | 0.2849 | 0.5481 | 0.2323 | 0.9543 | 0.8975 | 0.9663 | 0.9319 | 0.8976 | 0.39 | |
4 | 1270.2 | 0.2849 | 0.5481 | 0.2323 | 0.9543 | 0.8975 | 0.9663 | 0.9319 | 0.8976 | 0.39 | |
Glass dataset | |||||||||||
13 | 687.4 | 0.3411 | 0.6369 | 0.05804 | 0.6891 | 0.1966 | 0.4346 | 0.4073 | 0.1966 | 2.8 | |
2 | 679.9 | 0.3458 | 0.6433 | 0.04906 | 0.6926 | 0.2036 | 0.4395 | 0.4116 | 0.2036 | 2.73 | |
4 | 790.2 | 0.3021 | 0.5754 | 0.06644 | 0.6531 | 0.1908 | 0.3598 | 0.4327 | 0.1954 | 2.60 | |
10 | 678.6 | 0.3427 | 0.6390 | 0.04502 | 0.6879 | 0.1946 | 0.4766 | 0.4062 | 0.1946 | 2.84 |
Number of iterations to converge for the hybrid methods in comparison with
Obviously, based on all evaluation criteria, one superior clustering method could not be achieved. But, depending on the purpose of the study, internal or external validity indices may be important. Therefore, according to internal validity indices, the MST-based clustering method was the best for all datasets except for the leukemia, wine, and glass datasets. For the former, GA-based and for the two latter, hierarchal-based methods are the best hybrid method (Table
Totally, the hybrid methods could not greatly improve the performance of
We have conducted a comparison study on three hybrid clustering methods which try to solve the random centroids problem in
To the best of our knowledge, MST-, GA-, and hierarchical-based
Results of this research indicated that the hybrid methods did not necessarily improve the ordinary
Totally, the hybrid methods could not greatly improve the performance of
Finally, since some previous studies reported better performance for these three hybrid methods than the ordinary
The data used to support the findings of this study have been deposited in the Gene Expression Omnibus repository for Leukemia, Prostate, and colon cancers (
This article was extracted from Atefeh Bassirat’s Master of Science Thesis.
The authors declare that they have no conflicts of interest.
This work was supported by the grant number 98-20079 from the Shiraz University of Medical Sciences Research Council.