^{1,2,3}

^{1,3}

^{1,3}

^{1,3}

^{1}

^{2}

^{3}

For the shortcoming of fuzzy

Cluster analysis has a long research history. Due to its advantage of learning without a priori knowledge, it is widely applied in the fields of pattern recognition, image processing, web mining, spatiotemporal database application, business intelligence, and so forth.

Clustering is often as unsupervised learning and aims to partition objects in a dataset into several natural groupings, namely, the so-called clusters, such that objects within a cluster tend to be similar while objects belonging to different clusters are dissimilar. Generally, the datasets from different application fields vary in feature, and the purposes of clustering are multifarious. Therefore, the best method of cluster analysis depends on datasets and purposes of use. There is no universal clustering technology that can be widely applicable to the diverse structures presented by various datasets [

Contributions of this paper are as follows. (1) A density-based algorithm is proposed, which can quickly and succinctly generate high-quality initial cluster centroids instead of choosing randomly, so as to stabilize the clustering result and also quicken the convergence of the clustering algorithm. Besides, this algorithm can automatically estimate the maximum number of clusters based on the features of a dataset, hereby determining the search range for estimating the optimal number of clusters and effectively reducing the iterations of the clustering algorithm. (2) Based on the features of compactness within a cluster and separation between clusters, a new fuzzy clustering validity index (CVI) is defined in this paper avoiding its value close to zero along with the number of clusters tending to the number of objects and obtaining the optimal clustering result. (3) A self-adaptive method that iteratively uses improved FCM algorithm to estimate the optimal number of clusters was put forward.

The easiest method of determining the number of clusters is data visualization. For the dataset that can be effectively mapped to a 2-dimensional Euclidean space, the number of clusters can be intuitively acquired through the distribution graph of data points. However, for high-dimensional and complicated data, this method is unserviceable. Rodriguez and Laio proposed a clustering algorithm based on density peaks, declaring that it was able to detect nonspherical clusters and to automatically find the true number of clusters [

Clustering validity index is used to evaluate the quality of partitions on a dataset generated by the clustering algorithm. It is an effective method to construct an appropriate clustering validity index to determine the number of clusters. The idea is to assign different values of the number of clusters

Based on clustering validity index, the optimal number of clusters is determined through exhaustive search. In order to increase the efficiency of estimating the optimal number of clusters

Some new clustering algorithms have been proposed in succession recently. The main idea is to use some criteria to guide the clustering process, with the number of clusters being adjusted. In this way, while the clustering is completed, the appropriate number of clusters can be obtained as well. For example,

Since Ruspini first introduced the theory of fuzzy sets into cluster analysis in 1973 [

Assume that

According to the clustering criterion, appropriate fuzzy partition matrix

FCM algorithm is carried out through an iterative process of minimizing the objective function

Assign the initial value of the number of clusters

Initialize the fuzzy partition

At the

Calculate the objective function

Calculate

At last, each object can be arranged into one cluster in accordance with the principle of the maximum degree of membership. The advantages of FCM may be summarized as simple algorithm, quick convergence, and easiness to be extended. Its disadvantages lie in the selection of the initial cluster centroids, the sensitivity to noise and outliers, and the setting of the number of clusters, which have a great impact on the clustering result. As the random selection of the initial cluster centroids cannot ensure the fact that FCM converges to an optimal solution, different initial cluster centroids are used for multiple running of FCM; otherwise they are determined by using of other fast algorithms.

The traditional method to determine the optimal number of clusters of FCM is to set the search range of the number of clusters, run FCM to generate clustering results of different number of clusters, select an appropriate clustering validity index to evaluate clustering results, and finally obtain the optimal number of clusters according to the evaluation result. The method is composed of the following steps.

Input the search range

For each integer

Compare all values of clustering validity index.

Output

Considering the large influence of the randomly selected initial cluster centroid on the clustering result, a density-based algorithm is proposed to select the initial cluster centroids, and the maximum number of clusters can be estimated at the same time. Some related terms will be defined at first.

The local density

Assume that there is an object

Assume that there are two objects

Assume that

Neighbors of an object

An object is called a border point if it has no neighbors.

The example is shown in Figure

An example. Point A is of the highest local density. If A does not belong to any cluster, then A is a core point. Points B, C, D, and E are directly density-reachable to point A. Point F is density-reachable to point A. Point H is a border point.

The selection principle of initial cluster centroids of density-based algorithm is that, usually, a cluster centroid is an object with higher local density, surrounded by neighbors with lower local density than it, and has a relatively large distance from other cluster centroids [

Demonstration of the process of density-based algorithm. (a) is the initial data distribution of a synthetic dataset. The dataset consists of two pieces of 2-dimensional Gaussian distribution data with centroids, respectively, as (2, 3) and (7, 8). Each class has 100 samples. In (b), the blue circle represents the highest density core point as the centroid of the first cluster, and the red plus sign represents the object belonging to the first cluster. In (c), the red circle represents the core point as the centroid of the second cluster, and the blue asterisk represents the object belonging to the second cluster. In (d), the purple circle represents the core point as the centroid of the third cluster, the green times sign represents the object belonging to the third cluster, and the black dot represents the final border point which does not belong to any cluster. According to a certain cutoff distance, the maximum number of clusters is 3. If calculated in accordance with the empirical rule, the maximum number of clusters should be 14. Therefore, the algorithm can effectively reduce the iteration of FCM algorithm operation.

Initial data

Iteration 1

Iteration 2

Iteration 3

Firstly based on the geometric structure of the dataset, Xie and Beni put forward the Xie-Beni fuzzy clustering validity index [

In the formula, the numerator is the average distance from various objects to centroids, used to measure the compactness, and the denominator is the minimum distance between any two centroids, measuring the separation.

However, Bensaid et al. found that the size of each cluster had a large influence on Xie-Beni index and put forward a new index [

Because

The numerator represents the compactness of the

In this paper, the iterative trial-and-error process [

Update

Update

Update

This paper selects 8 experimental datasets, among which 3 datasets come from UCI datasets, respectively, Iris, Wine, and Seeds, a dataset (SubKDD) is randomly selected from KDD Cup 1999 shown in Table

The data type and distribution of SubKDD.

Attack behavior | Number of samples |
---|---|

normal | 200 |

ipsweep | 50 |

portsweep | 50 |

neptune | 200 |

smurf | 300 |

back | 50 |

Four synthetic datasets.

SD1

SD2

SD3

SD4

In this paper, the numeric value

For the categorical attribute of SubKDD, the simple matching is used for the dissimilarity measure, that is, 0 for identical values and 1 for different values.

In [

| ||||||||
---|---|---|---|---|---|---|---|---|

Dataset | | | | | | | | |

Iris | 150 | 3 | 12 | 9 | 20 | 14 | 9 | 6 |

Wine | 178 | 3 | 13 | 15 | 14 | 7 | 6 | 3 |

Seeds | 210 | 3 | 14 | 13 | 18 | 12 | 5 | 2 |

SubKDD | 1050 | 6 | 32 | 24 | 21 | 17 | 10 | 7 |

SD1 | 200 | 20 | 14 | 19 | 38 | 22 | 20 | — |

SD2 | 2000 | 4 | 44 | 25 | 16 | 3 | 4 | 2 |

SD3 | 885 | 3 | 29 | 27 | 24 | 19 | 5 | 3 |

SD4 | 947 | 3 | 30 | 31 | 23 | 13 | 8 | 4 |

Experiment results show that, for the dataset with the true number of clusters greater than

To show that the initial cluster centroids obtained by density-based algorithm can quicken the convergence of clustering algorithm, the traditional FCM is adopted for verification. The number of clusters is assigned as the true value, with the convergence threshold being

Comparison of iterations of FCM algorithm. Method 1 uses the random initial cluster centroids, and Method 2 uses the cluster centroids obtained by density-based algorithm.

Dataset | Method 1 | Method 2 |
---|---|---|

Iris | 21 | 16 |

Wine | 27 | 18 |

Seeds | 19 | 16 |

SubKDD | 31 | 23 |

SD1 | 38 | 14 |

SD2 | 18 | 12 |

SD3 | 30 | 22 |

SD4 | 26 | 21 |

As shown in Table

For 3 UCI datasets and SubKDD, clustering accuracy

Here

In the experiments, the true number of clusters is assigned to each dataset and the initial cluster centroids are obtained by density-based algorithm. Then, experimental results of clustering accuracy are shown in Table

Clustering accuracy.

Dataset | Iris | Wine | Seeds | SubKDD |
---|---|---|---|---|

Clustering accuracy | 84.00% | 96.63% | 91.90% | 94.35% |

Apparently, clustering accuracies of the datasets Wine, Seeds, and SubKDD are high, while that of the dataset Iris is relatively low for the reason that two clusters are nonlinearly separable.

The clustering results of the proposed clustering validity index on 4 synthetic datasets are shown in Figure

Clustering results of two synthetic datasets.

SD1

SD2

SD3

SD4

At last, Xie-Beni index, Bensaid index, Kwon index, and the proposed index are, respectively, adopted for running of SAFCM so as to determine the optimal number of clusters. The results are shown in Table

Optimal number of clusters estimated by several clustering validity indices.

Dataset | | | | |
---|---|---|---|---|

Iris | 2 | 9 | 2 | 2 |

Wine | 3 | 6 | 3 | 3 |

Seeds | 2 | 5 | 2 | 2 |

SubKDD | 10 | 10 | 9 | 4 |

SD1 | 20 | 20 | 20 | 20 |

SD2 | 4 | 4 | 4 | 4 |

SD3 | 5 | 3 | 5 | 4 |

SD4 | 2 | 8 | 2 | 3 |

It shows that, for synthetic datasets SD1 and SD2 with simple structure, these indices can all obtain the optimal number of clusters. For 3 UCI datasets, SubKDD and SD4, Bensaid index cannot obtain an accurate number of clusters except SD3. For the dataset Wine, both Xie-Beni index and Kwon index can obtain the accurate number of clusters, while for the datasets Iris, Seeds, SubKDD, SD3, and SD4, they only obtain the result approximate to the true number of cluster. The proposed index obtains the right result on the datasets Wine and SD4 and a better result compared to Xie-Beni index and Kwon index on the datasets SubKDD and SD3, while it obtains the same result of the two indices on the datasets Iris and Seeds. There are two reasons. First, the importance degree of each property of the dataset is not considered in the clustering process but assumed to be the same, thereby affecting the experimental result. Second, SAFCM has a weak capacity to deal with overlapping clusters or nonconvexity dataset. These are the authors’ subsequent research contents.

Tables

The value of clustering validity index on Iris.

| | | | |
---|---|---|---|---|

2 | | 0.223604 | | |

3 | 0.223973 | 0.124598 | 34.572146 | 0.551565 |

4 | 0.316742 | 0.099103 | 49.279488 | 0.615436 |

5 | 0.560109 | 0.089108 | 87.540968 | 0.676350 |

6 | 0.574475 | 0.072201 | 90.563379 | 0.691340 |

7 | 0.400071 | 0.067005 | 63.328679 | 0.735311 |

8 | 0.275682 | 0.036283 | 45.736972 | 0.614055 |

9 | 0.250971 | | 42.868449 | 0.584244 |

The value of clustering validity index on Wine.

| | | | |
---|---|---|---|---|

2 | 0.663406 | 1.328291 | 118.33902 | 1.578291 |

3 | | 0.513071 | | |

4 | — | 0.473254 | — | 1.735791 |

5 | — | 0.373668 | — | 1.846686 |

6 | — | | — | 1.683222 |

The value of clustering validity index on Seeds.

| | | | |
---|---|---|---|---|

2 | | 0.293609 | | |

3 | 0.212127 | 0.150899 | 45.326216 | 0.599001 |

4 | 0.243483 | 0.127720 | 52.215334 | 0.697943 |

5 | 0.348842 | | 75.493654 | 0.701153 |

The value of clustering validity index on SubKDD.

| | | | |
---|---|---|---|---|

2 | 0.646989 | 1.324434 | 550.166676 | 1.574431 |

3 | 0.260755 | 0.378775 | 222.020838 | 1.090289 |

4 | 0.133843 | 0.062126 | 119.544560 | |

5 | 0.234402 | 0.052499 | 202.641204 | 0.537852 |

6 | 0.180728 | 0.054938 | 156.812271 | 0.583800 |

7 | 0.134636 | 0.047514 | 119.029265 | 0.619720 |

8 | 0.104511 | 0.032849 | 91.9852740 | 0.690873 |

9 | 0.129721 | 0.027639 | | 0.562636 |

10 | | | 91.3528560 | 0.528528 |

The value of clustering validity index on SD1.

| | | | |
---|---|---|---|---|

2 | 0.221693 | 0.443390 | 44.592968 | 0.693390 |

3 | 0.206035 | 0.198853 | 40.264251 | 0.726245 |

4 | 0.127731 | 0.093653 | 26.220200 | 0.655550 |

5 | 0.130781 | 0.069848 | 27.154867 | 0.651465 |

6 | 0.144894 | 0.050067 | 22.922121 | 0.639325 |

7 | 0.136562 | 0.040275 | 29.126152 | 0.636258 |

8 | 0.112480 | 0.032874 | 24.323625 | 0.627442 |

9 | 0.115090 | 0.026833 | 24.242580 | 0.624936 |

10 | 0.141415 | 0.022611 | 28.574579 | 0.616701 |

11 | 0.126680 | 0.019256 | 28.821707 | 0.611524 |

12 | 0.103178 | 0.016634 | 23.931865 | 0.605990 |

13 | 0.110355 | 0.013253 | 26.517065 | 0.588246 |

14 | 0.095513 | 0.011083 | 23.635022 | 0.576808 |

15 | 0.075928 | 0.009817 | 19.302095 | 0.562289 |

16 | 0.066025 | 0.008824 | 17.236138 | 0.557990 |

17 | 0.054314 | 0.007248 | 14.995284 | 0.544341 |

18 | 0.045398 | 0.006090 | 13.208810 | 0.534882 |

19 | 0.039492 | 0.005365 | 11.977437 | 0.527131 |

20 | | | | |

The value of clustering validity index on SD2.

| | | | |
---|---|---|---|---|

2 | 0.066286 | 0.132572 | 132.81503 | 0.382572 |

3 | 0.068242 | 0.063751 | 137.52535 | 0.394200 |

4 | | | | |

The value of clustering validity index on SD3.

| | | | |
---|---|---|---|---|

2 | 0.148379 | 0.300899 | 131.557269 | 0.570876 |

3 | 0.195663 | | 173.900551 | 0.599680 |

4 | 0.127512 | 0.142150 | 113.748947 | |

5 | | 0.738070 | | 0.589535 |

The value of clustering validity index on SD4.

| | | | |
---|---|---|---|---|

2 | | 0.208832 | | 0.473748 |

3 | 0.170326 | 0.142561 | 162.044450 | |

4 | 0.221884 | 0.081007 | 211.692699 | 0.583529 |

5 | 0.156253 | 0.053094 | 157.683921 | 0.603211 |

6 | 0.123191 | 0.041799 | 118.279116 | 0.575396 |

7 | 0.165465 | 0.032411 | 107.210082 | 0.592625 |

8 | 0.145164 | | 139.310969 | 0.606049 |

FCM is widely used in lots of fields. But it needs to preset the number of clusters and is greatly influenced by the initial cluster centroids. This paper studies a self-adaptive method for determining the number of clusters by using of FCM algorithm. In this method, a density-based algorithm is put forward at first, which can estimate the maximum number of clusters to reduce the search range of the optimal number, especially being fit for the dataset on which the empirical rule is inoperative. Besides, it can generate the high-quality initial cluster centroids so that the the clustering result is stable and the convergence of FCM is quick. Then, a new fuzzy clustering validity index was put forward based on fuzzy compactness and separation so that the clustering result is closer to global optimum. The index is robust and interpretable when the number of clusters tends to that of objects in the dataset. Finally, a self-adaptive FCM algorithm is proposed to determine the optimal number of clusters run in the iterative trial-and-error process.

The contributions are validated by experimental results. However, in most cases, each property plays a different role in the clustering process. In other words, the weight of properties is not the same. This issue will be focused on in the authors’ future work.

The authors declare that they have no competing interests.

This work is partially supported by the National Natural Science Foundation of China (61373148 and 61502151), Shandong Province Natural Science Foundation (ZR2014FL010), Science Foundation of Ministry of Education of China (14YJC 860042), Shandong Province Outstanding Young Scientist Award Fund (BS2013DX033), Shandong Province Social Sciences Project (15CXWJ13 and 16CFXJ05), and Project of Shandong Province Higher Educational Science and Technology Program (no. J15LN02).