A Self-Adaptive Fuzzy c-Means Algorithm for Determining the Optimal Number of Clusters

For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. The algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule n and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. Then, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result.


Introduction
Cluster analysis has a long research history. Due to its advantage of learning without a priori knowledge, it is widely applied in the fields of pattern recognition, image processing, web mining, spatiotemporal database application, business intelligence, and so forth.
Clustering is often as unsupervised learning and aims to partition objects in a dataset into several natural groupings, namely, the so-called clusters, such that objects within a cluster tend to be similar while objects belonging to different clusters are dissimilar. Generally, the datasets from different application fields vary in feature, and the purposes of clustering are multifarious. Therefore, the best method of cluster analysis depends on datasets and purposes of use. There is no universal clustering technology that can be widely applicable to the diverse structures presented by various datasets [1]. According to the accumulation rules of objects in clusters and the methods of applying these rules, clustering algorithms are divided into many types. However, for most clustering algorithms including partitional clustering and hierarchical clustering, the number of clusters is a parameter needing to be preset, to which the quality of clustering result is closely related. In practical application, it usually relies on users' experience or background knowledge in related fields. But in most cases, the number of clusters is unknown to users. If the number is assigned too large, it may result in more complicated clustering results which are difficult to be explained. On the contrary, if it is too small, a lot of valuable information in clustering result may be lost [2]. Thus, it is still a fundamental problem in the research of cluster analysis to determine the optimal number of clusters in a dataset [3].
Contributions of this paper are as follows.
(1) A densitybased algorithm is proposed, which can quickly and succinctly generate high-quality initial cluster centroids instead of choosing randomly, so as to stabilize the clustering result and also quicken the convergence of the clustering algorithm. Besides, this algorithm can automatically estimate the 2 Computational Intelligence and Neuroscience maximum number of clusters based on the features of a dataset, hereby determining the search range for estimating the optimal number of clusters and effectively reducing the iterations of the clustering algorithm. (2) Based on the features of compactness within a cluster and separation between clusters, a new fuzzy clustering validity index (CVI) is defined in this paper avoiding its value close to zero along with the number of clusters tending to the number of objects and obtaining the optimal clustering result. (3) A self-adaptive method that iteratively uses improved FCM algorithm to estimate the optimal number of clusters was put forward.

Related Work
The easiest method of determining the number of clusters is data visualization. For the dataset that can be effectively mapped to a 2-dimensional Euclidean space, the number of clusters can be intuitively acquired through the distribution graph of data points. However, for high-dimensional and complicated data, this method is unserviceable. Rodriguez and Laio proposed a clustering algorithm based on density peaks, declaring that it was able to detect nonspherical clusters and to automatically find the true number of clusters [4]. But, in fact, the number of cluster centroids still needs to be selected artificially according to the decision graph. Next, relevant technologies to determine the optimal number of clusters are summarized below.

Clustering Validity Index Based Method.
Clustering validity index is used to evaluate the quality of partitions on a dataset generated by the clustering algorithm. It is an effective method to construct an appropriate clustering validity index to determine the number of clusters. The idea is to assign different values of the number of clusters within a certain range, then run fuzzy clustering algorithm on the dataset, and finally to evaluate the results by clustering validity index. When the value of clustering validity index is of the maximum or the minimum or an obvious inflection point appears, the corresponding value of is the optimal number of clusters opt . So far, researchers have put forward lots of fuzzy clustering validity indices, divided into the following two types.
(1) Clustering Validity Index Based on Fuzzy Partition. These indices are according to such a point of view that for a wellseparated dataset, the smaller the fuzziness of fuzzy partition is, the more reliable the clustering result is. Based on it, Zadeh, the founder of fuzzy sets, put forward the first clustering validity index degree of separation in 1965 [5]. But its discriminating effect is not ideal. In 1974, Bezdek put forward the concept of partition coefficient (PC) [6] which was the first practical generic function for measuring the validity of the fuzzy clustering and subsequently proposed another clustering validity function partition entropy (PE) [7] closely related to partition coefficient. Later, Windham defined proportion exponent by use of the maximum value of a fuzzy membership function [8]. Lee put forward a fuzzy clustering validity index using the distinguishableness of clusters measured by the object proximities [9]. Based on the Shannon entropy and fuzzy variation theory, Zhang and Jiang put forward a new fuzzy clustering validity index taking account of the geometry structure of the dataset [10]. Saha et al. put forward an algorithm based on differential evolution for automatic cluster detection, which well evaluated the validity of the clustering result [11]. Yue et al. partitioned the original data space into a grid-based structure and proposed a cluster separation measure based on grid distances [12]. The clustering validity index based on fuzzy partition is only related to the fuzzy degree of membership and has the advantages of simpleness and small calculating amount. But it lacks direct relation with some structural features of the dataset.
(2) Clustering Validity Index Based on the Geometry Structure of the Dataset. These indices are based on such a point of view that for a well-separated dataset, every cluster should be compact and separated from each other as far as possible. The ratio of compactness and separation is used as the standard of clustering validity. This type of representative clustering validity indices include Xie-Beni index [13], Bensaid index [14], and Kwon index [15]. Sun et al. proposed a new validity index based on a linear combination of compactness and separation and inspired by Rezaee's validity [16]. Li and Yu defined new compactness and separation and put forward a new fuzzy clustering validity function [17]. Based on fuzzy granulation-degranulation, Saha and Bandyopadhyay put forward a fuzzy clustering validity function [18]. Zhang et al. adopted Pearson correlation to measure the distance and put forward a validity function [19]. Kim et al. proposed a clustering validity index for GK algorithm based on the average value of the relative degrees of sharing of all possible pairs of fuzzy clusters [20]. Rezaee proposed a new validity index for GK algorithm to overcome the shortcomings of Kim's index [21]. Zhang et al. proposed a novel WGLI to detect the optimal cluster number, using global optimum membership as the global property and modularity of bipartite network as the local independent property [22]. The clustering validity index based on the geometric structure of the dataset considers both the fuzzy degree of membership and the geometric structure, but its membership function is quite complicated with large calculating amount.
Based on clustering validity index, the optimal number of clusters is determined through exhaustive search. In order to increase the efficiency of estimating the optimal number of clusters opt , the search range of opt must be set; that is, max is the maximum number of clusters, assigned to meet the condition opt ≤ max . Most researchers used the empirical rule max ≤ √ , where is the number of data in the dataset. For this problem, theoretical analysis and example verification were conducted in [23], which indicated that it was reasonable in a sense. However, apparently, this method has the following disadvantages. (1) Each must be tried in turn, which will cause a huge calculation. (2) For each , it cannot be guaranteed that the clustering result is the globally optimal solution. (3) When noise and outliers are existing, the reliability of clustering validity index is weak. (4) For some datasets like FaceImage [24], if ≥ √ , the empirical rule will be invalid. Researches showed that due to the diversified data types and structures, no universal fuzzy clustering validity Computational Intelligence and Neuroscience 3 index can be applicable to all datasets. The research is and will still be carried on urgently.

Heuristic Method.
Some new clustering algorithms have been proposed in succession recently. The main idea is to use some criteria to guide the clustering process, with the number of clusters being adjusted. In this way, while the clustering is completed, the appropriate number of clusters can be obtained as well. For example, -means algorithm [25] based on the split hierarchical clustering is representative. Contrary to the aforesaid process, RCA [26] determined the actual number of clusters by a process of competitive agglomeration. Combining the single-point iterative technology with hierarchical clustering, a similarity-based clustering method (SCM) [27] was proposed. A mercer kernel-based clustering [28] estimated the number of clusters by the eigenvectors of a kernel matrix. A clustering algorithm based on maximaldistant subtrees [29] detected any number of well-separated clusters with any shapes. In addition, Frey and Dueck put forward an affinity propagation clustering algorithm (AP) [24] which generated high-quality cluster centroids via the message passing between objects to determine the optimal number of clusters of large-scale data. Shihong et al. proposed a Gerschgorin disk estimation-based criterion to estimate the true number of clusters [30]. José-García and Gómez-Flores presented an up-to-date review of all major natureinspired metaheuristic algorithms used thus far for automatic clustering, determining the best estimate of the number of clusters [31]. All these have widened the thoughts for relevant researches.

Traditional FCM for Determining the Optimal Number of Clusters
Since Ruspini first introduced the theory of fuzzy sets into cluster analysis in 1973 [32], different fuzzy clustering algorithms have been widely discussed, developed, and applied in various areas. FCM [33] is one of the most commonly used algorithms. FCM was first presented by Dunn in 1974. Subsequently, Bezdek introduced weighted index to the fuzzy degree of membership [34], further developing FCM. FCM divides the dataset into fuzzy clusters. This algorithm holds that each object belongs to a certain cluster with a different degree of membership, that is, a cluster is considered as a fuzzy subset on the dataset.
is an l-dimensional object and is the th property of the th object.
is a fuzzy partition matrix, and is the degree of membership of the th object in the th cluster, where ∑ =1 = 1, ∀ = 1, . . . , . The objective function is the quadratic sum of weighed distances from the samples to the cluster centroid in each cluster; that is, where = ‖ − V ‖ shows the Euclidean distance between the th object and the th cluster centroid. ( ∈ [1, ∞)) is a fuzziness index, controlling the fuzziness of the memberships. As the value of becomes progressively higher, the resulting memberships become fuzzier [35]. Pal and Bezdek advised that should be between 1.5 and 2.5, and usually let = 2 if without any special requirement [36].
According to the clustering criterion, appropriate fuzzy partition matrix and cluster centroid are obtained to minimize the objective function . Based on the Lagrange multiplier method, and are, respectively, calculated by the formulas below: FCM algorithm is carried out through an iterative process of minimizing the objective function , with the update of and . The specific steps are as follows.
Step 1. Assign the initial value of the number of clusters , fuzziness index , maximum iterations max , and threshold .
Step 2. Initialize the fuzzy partition (0) randomly according to the constraints of the degree of membership.
At last, each object can be arranged into one cluster in accordance with the principle of the maximum degree of membership. The advantages of FCM may be summarized as simple algorithm, quick convergence, and easiness to be extended. Its disadvantages lie in the selection of the initial cluster centroids, the sensitivity to noise and outliers, and the setting of the number of clusters, which have a great impact on the clustering result. As the random selection of the initial cluster centroids cannot ensure the fact that FCM converges to an optimal solution, different initial cluster centroids are used for multiple running of FCM; otherwise they are determined by using of other fast algorithms.
The traditional method to determine the optimal number of clusters of FCM is to set the search range of the number of clusters, run FCM to generate clustering results of different number of clusters, select an appropriate clustering validity index to evaluate clustering results, and finally obtain the optimal number of clusters according to the evaluation result. The method is composed of the following steps.
Step 3. Compare all values of clustering validity index. corresponding to the maximum or minimum value is the optimal number of clusters opt .
Step 4. Output opt , the optimal value of clustering validity index, and clustering result.

Density-Based Algorithm.
Considering the large influence of the randomly selected initial cluster centroid on the clustering result, a density-based algorithm is proposed to select the initial cluster centroids, and the maximum number of clusters can be estimated at the same time. Some related terms will be defined at first.
Definition 1 (local density). The local density of object is defined as where is the distance between two objects and and is a cutoff distance. The recommended approach is to sort the distance between any two objects in descending order and then assign as the value corresponding to the first % of the sorted distance (appropriately ∈ [2,5]). It shows that the more objects with a distance from less than there are, the bigger the value of is. The cutoff distance finally can decide the number of initial cluster centroids, namely, the maximum number of clusters. The density-based algorithm is robust with respect to the choice of .
Definition 3 (directly density-reachable). Assume that there are two objects , ∈ , if their distance < , then is directly density-reachable to and vice versa.
Definition 5 (neighbor). Neighbors of an object are those who are directly density-reachable or density-reachable to it, denoted as Neighbor ( ). The example is shown in Figure 1. The selection principle of initial cluster centroids of density-based algorithm is that, usually, a cluster centroid is an object with higher local density, surrounded by neighbors with lower local density than it, and has a relatively large distance from other cluster centroids [4]. Density-based algorithm can automatically select the initial cluster centroids and determine the maximum number of clusters max according to local density. The pseudocode is shown in Algorithm 1. These cluster centroids obtained are sorted in descending order according to local density. Figure 2 demonstrates the process of density-based algorithm.

Fuzzy Clustering Validity Index.
Firstly based on the geometric structure of the dataset, Xie and Beni put forward the Xie-Beni fuzzy clustering validity index [13], which tried to find a balance point between the fuzzy compactness and separation so as to acquire the optimal cluster result. The index is defined as In the formula, the numerator is the average distance from various objects to centroids, used to measure the compactness, and the denominator is the minimum distance between any two centroids, measuring the separation.
However, Bensaid et al. found that the size of each cluster had a large influence on Xie-Beni index and put forward a new index [14], which was insensitive to the number of objects in each cluster. Bensaid index is defined as Computational Intelligence and Neuroscience where is the fuzzy cardinality of the th cluster and defined as ∑ =1 ‖ − V ‖ 2 shows the variation of the th fuzzy cluster. Then, the compactness is computed as ∑ =1 ‖V − V ‖ 2 denotes the separation of the th fuzzy cluster, defined as the sum of the distances from its cluster centroid to the centroids of other − 1 clusters. Because this index is the same as the Xie-Beni index. When → , the index value will be monotonically decreased, close to 0, and will lose robustness and judgment function for determining the optimal number of clusters. Thus, this paper improves Bensaid index and proposes a new index : The numerator represents the compactness of the th cluster, where is its fuzzy cardinality. Its second item, an introduced punishing function, denotes the distance from the cluster centroid of the th cluster to the average of all cluster centroids, which can eliminate the monotonically decreasing tendency as the number of clusters increases to . The denominator represents the mean distance from the th cluster centroid to other cluster centroids, which is used for measuring the separation. The ratio of the numerator and the denominator thereof represents the clustering effect of the th cluster. The clustering validity index is defined as the sum of the clustering effect (the ratio) of all clusters. Obviously, the smaller the value is, the better the clustering effect of the dataset is, and the corresponding to the minimum value is the optimal number of clusters.

Self-Adaptive FCM.
In this paper, the iterative trial-anderror process [16] is still used to determine the optimal number of clusters by self-adaptive FCM (SAFCM). The pseudocode of SAFCM algorithm is described in Algorithm 2.

Experiment and Analysis
This paper selects 8 experimental datasets, among which 3 datasets come from UCI datasets, respectively, Iris, Wine, and Seeds, a dataset (SubKDD) is randomly selected from KDD Cup 1999 shown in Table 1, and the remaining 4 datasets are synthetic datasets shown in Figure 3. SubKDD includes normal, 2 kinds of Probe attack (ipsweep and portsweep) and 3 kinds of DoS attack (neptune, smurf, and back). The first synthetic dataset (SD1) consists of 20 2-dimensional Gaussian distribution data with 10 samples. Their covariance matrixes are second-order unit matrix 2 . The structural feature of the dataset is that the distance between any two clusters is large, and the number of classes is greater than ⌊√ ⌋. The second synthetic dataset (SD2) consists of 4 2-dimensional Gaussian distribution data. The cluster centroids are, respectively, (5,5), (10,10), (15,15), and (20,20), each containing 500 samples and the covariance matrix of each being 2 2 . The structural feature consists of two pieces of 2-dimensional Gaussian distribution data with centroids, respectively, as (2, 3) and (7,8). Each class has 100 samples. In (b), the blue circle represents the highest density core point as the centroid of the first cluster, and the red plus sign represents the object belonging to the first cluster. In (c), the red circle represents the core point as the centroid of the second cluster, and the blue asterisk represents the object belonging to the second cluster. In (d), the purple circle represents the core point as the centroid of the third cluster, the green times sign represents the object belonging to the third cluster, and the black dot represents the final border point which does not belong to any cluster. According to a certain cutoff distance, the maximum number of clusters is 3. If calculated in accordance with the empirical rule, the maximum number of clusters should be 14. Therefore, the algorithm can effectively reduce the iteration of FCM algorithm operation. of the dataset is of short intercluster distance with a small overlap. The third synthetic dataset (SD3) has a complicated structure. The fourth synthetic dataset (SD4) is a nonconvex one.

Experiment of max Simulation.
In [37], it was proposed that the number of clusters generated by AP algorithm could be selected as the maximum number of clusters. So this paper estimates the value of max by the empirical rule, AP algorithm, and density-based algorithm. The specific experimental results are shown in Table 2. Let = in AP algorithm.
Experiment results show that, for the dataset with the true number of clusters greater than ⌊√ ⌋, it is obviously incorrect to let max = ⌊√ ⌋. In other words, the empirical rule is invalid. The number of clusters finally obtained by AP algorithm is close to the actual number of clusters. For the dataset with the true number of clusters smaller than ⌊√ ⌋, the number of clusters generated by AP algorithm can be used as an appropriate max , but, sometimes, the value is greater than the value estimated by the empirical rule, which enlarges the search range of the optimal number of clusters. If max is estimated by the proposed density-based algorithm, the results in most cases are appropriate. The method is invalid only on SD1 when the cutoff distance is the first 5% distance. When the cutoff distance is selected as the first 3% distance, max generated by the proposed algorithm is much smaller than ⌊√ ⌋. It is closer to the true number of clusters and greatly narrows the search range of the optimal number of clusters. Therefore, the cutoff distance is selected as the distance of the first 3% in the later experiments.

Experiment of Influence of the Initial Cluster Centroids on the Convergence of FCM.
To show that the initial cluster centroids obtained by density-based algorithm can quicken the convergence of clustering algorithm, the traditional FCM is adopted for verification. The number of clusters is assigned as the true value, with the convergence threshold being 10 −5 . Because the random selection of initial cluster centroids has a large influence on FCM, the experiment of each dataset 8 Computational Intelligence and Neuroscience  is done repeatedly for 50 times, and the round numbers of the mean of algorithmic iterations are compared. The specific experimental result is shown in Table 3. As shown in Table 3, the initial cluster centroids obtained by density-based algorithm can effectively reduce the iteration of FCM. Particularly, on SD1, the iteration of FCM is far smaller than that with randomly selected initial cluster centroids whose minimum iteration is 24 and maximum iteration reaches 63, with unstable clustering results. Therefore, the proposed method can not only effectively quicken the convergence of FCM, but also obtain a stable clustering result.

Experiment of Clustering Accuracy Based on Clustering
Validity Index. For 3 UCI datasets and SubKDD, clustering accuracy is adopted to measure the clustering effect, defined as Here is the number of objects which co-occur in the th cluster and the th real cluster, and is the number of objects in the dataset. According to this measurement, the higher the clustering accuracy is, the better the clustering result of FCM is. When = 1, the clustering result of FCM is totally accurate.
In the experiments, the true number of clusters is assigned to each dataset and the initial cluster centroids are obtained by density-based algorithm. Then, experimental results of clustering accuracy are shown in Table 4.
Computational Intelligence and Neuroscience  Apparently, clustering accuracies of the datasets Wine, Seeds, and SubKDD are high, while that of the dataset Iris is relatively low for the reason that two clusters are nonlinearly separable.
The clustering results of the proposed clustering validity index on 4 synthetic datasets are shown in Figure 4, which depict the good partitions on the datasets SD1 and SD2 and the slightly poor partition on SD3 and SD4 because of the complexity of their structure or nonconvexity. When the number of clusters is set as 4, SD3 is divided into 4 groups that means the right and the left each have 2 groups.

Experiment of the Optimal Number of Clusters.
At last, Xie-Beni index, Bensaid index, Kwon index, and the proposed index are, respectively, adopted for running of SAFCM so as to determine the optimal number of clusters. The results are shown in Table 5. XB , B , K , and R represent Xie-Beni index, Bensaid index, Kwon index, and the proposed index, respectively. It shows that, for synthetic datasets SD1 and SD2 with simple structure, these indices can all obtain the optimal number of clusters. For 3 UCI datasets, SubKDD and SD4, Bensaid index cannot obtain an accurate number of clusters except SD3. For the dataset Wine, both Xie-Beni index and 10 Computational Intelligence and Neuroscience

Conclusions
FCM is widely used in lots of fields. But it needs to preset the number of clusters and is greatly influenced by the initial cluster centroids. This paper studies a self-adaptive method for determining the number of clusters by using of FCM algorithm. In this method, a density-based algorithm is put     initial cluster centroids so that the the clustering result is stable and the convergence of FCM is quick. Then, a new fuzzy clustering validity index was put forward based on fuzzy compactness and separation so that the clustering result is closer to global optimum. The index is robust and interpretable when the number of clusters tends to that of objects in the dataset. Finally, a self-adaptive FCM algorithm is proposed to determine the optimal number of clusters run in the iterative trial-and-error process. The contributions are validated by experimental results. However, in most cases, each property plays a different role in the clustering process. In other words, the weight of properties is not the same. This issue will be focused on in the authors' future work.