Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy

A small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper we propose a new algorithm to compute initial cluster centers for k-means clustering and the best number of the clusters with little prior knowledge and optimize clustering result. It constructs the Euclidean distance control factor based on aggregation density sparse degree to select the initial cluster center of nonuniform sparse data and obtains initial data clusters by multidimensional diffusion density distribution. Multiobjective clustering approach based on dynamic cumulative entropy is adopted to optimize the initial data clusters and the best number of the clusters. The experimental results show that the newly proposed algorithm has good performance to obtain the initial cluster centers for the k-means algorithm and it effectively improves the clustering accuracy of nonuniform sparse data by about 5%.


Introduction
Clustering is an important discovery technique of exploratory data mining and a common technique for statistical data analysis.Iterative clustering algorithm is one kind of the clustering algorithms.And -means is the most popular and the fast method in iterative clustering algorithms.Because of the simplicity of -means algorithm, it is used in many fields, including machine learning, medicine, image analysis, pattern recognition, information retrieval, bioinformatics, and computer.For example, in the medical field, cancer genomics [1], cell signaling [2], and viral genomes [3] use -means as a data analysis tool; in the bioinformatics field, bioanalytical chemistry [4], the vibrational spectra of biomolecules [5], and the nervous system [6] use -means to mine potential information; in the image analysis field, imaging techniques [7] use -means to partition a given set of points into homogeneous groups; in the pattern recognition field, automatic system for imbalance diagnosis in wind turbines [8] uses -means to suggest the optimum number of groups.Reference [9] uses -means to analyze network data.Reference [10] generates profiles by -means to group together days with a similar pattern of request arrivals.
Although -means algorithm has been developed to solve a wide range of different problems, it has three major drawbacks: (1) It needs to predetermine the cluster number  by user.
In practice, due to little prior knowledge,  value is generally difficult to determine.
(2) It is sensitive to selection of the initial cluster centers.That is, -means selects different initial cluster centers with different results.Because of randomly chosen initial clusters centers, populations are generally composed of low quality individuals exclusively.
(3) The effect of -means algorithm for nonuniform sparse data processing is not good.
To overcome these drawbacks, many evolutionary algorithms such as GA, TS, and SA have been introduced.Kao et al. have proposed a hybrid technique based on combining the -means algorithm [11].Bahmani Firouzi et al. have introduced a hybrid evolutionary algorithm based on combining PSO, SA, and -means to find optimal solution [12].Niknam and Amiri have proposed a hybrid algorithm based on a fuzzy adaptive PSO, ACO, and -means for cluster analysis [13].Niknam et al. have purposed a novel algorithm that is based on combining two algorithms of clustering: -means and Modified Imperialist Competitive Algorithm [14].Evolutionary algorithms require large amounts of data to study; however, many real-world problems are like black boxes; hence no sufficient data about their internals is available.
In addition, to solve the problem of selection of the initial cluster centers, Bianchi et al. have proposed two densitybased -means initialization algorithms for nonmetric data clustering [15].Tunali et al. have proposed an improved clustering algorithm for text mining: multicluster spherical -means [16].Tvrdík and Křivý have proposed a new algorithm combining differential evolution and -means [17].Rodriguez and Laio proposed selection of the initial cluster centers by density peak.Khan and Ahmad [18] have proposed the cluster center initialization algorithm for means clustering.But the computing of the above methods is laborious.
In order to solve the situation that a small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm.In this paper, we propose a new algorithm for nonuniform sparse data clustering based on cascade entropy increase and decrease.It designs Euclidean distance sparse degree of aggregation density control factor, determines the initial cluster center of nonuniform sparse data, and groups initial data clusters by multidimensional diffusion data distribution density.Multiobjective clustering approach is adopted to compensate the clustering error of initial data clusters.The experimental results show that the new data clustering algorithm can effectively improve the clustering accuracy of nonuniform sparse data clusters.

Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy
In order to obtain the optimal clustering results of nonuniform sparse data, in this paper, we use multidimensional diffusion density distribution to obtain the initial data clusters, while the Euclidean distance control factor based on aggregation density sparse degree is put forward to solve the problem that multidimensional data is easy to misjudge.The initial data clusters are more than the real cluster.So we need to select optimal initial cluster centers by decision graph and then execute -means on complete data set based on multiobjective clustering approach.

Initial Data Clustering Using Multidimensional Diffusion Density Distribution.
In iterative clustering algorithms, choosing initial cluster centers is extremely important as it has a direct impact on the formation of final clusters.It is dangerous to select some samples as initial centers, which are away from normal samples.In this paper, first we define the multidimensional diffusion density distribution of samples and the comprehensive distance, according to which we get the initial data clusters.

Multidimensional Diffusion Data Normalization.
Different attributes of multidimensional data have different units of measurement and value ranges, which has a serious impact on clustering formation.So in order to avoid the above situation at first we need to normalize clustering data to make all attributes have the same weights.Normalization is done using the following formula.
Calculate the average of the absolute deviation:  is the set of  data elements described with attributes 1, 2, . . .,  where  is the number of attributes and all attributes are numeric.  11 is the measured value of the first attribute belonging to the first data: where (2)

Multidimensional Diffusion Density Distribution.
In iterative clustering algorithms the function adopted for density is the cut-off kernel [19]: where where   is the cut-off distance determined by user and   > 0.  1 is the th data with attributes 1 and  1 is the th data with attributes 1.  measures the Euclidean distance between  1 and  1 .So   is the number of data whose distance to the data  is less than   .
The clustering results of multidimensional data have shortcoming using the above functions adopted for density.Some multidimensional data are misjudged for the functions without considering the difference among attributes.If few attributes of some data change a lot but other attributes are close to other data, these mutating data can be misjudged into the cluster that they are not similar to for the changing attributes are ignored.In order to solve the above problems, we put forward the Euclidean distance control factor based on aggregation density sparse degree : where  is the th attribute of data about which   and   have the maximum distance and  is the attribute of data about which   and   have the minimum distance.According to the above views, we propose the optimized density formula: Formula ( 6) makes distinction among different attributes using the Euclidean distance control factor based on aggregation density sparse degree .The attributes with large differences are given more weights' computing density, which reduces the risk of misjudged data.The optimized density is multidimensional diffusion density: where  is the average of multidimensional diffusion density   and   is the standard deviation of multidimensional diffusion density.  is standard value of multidimensional diffusion density.Let the data whose multidimensional diffusion density is bigger than   in a collection named : is the distance between   and   , and both of them are from collection : where  is the average of distance between the data from collection .And   is the standard deviation of distance between the data from collection .  is standard value of distance.Let the data whose distance is bigger than   be in a collection named .Reference [19] shows that the initial cluster centers have the large density and the far distance.So the data in collection  are likely to be initial cluster centers.We choose them as the initial cluster centers of initial data clusters.

Obtaining Initial Data Clusters by Multidimensional Diffusion Density Distribution.
In this subsection we present execution steps of our proposed initial data clustering using multidimensional diffusion density distribution for -means clustering.This algorithm consists of two parts.The first part deals with the the initial cluster centers of the initial data.
Then we execute the second part of the algorithm to group the data. = { 1 ,  2 , . . .,   }, which is the set of  data elements described with attributes 1, 2, . . .,  where  is number of attributes and all attributes are numeric.Compute   of each data, mean , standard deviation   , and   .Choose the data whose   is bigger than   , and put them into collection .Compute  of each piece of data in  collection, , standard deviation   , and   .Choose the data whose  is bigger than   and put them into  collection,  = { 1 ,  2 , . . .,    } is the set of the initial cluster centers of the initial data, and   is the number of the initial cluster centers: (  ,   ) measures the Euclidean distance between a pattern   and its cluster center   that is in the  collection.
-means algorithm minimizes the function which is defined as object function [18]: The -means algorithm groups the data iteratively as follows.Choose the data in  collection as the initial cluster centers and the number of them is   , the number of initial data clusters.Decide membership of the patterns in one of the -clusters according to the minimum distance from cluster center criteria.Then calculate new   centers as: is the number of data from which the th cluster center of.Repeat the previous steps till there is no change in cluster centers.The clusters are the initial data clusters.Let  = { 1 ,  2 , . . .,    },   = { 1 ,  2 , . . .,   } be a clusters set of the initial data clusters.

Multiobjective Clustering Approach Based on Dynamic
Cumulative Entropy.Obtaining the initial data clusters by multidimensional diffusion density distribution is the primary data processing.To get the accurate clustering results we need to cluster again on the basis of it, and the initial data clusters are the basic elements.

Multiobjective Clustering Function Based on Dynamic
Cumulative Entropy.In the -means algorithm the object function is JMSE.But it is not suitable for optimizing the primary clusters of data and determining , the number of clusters because JMSE decreases monotonously for the increase of the number of clusters.When JMSE reduced to the global minimum, the initial cluster centers are putted away from their data, which makes single data a cluster.So proposed objective function refers to the information of each initial data cluster in the process of the deepening clustering.
The paper tries to solve the above problems by studying the clusters' structure by informational entropy theory and the principle of Maximum Informational Entropy.We propose a multiobjective clustering function  based on dynamic cumulative entropy to determine the final clustering results.Firstly, we define the information entropy formula of cluster: where   is the number of data in the th initial data clusters,  is the number of data to be clustered, and  is the number of clusters.
According to the principle of Maximum Informational Entropy, if clusters have no elements, the information entropy is 0,  = 0.When clusters tend to be stable and meet the condition that the numbers of elements belonging to different clusters   are very similar, the information entropy is maximum;  max = ln .Based on the information entropy formula (14) of cluster, we define the cluster structure equilibrium degree : is time-varying entropy-to-maximum entropy ratio, which shows the balance degree of clusters and 0 ≤  ≤ 1.
When   = 0, cluster is in the most uneven condition.When   = 1 clusters are in ideal equilibrium state: is the degree of equilibrium gain, which reflects the cluster occupancy of data.
Finally, we get the multiobjective clustering function based on dynamic cumulative entropy : where R is the average distance among the clusters.The multiobjective clustering function considers not only the distance among the cluster centers and its data influence on the clustering results, but also the amount of information.JMES decreases monotonically with the decreasing of distance among the cluster centers and its data influence on the clustering results.On the contrary, ∑  =1   R increases monotonically with the increasing of the number of the global clusters.So ∑  =1   R as restricting factor prevents the number of clusters  more than the real  for JMES decreasing.It improves the clustering accuracy and reliability when the number of clusters  is not determined.
The less information of a cluster has, the smaller its uncertainty is and the more likely it is to be the final cluster.At the same time, the global classification is more stable.Formula for information of the initial data clusters is Calculate the distance among the initial data clusters: where   = (1/) ∑  =1   ,   is the data from which the th cluster center of.
By the nature of the clustering, we can know that the cluster's uncertainty should be small and the distance among clusters should be large.According to this, we can draw distance-information decision graph to determine  and the initial cluster centers.Take two-dimensional data for example, such as shown in Figure 1.Multidimensional diffusion density distribution obtains initial data clusters , , , , , , , , , , and .

Experimental Results and Analysis
3.1.Experimental Objects and Related Settings.All experiments were performed on Intel5 Core6 i5 with 3.30 GHz CPU and 4.00 GB of random access memory (RAM).All programs were coded by standard MATLAB language and operating system was Windows 7. To show the accuracy of the proposed algorithm, it has been applied to the two types of data sets.One is the AR artificial data set from experimental data and the other is UCI data.To define the quality of proposed algorithm we use 4 indicators.They are accuracy, Adjust Rand Index, MSE, and BIC.The "Adjusted Rand index" is the best clustering validity evaluation criterion [20].And BIC is often used as an the accurate evaluation [21].
Because the value of MSE is large, the results of it are scaled down.The MSE is different from other indicators; it is the smaller the better.In this paper, an experimental system is used to evaluate the effectiveness of the proposed approach.A wireless information collection system of field soil temperature and humidity is used in this work to obtain the real data sets.The measurement system is shown in Figure 3.The field soil temperature is measured by TM-100 and humidity is measured by SP40A.The TM-100 measuring range is from 0% to 100%, and the SP40A measuring range is from −20 ∘ C to 60 ∘ C. The JN5148 is used for receiving and transmitting the soil temperature and humidity data.Then the data is controlled and recorded digitally by a pc.The measurements were   performed in four fields.So the record data are from four real data sets.The collecting places are shown in Figure 4. Table 1 shows the measurements data named artificial data AR.The artificial data AR that sets two-dimensional distribution is shown in Figure 5.
The optimal clustering result is obtained by using the multiobjective clustering approach based on dynamic cumulative entropy.The information and the distance among the initial data clusters are obtained by the proposed formula.
The distance-information decision graph is shown in Figure 7. From the graph we can clearly see that the uncertainty of initial data clusters , , , and  is low and the distance among them is large.So we select the respective average values  of initial data clusters , , , , and  as the initial cluster centers and  1 value as 5 and calculate the minimum of .Then we select the respective average values  of initial data cluster , , ,  as the initial cluster centers and  2 value as 4 and calculate the minimum of .From the proposed formula  1 = 247.49, 2 = 218.36,so the optimal number of clusters is 4.And the optimal clustering result is shown in Figure 8.We get the cluster centers of AR.They are (10.2,26.2), (23.7, 26.4), (17.8, 9.1), and (36, 10.5).And the real cluster centers are (10,25), (24, 25), (18,9),and (35, 10).The similarity between our cluster centers and the real cluster centers is 95.582%, 97.628%, 99.337%, and 96.968%.The average value of similarity of cluster centers is 94.72%, the clustering accuracy is 92.14%,Adjust Rand Index is 0.7840, and MSE (Mean Squared Error) is 3.82.
The AR artificial data is clustered by [11,15].The comparisons of cluster center's similarity and clustering result's accuracy and Adjust Rand Index and MSE, using our algorithm and literature [11,15], for AR data sets, are shown in Figures 9,10,11,12,and 13.The comparison of similarity of initial cluster centers computed using our algorithm and [11,15], for the AR data set, is shown in Figure 9.The algorithm with high value will have a good selection of cluster centers.The result of our algorithm  has the highest value of similarity of initial cluster centers.What is more, for the other numerous measures used for clustering validation, the accuracy, Adjust Rand Index MSE, and BIC, our algorithm has better performance than other algorithms.Obviously, our algorithm can find the optimal clustering number  and the optimal cluster center to get the best clustering result, which provides the accuracy of clustering.Table 2 shows the comparison of average of the running time of clustering algorithms; the unit is .Figures 14,15,16,and 17 are the comparisons about the clustering accuracy, the "Rand index Adjusted," MSE, and the BIC of our algorithm and [11,15].

Experimental Results and
According to clustering results of UCI data we can see that for all the tested data sets, the proposed algorithm gets improved and consistent clusters for all data sets in comparison to [11,15].Its accuracy for complicated data sets also was high.However, the algorithm of [11] is quicker than the proposed algorithm.

Conclusion
We have presented an algorithm for iterative clustering algorithm with a small amount of prior knowledge for nonuniform sparse data.This procedure is based on the dynamic cumulative entropy.However, the outliers are more susceptible to a change in cluster membership.We propose

Mathematical Problems in Engineering
Nonuniform sparse data clustering cascade algorithm Literature [8] Literature [9] Adjust rand index  multidimensional diffusion density distribution in computing initial cluster centers, which generate the initial data clusters that may be more than the number of desired clusters.So we need to cluster again on the basis of them for getting the accurate clustering results by multiobjective clustering approach based on dynamic cumulative entropy.Experimental results show improved and consistent cluster

Iris Seg
Reference [11] Reference [15] Our algorithm Accuracy (%) structures.And the experimental results show that compared with full search, our proposed method may got production up by 5 percent for nonuniform sparse data set from field soil temperature and humidity.The superiority of our method over other algorithms is more remarkable, when a data set with higher data dimension and larger number of clusters is used.And our method handles clustering with little prior knowledge, so our method is at high complexity.Following our paper, several interesting problems deserve to be explored, for example, how to reduce computation and enhance the algorithm speed with a similar approximation ratio.To generalize our method to the setting of arbitrary partition is also challenging but useful in some scenarios.

Figure 3 :
Figure 3: Field soil temperature and humidity collection system.

Figure 9 :
Figure 9: The comparisons of cluster center's similarity.

Figure 10 :
Figure 10: The comparisons of clustering results accuracy.

Figure 11 :
Figure 11: The comparisons of Adjust Rand Index.

Figure 14 :
Figure 14: Comparison of the accuracy for UCI data.

Figure 15 :
Figure 15: Comparison of the Adjust Rand Index for UCI data.

Figure 16 :
Figure 16: Comparison of the MSE for UCI data.

Figure 17 :
Figure 17: Comparison of the BIC for UCI data.

Table 1 :
The artificial data AR.

Table 2 :
Comparison of running time on UCI data set.