Fast Density Clustering Algorithm for Numerical Data and Categorical Data

Data objects with mixed numerical and categorical attributes are often dealt with in the real world. Most existing algorithms have limitations such as low clustering quality, cluster center determination difficulty, and initial parameter sensibility. A fast density clustering algorithm (FDCA) is put forward based on one-time scan with cluster centers automatically determined by center set algorithm (CSA). A novel data similarity metric is designed for clustering data including numerical attributes and categorical attributes. CSA is designed to choose cluster centers from data object automatically which overcome the cluster centers setting difficulty in most clustering algorithms. The performance of the proposed method is verified through a series of experiments on ten mixed data sets in comparison with several other clustering algorithms in terms of the clustering purity, the efficiency, and the time complexity.


Introduction
As one of the most important techniques in data mining, clustering is to partition a set of unlabeled objects into clusters, where the objects which fall into the same cluster have more similarities than others [1].Clustering algorithms have been developed and applied to various fields including text analysis, customer segmentation, and image recognition.They are also useful in our daily life, since massive data with mixed attributes are now emerging.Typically, these data contain both numeric and categorical attributes [2,3].For example, the analysis of an applicant for a credit card would involve data of age (integers), income (float), marital status (categorical), and so forth, forming a typical example of data with mixed attributes.
Up to now, most research on data clustering has been focusing on either numeric or categorical data instead of both types of attributes.-means [4], BIRCH [5], DBSCAN [6], -modes [7], fuzzy -modes [8], BFCM [9], COOLCAT [10], TCGA [11], AS  fuzzy -modes [12], and -means based method [13] are classic clustering algorithms.-means clustering algorithm [4] is put forward based on partition, where  cluster centers need to be initialized by users or experience.Initialized cluster centers number  could decide the clustering purity and efficiency.BIRCH [5] is short for balanced iterative reducing and clustering using hierarchies.Clustering feature and clustering feature trees are adopted to describe cluster specifically.Two stages are defined to implement BIRCH, including database scanning to build a clustering feature tree and global clustering to improve purity and efficiency.DBSCAN [6] (Density-Based Spatial Clustering of Applications with Noise) is a classic densitybased clustering algorithm, which is capable of dealing with data with noise.Compared with -means, DBSCAN does not need to set cluster numbers priorly.However, two sensitive parameters are essential for DBSCAN, which are eps and minPts.Until now, various revised DBSCANs are brought up to improve the performance of DBSCAN algorithm.However, parameter sensitivity is still a challenge for DBSCAN for its further applications.-modes [7] is an upgraded version of -means by introducing categorical attributes clustering capability.Fuzzy -modes [8] is a modified -modes clustering algorithm with fuzzy mechanism to improve its robustness for various types of data sets.BFCM [9] is short 2 Mathematical Problems in Engineering for bias-correction fuzzy clustering algorithm which is an extension of hard clustering and it is based on fuzzy membership partitions.COOLCAT [10] is an entropy-based algorithm for categorical clustering which brought up a novel idea of clustering on basis of entropy.Data clusters are generated by their entropy values.TCGA [11] is a two-stage genetic algorithm for automatic clustering.Bioinspired clustering algorithm summarizes clustering process as an optimization problem and genetic algorithm is adopted for convergence to the global optima.These above-mentioned methods face difficulties when dealing with data with mixed attributes, while the latter is emerging very quickly [14][15][16][17][18][19][20][21][22][23].Fast density clustering algorithm is put forward to solve clustering center determination problem [24].However, its mixed similarity calculation method is based on relationship of all attributes which has high computation complexity.And its cluster center determination method is mainly dependent on parameter  which is difficult to set priorly.
For example, distance measure functions for numerical values cannot capture the similarity among data with mixed attributes.Moreover, the representation of a cluster with numerical values is often defined as the mean value of all data objects in the cluster, which, however, is illogical for other attributes.Algorithms have been proposed [14,15,17,21,22] to cluster hybrid data, most of which are based on partition.First, a set of disjoint clusters are obtained and refined to minimize a predefined criterion function.The objective is maximizing the intracluster connectivity or compactness while minimizing intercluster connectivity [25].However, most partition clustering algorithms are sensitive to the initial cluster centers which are yet difficult to determine.They are also suitable for spherical distribution data without outliers handling capacity.
The main contributions of our work include four aspects.A novel mixed data similarity metric is come up for mixed data clustering.Clustering center self-set algorithm (CSA) is applied to determine center automatically.Bisection method is adopted to calculate parameter for clustering to overcome parameter sensibility problem.Fast one-time scan density clustering algorithm (FDCA) is brought up to implement fast and efficient clustering for mixed data.
The rest of this paper is organized as follows.Section 2 introduces related works of mixed data clustering.In Section 3, the similarity metric for data with mixed attributes and how FDCA works are presented.In Section 4, the abundant simulations are carried out to testify FDCA's performance compared with other classic algorithms.Section 5 is a practical application for handwriting number image recognition based on FDCA.And finally Section 6 concludes the paper.

Mixed Data Clustering Algorithms Overview.
As stated above, mixed data clustering algorithm is designed for data set of mixed attributes including numerical and categorical attributes.Numerical attributes of mixed data are evaluated by real values, while categorical attributes of mixed data represent the fact that those attributes are ordinal.It is still a challenge to cluster data with both numerical and categorical attributes.Lots of novel clustering algorithms are put forward to deal with mixed data.Huang proposed a -prototypes [14] algorithm which combines -means and -mode algorithms.-prototypes algorithm is an updated version of -means and -mode algorithm, especially designed for dealing with mixed data.It is a very early stage mixed data clustering algorithm.When the data set is uncertain, most clustering algorithm could not achieve purity and efficiency as expected.KL-FCM-GM [15] algorithm is an extended algorithm of -prototypes proposed by Chatzis.It is a fuzzy -meanstype algorithm for clustering data with mixed numeric and categorical attributes by employing a probabilistic dissimilarity functional.It is designed for the Guss-multinormal distributed data.When the data set is large, the data similarity metric processing costs much more time than expected.So it is not quite suitable for big data objects.Zheng et al. developed a new algorithm called EKP [17], which is an improved -prototypes algorithm to overcome its flaws.EKP algorithm has global search capability by introducing an evolutionary algorithm.Later, Li and Biswas proposed the Similarity-Based Agglomerative Clustering (SBAC) algorithm [18], which adopts the similarity measure defined by Goodall [19] to evaluate the similarity.It is an unsupervised analysis method for identifying critical samples in large populations, so the efficiency of the similarity metric is not stable.Hsu and Chen proposed a clustering algorithm based on the variance and entropy (CAVE) [20] for clustering mixed data.However, the CAVE algorithm needs to build the distance hierarchy for every categorical attribute and the determination of distance hierarchy requires the domain expertise.
Besides the above-mentioned unsupervised similarity metric for clustering, there are further researches on mixed data similarity calculation methods proposed.Ahmad and Dey proposed a -means type algorithm [21] to deal with mixed data.Cooccurrence of categorical attribute values is used to evaluate the significance of each attribute.For mixed data attributes, Ji et al. proposed IWKM algorithm [22], in which distribution centroid is applied to represent the prototypes clusters.And the significance of different attributes is taken into account towards the clustering process.Besides, Ji et al. proposed WFK-prototypes [23] by introducing fuzzy centroid to represent the cluster prototypes.The significance concepts proposed by Ahmad and Dey [21] are adopted to extend -prototypes algorithm in WFK-prototypes algorithm.WFK-prototypes algorithm is a classic mixed data clustering algorithm until now.David and Averbuch proposed a categorical spectral clustering algorithm for numerical and nominal data, called SpectralCAT [26].Cheung and Jia [27] proposed a mixed data clustering algorithm based on a unified similarity metric without knowing clusters number.The embedded competition and penalization mechanisms are used to determine the number of clusters automatically by gradually eliminating the redundant clusters.
In a word, there are a lot of mixed data similarity metrics and clustering algorithms designed for different applications.We still want to develop a universal numerical and categorical  data similarity metric and clustering algorithm that could be applied to most cases and practical data sets.

Fast Data Clustering
Algorithm.Rodriguez and Laio had got their novel paper "Clustering by Fast Search and Fine of Density Peaks" published on Science in June 2014 [28].In their algorithm, clustering centers could be observed from density-distance relationship graph.Inspired by their method, we conclude their method as follows: the cluster centers are surrounded by neighbors with lower density and they are at a relatively large distance from any points with a higher density.Noise points have comparatively larger distance and smaller density.
The density   of data point  is defined as follows: where   denotes data   's density,   represents distance between data   and data   , and  is the threshold distance of each cluster defined priorly.According to (2), if the distance between data   and data   is less than , then density of data   is   =   + 1.In other words,   is equal to the number of points that are closer than  to point .  is measured by computing the minimum distance between the point  and any other point with higher density: For the point with highest density, we conventionally take   = max  (  ).Note that   is much larger than the typical nearest neighbor distance only for points that are local or global maxima in the density.Thus, cluster centers are recognized as points for which the value of   is anomalously large.
This observation, which is the core of the algorithm, is illustrated by the simple example in Figure 1(a).Then the density and distance of every point are computed. and  distribution is shown in Figure 1

(b).
There is a mapping between point distribution and  and  distribution.For example, there are three red points A1, A2, and A3 in Figure 1(a) and they are cluster centers in original point distribution; the corresponding points A1, A2, and A3 in Figure 1(b) have larger distance and larger density than other points.In addition, there are three black points B1, B2, and B3 in Figure 1(a) and they are isolated and called the noise points.The corresponding points B1, B2, and B3 in Figure 1(b) have larger distance and smaller density than other points.Other points belong to one cluster and are called border points.
For all the data objects, we sort the density in descending order, as shown in Figure 2.
For any data point , there are some qualitative relationships as follows: (1) If   ∈ (  ,   ) and   ∈ (  ,   ), the data point  is the cluster center.(2) If   ∈ ( 1 ,   ) and   ∈ (  ,   ), the data point  is a noise point.
If the data point does not meet situations 1 and 2, then the data point  is a border point.Because cluster center has relatively larger density and larger distance compared to  other centers, while noise data only has relatively larger distance from cluster centers and much less density, both cluster centers number and noise amount are relatively small compared with other data objects.The average density value and distance value are mainly dependent on majority of data objects besides centers and noise.So the specific value of  and  for different data set could be self-determined during the finding cluster center process.For instance, if the data size is 1000, the cluster center is selected from   .For one data object   , if its density is   , then we check if its  is   or a little bit less than   .If so, then data object   is one of the cluster centers.And if its density is more like  1 while its distance is more like   , then data object   is noise data.By checking those data objects according to CSA in Section 3.2.2,we could get all those cluster centers one by one.In summary, the only points of high  and relatively high  are the cluster centers.The points have relatively high  and low  because they are isolated; they can be considered as noise points.

Fast Density Clustering Algorithm for
Numerical Data and Categorical Data  1, six classic mixed data similarity metrics are listed and compared.According to each algorithm, different distance measure equations are developed including numerical attributes calculation part and categorical attributes calculation part.For instance, -means algorithm is only suitable for numerical attributed data only, so there is no definition for measuring categorical attribute part for data set.And -modes algorithm is designed for dealing with categorical attributed data which has no numerical attributes similarity metric.The other four similarity metrics are applied to mixed data, so all of them have both numerical attribute and categorical attribute parts distance metrics.
The Euclidean distance is adopted by -means algorithm to deal with the pure numerical data.The simple matching distance is adopted by -modes algorithm to deal with the pure categorical data.-prototypes algorithm integrates means and -modes to deal with mixed data.Algorithms EKP and WFK-prototypes improved -prototypes algorithm by introducing fuzzy factor or weight coefficient in original distance measure, so that it can more accurately measure the similarity between objects.FPC-MDACC algorithm [29] adopts three different distance measure methods for mixed data depending on their types which need prior work to determine which type the current mixed data is, and this represents extra time cost and extra algorithm complexity.
Until now, we still need an efficient similarity metric for calculating distance of data objects of mixed data.We believe that one unified similarity metric for both numerical and categorical data is more efficient and reasonable for mixed data instead of independent calculation for each of the other attributes.

Unified Similarity Metric for Numerical and Categorical
Data.A unified similarity metric is presented in this section for mixed data, which is applicable for any type of mixed data which has numerical attributes or categorical attributes or both.
Definition 1.Given the data set  = { 1 ,  2 , . . .,   , . . .,   }, each data object   has  dimensions.The distance (  ,   ) between two data objects   and   is defined as where  (1) If th attribute is numerical, then   , is defined as follows: where ℎ goes through every possible attribute value of data objects   and   .
Since the numerical attribute for different data could be quite different, in case the value is quite large or small, we have to balance its contribution to the final distance.So numerical attributes need to be normalized into [0, 1].
(2) If th attribute is categorical or binary, then   , is defined as follows: where  goes through every possible attribute value of data objects   and   .The categorical attribute is defined to evaluate whether the data objects  and  are the same or not on this attribute.If they have the same attribute, then the distance defined equals 0; otherwise the distance is 1.
(3) If th attribute is order, then   , is defined as follows: where  goes through every possible attribute value of data objects   and   .   is defined as follows: where   the distance between  and  equals 1; distance between  and  equals 1 as well.However, in our case, the th attribute is ordinal, so these two distances should be distinguished.We calculated their th attribute distance according to (7) and (8).
In this way, similarity for all the data objects could be calculated based on (5) to (8).In order to demonstrate how these three types of attribute are defined and measured according to the above proposed methods, we take data set Heart from UCI as an example.

Illustration for Unified Similarity Metric.
As the unified similarity metric is put forward in Section 3.1.2,we would like to take data set Heart from UCI as an example to testify how it works.
Data in Heart has 13 attributes including the following: (1) Age According to their practical meanings, five attributes are defined as numerical attributes (1,4,5,8,10), two attributes are defined as ordinal attributes (3,12), and the remaining six attributes are defined as categorical attributes (2,6,7,9,11,13).Based on (1) to (8), the data similarity of Heart could be measured according to their attribute type.For instance, three data samples data sample 1, data sample 2, and data sample 3 are listed in Table 2 for calculating and explaining how brought up similarity calculation metric works.According to the unified similarity metric, the distance of each data sample can be measured as in Table 3.
From Table 2, we can conclude that data sample 1 and data sample 3 are more likely to be clustered into one cluster because their distance is less, while data sample 2 has less similarity with data sample 1 and data sample 3. From Heart data set from UCI, original label information of data samples is given.And data sample 1 and data sample 3 are labeled as the same class, while data sample 2 belongs to another class.Our similarity results are correct, and the unified metric for mixed data set is efficient from this illustration.

Main Idea.
Based on analysis of Figure 3, the only points of relatively larger  and larger  are the cluster centers.The points which have relatively larger  and less  can be considered as noise points because they are isolated.In order to realize cluster centers self-determination, more information from all data objects in the descending order of  and  is explored.
First of all, all data objects are sorted in descending order of their  and  values each.And a fast center set algorithm (CSA) is adopted to choose the clustering centers automatically.After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density.The cluster assignment is executed through one-time scan.Different from other partitioned clustering algorithms, FDCA can deal with arbitrary shape cluster.Each remaining point is assigned to the same cluster as its nearest neighbor of higher density.As shown in Figure 3, the number means the level of density: the bigger the number, the larger the density.Data object "3" is a cluster center and the cluster label is CENTER-1.The cluster label of data object "4" should be the same as the nearest neighbor of higher density, so the cluster label should be the same as data object "5," which is CENTER-1.
For the noise point, FDCA does not introduce a noisesignal cutoff.Instead, we first find for each cluster a border region, defined as the set of points assigned to that cluster but being within a distance  from data points belonging to other clusters.We then find, for each cluster, the point with highest density within its border region.Its density is denoted by   , and only keep the points that have density larger than or equal to   .
The main idea of how CSA algorithm is applied for FDCA is shown as chart in Figure 4.

Clustering Center Set Algorithm (CSA).
CSA algorithm is brought up to find out clustering centers for data clustering automatically based on  and  descending order of all data objects.The process of CSA algorithm is shown in Algorithm 1.

Parameter 𝑑𝑐 Optimization.
CSA algorithm is sensitive only to the choice of ; proper selection of  could help CSA to find the correct clustering centers which would lead to high-efficient FDCA.This section would focus on how to get proper value of .
In Alex algorithm [28], as a rule of thumb, proper value for  is in the scale of 1% to 2% of data objects number in data set.For example, if the total number of data objects is 1000, then  ∈ [10,20].Since our designed FDCA aims to cluster mixed data, the target data set is different from Alex algorithm.Therefore, mixed data set is observed from UCI Machine Learning Repository.Because mixed data has more complicated similarity metric, the distances between cluster center and its data objects are more likely to be of wider scale.We can choose  in the scale of 1% to 20% of data objects number in data set for all possibilities.For example, if the total number of data objects is 1000, then  ∈ [10,200].However, in this way, we could only confirm the value scale of  but could not achieve the optimal value.Suppose that the data set has  data samples; the scale for  could be defined as   low =  * 1% and   high =  * 20%.For one  from [  low ,   high ], density and distance for  all data objects could be calculated.From the corresponding relationship graph of density and distance for each data object, CSA algorithm is adopted to determine cluster centers.After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density.The rest of data objects are divided into those clusters based on FDCA (described in Section 3.2).This whole process is called one iteration for one .Because clustering is an unsupervised method, whether  is an optimal value of distance threshold or not could not be evaluated by data samples' original label class.Another performance evaluation index is designed.
Suppose that there are  clusters; each cluster center could be represented as   .Data objects clustered into   are denoted as    ,  ∈ [0,   ], where   represents the number of data objects belonging to cluster   .Then the performance evaluation index for each  is defined as where (   ,   ) is short for distance between data object    and its cluster center   .
The value of  could reflect the closeness of clusters.So we would optimize  value with the minimum of .So finding proper value of  could be summarized as an optimization problem.Optimization algorithm is applied for selecting optimal parameters for clustering algorithms such as PSO [29].PSO based parameter self-adaptive method is proven useful by comprehensive simulations.However, PSO is a bioinspired optimization algorithm based on iterations, which results in high algorithm complexity and time complexity.In order to realize fast data clustering, dichotomy [30] is adopted instead of bioinspired algorithms to search for optimal .According to this rule, for each data set, we can get an initial range for  as [  low ,   high ].The only problem is how to get the optimal value of .We already know that proper  could make CSA get the optima clustering centers, so we have to get how  influences clustering efficiency.We take Iris data set as an example. is set from 0.1 to 0.9 with 0.05 as a step.CSA is adopted to get clustering centers number  as in Table 4.
From Table 4, we can conclude that sequential value of  from minimum value of 0.1 to maximum value of 0.95 with each 0.05 step could get the optima value of  as 0.2, 0.25, or 0.3 whose clustering centers number is 3.So we could use a fast searching algorithm to find the best value of  to get the optima value of clustering center.We apply the self-adaptive strategy of  value on Iris data shown in Figure 5 to testify the efficiency of clustering centers numbers of  value.
Dichotomy algorithm is applied to search the optima value of  for clustering algorithm.We define the value scale of  as [  low ,   high ], where  is from 1% to 20% of total data samples number.For fixed value , dichotomy algorithm uses function () to find the approximate zero by the following steps.
Step 2. Calculate midpoint for range [  low ,   high ], which is denoted as .
Step 3. Calculate () according to CSA to get specific clustering centers number  based on  = .
Step 5.If the definition  is achieved, in other words, |  high −   low | < , then the optima value of  is current   high or   low ; end the algorithm; else go to step 2.
Therefore, initial  is selected randomly in scale of [  low ,   high ], and CSA algorithm is executed to determine cluster centers automatically.According to the current result of clustering, compute  defined as (9) to evaluate whether current  is good enough for clustering.If it is, then we fix current  as the optimum and calculate the purity and efficiency of FDCA.Otherwise, dichotomy searching algorithm is applied to find another  and repeat CSA and FDCA.The brought up optimal  self-adaptive algorithm is faster than PSO based algorithm.For PSO or other bioinspired optimization algorithms, from Table 4, we can conclude that proper value of  could help CSA find the correct cluster center.However, with the slight difference of  value from 0.1 to 0.5, the clusters number  is the same, which means we only have to find the proper scale of  from its initial scales instead of finding the optimal value.Dichotomy searching algorithm is a fast searching algorithm to find the proper half area for .In Section 4, abundant simulations and the reallife application testify its efficiency in finding proper .

Data Settings. Ten data sets from UCI Machine Learning
Repository are used for clustering algorithm simulations, as shown in Table 5.

Performance Analysis. (1)
In clustering analysis, the clustering accuracy () [11] is one of the most commonly used criteria to evaluate the quality of clustering results, defined as follows: where   is the number of data objects occurring in both th cluster and its corresponding true class and  is the number of data objects in the data sets.According to this measure, the larger  is, the better the clustering results are, and for perfect clustering  = 1.0.
(2) Another clustering quality measure is the average purity of clusters defined as follows: where  denotes the number of clusters.|   | denotes the number of points with the dominant class label in cluster .|  | denotes the number of points in cluster .Intuitively, the purity measures the purity of the clusters with respect to the true cluster (class) labels that are known for our data sets.

Clustering Efficiency.
There are four 2-dimensional data sets (Aggregation, Jain, Spiral, and Flame) with various shapes of clusters (circular, elongated, spiral, etc.).The results are presented in Figure 6.
The results in Table 6 show that the algorithm is capable of clustering arbitrary shape, variable density clusters and has a good clustering quality.
The performance of FDCA is compared with prototypes, SBAC, KL-FCM-GM, IWKM, DBSCAN, BIRCH, SpectralCAT, TGCA, and FPC-MDACC algorithms.The experiments results on different data sets show that FDCA algorithm is able to find optimal solution after a small number of iterations.The following reasons contribute to the better performance of our proposed algorithm.FDCA needs to analyse the density and distance of each point, and we then adopt dichotomy analysis techniques to fit the functional relationship  *  = (  ).Afterwards, by analysis, the residuals distribution finds the cluster centers automatically.It conforms with the original data distribution of mixed data, which leads to a good clustering result.Because the number of data records in Iris, Soybean, Zoo, and Acute data sets is small, the execution is fast.The KDD CUP sample data sets and Breast data have a relatively large number of data records, and thus the execution time is longer.Since the balanced data, like Heart and Credit, adopt the probability and statistics method in the pretreatment stage, therefore they need more time than others.

Complexity Analysis.
Assume that the data set has  data objects; the time complexity of FDCA algorithm mainly consists of the computation of the distance and density of each data object, and the computational costs are ( 2 ) and (( 2 − )/2).After the cluster centers are found, the cluster assignment is performed in a single step, and the corresponding computational cost is ( − ), where  denotes the number of cluster centers.
The time complexity of partition-based clustering algorithms and hierarchical clustering algorithms is (iter *  * ) and ( 2 ).So the time complexity of our proposed algorithm is higher than the partition-based clustering algorithms and hierarchical clustering algorithms.The advantages of our proposed algorithm are that the algorithm can determine the cluster centers automatically, can deal with arbitrary shape clusters, and is not sensitive to parameters.

Unsupervised Number Image Recognition
Based on FDCA  labeled examples for classifiers to train, while in practical cases, it is not always suitable.Aiming at those problems, an unsupervised number image recognition method based FDCA is brought up to improve the recognition rate of handwriting number images without any labeled samples in advance.First of all, number images are clustered based on FDCA.And then a strict filter is designed to extract cluster centers and typical cluster members automatically for classifier to guarantee that those training samples have pure cluster features.Finally, traditional classifiers BP artificial neural network (ANN) [31] is adopted to classify number images based on those selected cluster centers and typical images as training sample instead of label known images in advance to realize unsupervised method.MNIST data set is recognized to testify our designed unsupervised image recognition method based on FDCA.

Handwriting Images
where   is central frequency of modulation band pass filter; () is a progressive function with symmetry.After stretching and shifting transformation, we can obtain corresponding wavelet clusters: where  ∈  + is stretch factor and  ∈  is shift factor.Continuous complex wavelet transform of real signal () is where () and (), respectively, represent Fourier transform of () and ().
In the process of complex wavelet transform, assume that   = { , |  = 1, . . ., } and   = { , |  = 1, . . ., }, respectively, represent two coefficient sets of different images to be compared, which are extracted from same wavelet subband and same spatial location: where  * and  are complex conjugates;  is positive constant with small value, which is used to improve robustness of S at low signal-to-noise ratio.
In order to better understand CW-SSIM, right part of the equation is multiplied by an equivalent factor, whose value is 1: In the first part of right-hand side, each factor is constant or mode of complex wavelet coefficient.For two given images, complex wavelet coefficient corresponds to a certain value.If the condition of | , | = | , | for all  is met, then the first part of right-hand side has maximum value of 1, and the value of second part is related to phase change of   and   .If the condition that phase change of  , and  , is constant for all  is met, then the second part has maximum value of 1.The reason for taking this part as image structural similarity index is mainly based on the following two points: (1) The structural information of local image features is all included in phase pattern related to wavelet coefficients.
(2) The constant phase change of all coefficients does not change structure of local image features.
With dual-tree complex wavelet transform with shift invariance and good direction selectivity, CW-SSIM index based on dual-tree complex wavelet transform is given.Firstly, the image is decomposed into 6 levels through dualtree complex wavelet decomposition, which can avoid serrated subband.And then calculate local CW-SSIM index of each wavelet subband by moving sliding window on subband, whose size is 7 * 7.In the experiment, we found that performance of CW-SSIM will not be obviously affected by slight perturbation of parameter , so that take  = 0.However, the value of  must be adjusted to obtain the best results Initialize ℎ  = 0,    = 0; For  = 1, 2, . . .,  − 1 For  =  + The range of CW-SSIM is [0, 1].The larger the value is, the higher the image similarity is.

Strict Filter Design.
In order to guarantee the purity of each cluster, strict filter is designed to kick out the members which lie on the edge of cluster.Therefore, after the cluster centers are determined and the remaining points are assigned to appropriate cluster, boundary region of fixed cluster is set.Data points within the region have the following characteristics: the data points are belonging to the cluster, but within a distance of  ( is adjustable) there are objects belonging to the other cluster.By means of the objects in the boundary region, we can determine a local average density of the cluster; the object with density which is larger than the local density will be divided into the cluster, whereas the other objects are rejected, in order to ensure the cluster's purity.The implementation process is as Algorithm 2 shows.

Unsupervised Number Image Recognition
Based on FDCA 5.3.1.Data Set and Evaluation Index.MNIST data set is applied to testify the performance of image recognition method based on FDCA, which consists of 60000 number handwriting images.Numbers from 0 to 9 are all collected for classifier stored as binary file, each of which is 28 * 28, shown in Figure 8.
In this paper, we adopt the consistent indicators and recognition rate to evaluate the results as follows.
(  (2)  true represents the fraction of pair of images of the same subject correctly associated with the same cluster. false represents the fraction of pair of images of different subjects erroneously assigned to the same cluster.We define them as follows: where  represents the number of objects in data sets, ( − 1)/2 represents the pair number of the data sets,  represents the same type of objects assigned to the same cluster,  represents objects of different classes assigned to different clusters, and  represents objects of different classes assigned to the same cluster.

Application Results and Analysis.
The recognition algorithm is processed as follows.
Step 1. Original images are input to calculate their similarity based on CW-SSIM.
Step 2. FDCA is applied to cluster images to get training samples for BP ANN.Those cluster centers and typical members are selected by strict filters.
Step 3. Train BP ANN with cluster label information images.
Step 4. Recognition process is carried out based on BP ANN.In our method, BP ANN is adopted according to paper [31].
First of all, we select 600 images for clustering to get cluster centers and other typical images for classifier to train.Those images contain numbers from 0 to 9, and each number has 60 images.Figure 9(a) is the cluster center selfdetermination process based on FDCA based on density and distance values.Different color is used to denote the different cluster centers.Figure 9(a) is the distribution of  sort and  sort .Before the strict filter is added into the method, the cluster results consist of two situations.(1) For image x, it is clustered into cluster A, while its true label is X label, and A label = X label; then image x has been clustered correctly.(2) If case (1) is not established, then image x has been wrongly clustered.In order to make sure that training samples for classifier    have been clustered as correctly as possible, we adopt strict filter to keep cluster pure through deleting cluster edge members.
As shown in Table 7,  is the radius parameter of strict filter denoted as the distance from cluster center.In other words, for strict filter with radius , if the distance between cluster member and center is larger than , then this member would be removed from the cluster to guarantee the purity of the cluster.With different filter , we could achieve different clustering efficiency as shown in Table 7.  true denotes clustering accuracy, while  false denotes error rates.We can conclude from Table 7 that, without filters, recognition based on FDCA could achieve  true = 89.8% and  false = 4.6%.The higher  true is, the higher  false is at the same time.On the contrary, the lower  true is, the lower  false is.The reason for this result is that the more strict filter is, more cluster members would be 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 excluded from cluster and the purer cluster would be, so  true would be low, with lower  false at the same time.

Conclusion
A novel fast density clustering algorithm (FDCA) for mixed data which determines cluster centers automatically is proposed.A unified mixed data similarity metric is defined to calculate data distances.Moreover, the CSA is used to fit the relationship of density and distance of every data object, and residual analysis is used to determine the centers automatically, which conforms to the original mixed data distribution.Finally, dichotomy analysis is adopted to eliminate parameter sensitivity problem.The experiments validated the feasibility and effectiveness of our proposed algorithm.Furthermore, our proposed FDCA is applied to number image recognition as an unsupervised method.MNIST data set is adopted to testify the high recognition rate with low false rate of our FDCA based method as a typical application.The future research will focus on the clustering data stream to achieve high clustering quality based on this work.

Figure 1 :
Figure 1: The algorithm in two dimensions.(a) Point distribution.(b)  and  distribution of (a).

Figure 2 :
Figure 2: The descending order of  and .
, denotes weight of th attribute and  is the number of attributes.If the attribute value of th is missing, then   , = 0; else   , = 1.  , denotes distance of th attribute for data objects   and   .
denotes order of each    and   is the total number of values    has among all data objects.In this paper, ordinal attributes are defined different from categorical attributes.Ordinal attributes are ordered by their values from big to small.For instance, th attribute of data object  is represented as    = 1, pth attribute of data object  is represented as    = 2, and attribute of data object  is represented as    = 3.If the th attribute is categorical, then

Figure 5 :
Figure 5: Relationship of  and clustering centers number  on Iris data set.
Figure 7 lists the average execution time of our proposed algorithm and other algorithms on the eight data sets.

)
Specific equation of recognition rate is defined as follows: recognition=Number of face recognized correctly Total number of face attending recognition × 100%.(17)

Figure 8 :
Figure 8: Examples of each number in MNIST.

Figure 9 (
b) is the result of number image cluster based on FDCA for all 600 images.
Cluster center self-determination process based on FDCA (b) The same color numbers belong to the same cluster, while those grey images do not belong to any cluster.For each cluster, those images with tiny black circle are cluster centers for each cluster

Table 1 :
Six distance measures of partition-based clustering algorithms.

Table 2 :
Attributes information of three data samples.

Table 3 :
Distance between two data samples.

Table 4 :
value influences clustering centers number on Iris data set.

Table 5 :
Twelve data sets from UCI. is the number of numerical attributes,  is the number of categorical attributes, "/" is for unknown parameters, and  is the number of clusters.

Table 6 :
Clustering quality evaluation on all data sets.

Table 7 :
Performances comparison of different strict filters for MNIST data set.