Structure Identiﬁcation-Based Clustering According to Density Consistency

. Structure of data set is of critical importance in identifying clusters, especially the density di ﬀ er-ence feature. In this paper, we present a clustering algorithm based on density consistency, which is a ﬁltering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.


Introduction
As one important stage of the data mining process, data clustering is a division of data into groups of similar objects.Each group, called cluster, consists of objects similar among themselves and dissimilar to other groups.Representing data by fewer clusters necessarily loses certain fine details, but achieves compact representation.Clustering is the basis of pattern recognition and knowledge discovery.Tremendous literatures on clustering techniques have appeared since the last century.These techniques are widely applied in image segmentation 1 , information retrieval 2 , text mining 3 , image quantization 4 , and so on.
Clustering algorithms can be basically divided into two categories of approaches, the partitional approach 5, 6 and the hierarchical approach 7-9 .The partitional approach generally obtains a partition of the data set by minimizing a cost function, which is commonly modeled to prefer the clusters with maximal intracluster similarities and minimal inter-cluster similarities.The existing algorithms that belong to this kind of approach include FORGY 10 , ISODATA 11 , WISH 5, 6 , and so forth.

Mathematical Problems in Engineering
Hierarchical clustering algorithms are the most commonly used clustering techniques specially useful in biology and medicine.The basic idea of such algorithms is to successively cluster a data set into clusters in a hierarchical fashion.Hierarchical clustering algorithms are able to produce multilevel clusters, which is necessary for the applications in biology and other applications.Examples of hierarchical clustering algorithms are CURE 7 , ROCK 8 , and Chameleon 9 .
Density-based algorithm is a kind of classical clustering algorithm.One method of density clustering algorithm is based on the assumption for the whole data set, and the other method is based on the local density feature to mine the structure of data set.
In order to assume the distribution of the whole data set to clustering, the first thing is to suppose different distribution functions on different density zones of data set, and which commonly uses the nonparametric density estimation.Using the nonparametric density estimation with a kernel function to estimate the distribution, the clusters of a data set are obtained by repeatedly moving the data to the local maximal points of the estimated density.There exist density-based clustering algorithms such as famous Mean Shift 12-15 , CSSF 16 , and so on.The computation of the algorithm is simple and the cluster number is not required to be known in advance.However, they tightly depend on the density estimation quality of the data set.Without any prior knowledge on structure of the data set, it is difficult to provide a good density estimation to fit the data set, so the clustering results might not be desirable for real data sets or images.
Describing structure of data set by mining the local density feature, the clustering is to agglomerate the data points with same density feature.For a data set, the density difference of each data point can be measured with k-NN neighborhood, in which the larger k-NN neighborhood represents that the local density is low or the data points in this region are sparse.The data points classified into same cluster should have same local density value or same radius of k-NN neighborhood.The work in 17 and DBScan 18 are of this kind.This kind of method need not to assume the distribution of the whole data set and are not be restricted the shape of data set.However, the algorithms need to compare the density difference to agglomerate the data points.
The present research is a new clustering based on density consistency to cluster data set with filtering process.Through describing the local structure feature of data set, the data points with same feature can be classified into same cluster.This new clustering algorithm is not restricted by the shapes and high dimension data set, meanwhile it is robust to noises and outliers.
The reminder of this paper is organized as follows.In Section 2 we introduce some useful structure features of data set, which is important to clustering problems.Section 3 introduces the clustering process, includes the modeling of filters, the feature extraction process and feature integration process following with top-down process.The experiments and conclusion are given in Sections 4 and 5.

Density Feature of Data Sets for Clustering
Structures of data sets and clustering are closely related.It is commonly considered that the clusters are made of some discrete data points according to some structure features, and these structure features contain local and large scale structure features.Local structure is mainly described with the spatial relationship between near data points, which is also called neighborhood relationship.Local spatial relationship is the most original and elementary for clustering problem.As shown in Figure 1 a , there are lots of tiny clusters, which may be extracted at local scale though the data set follows uniform distribution and has only one cluster at large scale.
The large scale structure features mainly contain density, connectedness, and direction, and we will give explanation to these structure features in the following section.

Density Feature
Density characterizes the distribution feature of a data set, which can be measured with the number of data in a certain volume of pattern space.According to the proximity and similarity laws in psychology, a region of data set with higher density is prone to be recognized as a cluster by human eyes.See Figure 1 b for example, where two regions of data with different densities are very naturally perceived to be two different clusters, despite the fact that they are very close to each other.A uniformly distributed data set is naturally accepted as no feature because no visual difference is observed see Figure 1 a .

Connectedness Feature
The neighbor regions with local structure feature may connect each other and form curve or manifold structure as Figure 1 c .By the Gestalt continuity law, the connected data are easily perceived to be a same cluster, but the disconnected data are prone to be perceived as different clusters.

Direction Feature
If the local regions connect each other and have same principal direction, then the principal direction is called the direction feature of the data set, see Figure 1 d as an example, which has two different direction features.
The local and large scale structure features are the most typical features for valid clustering of data set.Local structure features can form large scale structure features with bottomup process while the large scale features provides feedback information with top-down process.In next section, we will introduce clustering process to extract and integrate these features into clusters.

Clustering Process
Since the data set to be clustered has many native structure features, in this section we propose feature extraction and feature integration methods to obtain valid clustering results.
Let X x 1 , x 2 , . . ., x N N×d be a pattern matrix and x T i ∈ R d be a pattern.The data set to be clustered can be viewed as a kind of imaginary image with its native structure features, and then the clustering problem is viewed as a cognition problem.To extract the structure features, we employ filtering methods in the theory of vision research.Given a stimuli, the response of a neuron in primary visual system can be measured where * is convolution operation, I x is stimuli an image or imaginary image , K x; Θ is a series of filters or kernel function , and Θ is a set of parameters.In 3.1 , f x; Θ is the response of the neuron with stimuli I x , which is the filtering response and is called as a feature of I x .Data sets may have many native structure features, which need to be extracted based on 3.1 with different filters or kernels .These filters should be determined automatically to deal with different structure features of different data sets.Next we introduce how to construct filters automatically based on these salient structural features.

Modeling of Filters to Extract Local Structure Features
Like image processing, each data point is corresponding to the center of a filter, and for clustering problem here we model a new type of low-pass filter where Ω x i ; k is k-NN of x i with new special method, and x j|i is the jth neighbor of data point x i .The matrix A is a covariance which can control similarity measurement according to the neighborhood relationship and structure of data set.
To determine the special neighborhood for each data point is very important for the new clustering algorithm, and the aim is to depict Ω x i ; k according to density difference.Then we will introduce how to determine the data set Ω x i ; k for each data point x i and how to scale the neighborhood Ω x i ; k according to matrix A.

Determination of Neighborhood
Unlike classical k-NN strategy, we use a new method to find nearest neighborhood for each data point.First, we initialize k nearest neighborhood data points 19 in which each new adding data point is the nearest to origin data set, then we select proper neighbors among the initial k data points.Definition 3.1 acceptable neighborhood radius .Each data point has k initial neighbors, and d i x is the distance between ith new adding nearest neighbor x i while data set containing origin i data points.Then the neighborhood radius of the data point x is defined as Take Figure 2 a as an example, in which we find the nearest 4 data points from a. First the data point b is the nearest to a and its distance is d1, then a and b can be viewed as a whole, and the second nearest data point to a is c rather than d, its distance is d2.So in this way the acceptable neighborhood radius of data point a is d1 d2 d3 d4 /4.
With the definition, each data point is corresponding to a neighborhood radius.In order to select true neighbors for a data point, the neighbors should have similar features as the data point.So the consistency criterion is built up.
where d i is the ith adding distance in 3.3 and α is a parameter.
The aim of the ICC is to wipe density different data point out.With the criterion, the neighbors of each data point can be determined by the structures of data set, meanwhile the selection mechanism for neighbors can also explore the structures of data set.In Figure 2 b , the neighbors of data point a are b, c, d, and e according to classical k-NN.However, the data point a and b are not satisfied with the two-criterion 3.4 , so the data point a has no neighbor and it is an outlier.The two criteria can also detect the density structure as shown in Figure 2 c , and the initial 6 nearest neighbors are b, c, d, e, f, and g.The first three data points b, c and d are the neighbors of a because they satisfy the criterions.However, the data points d, e and distance d4 are not satisfied the criterions, so the number of nearest neighbors are 3, {b, c, d}.Thus, the data points e, f, and g are not its neighbors.Note that the number of nearest neighbors may not be the initial number k.Note that the method to select neighbors according to ICC is not related to sequence of nearest neighbors.

Local Scale and Direction of Neighborhoods
Combined with neighbors, data point x has a neighborhood Ω x i ; x, k which is important to depict the interaction between x and each of its neighbors.The matrix A in 3.2 cannot only provide the scale restriction, but endow each data point with the direction feature.Then the matrix A can be concreted by employing covariance matrix where |•| is the number of neighbors in the neighborhood.Then in 3.2 , the local scale matrix can be represented with Whenever B is irreversible, we add each diagonal element of B by an any small positive constant.With the local scale matrix A in 3.2 , the neighborhood of each data point has a local scale which can restrict the similarity or the relationship between each pair of data points in a proper scope.Meanwhile, the biggest eigenvector of A x is the local direction at the point x which can provide other agglomerate criterion and feedback information to refine clusters.

Agglomeration Process of Clusters
In this subsection, we introduce the agglomerate process of clusters, in which different clustering feature can be extracted, from local features into large scale features and then clusters.The main process contains three steps: feature extraction with local similarity, feature cognition with large scale structure features and the reidentification with top-down process.In the procedure of feature extraction, the data set is successively filtered convolved with multiscale filters, yielding various feature representation of the data set.In the procedure of feature cognition, the clustering result is obtained by integrating various large scale features.In top-down process, the clusters are rechecked to form satisfied ones, such as noisy data points, manifold clusters, clusters with principle direction, and so on.

Extraction of Local Scale Features with Filtering Process
Local scale features embody the local structures of data sets which are based on the local similarity.Neighbor data points with close relationship can form local structure features, such as the density feature of the neighborhood, shape, and principal direction of the neighborhood and so on.In order to extract these local scale features automatically, we use a series of selftuning filters to agglomerate local neighborhood data points.
Let X t be the feature of data set extracted at t-layer, and X 0 simply corresponds to X. Then X t can be expressed as where U is a data-driven filtering matrix according to 3.2

3.8
The filtering matrix is a sparse matrix, in which each element u ij represents the relationship between i and j.If they are not neighbors, the element is zero.The representation 3.7 defines a scale space {X t 1 U X t X t | t 0}, which we call the discrete scale space (DSS) of the data set deduced from its feature.For definiteness, we assume that the feature extraction procedure is accomplished within T layers of processing.Furthermore, it can be proved that X t defined by 3.7 will be finally stabilized as t → ∞.So we can also assume that T is so large that X t reaches to a stationary state when t T , when the data points with strong relationship will shrink into a same destination to form a cluster.

Integration of Large Scale Feature Clusters
At local scales, the data points with close relationship will be grouped together to form local structure features, and these structure features are rudiments of clusters.These rudiments of clusters then can be grouped according to some aggregative properties to final clusters.Next we introduce the integration of large scale features.
The two similar regions clusters merge together, with the precondition that the two regions clusters are close, and the average distances of each one are almost equal.The distance of sets is represented with As for the distance between the two clusters, we use the modified distance of sets.m minimal distances of clusters are selected and the average is set to be the intracluster distance.In this paper, the value is set as 3 to make the algorithm more robust.Then the average distance of cluster is defined as follows.
Definition 3.3 average distance of cluster .C i is the ith cluster obtained with filtering process and x ∈ C i , the neighborhood of x is Ω x ; x, k , then the average distance d C i of cluster C i is defined as As the discussion above, the two clusters C i and C j merge together when the two criteria are satisfied as

3.10
With the definitions of average distance and the intradistance between two close clusters, also with the agglomerating criterions 3.10 the local structure features will integrate into large scale structure features, such as connectedness and direction features.This procedure is a rough agglomerating and a bottom-up procedure, in which different local structure features may merge together.Then the Top-down process should be considered to analyze whether these local features should merge or not.

Top-Down Procedure
The identification of noisy data points and large scale structure features of data set are difficult for clustering problem and are parallel with clustering algorithm.Here the identification process embeds Top-down procedures because the agglomerating process is only a bottomup procedure without any large scale feature information.In this paper, the identification of noisy data points, connectedness, and direction features are considered.

Identification of Noisy Data Points or Outlier
Lots of noisy data points or outliers may appear with the local features agglomeration in the filtering process.These data points strongly depend on the local similarity and may not be true noises, and whether they are noisy data points or not depends on the large scale structure features after agglomeration.Some outliers appear in the seeking of neighbors for each data point, and then some of these outliers may agglomerate in clusters in the filtering process.To avoid this, the outliers should be identified whether they are noisy or not, or whether they should be classified into close clusters according to the statistical properties such as mean and variance of Gaussian .An example is given in Section 3.2.4 to demonstrate the identification procedures.

Identification of Connectedness and Direction Feature
With the whole agglomeration procedure, fine clusters merge into coarse clusters, and the number of clusters decrease.The coarse clusters are made of small fine clusters which have their own local principal directions.The fine clusters are regions with various shapes, and the region of a fine cluster C i can be measured with covariance matrix where x is the mean of the local cluster.The principal direction of Σ i is the local direction of the cluster.In agglomerating procedure into coarse clusters, if the directions of subagglomerates are almost the same, the coarse cluster has a direction feature.
Input: Data set X, in which each row is a pattern; parameters: k, \alpha, \beta.
Output: Label vector L, in which each element is cluster label.
Step 1 : Find k nearest neighbors for each data point x according to Euclidean distance.
Step 2 : Find nearest neighbors for each data point x according to Definition 3.2.
Step 3 : Construct filtering matrix U based on 3.2 and 3.8 .
Step 4 : Filtering process X UX, obtain elementary clusters, which are the element with various kind of structure features.Top-down process is included to identify connectedness and direction.
Step 5 : Integrate elementary clusters with same structure features into meaningful clusters according to 3.9 .
Step 6 : Top-down process to identify noisy data points and outliers.

Algorithm 1:
The process of new clustering algorithm.

Algorithm Outline
The clustering algorithm includes three procedures, the local structure feature extraction with filtering process, integration of large scale features, and the top-down process to refine clustering results.The clustering algorithm is summarized by an outline in Algorithm 1.
An example is shown in Figure 3, where two Gaussian data sets have different mean and variance.From left to right, and from the top to down are the agglomerating procedure of clustering.The second subfigure is the result of local features integration under filtering procedure, and then the second subfigure is the result after the integration of large scale structure features.And the last is the result with top-down process, where some noisy data points attribute to the two final clusters according to the variance of the two main clusters.

Complexity of the Algorithm
The high complexity of clustering algorithm is difficulty to deal with large datasets, and meanwhile the preprocessing origin datasets is expensive.So in the scope of this paper, we do not intend to solve efficiently the preprocessing.The KNN procedure typically k 40 is used in step 1 of the new algorithm.In low dimension, KNN has complexity of O N log N , while in high dimension the complexity is O N 2 , where N is the number of data points.
The complexity of our clustering algorithm is O N .In the filtering process, the complexity is not more than O N • k • d because the filtering matrix is a sparse matrix, in which the number of each nonzero element in each row is not more than k typically k 40 , and d is the dimension of each pattern.So the complexity of the filtering process is n O N .In the integration procedure, the complexity is O N 2 c and N c is the number of elementary clusters.In practice Nc N, and the total complexity of the algorithm is linear in the size of the data set, which is much lower than CSSF 20 and VClust 21 O N 3 .

Experiment
We provide a series of experiments and applications in this section to demonstrate the effectiveness of the proposed clustering algorithm.The test data sets include synthetic data sets  and a set of real world data sets from public domains that are extensively used in testing the performance of classical clustering algorithms.The synthetic data sets are specially generated to exhibit complex structure features, such as the unbalanced and complex manifold shaped.The real world data sets are those derived from image segmentation and a runway detection task from remote sensing image.
We have applied the new algorithm to the above data sets with comparison to several well-known clustering algorithms.These algorithms are known to be representatives of the existing clustering approaches.The method of clustering by scale space filtering-CSSF-20 in the vision-simulation-based approach, the Chameleon 9 method in the graph-based approach, the spectral-Ng method 22 in the spectrum-based approach, Mean Shift algorithm 15 in the density-based algorithm, and the latest LEGClust algorithm 23 from the information entropy-based approach.
In experiments, we have used the normalized mutual information NMI 24 , as the criterion of measuring the performance of a clustering algorithm, whenever the data sets are of high dimension.There are three parameters in the algorithm: initial neighbor parameter k, agglomerating criterion parameters α and β.These parameters are very stable for the algorithm, k usually 10 ≤ k ≤ 40 is only an initial number of neighbors for each data point  The three toy data sets and the results of different clustering algorithms are shown in Figure 4, where each row is a synthetic data set, and each column contains the results of three data sets with a same algorithm.We have employed 50 people who are familiar with clustering problem to evaluate these results with the figures, and most acceptable results are labeled with red box.From the results with each cluster algorithm each column , the new algorithm and VClust are better than the others.The only advantage of the new algorithm is that the computation load is lower O N 2 than VClust O N 3 .
We further compare the new algorithm with other clustering algorithms by applying to real world problems.The real data sets are from UCI and USPS, where the noisy data points follow uniform distribution.In Iris data set, the 30 noisy data points are set while in USPS-01 the 100 uniform noisy ones are set.Due to the uniform unsatisfying performance of the density-based algorithms in afore conducted simulations, we compare the the algorithm only with the graph-based algorithms whose results are more better.Thus, as representatives, Chameleon, spectral-Ng, noise robust spectral clustering NRSC 24 , and self-tuning spectral clustering STSC 25 are selected to apply, as compared with the new algorithm.All the algorithms are applied to the four real data sets, shown in Table 1.
The clustering algorithms are applied in two natural images, which are shown in Figure 5.The images are composed of regions with different textures and gradually changed colors.Two well-known algorithms, which are very successful in image segmentation, are normalize cut 26 and mean shift 15 .We apply the new algorithm and CSSF in the chosen images, and compare the results with normalized cut and mean shift.The yielded segmentation results are shown in Figure 5.Note that the second column shows one kind

Conclusion
We have suggested a new clustering by identifying structure feature based on density consistence.Through viewing a data set as an imaginary image and explaining the data clustering as a cognition problem, it solves the problem via mimicking the mechanism of visual information processing.First, the results with the new algorithm are most close to the cognition of people for low dimension complex structure data sets.Second, it can deal with various structures of subspace clusters and is not restricted by shapes and high dimension.Third, almost no parameters has to be considered and it is easily to use.Fourth, the computation load is much lower than CSSF and VClust.

Figure 1 :
Figure 1: Structure features of data sets.a A data set with no structure.b A data set with density feature: two clusters of data with different densities.c A data set with connectedness feature: it contains two connected manifolds.d A data set with orientation feature.

Definition 3 . 2
internal consistency criterion ICC .If data point y is neighbor of data point x according to Euclidean distance, they must satisfy max r x; k , r y; k ≤ α • min r x; k , r y; k ,

Figure 2 :
Figure 2: Criterions to determine nearest neighbors for each data point.a Initial 4 nearest neighbors for data point a. b An example to determine neighbors for data point e according to the two criterions.c An example to determine neighbors for data point a according to the two criterions.

Figure 3 :
Figure 3: An example to show the clustering procedure.

Figure 4 :
Figure 4: Comprising results with different classical algorithms on three toy data sets.a Clustering results with the new algorithm on three data sets.b Clustering results with the CSSF.c Clustering results with Chameleon.d Clustering results with Spectral Ng. e Clustering results with LEGClust.

Figure 6 :
Figure 6: Runway detection in remote sensory image.a Remote sensory image.b Pixels extracted from image.c Detected result with RCMD.d Clustering result with the new algorithm.e Two detected runway with the new algorithm.