Density Peak Clustering Based on Relative Density Optimization

Among numerous clustering algorithms, clustering by fast search and find of density peaks (DPC) is favoured because it is less affected by shapes and density structures of the data set. However, DPC still shows some limitations in clustering of data set with heterogeneity clusters and easily makes mistakes in assignment of remaining points. 0e new algorithm, density peak clustering based on relative density optimization (RDO-DPC), is proposed to settle these problems and try obtaining better results. With the help of neighborhood information of sample points, the proposed algorithm defines relative density of the sample data and searches and recognizes density peaks of the nonhomogeneous distribution as cluster centers. A new assignment strategy is proposed to solve the abundance classification problem. 0e experiments on synthetic and real data sets show good performance of the proposed algorithm.


Introduction
As an unsupervised machine learning algorithm, clustering groups sample data into reasonable class based on similarity between sample points. Such process tries to make the similarity between samples inside a same cluster as high as possible and the similarity between samples in different clusters as low as possible. Many different types of clustering algorithms are proposed in different applications. In general, clustering can be divided into divisive clustering [1][2][3], hierarchical clustering [4,5], grid-based algorithms [6,7], model-based algorithms [8,9], and density-based algorithms [10,11]. In practical applications, data sets are various and complex with high dimensions, which brings a huge challenge to clustering. Some scholars put forward the idea of considering multiple clustering algorithms comprehensively, that is, integrated clustering [12,13], which effectively improves the accuracy of clustering. With the development of cluster analysis theory and technology, it plays an increasingly important role in image processing, machine learning, artificial intelligence, natural language processing, pattern recognition, information retrieval, and bioinformatics [14].
Clustering by fast search and find of density peaks (DPC) [15] proposes a totally new clustering frame and the type of redefining clustering center. e structures of data are mapped into two-dimensional space (local density and the nearest distance), in which centers are recognized and clusters are grouped. With DPC, density peaks of sample data are easily and quickly found and DPC also shows high efficiency in assignment and elimination of noises. However, there are still limitations in clustering with DPC. (1) ere is no unified density measurement, and parameter d c is difficult to set because it is related with specific problems. (2) Clustering centers need to be selected manually, which is qualitative analysis with subjective factors. As a result, objective and reasonable centers are difficult to find in decision graphs. (3) In terms of sample distribution, sample points are assigned to the nearest clusters with high density, which easily results in continuous transmit of the mistake clustering. (4) According to the definition of distance δ i , two points would be selected as clustering centers if density of the two points is both the highest and belongs to the same cluster, which means one cluster is divided into two clusters mistakenly. (5) DPC shows limits in clustering of data sets with high dimension, unevenly distributed density, and noises.
To improve DPC, a new algorithm is proposed from two aspects, density measurement and assignment of the remaining points. e classical DPC algorithm uses global density, which cannot effectively identify the density peaks in the low density area. In this paper, the d c nearest information of samples is employed to calculate the local relative density, in attempt to recognize the centers of data set with nonhomogeneous distribution. To solve the overclassification problem in DPC, a new assignment strategy with sorting of local density and defining of corresponding distances of data samples is proposed. Based on the two improvements, a density peak clustering algorithm based on relative density optimization (RDO-DPC) achieves satisfied clustering results on synthetic and real data sets with various density types and irregular shapes. e reminder of the paper is organized as follows: Section 2 introduces the definition and process of classical DPC and related works; density peak clustering algorithm based on relative density (RDO-DPC) algorithm is proposed in Section 3; experiments on synthetic and real data sets are shown in Section 4; and Section 5 gives conclusion and prospect.

DPC Algorithm.
Clustering by fast search and find of density peaks (DPC) [15] could find the clusters of various densities and shapes with a simple strategy. e fundamental principle of DPC is that the ideal density peaks possess two essential features: (1) the local density of the peak is higher than the density of the neighbors; (2) the distances between different peaks are relatively longer. To find density peaks meeting the two above conditions, DPC introduces local density ρ i of sample i and the corresponding distance δ i , which is the distance from i to j, the sample whose local density is higher than i, and which is the nearest sample to j.
Local density depends on distance, which means it can be regarded as a function of the distance, for example, kernel function. One of the local densities is defined by cut-off kernel: where d ij represents the distance between point i and j.
Positive number d c is the appointed parameter. e value of otherwise, it is set as 0. e other local density can be defined by Gaussian kernel: d c in equations (1) and (2) can control the influence of neighbors on sample points, which equal the function of neighborhood ε. When data set is of large scale (number of points it contains), clustering result from DPC is slightly influenced by cut-off distance, and the influence from cut-off distance becomes greater and greater, while data scale becomes smaller. To avoid the influence from cut-off distance on local density, or further on clustering results, DPC employs Gaussian kernel in equation (2) to calculate overall density of the sample, while it is used to cluster small-scale data.
Another feature of ideal clustering center is that the distance between different centers should be as far as possible. As a result, δ i , the distance from sample i to j which is the nearest to i, and whose local density is larger than i, is defined as e definition in equation (3) shows that if the density of sample i is the largest local density or the largest overall density, distance δ i of sample i is far more larger than distance δ j of the neighbors of i. erefore, cluster centers are often points with extremely large δ j , and density ρ i of those center points is also very large. rough constructing decision graph of distance δ in relative to density ρ, DPC selects sample points with relatively large ρ and δ as cluster centers. For remaining points j, DPC assigns the points to clusters, which are the nearest to j and are larger than j in density, thus completing the distribution of remaining points with high efficiency.

Related Work.
e researchers have improved the DPC [15] in many ways to adapt it to different applications, mainly focusing on the definition of cluster centers and assignment strategy.
In terms of definition of cluster centers, some scholars try to expand the differentiation between cluster center and other sample points, so as to select cluster centers in the decision-making graph, such as the normalization of local density and distance [16], gravitational analogy minimum distance [17,18], and the Laplacian centrality in the form of no parameter [19]. Although this kind of method expands the differentiation between the density peak point and other points to a certain extent, it is still difficult to determine the cluster centers directly and effectively in some complex decision-making, and it needs manual selection. erefore, other scholars have proposed a method to quantitatively select the class center based on the decision graph, among which the most prominent algorithms are the fuzzy theory σ principle [20], the normal distribution 3σ criterion [21], the inflexion point [22] of data distribution in the decision graph, the linear fitting of the distribution curve of density and distance product [23], and the Chebyshev inequality [24] or the upper bound of generalized extremum [25]. is kind of method can automatically determine the potential class center of the data set without human intervention. However, due to the influence of multiple density extreme value, it is often necessary to merge subclusters to further optimize the sample allocation effect. e remaining points assignment strategy of the classical DPC is prone to chain mistaken assignments. Many improvements are proposed to modify the assignment strategy of the classical DPC, such as the distribution of the remaining points based on the k-nearest neighbor [26,27], the similarity measurement of the samples based on the shared nearest neighbor [28], the combination of initial clusters with boundary samples [29] or density reachable [30], and the assignment of remaining points in combination with other algorithms [11,31]. e assignment strategy based on nearest k and shared nearest neighbors takes full consideration of the neighbor information of samples, which is beneficial to get the reasonable cluster assignment of samples. However, the mere consideration of distances between samples cannot reflect the impact of the real cluster attribution on the similarities between samples. e assignment strategy of the remaining points based on the combination of initial clusters works well on multiple density peaks, but it shows high time complexity. Moreover, some algorithms use DPC as the initial cluster center selection strategy, which can better solve the impact of initial cluster center selection on clustering results, but these algorithms all show high time complexity and are not suitable for clustering of large-scale high-dimensional data.
For high-dimensional data with noises, noises filtering standard is constructed based on nearest k, and the clustering centers recognition and remaining points assignment are conducted after filtering of noises [26,27]. DenPEHC [23] takes sample points with a higher ratio of δ and ρ as noises, but there were still errors and manual factors. Furthermore, dimension reduction is combined to reduce the dimensions of high-dimensional data [32], and then sample points are assigned with nearest neighborhood parameter k. Furthermore, geodesic distance [33,34] is used to calculate the manifold distance between data points, and isometric mapping is introduced to reduce the dimension of high-dimensional data sets. e above analysis shows that many improvements and optimization are proposed to solve the problems in DPC, and results are satisfying. However, many problems still exist in clustering of complex data sets, for example, uneven density of clusters, high dimensions, optimization of parameters, recognition of center, noise treatment, and high time complexity.

RDO-DPC Algorithm
e proposed RDO-DPC improves the classical DPC from two aspects: the definition of local density and assignment strategy of cluster members. Taking advantage of neighbor information, RDO-DPC defines a new measurement of relative density. en, cluster centers are selected combining decision graph, so as to obtain satisfying results from the clustering of data set with uneven density between clusters. e remaining points are allocated according to the structure information of data set, which effectively avoid the disadvantage of one-step distribution strategy in DPC.
Recognizing cluster centers of different density areas is the guarantee of effective clustering results. Peaks of low density area are buried in high density peaks with local density definition in equation (2) because the local density of dense area is much higher than that of sparse area. In order to give prominence to peaks of sparse area, relative local density is defined as where the radius of influence d c is the p quantile of pairwise distances from the smallest to the biggest. N i is the number of samples in d c spherical neighbor of sample i. Revised local density ρ i is defined as where the strict condition d ij ≤ d c in equation (5) is equivalent to truncated Gaussian kernel function in order to eliminate the interference from samples far away. Compared with classical DPC, relative local density (4) and (5) can recognize the cluster centers of regions with different densities by employing relative index rather than absolutely index. e ideal cluster centers of DPC possess two features: one is that local density is higher than the density of samples around, and the other one is that cluster centers are far away from each other. It is shown that distance also is important in selection of cluster centers. As a result, cluster centers are often samples with a higher density and larger distance. If there are two largest density peaks in one cluster, the two points will be both selected as cluster centers according to equation (3). e result is that one cluster is mistakenly divided into two clusters, which eventually leads to unsatisfied clustering results. erefore, relative density is ranked before calculation of the density higher than ρ i and the shortest distance to sample i, which can help the distinction of two largest density peaks. e corresponding distance of q i is defined as where q i n i�1 represents the subscript sequence of one descending order of ρ i n i�1 , satisfying ρ q 1 ≥ ρ q 2 ≥ · · · ≥ ρ q n . If the biggest local density peaks of q i and q j in a data set according to equations (2) or (4) are very close, it is hard to identify the real peak in decision graph. erefore, q i and q j may be recognized as their own cluster centers, respectively. After the ranking of the two peaks, if ρ q i ≥ ρ q j , the distance in corresponding to q i is set as the largest corresponding distance of other density peaks with equation (6).
e distance corresponding to q j is the distance between q i and q j , which weakens the value of δ q j corresponding to q j . As a result, q j is no longer the cluster center.
Combined with equations (5) and (6), the peaks of areas with a greater density difference are easy to be recognized in decision graph, and the discriminability is strengthened with the decision distances that the peaks are corresponding to. erefore, a stronger generalization ability is obtained. RDO-DPC algorithm is formed, as shown in Algorithm 1.
RDO-DPC takes relative density as measurement of density. With relative density, density calculation of each point is restricted in d c scope, and the values are only related to points inside d c neighbor scope. e relative closeness of samples with sample inside d c scope can be revealed more clearly, and local information of each point and its sample point inside d c scope can also be shown clearly. erefore, RDO-DPC suits not only data sets with relatively even density between clusters but also data sets with obvious density differences between clusters. e time complexity of RDO-DPC is O(n 2 ), which consists of the measurement of relative local density ρ and the assignment of remaining points based on the nearest distance δ. e computation of ρ lies in the Euclidean distance between sample points and the determination of d c neighborhood, whose computing complexity is O(n 2 ). e assignment strategy of the remaining points based on nearest distance δ employs the classical sorting algorithm, whose computing complexity is O(n 2 ).

Experiments
In this section, 8 synthetic and 7 real data sets are employed to test the new proposed algorithm. e data sets used are greatly different from each other in density distribution, scale, shapes, and so on. Among those data sets, DS1-DS5, aggregation, compound, and flame are synthetic two-dimensional data sets, which are shown in Figure 1. And the 7 real data sets are from UCI machine learning repository.
In the experiment, the clustering results of RDO-DPC are compared with that of the classical DPC. Both algorithms, RDO-DPC and DPC, need the setting of cut-off distance d c , which is defined as the distance at p% in the ascending sequence of all distances among samples. e clustering results are measured with AMI (adjusted mutual information) and ARI (adjusted Rand index) [35]. e value range of the two indexes is [0, 1], and the larger the value is, the better the clustering result is. Besides, the clustering results of two-dimensional synthetic data sets are labelled with different colors, and the centers are labelled with red star to give clear view of the results. e results shown in this section are both the best results from RDO-DPC and DPC with best parameters. In this way, the algorithms are better judged concerning their adaptability to data sets of different types and clustering effectiveness.
Eight two-dimensional synthetic data sets are employed to test the clustering efficiency of RDO-DPC and DPC. Both the two algorithms found centers quickly and assigned the reminder samples effectively. Some comparative visualization results of synthetic data sets are shown in Figure 2, in which sparse clusters can be recognized, and excessive clustering can be avoided. e validation of the comparative clustering results of the 8 synthetic data sets is shown in Table 1, which includes ARI, AMI, and their variances. e variances of ARI and AMI are expressed as "ARI.Var" and "AMI.Var" in the table. With different parameter p , AMI and ARI varied more or less. e variance in the accuracy with different parameter p is given in the table to show the validation of the RDO-DPC algorithm. Furthermore, the ARI and AMI listed here in the table are the best with the proper parameter p . Compared with the DPC algorithm, RDO-DPC exhibits superior performance in clustering of data sets with extremely large density differences among clusters and with various shapes. e comparison of the quantitation indexes between RDO-DPC and DPC shows obvious superiority of RDO-DPC.
RDO-DPC is slightly lower than DPC in clustering indexes of DS1 but was apparently higher than DPC in indexes of other data sets. e superior performance of RDO-DPC is because of its employment of relative density in clustering of data sets with uneven density among clusters. erefore, RDO-DPC can recognize cluster centers more effectively and correctly and assign the remaining points correctly, thus achieving better clustering results than DPC.
Seven real data sets from UCI machine learning repository are employed to test the performance of RDO-DPC and classical DPC. ese benchmark data sets include data of high dimensions, complicated structures, and various shapes. With different parameter p, the efficiency of the two algorithms varies slightly. AMI and ARI are employed to measure the different clustering results, and the variance in the accuracy and best parameters are listed in Table 2. Input: Sample matrix X ∈ R n×m and cut-off ratio parameter p Output: Clustering label y ∈ R n (1) Calculate distance matrix (2) Calculate relative local density ρ i according to equation (4) (3) Calculate distance δ q i with equation (6)  (4) Draw decision graph and select cluster centers (5) Assign points to centers according to the nearest distance principle (6) Clustering result ALGORITHM 1: RDO-DPC.

Mathematical Problems in Engineering
From Table 2, the contrastive results of the two algorithms real data sets show the superior performance of the proposed RDO-DPC, which can find the center and meaningful group of real data sets. Especially for data set Wdbc, DPC could not find cluster centers and recognize meaningful groups of the data set because of its deficiency in clustering of high-dimensional data. ARI and AMI of the proposed RDO-DPC shows that RDO-DPC performs well on high-dimensional data. e robustness of the new algorithm is also considered. In RDO-DPC, d c is important because it is used to determine the relative density of each sample, which has impact on   many critical steps in clustering. e value of d c is closely related with parameter p, which means d c determines the performance of RDO-DPC. Figure 3 lists the influence of different values of p on ARI and AMI of some synthetic and real data sets. e robust interval of p is suggested to be set from 10 to 20 for the proposed algorithm in the experiments. As shown in Figure 3, the accuracy of new algorithm remains stable overall with respect to p. e above comparative results on synthetic and real data sets show that the new proposed algorithm RDO-DPC is effective in the clustering of data sets with extremely large density differences among clusters and with various shapes. And the algorithm is robust overall. In terms of data sets with low number of records and huge number of features, the new algorithm also shows certain efficiency although clustering on such data sets is difficult.

Conclusions
Based on neighborhood information of samples, relative density is introduced in this paper. e introduced relative density is used to describe the relative density between each sample and the samples around it and takes full advantage of the information of adjacent samples, thus facilitating the effective find of centers and distinction of clusters of different densities. In addition, the assignment strategy of the original DPC is also improved. e experiments on different types of data sets show that the proposed algorithm can perform effectively on data sets with arbitrary shapes, uneven density, and high dimensions, avoiding the mistaken assignment of samples of the original DPC. Compared with classical DPC, the proposed RDO-DPC not only considers the local density of the samples but also the relative density, which enables RDO-DPC to cluster data sets with uneven density with a higher efficiency. For further research, the reduction of calculation complexity is still an important problem.
Data Availability e 7 real data sets used in this paper are from UCI machine learning repository. e other data sets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.