Among numerous clustering algorithms, clustering by fast search and find of density peaks (DPC) is favoured because it is less affected by shapes and density structures of the data set. However, DPC still shows some limitations in clustering of data set with heterogeneity clusters and easily makes mistakes in assignment of remaining points. The new algorithm, density peak clustering based on relative density optimization (RDO-DPC), is proposed to settle these problems and try obtaining better results. With the help of neighborhood information of sample points, the proposed algorithm defines relative density of the sample data and searches and recognizes density peaks of the nonhomogeneous distribution as cluster centers. A new assignment strategy is proposed to solve the abundance classification problem. The experiments on synthetic and real data sets show good performance of the proposed algorithm.

As an unsupervised machine learning algorithm, clustering groups sample data into reasonable class based on similarity between sample points. Such process tries to make the similarity between samples inside a same cluster as high as possible and the similarity between samples in different clusters as low as possible. Many different types of clustering algorithms are proposed in different applications. In general, clustering can be divided into divisive clustering [

Clustering by fast search and find of density peaks (DPC) [

To improve DPC, a new algorithm is proposed from two aspects, density measurement and assignment of the remaining points. The classical DPC algorithm uses global density, which cannot effectively identify the density peaks in the low density area. In this paper, the

The reminder of the paper is organized as follows: Section

Clustering by fast search and find of density peaks (DPC) [

Local density depends on distance, which means it can be regarded as a function of the distance, for example, kernel function. One of the local densities is defined by cut-off kernel:

Another feature of ideal clustering center is that the distance between different centers should be as far as possible. As a result,

The definition in equation (

The researchers have improved the DPC [

In terms of definition of cluster centers, some scholars try to expand the differentiation between cluster center and other sample points, so as to select cluster centers in the decision-making graph, such as the normalization of local density and distance [

The remaining points assignment strategy of the classical DPC is prone to chain mistaken assignments. Many improvements are proposed to modify the assignment strategy of the classical DPC, such as the distribution of the remaining points based on the

For high-dimensional data with noises, noises filtering standard is constructed based on nearest

The proposed RDO-DPC improves the classical DPC from two aspects: the definition of local density and assignment strategy of cluster members. Taking advantage of neighbor information, RDO-DPC defines a new measurement of relative density. Then, cluster centers are selected combining decision graph, so as to obtain satisfying results from the clustering of data set with uneven density between clusters. The remaining points are allocated according to the structure information of data set, which effectively avoid the disadvantage of one-step distribution strategy in DPC.

Recognizing cluster centers of different density areas is the guarantee of effective clustering results. Peaks of low density area are buried in high density peaks with local density definition in equation (

The ideal cluster centers of DPC possess two features: one is that local density is higher than the density of samples around, and the other one is that cluster centers are far away from each other. It is shown that distance also is important in selection of cluster centers. As a result, cluster centers are often samples with a higher density and larger distance. If there are two largest density peaks in one cluster, the two points will be both selected as cluster centers according to equation (

Combined with equations (

Input:

Sample matrix

Output:

Clustering label

Calculate distance matrix

Calculate relative local density

Calculate distance

Draw decision graph and select cluster centers

Assign points to centers according to the nearest distance principle

Clustering result

RDO-DPC takes relative density as measurement of density. With relative density, density calculation of each point is restricted in

The time complexity of RDO-DPC is

In this section, 8 synthetic and 7 real data sets are employed to test the new proposed algorithm. The data sets used are greatly different from each other in density distribution, scale, shapes, and so on. Among those data sets, DS1–DS5, aggregation, compound, and flame are synthetic two-dimensional data sets, which are shown in Figure

Two-dimensional exhibition of 8 synthetic data sets: (a) DS1; (b) DS2; (c) DS3; (d) DS4; (e) DS5; (f) aggregation; (g) compound; (h) flame.

In the experiment, the clustering results of RDO-DPC are compared with that of the classical DPC. Both algorithms, RDO-DPC and DPC, need the setting of cut-off distance

The clustering results are measured with AMI (adjusted mutual information) and ARI (adjusted Rand index) [

Eight two-dimensional synthetic data sets are employed to test the clustering efficiency of RDO-DPC and DPC. Both the two algorithms found centers quickly and assigned the reminder samples effectively. Some comparative visualization results of synthetic data sets are shown in Figure

Some comparative clustering results of RDO-DPC and DPC with proper parameters on some synthetic data sets: (a) RDO-DPC on DS2; (b) RDO-DPC on DS3; (c) RDO-DPC on DS4; (d) DPC on DS2; (e) DPC on DS3; (f) DPC on DS4.

The validation of the comparative clustering results of the 8 synthetic data sets is shown in Table

Clustering results measured with AMI and ARI on eight synthetic data sets.

Algorithm | RDO-DPC | DPC | ||||||
---|---|---|---|---|---|---|---|---|

ARI | ARI.Var | AMI | AMI.Var | ARI | AMI | |||

DS1 | 0.988 | 0.010 | 0.980 | 0.004 | 10.0 | 0.994 | 0.989 | 2.0 |

DS2 | 0.979 | 0.000 | 0.959 | 0.000 | 12.8 | 0.585 | 0.606 | 2.0 |

DS3 | 1.000 | 0.035 | 1.000 | 0.013 | 8.0 | — | 0.095 | 2.0 |

DS4 | 1.000 | 0.003 | 1.000 | 0.004 | 8.1 | 0.015 | 0.041 | 1.5 |

DS5 | 0.904 | 0.000 | 0.829 | 0.000 | 10.0 | 0.691 | 0.696 | 1.5 |

Aggregation | 0.895 | 0.002 | 0.882 | 0.000 | 13.0 | 0.755 | 0.860 | 2.0 |

Compound | 0.789 | 0.008 | 0.773 | 0.002 | 12.5 | 0.546 | 0.697 | 2.0 |

Flame | 0.476 | 0.002 | 0.421 | 0.003 | 13.0 | 0.327 | 0.403 | 2.0 |

ARI.Var and AMI.Var represent the variance in ARI and AMI, respectively.

The comparison of the quantitation indexes between RDO-DPC and DPC shows obvious superiority of RDO-DPC. RDO-DPC is slightly lower than DPC in clustering indexes of DS1 but was apparently higher than DPC in indexes of other data sets. The superior performance of RDO-DPC is because of its employment of relative density in clustering of data sets with uneven density among clusters. Therefore, RDO-DPC can recognize cluster centers more effectively and correctly and assign the remaining points correctly, thus achieving better clustering results than DPC.

Seven real data sets from UCI machine learning repository are employed to test the performance of RDO-DPC and classical DPC. These benchmark data sets include data of high dimensions, complicated structures, and various shapes. With different parameter

Clustering results measured with AMI and ARI on real data sets.

Algorithm | RDO-DPC | DPC | ||||||
---|---|---|---|---|---|---|---|---|

ARI | ARI.Var | AMI | AMI.Var | ARI | AMI | |||

Iris | 0.759 | 0.008 | 0.793 | 0.002 | 13.0 | 0.720 | 0.767 | 2.0 |

Seeds | 0.787 | 0.005 | 0.736 | 0.002 | 6.8 | 0.734 | 0.717 | 2.0 |

Wine | 0.742 | 0.002 | 0.746 | 0.000 | 2.0 | 0.672 | 0.706 | 2.0 |

E. coli | 0.796 | 0.002 | 0.699 | 0.001 | 9.5 | 0.309 | 0.443 | 2.0 |

Wdbc | 0.850 | 0.001 | 0.762 | 0.001 | 10.5 | — | — | 1.0 |

Zoo | 0.883 | 0.040 | 0.771 | 0.015 | 8.0 | 0.363 | 0.321 | 2.0 |

Glass | 0.266 | 0.000 | 0.378 | 0.001 | 4.5 | 0.224 | 0.246 | 1.0 |

From Table

The robustness of the new algorithm is also considered. In RDO-DPC,

Figure

Parameter sensitivity analysis (measured with ARI and NMI) of RDO-DPC on some synthetic and real data sets.

The above comparative results on synthetic and real data sets show that the new proposed algorithm RDO-DPC is effective in the clustering of data sets with extremely large density differences among clusters and with various shapes. And the algorithm is robust overall. In terms of data sets with low number of records and huge number of features, the new algorithm also shows certain efficiency although clustering on such data sets is difficult.

Based on neighborhood information of samples, relative density is introduced in this paper. The introduced relative density is used to describe the relative density between each sample and the samples around it and takes full advantage of the information of adjacent samples, thus facilitating the effective find of centers and distinction of clusters of different densities. In addition, the assignment strategy of the original DPC is also improved. The experiments on different types of data sets show that the proposed algorithm can perform effectively on data sets with arbitrary shapes, uneven density, and high dimensions, avoiding the mistaken assignment of samples of the original DPC. Compared with classical DPC, the proposed RDO-DPC not only considers the local density of the samples but also the relative density, which enables RDO-DPC to cluster data sets with uneven density with a higher efficiency. For further research, the reduction of calculation complexity is still an important problem.

The 7 real data sets used in this paper are from UCI machine learning repository. The other data sets used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

The study presented in this article was supported by the National Science Foundation of China (Grant nos. 61305070 and 61703001).