Clustering is an important unsupervised machine learning method which can efficiently partition points without training data set. However, most of the existing clustering algorithms need to set parameters artificially, and the results of clustering are much influenced by these parameters, so optimizing clustering parameters is a key factor of improving clustering performance. In this paper, we propose a parameter adaptive clustering algorithm DDPA-DP which is based on density-peak algorithm. In DDPA-DP, all parameters can be adaptively adjusted based on the data-driven thought, and then the accuracy of clustering is highly improved, and the time complexity is not increased obviously. To prove the performance of DDPA-DP, a series of experiments are designed with some artificial data sets and a real application data set, and the clustering results of DDPA-DP are compared with some typical algorithms by these experiments. Based on these results, the accuracy of DDPA-DP has obvious advantage of all, and its time complexity is close to classical DP-Clust.

Clustering is one of the most important methods in machine learning, and by clustering, data points are partitioned to several groups [

There have been more and more researches focused in designing high efficient clustering algorithm, and these researches can be divided to four kinds: the partition-based methods, such as K-means [

Based on the above analysis, it can be seen that the artificial setting of clustering parameters has been the key factor of influencing the performance of clustering, so nowadays, some researchers have focused on optimizing parameters to improve clustering efficiencies: FEAC can adapt the number of clusters which was proposed by Silva to get rid of the defect K-means [

In 2014, Rodriguez proposed a density-based clustering algorithm named DP-Clust, and the basic thought of DP-Clust is that the centers of all clusters should be located at the peak of local density changing curve, and the borders will be located at the neighborhoods of centers, and outliers will be far away from high-density area [

Local density is an attribute to measure the density station of point

In (

The distance from the nearest neighbor with larger local density than

According to (

After obtaining the

Because just local density instead of global density needs to be computed, DP-Clust has obvious advantage in clustering nonuninform density fields comparing to DBSCAN, and its clustering procedure is simple to be deployed in application. Now, DP-Clust has dropped much attention, and many density peak-based clustering algorithms have been proposed [

To reduce the influence of initial setting of parameters of DP-Clust, there are two problems to be resolved: one is how to determine the thresholds of local density

In ref [

Besides the defects of setting thresholds of centers and outliers, the algorithms mentioned above adopt fixed and preset radius of local field to accomplish clustering, which much restricts the performance in complex environments. Nowadays, there are two kinds of effective algorithms that focus on optimizing the local field radius of density peak clustering: one adopts K nearest neighbors-based method to divide the local field instead of by radius; the other directly optimizes the local field’s radius to reduce the affection of initial setting.

DPC-KNN is a classical KNN-based algorithm [

FKNN-DPC [

Comparing (

In ref [

In (

Based on the application of clustering, it can be concluded that there are three key problems that should be resolved when designing clustering algorithm: the accuracy in clustering arbitrary data set, the parameter independence when clustering, and the complexity of time and memory. However, based on the above analysis, existing researches have much or less defects so that these problems are not well resolved, and now these problems have been the major obstacles of restricting the clustering application.

To obtain the target of improving clustering performance, we propose a fully data-driven parameter adaptive clustering algorithm based on density peak (DDPA-DP), and in DDPA-DP, the parameter of the local field’s radius

The flow of DDPA-DP.

In the first step of DDPA-DP,

According to the thought of DP-Clust, we propose a series of fitted curves to predict the combination value of

When a point’s

A simple model of clusters.

Assuming

In (

The distribution of

The corresponding points of centers.

After detecting centers, outliers should be detected from remained points. So a variable

By (

The distribution of

The corresponding points of outliers.

As the operations of detecting centers and outliers, pending points can be detected by finding the points with less

The corresponding points of pending points.

After centers, outliers and pending points are detected, remained points can be seemed as borders of clusters, and then every border will join the nearest center and form the cluster, and the result is shown in Figure

The distribution of points’ roles when

In Figure

In (

The points’ distribution is shown in Figure

The distribution of points’ roles when

The distribution of points’ roles when

Although the result of Figure

Time complexity is an important performance in designing clustering algorithm because there are a large number of data sets to be computed. In DDPA-DP, ^{2}); and then in the local field radius optimizing step, just pending points should be redetected and its average number assumes ^{2}) too. Because the numbers of ^{2}), which is the same with DP-Clust. Based on this analysis, it can be concluded that DDPA-DP can maintain relative high performance in complexity with parameter independence.

To prove the advantages of DDPA-DP, a series of experiments are designed and simulated, and three typical clustering algorithms DBSCAN [

The parameters of experiment data set.

Data set | |||
---|---|---|---|

Aggregation | 788 | 7 | 2 |

Pathbased | 300 | 3 | 2 |

Spiral | 312 | 3 | 2 |

Jain | 373 | 2 | 2 |

DIM | 1024 | 16 | 32 |

KDDCUP04Bio | 145751 | 2000 | 74 |

GL1 | 280307 | 16 | 18 |

The distribution of JAIN.

The distribution of Spiral.

The distribution of Pathbased.

The distribution of Aggregation.

Accuracy is one of the most important performances of clustering algorithms, and to compare different algorithms’ clustering accuracy, the clustering purities of all algorithms in the same data set are calculated, and clustering purity has been used in most researches to judge clustering accuracy [

In (

Before clustering, initial parameters should be preset to deal with different data sets, and in DBSCAN, DP-Clust, DCore, DCNaN, and DDPA-DP, the initial parameter should be set is the radius of local field, and in DPC-KNN and FKNN-DPC, the initial parameter should be set is the neighbors’ number of every point. In this paper, the initial parameters are defined also by data driven instead of by experience, and to overall simulate different applications, three initial states are designed in this section to better illustrate the parameter independence and accuracy of different algorithms. The first parameter state is set that the local field’s radius

In (

The second parameter state is as the local field’s radius

The third parameter state is as the local field’s radius

In (

Among three initial local field’s radiuses,

In Figure

The simulation results in JAIN.

In Figure

The simulation results in Spiral.

In Figures

The simulation results in Pathbased.

The simulation results in Aggregation.

In Figures

The simulation results in DIM.

The simulation results in KDDCUP04Bio.

The simulation results in GL 1.

In Figure

In Figure

Based on the simulated results in this section, it can be concluded that DDPA-DP has obvious advantage in clustering accuracy, because the parameters in DDPA-DP are continuously adapted by the data-driven method, by which the parameters are optimized to improve the clustering accuracy, and then the optimized parameters can reduce the influence by initial set values which ensures the clustering accuracy is stable at high level. The advantages of DDPA-DP are more obvious with more complex data set, so DDPA-DP is fitter for the big data applications.

Runtime is also an important standard to estimate the performance of clustering algorithm, and it can be used to estimate the time complexity of clustering. In this section, the runtime of DDPA-DP is compared with DBSCAN, DP-Clust, DPC-KNN, FKNN-DPC, DCore, and DCNaN, and the results of experiments are shown in Figures

The runtimes in small data sets.

The runtimes in large data sets.

In Figure

In Figure

Based on the classical density-based clustering algorithm DP-Clust, we proposed a parameter adaptive clustering algorithm named DDPA-DP in this paper. The data-driven thought goes through the design of DDPA-DP: at first, a series of fitted curves are established to automatically detect points’ roles by points’ density attributes instead of any artificial thresholds; meanwhile, a new point’s role “pending point” is defined, and then by the change of pending points’ number, the local field’s radius can be adaptively optimized.

DDPA-DP improves the flexibility of clustering by avoiding the influence of artificial parameters, and the time complexity of DDPA-DP is not significantly increased comparing with DP-Clust because there is little extra calculation added to optimize parameters. A series of experiments are designed to compare DDPA-DP with some existing clustering algorithms, and in these experiments, some typical synthetic data sets and a real-world data set from thermal power industry are simulated with different initial conditions to overall estimate these algorithms. By the results of experiments, it can be concluded that DDPA-DP has advantage in the performance of clustering accuracy and time complexity.

All artificial data sets can be downloaded from the following website:

The authors declare that there is no conflict of interests regarding the publication of this paper.

This research is supported by Natural Science Foundation of China under Contract no. 60573065 and Science and Technology Development Plan of Shandong Province under Contract no. 2014GGX101039, and it is partially supported by Natural Science Foundation of China under Contract no. 60903176.