Clustering is widely used in data analysis, and density-based methods are developed rapidly in the recent 10 years. Although the state-of-art density peak clustering algorithms are efficient and can detect arbitrary shape clusters, they are nonsphere type of centroid-based methods essentially. In this paper, a novel local density hierarchical clustering algorithm based on reverse nearest neighbors, RNN-LDH, is proposed. By constructing and using a reverse nearest neighbor graph, the extended core regions are found out as initial clusters. Then, a new local density metric is defined to calculate the density of each object; meanwhile, the density hierarchical relationships among the objects are built according to their densities and neighbor relations. Finally, each unclustered object is classified to one of the initial clusters or noise. Results of experiments on synthetic and real data sets show that RNN-LDH outperforms the current clustering methods based on density peak or reverse nearest neighbors.
Clustering is the task to find a set of groups in which similar objects are in the same group, but different objects are separated into different groups. Since clustering can uncover the inherent, potential, and unknown knowledge, principles, or rules in the real-world, it has been widely used in many fields, including data mining, pattern recognition, machine learning, information retrieval, image analysis, and computer graphics [
In density-based clustering, clusters are considered to be dense regions of objects separated by low-density regions representing noise. With respect to clustering, the procedure can be broken up into two steps: estimating the density of each object and grouping density-connected objects.
The first approach adopted the density-based strategy proposed by Ester et al. [
Density peak clustering (DPC) [
To remedy these limitations in DPC, there are many improved methods that have been proposed [
In contrast to the algorithms listed above, RECORD [
In this paper, we propose an improved clustering approach by combining the
The proposed algorithm is performed on synthetic and real-world data sets, which are widely used for the performance tests of clustering algorithms. The results of RNN-DPC are compared with IS-DSC, ISB-DSC, RNN-DSC, and ADPC in terms of three very popular benchmarks: F-measure (F1) [
The rest of the paper is organized as follows: Section
In this section, we give the detail description of RNN-LDH theoretically. Some definitions in the section were introduced in other papers but modified by our method.
The notations used in this paper are listed below: Especially, std: standard deviation function.
Object
Object
A core region (
Given a core region (
Local density of an object
The parent of an object
The parent represents a local density hierarchical relationship of object
The hierarchical distance of an object
The inner distance of an extended core region
An object
Given an extended core region
An object
NR is the noise ratio, which is defined as
In this section, we discuss our algorithm in detail.
Algorithm
for all if if if end if else end if end if end for
while for all if end if end for end while return
In the procedure of Algorithm
The algorithm
Figure
Local density hierarchical relationship (for interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).
Algorithm
initialize an empty queue while not empty for each if if else end if end if end for end while
return {}; end if for each for each if end if end for end for return
Figure
Extended core regions (for interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).
The five algorithms discussed in this paper all need one parameter
The time complexity of RNN-LDH depends on the following aspects: (1) computing the distance between points
The above analysis shows that RNN-LDH has the same complexity as RNN-DSC and ADPC.
To evaluate the performance of RNN-LDH, we perform a set of experiments on synthetic and real world data sets which are commonly used to test the performance of clustering algorithms. Indeed, we compare the performance of RNN-LDH with well-known clustering algorithms including RNN-DSC in [
Table
Synthetic data sets.
Data | Objects | Dimensions | Classes |
---|---|---|---|
Pathbased [ |
300 | 2 | 3 |
Compound [ |
399 | 2 | 6 |
Flame [ |
240 | 2 | 2 |
Dim1024 [ |
1024 | 1024 | 16 |
Spiral [ |
312 | 2 | 3 |
Jain [ |
373 | 2 | 2 |
t4.8k [ |
8000 | 2 | 6 |
t5.8k [ |
8000 | 2 | 6 |
t7.10k [ |
10000 | 2 | 9 |
t8.8k [ |
8000 | 2 | 8 |
Comparison of RNN-LDH with RNN-DSC, IS-DSC, ISB-DSC, and ADPC. Different clusters are marked by different markers and colors (for interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).
Results of the algorithm on synthetic data sets.
Algorithms |
|
|
F1 | AMI | ARI | NR (%) |
|
C | F1 | AMI | ARI | NR (%) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pathbased | cluto-t4.8k | |||||||||||
RNN_LDH | 6 | 3 |
|
|
|
|
34 |
|
0.91 | 0.87 | 0.91 | 1.70 |
RNN_DSC | 6 | 3 | 0.99 | 0.94 | 0.96 | 0.30 | 31 |
|
0.88 | 0.85 | 0.87 | 0.10 |
IS_DSC | 10 | 3 |
|
|
|
50.70 | 63 |
|
0.84 | 0.79 | 0.81 | 16.60 |
ISB_DSC | 5 |
|
0.94 | 0.82 | 0.88 | 4.70 | 21 |
|
0.68 | 0.74 | 0.55 | 0.90 |
ADPC | 35 |
|
0.66 | 0.4 | 0.4 | 0.00 | 400 |
|
0.69 | 0.7 | 0.59 | 0.00 |
Compound | cluto-t5.8k | |||||||||||
RNN_LDH | 7 | 6 |
|
|
|
6.80 | 33 | 6 |
|
|
|
5.00 |
RNN_DSC | 8 | 6 | 0.89 | 0.86 | 0.87 |
|
32 |
|
0.83 | 0.8 | 0.8 | 0.90 |
IS_DSC | 10 |
|
1 | 1 | 1 | 31.60 | 62 |
|
0.99 | 0.98 | 0.99 | 20.50 |
ISB_DSC | 8 |
|
0.97 | 0.92 | 0.97 | 1.80 | 39 |
|
0.85 | 0.85 | 0.84 | 2.10 |
ADPC | 9 |
|
0.8 | 0.8 | 0.62 | 0.00 | 560 | 6 | 0.82 | 0.79 | 0.78 |
|
Flame | cluto-t7.10k | |||||||||||
RNN_LDH | 7 | 2 |
|
|
|
0.80 | 40 | 9 |
|
|
|
2.60 |
RNN_DSC | 8 | 2 |
|
0.96 | 0.98 |
|
28 |
|
0.88 | 0.88 | 0.9 | 0.30 |
IS_DSC | 4 | 2 | 0.7 | 0 | −0.01 | 21.70 | 30 |
|
0.87 | 0.91 | 0.82 | 16.60 |
ISB_DSC | 4 | 2 | 0.69 | 0.01 | −0.01 | 1.30 | 18 |
|
0.86 | 0.87 | 0.83 | 1.50 |
ADPC | 27 | 2 |
|
|
|
|
395 |
|
0.49 | 0.56 | 0.33 | 0.00 |
Dim1024 | cluto-t8.8k | |||||||||||
RNN_LDH | 63 | 16 |
|
|
|
|
27 | 8 |
|
|
|
1.30 |
RNN_DSC | 59 | 16 |
|
|
|
1.40 | 22 | 8 | 0.94 | 0.91 | 0.95 | 0.10 |
IS_DSC | 63 | 16 |
|
|
|
39.60 | 30 |
|
0.68 | 0.78 | 0.56 | 12.00 |
ISB_DSC | 58 | 16 |
|
|
|
6.30 | 10 |
|
0.98 | 0.96 | 0.98 | 2.10 |
ADPC | 2 | 16 |
|
|
|
|
240 |
|
0.59 | 0.6 | 0.45 | 0.00 |
Spiral | Jain | |||||||||||
RNN_LDH | 2 | 3 |
|
|
|
|
16 | 2 |
|
|
|
0.50 |
RNN_DSC | 2 | 3 |
|
|
|
|
15 | 2 |
|
|
|
|
IS_DSC | 5 | 3 |
|
|
|
50.30 | 16 | 2 |
|
|
|
36.20 |
ISB_DSC | 2 | 3 |
|
|
|
|
16 | 2 |
|
|
|
|
ADPC | 13 | 3 |
|
|
|
|
38 | 2 | 0.59 | 0.18 | −0.02 |
|
If the number of clusters (
There are 300 objects in path-based data set. They are classified to 3 classes. One class forms a 3/4 circular ring, and the other two classes distribute at the both ends of the horizontal diameter of the ring. As shown in the first row of Figure
Compound has six classes with different densities. Two adjacent classes in the upper-left corner are subject to Gaussian distribution, and in the right of the figure, the class with the irregular shape is surrounded by the class with lowest density. In the bottom-left corner, the smallest class is encircled by the ring-shape class. As shown in the second row of Figure
A particularly challenging feature of Frame, t7.10k, and t8.8k is that classes have homogeneous distributions and are very close to each other. RNN-LDH outperforms the other algorithms on the data sets. On the data set Frame, RNN-LDH takes two outliers in the upper-left corner as noise while ADPC classifies these two objects to the upper class. RNN-DSC misclassifies one object in the adjacent area of two classes. On t8.8k, the result of RNN-DSC is closed to RNN-LDH. Although ISB-DSC has the highest benchmarks, we can see it partition the data set incorrectly from Figure
Spiral has 3 classes which embrace each other, and Dim1024 is a high-dimensional data set and has 16 Gaussian classes with 1024 points. From Table
t4.8k has six classes with random noise. A thin sine curve runs across classes. RNN-LDH partitions the data set into 8 clusters for the sine curve is divided into several segments: the upper-left segment and the bottom-right segment are treated as two clusters, the bottom-left segment is looked upon as noise, and other segments are classified into their nearest clusters. RNN-DSC detected out only one segment of this curve. The other three algorithms are unable to partition some of main classes.
t5.8k has six label-like classes and a thick stick running across them. It also contains random noise. All label-like classes are found by the five algorithms. IS-DSC gets the highest benchmarks with highest noise ratio again. Our algorithm treats the stick as noise. RNN-DSC finds out one segment of the stick. ISB-DSC finds out 3 segments of the stick as 3 independent clusters and classifies some noise into 3 independent clusters too. ADPC partitions all objects into 6 clusters.
Table All samples with null or uncertain values or duplicates in the data sets were removed. Such data sets are Breast_C_W, Echocardiogram, and Internet-Ads. Most of data sets have class attributes or character attributes. So Table SPECT-Heart data set has two subsets, and we took the SPECT.test subset to test the algorithms. All text values in Chess were replaced by numbers, such as “ The attribute nos. 1 and 10–13 were removed from Echocardiogram, and the second attribute (“still-alive”) was selected as the clustering label. Lung-cancer is a sparse data set. There are 4 values for the fifth attribute, and 1 value for the ninth attribute was “?” (unknown). We replaced them with 0. Heart-disease has 10 sub-data sets. We used “reprocessed hugarian data” to test the algorithms. This data set is also unbalance because its largest cluster has more than 60% samples, while the smallest one has less than 6% samples.
Real-world data sets.
Data | Objects | Attributes | Classes |
---|---|---|---|
Breast_C_W | 683 | 9 | 2 |
Internet-Ads | 2359 | 1558 | 2 |
Image-seg | 2100 | 19 | 7 |
Lung-cancer | 32 | 56 | 3 |
SPECT-Heart | 187 | 22 | 2 |
Zoo | 101 | 16 | 7 |
Wine | 178 | 13 | 3 |
Echocardiogram | 106 | 11 | 2 |
Liver-disorders | 345 | 6 | 2 |
Monks-3 | 432 | 6 | 2 |
Sonar | 208 | 60 | 2 |
Ionosphere | 351 | 34 | 2 |
Wholesale | 440 | 8 | 3 |
Heart-disease | 294 | 13 | 5 |
Contraceptive-M | 1473 | 8 | 3 |
Hayes-roth | 160 | 4 | 3 |
Table
Results of algorithm on real-world data sets.
Algorithm |
|
|
F1 | AMI | ARI | NR (%) |
|
C | F1 | AMI | ARI | NR (%) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Breast_C_W | Internet-Ads | |||||||||||
RNN_LDH | 30 | 2 |
|
|
|
11.00 | 130 | 2 |
|
|
|
13.80 |
RNN_DSC | 11 | 2 | 0.69 | 0.01 | −0.01 | 0.30 | 107 | 2 | 0.92 | 0.39 | 0.59 | 2.90 |
IS_DSC | — | — | — | — | — | — | — | — | — | — | — | — |
ISB_DSC | 78 | 2 | 0.97 | 0.81 | 0.87 | 8.70 | 146 | 2 | 0.91 | 0.38 | 0.58 | 2.50 |
ADPC | 44 | 2 | 0.93 | 0.62 | 0.74 | 0.00 | 400 | 2 | 0.9 | 0.31 | 0.53 | 0.00 |
Image-seg | Lung-cancer | |||||||||||
RNN_LDH | 22 | 7 |
|
|
|
1.80 | 4 | 3 |
|
|
|
15.60 |
RNN_DSC | 12 | 7 | 0.59 | 0.56 | 0.36 | 0.30 | 2 |
|
0.58 | — | 0 | 0.00 |
IS_DSC | 33 |
|
0.71 | 0.64 | 0.49 | 39.90 | 7 |
|
0.62 | 0.12 | 0.13 | 50.00 |
ISB_DSC | 19 |
|
0.65 | 0.58 | 0.41 | 2.80 | 7 |
|
0.69 | 0.28 | 0.29 | 0.00 |
ADPC | 12 | 7 | 0.63 | 0.58 | 0.36 | 0.00 | 2 |
|
0.58 | — | 0 | 0.00 |
SPECT-Heart | Zoo | |||||||||||
RNN_LDH | 14 | 2 |
|
0 | −0.02 | 0.40 | 9 | 7 |
|
0.73 | 0.66 | 28.70 |
RNN_DSC | 5 |
|
0.96 | 0 | 0 | 0.40 | 4 |
|
0.77 | 0.7 | 0.6 | 7.90 |
IS_DSC | 25 |
|
0.94 | — | 0 | 41.60 | 23 |
|
1 | 1 | 1 | 53.50 |
ISB_DSC | 16 | 2 | 0.89 |
|
|
5.20 | 6 | 7 | 0.79 |
|
|
44.60 |
ADPC | 10 | 2 | 0.82 | 0.02 | 0 | 0.00 | 12 | 7 | 0.63 | 0.58 | 0.36 | 0.00 |
Lung-cancer | Liver-disorders | |||||||||||
RNN_LDH | 4 | 3 |
|
|
|
15.60 | 9 | 2 |
|
0 |
|
0.00 |
RNN_DSC | 2 |
|
0.58 | — | 0 | 0.00 | 5 | 2 | 0.67 | 0 | 0 | 0.30 |
IS_DSC | 7 |
|
0.62 | 0.12 | 0.13 | 50.00 | — | — | — | — | — | — |
ISB_DSC | 7 |
|
0.69 | 0.28 | 0.29 | 0.00 | 11 | 2 | 0.67 | 0 | −0.01 | 5.20 |
ADPC | 2 |
|
0.58 | — | 0 | 0.00 | 31 | 2 | 0.67 |
|
0 | 0.00 |
Hayes-roth | Wine | |||||||||||
RNN_LDH | 6 | 3 |
|
|
0.14 | 19.40 | 34 | 3 | 0.72 | 0.39 |
|
3.90 |
RNN_DSC | 4 | 3 | 0.45 | 0.02 | 0.02 | 0.60 | 20 | 3 |
|
|
0.38 | 0.60 |
IS_DSC | 13 |
|
0.67 | 0.05 | 0.05 | 34.40 | 16 | 3 | 0.7 | 0.31 | 0.29 | 38.20 |
ISB_DSC | 8 | 3 | 0.58 | 0.16 |
|
28.80 | 12 |
|
0.64 | 0.41 | 0.34 | 3.40 |
ADPC | 13 |
|
0.46 | −0.01 | −0.01 | 0.00 | 21 | 3 | 0.72 | 0.41 | 0.37 | 0.00 |
Sonar | Monks-3 | |||||||||||
RNN_LDH | 11 | 2 |
|
0.01 |
|
4.30 | 5 | 2 |
|
0 | 0 | 1.90 |
RNN_DSC | 6 | 2 | 0.57 | 0 | 0 | 2.40 | 3 |
|
0.65 | 0 | 0 | 0.50 |
IS_DSC | — | — | — | — | — | — | 9 | 2 |
|
0 | 0 | 14.60 |
ISB_DSC | 27 | 2 | 0.62 | 0 | −0.01 | 5.30 | 7 | 2 |
|
−0.01 | 0 | 0.00 |
ADPC | 14 | 2 | 0.66 |
|
0 | 0.00 | 10 | 2 | 0.65 |
|
|
0.00 |
Echocardiogram | Heart-disease | |||||||||||
RNN_LDH | 6 | 2 | 0.71 | 0.01 | −0.03 | 2.30 | 6 | 4 |
|
0.02 | 0.03 | 4.00 |
RNN_DSC | 4 | 2 | 0.71 | 0.01 | 0.06 | 0.80 | 4 | 4 | 0.53 | 0 | −0.02 | 1.70 |
IS_DSC | 28 | 2 |
|
|
|
47.00 | — | — | — | — | — | — |
ISB_DSC | 23 | 2 | 0.7 | 0.06 | 0.15 | 2.30 | 7 |
|
0.53 | 0.03 | 0 | 7.60 |
ADPC | 5 | 2 | 0.7 | 0.01 | 0.06 | 0.00 | 6 | 4 | 0.55 |
|
|
0.00 |
Ionosphere | Contraceptive-M | |||||||||||
RNN_LDH | 15 | 2 | 0.7 | 0.01 |
|
2.80 | 8 | 3 |
|
0.03 | 0 | 2.50 |
RNN_DSC | 4 |
|
0.66 | 0.03 | −0.05 | 2.60 | 4 |
|
0.52 | 0.01 | 0 | 0.20 |
IS_DSC | — | — | — | — | — | — | — | — | — | — | — | — |
ISB_DSC | 18 | 2 |
|
|
−0.08 | 37.90 | 7 |
|
0.52 | 0.01 | 0 | 6.40 |
ADPC | 2 |
|
0.77 | 0.26 | 0.32 | 0.00 | 44 | 3 | 0.45 |
|
|
0.00 |
The attribute characters of Internet-Ads, Echocardiogram, Heart-disease, and Liver-disorders are categorical, integral, and real. The first three data sets are unbalance data sets because their vast majority of samples are in one class. Internet-Ads are also sparse. The benchmarks show that RNN-LDH outperforms other algorithms on Internet-Ads. For Echocardiogram, IS_DSC gets the best benchmark, but it classifies near half samples into noise. Compared to the other 3 algorithms, RNN-LDH gets the best results on F1.
The attribute values of Breast_C_W, Lung-cancer, and Wholesale are all integral. Our algorithm outperforms the others on all benchmarks for the first two data sets. For Lung-cancer, the other four algorithms cannot get the correct cluster numbers.
The attribute characters of Image-seg, Wine, and Sonar are real. For Image-seg, RNN-LDH does the best work than the others. IS_DSC gets the highest benchmarks but with the highest noise ratio and the wrong cluster number. For Wine and Sonar, RNN-LDH outperforms the other algorithms on one benchmark.
The attribute characters of SPECT-Heart, Monk-3, and Hays-roth are categorical. SPECT-Heart is also unbalance. For these data sets, our method outperforms the other methods on F1. The attribute characters of the remaining data sets are multiple. Our method does better than RNN-DSC, IS-DSC, and ISB-DSC.
The experimental results of RNN-LDH are combined with the experimental results of RNN-DSC, ISB-DSC, and ADPC, respectively, into three data groups. Each data group has 2 columns and 135 rows. One column represents the algorithm RNN-LDH, and the other column is one of other three methods. 135 rows are divided into 5 labels: F1, AMI, ARI, NR, and CR. Label CR represents the correct ratio of cluster numbers, which is calculated by the following equation:
The Friedman tests are carried out on these 3 data groups, and the
Friedman test.
Data group |
|
---|---|
RNN-LDH and RNN-DSC | 5.53 |
RNN-LDH and ISB-DSC | 9.90 |
RNN-LDH and APDC | 7.49 |
The
In this paper, we proposed an improved density-based clustering algorithm, which is termed as RNN-DPC, by combining the
The data sets used in this paper are standard test data sets which are all available online and could be freely accessed. The synthetic data sets were downloaded from
There are no conflicts of interest regarding the publication of this paper.
This work was supported in part by NSFC under Grant 61773022, Hunan Provincial Education Department (nos. 16B244, 17A200, and 18B504), and Natural Science Foundation of Hunan Province (nos. 2017JJ3287 and 2018JJ3479).