Improved Density Peaks Clustering Based on Natural Neighbor Expanded Group

Density peaks clustering (DPC) is an advanced clustering technique due to its multiple advantages of efficiently determining cluster centers, fewer arguments, no iterations, no border noise, etc. However, it does suffer from the following defects: (1) difficult to determine a suitable value of its crucial cutoff distance parameter, (2) the local density metric is too simple to find out the proper center(s) of the sparse cluster(s), and (3) it is not robust that parts of prominent density peaks are remotely assigned. &is paper proposes improved density peaks clustering based on natural neighbor expanded group (DPC-NNEG).&e cores of the proposed algorithm contain two parts: (1) define natural neighbor expanded (NNE) and natural neighbor expanded group (NNEG) and (2) divide all NNEGs into a goal number of sets as the final clustering result, according to the closeness degree of NNEGs. At the same time, the paper provides the measurement of the closeness degree. We compared the state of the art with our proposal in public datasets, including several complex and real datasets. Experiments show the effectiveness and robustness of the proposed algorithm.

Due to flexibility and validity, various clustering algorithms have been proposed one after another. Jain classified these methods into partitioning-based, model-based, hierarchical-based, grid-based, and density-based approaches [19]. Partitioning methods aim for grouping the dataset into a preset number of clusters via an iterative process. K-means [20,21] and Fuzzy c-means [22,23] are two famous partitioning-based clusterings. Although they are simple to understand and easy to implement, K-means is extremely sensitive to outliers and the selection of the initial cluster centers; besides, Fuzzy c-means approaches suffer from initial partition dependence [1]. Model-based clustering methods require one or more appropriate probability models to represent the dataset and often use the expectation-maximization approach to maximize the likelihood function [24]. Hierarchical-based approaches [25][26][27][28] partition the dataset into several categories using two opposite ways: top-down or bottom-up approach [23]. e first one considers the whole dataset as a cluster and split it into a suitable number of subclusters. Another regards each sample as a cluster and then merging these atomic clusters into more and more massive clusters. However, the effectiveness of hierarchical clustering algorithms depends on the type of distance measurement chosen for the clusters. Grid-based [29] and density-based [30,31] approaches automatically determine the number of categories using suitable and preset parameters such as epsilon, min-pts, or others. While it is necessary to take a mass of argument adjustments to obtain optimal clustering results, these two types of algorithms generate noise at the cluster borders.
To overcome the above shortcomings, recently, density peaks clustering [32] is proposed and based on the assumption that cluster centers are relatively denser and are far from each other. Using a suitable value of cutoff distance (namely, dc, the only parameter of DPC), this approach manually selects the appropriate center of each cluster from a decision graph. It then assigns each of the remaining elements to the nearest denser point (NDP) that is the nearest one of neighbors possessing bigger density than the assigned sample. It has many advantages, including higher efficiency in finding cluster centers, fewer parameters, no iterations, and no noise around the cluster border. However, the algorithm is still affected by the following defects: (1) It is challenging to determine suitable dc. It must also be mentioned that the original DPC algorithm does not cover a reliable and specific method to ensure proper dc. Besides, this was demonstrated in several studies [33,34] that DPC is sensitive to its parameter, and even when being normalized or using the relative percentage method, a small change in dc will still cause a conspicuous fluctuation in the result.
(2) e formula of local density is too simple to find out suitable center(s) of the sparse cluster(s) and is only useful in datasets with balanced density [33]. As shown in Figure 1(a), the Jain dataset has two clusters: the upper one is sparse and the lower one is denser. However, DPC overlooks the center of the upper cluster, instead of a prominent density peak of the lower cluster. (3) Its assignment strategy is not robust [35]. Each point is assigned to its NDP, which results in some prominent density peaks (PDP) that are relatively bigger on density and δ i value but not cluster centers are mistakenly attributed to a denser superordinate but are far away from each other. Accordingly, the subordinates of the incorrect-assigned PDP are portioned to an incorrect group. Figure 1(b) shows that we manually modify the center to the densest point of the upper cluster. However, the prominent local peak of the top cluster is assigned to its NDP belonging to the lower cluster, which leads to the incorrect assignment of its subordinates. And there is a distinct gap between the assignment path.
To improve the performance of DPC and inspired by the idea of natural neighbor (NN) [36], we propose an improved density peaks clustering based on natural neighbor expanded group. e main innovations and improvements in our algorithm are as follows: (1) Define natural neighbor expanded and natural neighbor expanded group based on the well-known Knearest neighbor method and its optimal version named natural neighbor. e concept of natural neighbor expanded is to absorb those close neighbors overlooked by the NN method. And NNEG is able to overcome the shortcoming of the remote assignment of PDP and mine the potential structure of data. e remainder of this paper comprises four sections. Section 2 describes the related works. Section 3 represents the DPC, NN method, and details of our algorithm. Section 4 presents the clustering results on our proposal and related works. In Section 5, we have a summary of the contributions and features of this paper.

Related Works
To improve the performance of the DPC algorithm, scholars proposed many optimization methods, as shown in Figure 2. Xie et al. modified the density metric formula using the Knearest neighbor (KNN), which used the number of the nearest neighbors to replace dc. Besides, they devised an entirely new assignment scheme based on fuzzy weighted K-nearest neighbors (FKNN-DPC) [33]. Furthermore, this method is easier to determine the suitable value of parameter. Lotfi et al. proposed a technique called IDPC [37]. e algorithm sorts samples using the local density and then apportions the labels of centers to their KNN to develop cluster cores. Finally, IDPC implements a specific propagation strategy to attach the remaining points with labels. Guo et al. capitalized on the linear regression method to fit the decision values of DPC with a preset proper dc required (DPC-LRA) and then choose the instances above the fitting function as the centers [38]. Ding et al. proposed an algorithm based on the generalized extreme value distribution (GEV) to fit the DPC decision values in the descending order (DPC-GVE). To reduce the time complexity, they also represented a substitution method using Chebyshev inequality (DPC-CI) [39]. Ni et al. presented the definitions of density gap and the density path, as well as a new threshold [35]. Instead of the decision graph of DPC, the proper value of dc is determined by manually observing a summary graph incorporating the density gaps calculated by different dc. e method, named PPC, is able to reduce obviously the difficulty on threshold determination. Jiang et al. provided a novel density peaks clustering algorithm based on K-nearest neighbors (DPC-KNN) to overcome the issue of the assignment [40]. In this method, there are two sets for each sample i: the first one is S i , which is composed of sample i and its KNN, while the second is H i , which covers the data points possessing

Methods
is section aims to present the short versions of the original DPC algorithm and NN method and show a detailed description of our method.

e Original DPC Algorithm.
DPC is the basis on which cluster centers are relatively denser and are distant from each other. For a given dataset . . , n, cluster centers are manually picked from the decision graph, which is twodimensional with δ i as the ordinate and the local density as the abscissa. Local density is to measure the neighbor number and distances of each sample in its neighborhood, which is a crucial concept of DPC. e ordinate δ i is the distance between the sample i and its nearest denser point. Since the centers have relative lager density, each of them must be far away from their NDP, namely, has an enormous value of δ i . In the two-dimensional coordinate system, cluster centers simultaneously possess big values of δ i and local density and appear in the upper right corner of the graph. To measure the local density of each element, the author provides two formulae expressed as equations (1) and (2). δ i is calculated by equation (3):  Complexity where d ij is the distance between pairwise elements i and j, dc is the cutoff distance, the only argument of DPC.
erefore, e DPC algorithm inherits a defect, where Gaussian kernel is sensitive to bandwidth: As shown in equation (3), δ i is the minimum distance between elements i and j whose density is higher than i. For i with the highest density, its δ i is the maximum distance between i and j. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density.

Natural Neighbor
Method. K-nearest neighbor is a popular method in machine learning to complete the tasks of classification and clustering. However, the crucial argument K is preset manually. And natural neighbor is an adaptive method to find the relative near neighbors of each sample. e basic idea of NN is that samples of the dense regions have more neighbors; data points of the sparse area have relatively fewer neighbors; the outliers only have a few or no natural neighbors.
In the dataset X, the authors assume that s ij is the similarity between two points x i and x j . With the help of comparing the similarity, let find KNN(x i , n) denote the function of KNN searching which returns the r th nearest neighbor of the point x i , KNN r (x i ) is a subset of X, and it is defined as follows: Definition 1. (natural neighbor). Natural neighbor of x i is defined as Definition 2. (natural neighbor eigenvalue). When the algorithm reaches the Stable Searching State, Natural Neighbor Eigenvalue (NaNE) λ is equal to the searching round r:

e Proposed Method.
In this section, the improved density peaks clustering based on natural neighbor expanded group is presented. Our method includes three major steps, including (1) calculating the local density of each sample according to the formula proposed, (2) determining natural neighbor expanded groups, and (3) grouping NNEGs into several sets as the final clustering result. e details of these steps are described in the remaining part of this section. To realize the above processing, we define the concept of natural neighbor expanded and then provide a straightforward but useful formula for local density. Besides, the definition of the natural neighbor expanded group is to reveal the structure of the dataset and divide the dataset into several local groups. For ensuring the grouping of NNEGs accuracy, we propose a measurement of closeness degree. And more details are presented in the rest content of this section.

Basic Concepts.
e NN method only considers the relationship of mutual neighbors and overlooks the impact of distance between samples. And to fit the density metric and the searching of density peaks, we propose the concept of Natural Neighbor Expanded.

Definition 3. (natural neighbor expanded). Natural
Neighbor Expanded is defined as the following equation: where we assume that the number of NN of Hence, K i < � r. As shown in Figure 3, sample 1 is not the NN of sample 8, since it does not belong to the KNN 6 (8). However, sample 1 is closer to sample 8 than 14. Hence, for calculating the density more wholly and accurately, we expand the natural neighborhood of sample 8 to include samples 1, 2, and 7.
Natural Neighbor is the set of close neighbors. Still, as shown in equation (2), the local density formula measures not only the close neighbors whose distances to sample i are smaller than dc but also the rest samples of whole datasets. In the latter part, the distance to sample i being approximate to dc and the corresponding sample also impacts the density of i. erefore, 2K i of equation (7) is to cover the more secondary-adjacent samples beside the close neighbors. And the new local density formula based on NNE is shown as , and distNNE(x i ) is the set of the distances of x i to all of the elements in NNE(x i ). Inspired by the famous K-means method, equation (8) considers each point as core and calculates the sum of distances of it to its NNE. And the smaller the distance sum is, the more likely it is to be the local center.
Equation (2)   In our method, each point is assigned to the nearest denser point of its NNE. e assignment process is stored in a list: the index numbers represent the samples in the given dataset, respectively; each unit stores the index number of its superordinate one, and if the density of a sample is bigger than all of its NDP, the related unit saves 0. Namely, zero samples are prominent density peaks. e assignment divides the dataset into several NNEGs, adaptively.
Essentially, NNEGs reveal the potential structure of the dataset analyzed and are relatively tighter subcluster and local groups in the cluster of the Ground Truth. Due to the application of NNEG, each sample only points to a neighbor, and our method could avoid the long-distance assignment of the PDP.
As shown in Figure 4, after NNEGs are determined, our method only needs to merge such local groups into the goal number of clusters and hence remove the operation of the center selection from the decision graph, which overcomes the mentioned issue of the density metric of DPC. To clarify the close relationship between NNEGs, we proposed the concept of the adjacent group graph.
is a set of edges linked to NNEGs v t and v τ , and subject to Adjacent Group Graph usually is a multigraph, since there could be several e(x i , x j ) between v t and v r . And the more the edges are, the closer the two groups are. Obviously, in Figure 4, there are no edges between the upper and the lower clusters. Moreover, the degree of closeness (DC) of the neighboring pairwise NNEGs is calculated by where w i � (|NNE(x i ) ∩ v t |)/(|NNE(x i )|) and w j � ( |NNE (x j ) ∩ v r |)/(|NNE(x j )|). As shown in equation (10), the formula of closeness degree is constituted with two parts: the weight and the similarity normalized. It is based on an assumption where the more compact the endpoints and their respective NNEGs are, the more reliable the edge is. w i represents the compactness between the sample x i and the group v t , viz., the bigger number of intersected elements of NNEG(x i ) and v t means the relationship between them is intenser. To ensure w i ∈ [0, 1], the number of the elements intersected divided by |NNE(x i )|.

e Specific Processing
Inputs: dataset X, the goal number of clusters. Output: the clustering result.
Step 1: Create a k-d tree. Search NNE for each sample using the k-d tree.
Step 4: Generate the Adjacent Group Graph as in Definition 5, and find all edges of each pairwise NNEGs as equation (9).
Step 5: Calculate the degree of closeness, according to equation (10).
Step 6: Break up the original cluster containing all NNEGs into the goal number of sets, according to the closeness degree.
To clarify Step 6 in detail, we present an example in Table 1. As shown in Table 1 (A), there are five NNEGs in a dataset. And the closeness degrees of adjacent pairwise NNEGs are recorded. Assume the goal number is 2. Our method considers the whole dataset as a cluster, since We force the minimum DC(v 2 , v 3 ) � 0 as shown in Table 1 (B), which means those NNEGs are split into two parts: v 1 , v 2 and v 3 , v 4 , v 5 , i.e., split is a for-loop operation which let the minimum DC � 0 until the cluster number equals to the goal one.
And more details are as shown in the pseudocode. In the 6 th line, AGG is a matrix where each row and each column correspond to one of NNEGs. In the 16 th line, inspired by the Top-down hierarchical clustering, we consider the whole dataset as a cluster containing all NNEGs and break the weakest E(v t , v τ ) in the AGG until the cluster number equals the goal, which corresponds to the process, Table 1 (A) and (B).

Time Complexity Analyses.
is section aims to analyze the computational complexity of our method, and suppose that the number of total samples in a dataset is n, the number of NNEG is equal to n NNEG , the goal number of clusters is G, the NDP of sample i is the nNDP th i neighbor, and the biggest K i equals K. (Algorithm 1). e time complexity of creating a k-d tree is O(n log n) [41]. It is demonstrated that determining NN for all samples also requires the cost of O(n log n) [36]. And for finding NNE, we can record the K i in the processing of searching NN. Hence, the searching NNE of a sample only needs to 2K i times search operation, and its whole complexity for all samples is less than O(2Kn). Our local density metric is based on NNE, and it is not necessary to generate a distance matrix and only needs to 2K i times plus operations for each sample. erefore, it is required with at most O(2Kn) for the time cost to calculate local densities of all instances. For each sample, the method takes nNDP i time search to find its NDP via the k-d tree in the round 2K i , and nNDP i < � 2K i . In the process of generating each NNEG, we store the labels of its prominent density peak to a list where the first unit is any unallocated instance, and the end is an assigned one or prominent density peak. And the operation of storing labels of all samples only needs to the time cost of O(n). And the cost required is O(2Kn) on dividing a dataset into G NNEGs. In equations (9) and (10), e(x i , x j ) is requested and determined via searching the NNE of each sample to find the neighbors having different labels. us, for all edges, it is equal to O(2Kn) for the magnitude of how many times the searching operation is performed. Furthermore, the time complexity of grouping in the last step must be less than G. Overall, we can conclude that the time complexity of the entire algorithm is O(Kn log n).

Results
In this section, several datasets are used to evaluate the performance of our method in comparison with some stateof-the-art techniques such as DPC-DBFN [34], DPC-KNN [40], IDPC [37], and FKNN-DPC [33]. e experiments are performed on a computer with a Windows 10, Intel (R) Core (TM) i7-8750H, 16 GB memory, and Matlab 2016b. e results represented are measured by several performance metrics, including Normalized Mutual Information (NMI) [42], Rand Index (RI) [43], and the Adjusted Rand Index (ARI) [44]. In this section, the similarity between points is measured using the Euclidean distance metric.

Datasets.
In this paper, all tested datasets include three low-dimensional datasets and five high-dimensional datasets, which are public and from UCI. e two-dimensional datasets have different numbers of samples and different objective distributions. e DMI512 dataset containing 1024 elements with 512-dimensional features, which belonged to 16 Gaussian clusters sampled from a Gaussian distribution, is often used to test algorithm performance in high-dimensional space. Experiments of the four datasets, including Statlog (Shuttle), Abalone, Wine Quality, and Libras Movement, are applications of our method on Physical (the positioning of radiators in the Space Shuttle), Population Biology, Model Wine Preferences, and Hand Movement Recognition, respectively. And more details are presented in Table 2.
To reduce the influence of dimension weights and ensure the validity of the experimental comparison, we processed each dataset and normalized all dataset tested. e normalization formula is as follows: Complexity where x ij is the jth feature value of the ith sample, while max(x j ) and min(x j ) represent the maximum and minimum values of the jth feature, respectively.

Evaluation Measures.
We tested our algorithm and several related works on the above datasets. For intuitive comparison, we chose RI, ARI, and NMI to measure the clustering results. e RI formula is shown in where TP indicates true positive, TN indicates real negative, and the denominator C 2 n is the total number of sample pairs in a dataset consisting of n samples.
e ARI formula is shown in where E[RI] represents the expectations of RI. e NMI formula is shown in where E[MI(A, B)] represents the expectations of MI (A, B), and MI(A, B) is expressed as where P(i) � |A i |/n, P(j) � |B j |/n, P(i, j) � |A i ∩ B j |/n, A � A i |i � 1, 2, . . . , |A| , and B � B j |j � 1, 2, . . . , |B| . A and B represent two allocation methods for a dataset containing n elements, and A i and B j are clusters. In experimental verification, let A and B be the original labels and the clustering results of an algorithm, respectively. If the clustering results are as same as the real labels, the three metrics take the value of 1, and if the clustering results are entirely different from the labels, the values will be equal to 0.

Results.
is section aims to show the detailed clustering results and evaluate the performance of different clustering algorithms on the various datasets. Tables 3-5 compare the performance of our method with DPC-DBFN, DPC-KNN, IDPC, and FKNN-DPC in terms of NMI, RI, and ARI measures, respectively. All these methods are using the KNN method, and the number of nearest neighbors (K) can be set from 1 to n. In these tables, the numbers in the parenthesis are the value of K, where the corresponding algorithm obtains the results represented, and boldface marks the best results. e Jain dataset has 373 points and two clusters: the upper one and the lower one. As shown in Figure 5, DPC-NNEG divides the dataset into nineteen NNEGs and then successfully and efficiently groups them into two sets since there are no edges between the two clusters. Homoplastically, as shown in Figure 6, our algorithm Require: Dataset X � x 1 , x 2 , . . . , x n , the goal number of clusters G Ensure: e result of clustering: C � C 1 , C 2 , . . . , C n (1) Create a k-d tree; (2) Search the k-d tree; (3) Determine NN according to [34], and record K i , which means NNE(x i ) determined; (4) Calculate local density ρ i according to equation (8).; (5) Assign each point to its NDP of its NNE to generate several NNEGs; (6) Create a matrix AGG � (|NNEG|, |NNEG|); (7) for i � 1 : n do (8) for t � 1 : 2K i do (9) if the tth NNE and sample i belongs to different NNEGs do (10) Calculate the closeness degree of this edge, referring to equation (10); (11) Add the DC of this edge to the corresponding unit of AGG; (12) end if (13) end for (14) end for (15) while the number of clusters does not equal G do (16) Store zero in the unit with the min value but greater than zero; (17) Count the number of clusters; (18) end while ALGORITHM 1: DPC-NNEG. divides the Spiral dataset into several local groups and subsequently merges all NNEGs accurately into the goal number of clusters. Unlike Jain and Spiral, as shown in Figure 7, the Flame dataset containing 240 data points has no clear gap between the two adjacent clusters. Hence, it is more sensitive to the value of dc of the DPC algorithm because a tiny change in dc will cause the border point is assigned to another cluster. However, our method not only partitions all samples into eight NNEGs but also measures the   (9) 1.0000 (9) 1.0000 (10) 1.0000 (9) 1.0000 Flame 1.0000 (4) 1.0000 (7) 1.0000 (6) 1.0000 (9) 1.0000 Spiral 1.0000 (7) 1.0000 (5) 1.0000 (5     Complexity tightness between different groups accurately, which realizes the correct grouping of those local groups. And Figure 7 shows that the clustering result of Flame by DPC-NNEG is consonant with the Ground Truth. As shown in Tables 3-5, there is no difference in performance among our algorithm, DPC-DBFN, DPC-KNN, IDPC, and FKNN-DPC in three two-dimensional datasets. However, as shown in Table 2, the clustering results of more complex high-dimensional datasets show the outperformance of our method: DPC-NNEG gains the best marks measured by NMI in all datasets. For example, the results of DPC-NNEG in the Statlog (Shuttle), Abalone, Wine Quality, DIM512, and Libras Movement datasets are 0.6101, 0.1852, 0.0935, 1.0000, and 0.5855, respectively. Moreover, its improvements to the second-best method (in %) for Statlog (Shuttle), Abalone, Wine Quality, and Libras Movement datasets are respectively 11.13, 0.32, 33.38, and 0.12. Tables 4 and 5 show similar results, respectively, measured by RI and ARI. ese results also demonstrate that the proposed method, in most cases, obtains the biggest values of NMI except the Wine Quality dataset.
Hence, based on these results, it can be concluded that DPC-NNEG has given an overall excellent performance in clustering.

Conclusions and Future Works
is paper proposed an efficient clustering algorithm called DPC-NNEG, which can easily split a dataset into local groups and then merge those groups into the goal number of clusters with various densities, shapes, and sizes. e proposed method aims at clustering the data by three major steps: calculating the local density of each sample, identifying natural neighbor expanded groups, and merge those groups into clusters. e first step utilizes the natural neighbor method in the local density calculation. And it is entirely different from the formula of the original DPC and could avoid the impact of outliners and reduce the sensitivity of dc. In the second step, the NNE defined is used to mine the potential structure of data, which is useful to divide the dataset into several relatively more compact local groups called NNEGs. And the last step groups all NNEGs into the goal number of clusters using the proposed formula of the   closeness degree of local groups. And the application of the second and third steps not only overcomes the issue of remote assignment of the prominent density peaks but also removes the step of center selection in the original DPC. e effectiveness of the method proposed was verified on several datasets. e results show that our approach is more effective against the related improvement algorithms of DPC. In future work, we shall contribute to developing the concept of NNE to find a more suitable method for secondary-adjacent samples, instead of the given and fixated parameter 2K i in equation (7). Fuzzy theory is a proper technique to mine relatively adjacent samples, in which NNE is used to construct the membership function of closeness, and then deduce the functions of secondary-adjacent samples and remote samples.

Data Availability
All datasets in this paper are available in UCI.

Conflicts of Interest
e authors declare that they have no conflicts of interest to report regarding the present study.