RFDPC: Density Peaks Clustering Algorithm Based on Resultant Force

Density peaks clustering (DPC) is an efficient and effective algorithm due to its outstanding performance in discovering clusters with varying densities. However, the quality of this method is highly dependent on the cutoff distance. To improve the performance of DPC, the gravitation-based clustering (GDPC) algorithm is proposed. However, it cannot identify the clusters of varying densities. We developed a novel density peaks clustering algorithm based on the magnitude and direction of the resultant force acting on a data point (RFDPC). RFDPC is based on the idea that the resultant forces acting on the data points in the same cluster are more likely to point towards the cluster center. The cluster centers are selected based on the force directional factor and distance in the decision graph. Experimental results indicate superior performance of the proposed algorithm in detecting clusters of different densities, irregular shapes, and numbers of clusters.


Introduction
e main goal of clustering is to divide a data set into groups of data points so that the points within the same group are close to one another and those from di erent groups are distinct [1]. Clustering is widely used in many elds, such as image analysis, medical applications, data mining, and bioinformatics [2][3][4][5]. In general, the clustering algorithms can be categorized as partitional clustering [6], densitybased clustering [7], hierarchical clustering [8], model-based clustering [9], grid-based clustering [10], and graph-based clustering [11].
K-means algorithm partitions data into k clusters by iteratively minimizing the sum of distances between each data point and the cluster center [12]. Nevertheless, as a result of its sensitivity to the initial cluster centers, this algorithm may reach local optimum. In terms of density-based clustering, DBSCAN is one of the most widely adopted algorithms, which can discover arbitrarily shaped clusters using neighboring relationships among data points [13]. However, the choice of the density threshold has a significant impact on this algorithm. Rodriguez and Laio proposed a fast clustering algorithm that employs the density peaks of the data (DPC) [14]. In this algorithm, it is assumed that cluster centers will have a higher density than their neighbors and that the cluster centers will be relatively far from points with a higher density. DPC can detect cluster centers rapidly and recognize certain types of shaped data sets. However, DPC cannot detect clusters with no obvious center, and the cuto distance impacts the performance. e clustering results of DPC on the two-circles data sets [1] are shown in Figure 1, where d c is the cuto distance. We can see from Figure 1 that DPC cannot detect two-circles correctly when the cuto distance d c is set as 0.1 and 1.21, respectively.
Numerous methods have been proposed to improve the DPC algorithm [15][16][17][18]. A residual error computation has been proposed by Parmar et al. [19] to measure local density in a neighborhood region. Guo et al. [20] proposed an improved DPC algorithm (DPC-CE) to estimate the local center connectivity. Wang et al. [21] developed a hierarchical density peak clustering algorithm (McDPC) that assumes that cluster centers are relatively far apart from one another. A new model by Wang et al. [22] illustrates the local gravitational effects of data points, where each data point is viewed as a mass, subject to a local force resulting from its neighbors. A density peaks clustering algorithm using gravity theory (GDPC) has been developed by Jiang et al. [23], which assumes that gravity is inversely proportional to the distance between data points. And the horizontal axis of the decision graph is density, and the vertical axis is the reciprocal of gravity. GDPC can discover outliers. However, this algorithm is challenged when clustering a data set with a wide variation in density among the clusters. Figure 2 shows the clustering results of GDPC, where the cluster centers are represented by green stars. As shown in Figure 2(a), GDPC can identify the clusters with similar densities. But GDPC cannot discover the expected clusters with various densities shown in Figure 2(b). is is mainly because most of the data points in the upper right corner in Figure 2(b) have lower densities than those in the lower-left corner. And the density will affect the determination of the cluster center to a certain extent. So the two cluster centers are located in the same cluster, which leads to wrong clustering results. is paper proposes a density peaks clustering algorithm based on the resultant force, called RFDPC. As shown in Table 1, in comparison to algorithms such as k-means, DBSCAN, DPC, and GDPC, the proposed RFDPC algorithm has the ability to detect clusters of various densities, irregular shapes, and the number of clusters. Here are our thoughts: (1) It is highly likely that the resultant forces acting on the data points in the same cluster point towards the cluster center. at is, for the cluster center c i , the resultant forces act on its neighbors towards c i . (2) e cluster centers are generally located farther from points with higher densities.
erefore, it is more likely for a data point to be associated with a cluster center if most resultant forces acting on other points point towards it. In addition, two cluster centers have a larger distance than the points in the same cluster.
us, the noncentral point can be assigned to its nearest cluster center.
e key contributions of this paper are as follows: (1) A data point can be viewed as a force object, and the resultant force acting on it can determine whether the data point is located near the cluster center. (2) e clustering centers are selected based on force directional factor and distance in the decision graph. (3) is paper proposes a clustering algorithm that is less sensitive to the shape or density of clusters. e rest of this paper is organized as follows: in Section 2, we review the DPC algorithm. We next present our proposed algorithm in Section 3. e time complexity of the proposed algorithm is analyzed in Section 4. Section 5 discusses the proposed algorithm. Section 6 presents the experimental evaluations. Finally, the conclusion of our work is presented in Section 7.

Related Works
e DPC algorithm is based on the idea that the cluster centers are surrounded by neighbors with lower density and that they are at a relatively large distance from any point with a higher density. e density ρ i of data point i can be defined as where d c is the cutoff distance, and d ij is the Euclidean distance between points x i and x j .
In addition, the density ρ i of data point i can also be defined as where Let δ i be the minimum distance from point x i to any other point with higher density. 2 Mathematical Problems in Engineering e decision graph can be obtained after calculating the value of ρ i and δ i . e cluster centers are determined by selecting the data points with higher values of ρ i and δ i . DPC suffers many shortcomings such that it cannot identify the clusters of varying densities, as well as the number of clusters. e reason for the problem is that the density of the data point is determined by the cutoff distance d c , which can affect the clustering results. To address this issue, the algorithm proposed in this paper uses the force directional factor c instead of the density of the data point by analyzing the resultant force acting on the data point.

Method
e RFDPC algorithm assumes that: (1) All points attract each other with a force of gravitational attraction; (2) e direction of the resultant force acting on a data point is mainly along with the line between the point and the cluster center. (3) e intercluster distance is larger than the intracluster distance. e RFDPC includes three major steps: Step 1: for data point i, calculate the density ρ i , distance δ i , and resultant force F i .
Step 2: calculate the angle α ij between the vector connecting two points and the resultant force, as well as the force directional factor c i .
Step 3: determine the cluster centers and assign each remaining point to its nearest cluster.

Calculate and Sort the Density of Data Point.
First, calculate the Euclidean distance d ij in equation (2) between two data points i and j. e density of the data point ρ i can be calculated according to equations (1) or (3). e choice of cutoff distance d c is presented in Ref. [14], that is, one can choose d c so that the average number of neighbors is about 1 to 2% of the total point number in the data set. en, all data points are sorted according to their densities in descending order. e distance δ i is calculated from a data point i to any other point with a higher density as determined by equation (2).

Resultant Force Acting on a Data Point.
e concept of gravity is introduced on the basis of DPC. Newton's law of universal gravitation [23] states that any particle in the universe attracts any other with a force varying directly as the product of the masses and inversely as the square of the distance between their centers. e gravitational force is formulated as follows: where G is the gravitational constant, m 1 and m 2 represent the two masses, r represents the distance between the centers of the masses. Table 2 provides a mapping of Newton's law of universal gravitation with parameters of DPC.
In equation (6), the Euclidean distance is used to calculate the gravitational force, which avoids the sensitivity of the algorithm to d c value. e gravitational force F acting between data points i and j is defined as equation (6) according to Table 2.  where "×" refers to disable, "√" means able, and "z" is partial.
Algorithms Varying densities where f ij is the gravitational force between the data point i and the data point j, whose direction is from i to j.
For each data point i, the gravitational force f ij from the other data point j is calculated according to equation (6), and we further obtain the magnitude and direction of the resultant force F i acting on data point i through decomposing and synthesizing the gravitational forces, as shown in equation.
An example of two-dimensional space is shown in Figure 3. Suppose that the data point i is located at the origin of a coordinate system. Two gravitational forces F 1 and F 2 act on point i. F 1 can further be decomposed into a force fx 1 along the x axis and a force fy 1 along the y axis. Similarly, F 2 can also be decomposed into fx 2 and fy 2 . Finally, the resultant force F i acting on point i is obtained by combining the sum of the components of all forces along the x axis (including fx 1 and fx 2 ) and the y axis (including fy 1 and fy 2 ). Figure 4 shows the resultant force acting on each data point in Flame data set, where the direction of the arrow represents the direction of the resultant force acting on the data point and the length of the arrow represents the magnitude of the resultant force acting on the data point. As can be seen from Figure 4, most resultant forces point towards the two cluster centers. Meanwhile, the farther away the data point is from the cluster center, the less affected by the cluster center. In addition, the magnitude of resultant force in a higher density area is usually larger. In contrast, the resultant force will have a small magnitude in a lower density area.

Force Directional Factor.
Here, a parameter c i is introduced which represents the effect of data point i on the resultant forces acting on all other points. e detailed method is given as follows.
Take points i and j in Figure 4, for example. For clarity, the points i and j are magnified in Figure 5. As shown in Figure 5, α ij represents the angle between the vector ji → connecting point j to point i and the resultant force F j . And     (8). cosα ij is equal to 1 when α ij is equal to 0, which means that the resultant force F j acting on j points towards i. e larger the value of α ij is, the smaller cosα ij is, and the more the direction of resultant force F j acting on j deviates from the data point i.
In addition, from Figure 4, it can be seen that the resultant force acting on data point k also points towards point i. However, i and k belong to different clusters, which indicates that it will produce the clustering error while only considering cos α ij . To address this issue, not only the resultant force should be considered but also the distance between data points. us, the parameter c ij is defined as For data point i, c ij is calculated between it and all other points. e force directional factor c i is further defined as follows: A larger value of c i indicates that the point i has a significant effect on the resultant force acting on the other points. Hence, the point i may be most likely to be a cluster center or located near the cluster center.

Determine the Cluster Centers.
For the data point i near the cluster center, its value of c i will be larger due to the effect of the cluster center. us, both of δ i and c i can be considered. erefore, the cluster centers are selected according to the product of δ i by c i . Each remaining point is assigned to the same cluster as its nearest neighbor of higher density. Algorithm 1 shows the proposed clustering algorithm based on the resultant force. Figure 6 illustrates the flowchart of RFDPC algorithm.

Complexity Analysis
e computational complexity of RFDPC is analyzed as follows: Firstly, we calculate the density ρ i and the distance δ i of each data point i, which requires O(n 2 ) calculations.
Next, we calculate the resultant force acting on the data point. e time complexity of calculating the gravitational force acting on a data point and the resultant force is O(n 2 ).
e time for the calculation of c i is O(n 2 ). Finally, we determine the cluster center. Since K ≤ n, the time for the selection of K cluster centers can be ignored. It takes about O(Kn) to assign each remaining point to its nearest cluster.
As a result, the total time complexity of the RFDPC algorithm is approximately O(n 2 ).

Experimental Results
For the purpose of testing the feasibility and effectiveness of the RFDPC algorithm, it is compared with k-means [6], DPC [14], GDPC [23], single linkage [8], spectral clustering [24], DPC-CE [20], and McDPC [21] on 11 synthetic data sets DS1-DS11 (https://github.com/milaan9/Clustering-Datasets) and 15 UCI real data sets [6] listed in Table 3. Both Shuttle and Eeg are large-scale data sets and Gene is a high-dimensional data set. e code used in this paper is released, which is written in Matlab and available at https://github.com/ djhahaha/cluster. Step 2 Calculate the resultant force Step 3 Determine the cluster centers

Mathematical Problems in Engineering
An example of a hierarchical clustering algorithm is a single linkage algorithm. Spectral clustering is one of the graph-based clustering algorithms. For k-means and spectral clustering, the best clustering result was obtained from 50 trial runs according to the external clustering validity index. In addition, the parameter of d c in DPC, GDPC, DPC-CE, and RFDPC set as 1.2. McDPC includes three parameters c, θ, and λ, and the values of these three parameters are determined by a heuristic algorithm [21]. e algorithms were all implemented in MATLAB. e experiments were carried out on a machine with Intel Core i5 2.2 GHz CPU and 8 GB RAM running Windows 10.

Detect Clusters of Varying Densities.
ere are two clusters in the DS1 data set, one with a higher density and another with a lower density. e DS2 has four clusters, one with higher densities and the other three with lower densities. e experiments are carried out on DS1 and DS2 in  (4) and sort all ρ i in descending order.
Step 2. Calculate the resultant force acting on the data point. (2.1) Calculate the gravitational force acting on the data point according to equation (6), and the resultant force according to equation (7). (2.2) Calculate c i according to equations (9) and (10).
Step 3. Determine the cluster center. (3.1) Select the K cluster centers according to the product of δ i by c i , that is, the cluster center is selected from the data points with larger product values. (3.2) Each remaining point is assigned to the same cluster as its nearest neighbor with higher density. detecting clusters of varying densities. For DS1, in Figure 7, it is obvious that k-means, GDPC, single linkage, and spectral clustering cannot identify the clusters of varying densities. All four remaining clustering algorithms can produce the correct clustering results. Figure 8 illustrates the clustering results on DS2. For DS2, the results of k-means, DPC-CE, and McDPC are not perfect, but they are better than those from DPC, GDPC, and single linkage. Both spectral clustering and RFDPC can identify the clusters correctly. e reason why RFDPC can aggregate clusters of varying densities properly is that it finds the cluster center based on the direction of the resultant force acting on the data point instead of the density of the data point.

Detect Clusters of Irregular Shapes.
We evaluate the performance of RFDPC using nine data sets (DS3-DS11). e DS3 data set comprises a spherical cluster as well as three semi-moon clusters. e DS4 data set consists of three spiral clusters which are isolated from one another. e DS5 data set includes seven Gaussian distributed clusters. With regard to the DS6 data set, there are 31 twodimensional Gaussian clusters, each with 100 data points.
ere are a crescentic cluster and a spherical cluster in the DS7 data set. In the DS8 data set, there are three spiral clusters and four Gaussian distributed clusters. e DS9 data set consists of four sphere-shaped clusters and one nonsphere-shaped cluster. e DS10 data set contains the following cluster types: linear cluster, ring-shaped cluster, and compact rectangular cluster.
ere are two round clusters in the DS11 data set, along with one Gaussian distributed cluster. Figure 9 illustrates that spectral clustering and RFDPC perform better than other algorithms on DS3. Figure 10 illustrates the clustering results on DS4. Except for k-means and spectral clustering, all algorithms identify the expected clusters. Figure 11 describes the clustering results on DS5. Among these methods, DPC, GDPC, spectral clustering, DPC-CE, McDPC, and RFDPC give correct partitions, while k-means and single linkage do not. Figure 12 illustrates the clustering results on DS6. GDPC, DPC-CE, and RFDPC can discover the 31 clusters. We can see from Figure 13 that both of DPC-CE and RFDPC can produce the proper partitions. As for DS8 shown in Figure 14, DPC, DPC-CE, and RFDPC can discover the expected clusters. For DS9 shown in Figure 15, DPC, DPC-CE, McDPC, and RFDPC can identify the satisfactory partitions except for the outliers in the upper right corner. Figure 16 describes the clustering results on DS10. DPC, single linkage, DPC-CE, and RFDPC are capable of generating the proper partitions. In addition, we can see from Figure 17 that the proper partitions can only be identified by single linkage, spectral clustering, and DPC-CE.
We further explain the reason why RFDPC cannot identify the proper partitions for DS11. Figure 18 shows the three cluster centers determined by RFDPC, where the cluster centers are represented by purple stars. According to Figure 18, it is evident that each of the three cluster centers is located in a different cluster. Following the determination of the three cluster centers, each remaining point is assigned to a cluster based on its nearest neighbor of higher density. An incorrectly assigned data point may result in the domino effect, that is, once one data point is incorrectly assigned, subsequent data points may also be incorrectly assigned. Figures 19-21 show the decision graph for three data sets, including Wine, Seeds, and DS3. It can be seen from Figure 19 that RFDPC is more accurate than both DPC and GDPC in detecting the cluster number. DPC can distinguish only two cluster centers. And, GDPC can find one cluster center and cannot identify all of the three cluster centers. As can be seen from the decision graph ( Figure 19(c)), there are three distinct points. Hence, RFDPC can find three cluster centers correctly. Figure 20 shows that GDPC can only discover one cluster center. Figure 21(a) shows that DPC can discover four cluster centers; however, the point in the red circle can easily be mistaken for the center.

Determine the Number of Clusters.
In contrast, it can be seen from Figure 21(c) that RFDPC can discover the proper cluster centers. Tables 4-18 list the performance comparison of eight clustering algorithms on 15 real data sets. In Tables 4-18, the optimal value for each index is indicated in bold. It can be seen from Tables 4-18 that RFDPC obtains the best performance for most data sets. For the large-scale Shuttle data set, RFDPC achieved the best results for all indices except the Purity index. In addition, RFDPC achieved the best results for FM and NMI on the Gene data set.

Parameter Setting.
In this section, we will discuss the value of d c involved in RFDPC.    where n, D, K, and I denote the data point number in the data set, dimensionality, cluster number, and the iteration number for k-means, respectively. According to  Table 21 indicates that the running time on real data sets is comparable to those in Table 20. In addition, the running time of single linkage and spectral clustering on three large-scale or high-dimensional data sets (including Shuttle, Gene, and Eeg) is significantly higher than other data sets. Although the calculation of resultant force is added to RFDPC, we optimize the code of RFDPC. So the running time of RFDPC on Gene and Eeg data sets is less than that of DPC.

Discussion
To further explain the proposed algorithm, let us consider the test case in Figure 22. In Figure 22(a), 25 data points are embedded in a two-dimensional space. e data points are ranked in order of decreasing c. e black arrow represents the direction of the resultant force acting on the data point. All arrows in Figure 22(a) are of the same length because the magnitude of the resultant force is not involved in the calculation of c. Figure 22(b) shows the corresponding decision graphs. According to Step 1 of Algorithm 1, we calculate the density ρ i and the distance δ i of each data point. Specifically, we calculate the Euclidean distance from data point x i to other points. For simplicity, the value of d c in this example is set to 2%. en, we calculate ρ i and δ i of each data point according to equations (1) and (2), respectively. Next, we calculate the resultant force acting on each data point according to Step 2 of Algorithm 1. We can see from Figure 22(a) that for the red cluster, most resultant forces point towards data point 1. In addition, we can find that there is no resultant force acting on the data point 1. is is mainly due to the fact that the magnitude of the resultant force acting on the data point 1 is equal to 0. Furthermore, the number of data points in the blue cluster is much smaller than that in the red cluster, so the resultant forces acting on the data points in the blue cluster are greatly affected by the data points in the red cluster. Hence, we can see from Figure 22(a) that for the blue cluster, the resultant forces do not point towards a certain data point. Next, we calculate c i according to equations (9) and (10). Finally, we determine the cluster centers according to the product of δ i by c i . It can be    Mathematical Problems in Engineering 13 seen from Figure 22(b) that the data points 1 and 17 with a larger product value are selected as the cluster center. en, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. e density of the red cluster is higher than that of the blue one in Figure 22(a). e c value of data point 17 is smaller than that of the data point 1.                   But our algorithm considers both of δ and c which can tackle the problem of varying densities. Since the δ value of data point 17 is close to 1, the data point 17 is selected as a cluster center.

Conclusions
is paper proposes a resultant force-based clustering algorithm based on gravitation theory. e clustering performance of RFDPC was evaluated using 5 synthetic data sets and 15 real data sets.
e results indicate that the RFDPC performs well in the following aspects: (i) Aggregating clusters with different shapes and densities in an efficient manner. (ii) Detecting the cluster number.
Experimental results indicate that RFDPC is superior to k-means, DPC, GDPC, single linkage, spectral clustering, DPC-CE, and McDPC. GDPC considers the magnitude of the gravitational force between two points but ignores the effect of gravitational forces coming from the other points. We extend the gravitational force in GDPC to the resultant force and select the cluster center depending on both the force directional factor c and distance δ in the decision graph. Hence, RFDPC can accurately recognize the cluster centers.
A major limitation of RFDPC is the assignment scheme for the remaining data points, which is prone to producing consecutive assignment errors. e future work will plan to improve RFDPC by reducing consecutive assignment errors. In addition, ensemble clustering techniques can improve clustering robustness by fusing the information of multiple clustering results [30]. Our algorithm may be improved by using ensemble clustering.
Data Availability e code data that supports the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.