A Novel Minimum Spanning Tree Clustering Algorithm Based on Density Core

Clustering analysis is an unsupervised learning method, which has applications across many fields such as pattern recognition, machine learning, information security, and image segmentation. The density-based method, as one of the various clustering algorithms, has achieved good performance. However, it works poor in dealing with multidensity and complex-shaped datasets. Moreover, the result of this method depends heavily on the parameters we input. Thus, we propose a novel clustering algorithm (called the MST-DC) in this paper, which is based on the density core. Firstly, we employ the reverse nearest neighbors to extract core objects. Secondly, we use the minimum spanning tree algorithm to cluster the core objects. Finally, the remaining objects are assigned to the cluster to which their nearest core object belongs. The experimental results on several synthetic and real-world datasets show the superiority of the MST-DC to Kmeans, DBSCAN, DPC, DCore, SNNDPC, and LDP-MST.


Introduction
Clustering analysis, which classifies the unlabeled data into some clusters, refers to a task to discover the internal structure of data or the potential data models [1]. Since the early 1950s, quite a few clustering algorithms have been put forward [2,3]. ese algorithms can be roughly classified into four categories: partition-based clustering algorithms [4,5], hierarchical clustering algorithms [6,7], density-based clustering algorithms [8,9], and graph-based clustering algorithms [10][11][12]. anks to the predominant capability of discovering clusters of different shapes and sizes along with outliers, density-based and partition-based clustering technologies are widely used in the fields of health care [13], information security [14], the Internet [15], etc. Besides, clustering is also a vital key for analyzing big data.
Partition-based clustering algorithms are the simplest and most fundamental clustering algorithms. ey organize the data objects into several nonoverlapping partitions where each partition represents a cluster, and each data object belongs to one cluster [16]. Nevertheless, traditional partition-based methods usually cannot find clusters with arbitrary shapes.
However, identifying clusters with arbitrary shapes is a very important task in the applications of clustering algorithms. Since the density-based clustering algorithm does not need to know the number of clusters in the dataset in advance and can effectively process datasets with arbitrary shapes, it has always been the focus of research in clustering algorithms. e idea of the density-based clustering algorithm [17] is that the clusters in the dataset are a collection of dense data regions separated by sparse data regions. DBSCAN [18] and DPC [8] are two typical algorithms among density-based algorithms. e DBSCAN algorithm manually sets the neighborhood radius ε and the minimum density points Minpts, and it classifies data points into core points, boundary points, and outliers. Besides, DBSCAN can be effectively applied to datasets with complex shapes, and outliers can be detected during the clustering process. However, the algorithm gets poor clustering results for multidensity datasets. Furthermore, for different datasets, setting different parameters in DBSCABN will get unstable clustering results. e DPC (density peak clustering) algorithm was published in the journal Science in 2014, and it holds the idea that cluster centers are characterized by a higher density and a relatively longer distance [8]. However, the DPC still has some drawbacks. Firstly, a threshold dc needs to be set by users. Besides, the cluster centers are obtained by the decision graph, which has certain human factors.
To improve the performance of DPC, DPC-KNN-PCA [19] and SNN-DPC [20] have been proposed. DPC-KNN-PCA integrates PCA, DPC, and KNN to avoid the defects of DPC. However, this density-based algorithm still cannot recognize clusters containing manifold distributions [21]. To solve the defect of the use of threshold dc, paper [20] proposed a shared-nearest-neighbor-based clustering by the fast search and find of density peak (SNN-DPC) algorithm. e computation of its local density ρ and distance from the nearest larger density point δ takes the information of the nearest neighbors and the shared neighbors into consideration. e assignment process in DPC is sensitive and of low fault tolerance. For example, if a data point is assigned incorrectly, then the subsequent assignment will magnify the error, resulting in more errors that will have a seriously negative impact on the clustering process.
erefore, the paper [20] adopted a two-step assignment to solve the drawbacks of DPC. Yet, SNN-DPC still has several apparent defects. Firstly, the number of shared nearest neighbors k needs to be set through manual experience. Besides, SNN-DPC still utilizes a decision graph to select cluster centers. MST-based clustering methods [22,23] do not assume that data points are grouped around centers or separated by regular geometric curve. Instead, they use tree edge information to divide a dataset into clusters and are able to recognize clusters with arbitrary shapes. Clustering algorithms based on the minimum spanning tree (MST) are able to detect clusters with arbitrary shapes; however, they are time-consuming and susceptible to noise points. e algorithm in [24] uses a new distance between local density peaks based on shared neighbors to construct a minimum spanning tree on the local density peaks, which excludes the interference of noise points and reduces the running time of MST-based clustering algorithms. Nevertheless, LDP-MST still needs input parameters, which means that the algorithm cannot exclude the interference of human factors.
To resolve the problem mentioned above, we propose a novel clustering algorithm (called the MST-DC). Firstly, we automatically get the reverse neighbors of each object based on the concept of natural neighbor searching. Secondly, we obtain the core objects according to the number of reverse neighbors of each object. irdly, based on the Prim algorithm in the graph theory, we construct a minimum spanning tree of the core objects to obtain the clustering result of core objects. Lastly, unallocated objects are marked as the label of their nearest local core objects. ere is no need to set parameters in MST-DC. Furthermore, MST-DC can be applied to complex patterns with extremely large variations in density. e remainder of this paper is organized as follows: Section 2 presents a brief overview of density core and natural neighbors; Section 3 presents the clustering algorithm (MST-DC); Section 4 presents the analysis of synthetic datasets and real datasets; and finally, Section 5 presents the summary of this paper and future work.

Related Work
We review related works on density core [25] and natural neighbors [26] in this section, which are originally mentioned by Dai et al. [25].

Density Core.
ere exist some intrinsic defects in centroid-based clustering methods, including shape loss, false distances, and false peaks, which cause centroid-based methods to fail when applied to complex patterns [27]. ence, Chen et al. [27] proposed a hybrid decentralized approach named DCore, which was used to overcome these defects in centroidbased clustering methods. Density cores can roughly maintain the shape of the cluster and be located far from each other.
As is known to all, the mean shift algorithm can identify nonspherical patterns by shifting tracks. us, DCore uses mean shift and k-center to obtain convergence points. DCore is a hybrid method that can decentralize each density peak into a loose density core, which can refrain some intrinsic defects in centroid-based clustering approaches. e application of DCore in different datasets indicates that it can perform well in many complex datasets. Nevertheless, it still has some obvious limitations: (1) DCore algorithm uses a global fastened scanning radius to search convergent points and density cores. For a dataset with multiple density levels, it cannot obtain an ideal density representative point using a global fastened scanning radius, and thus cannot obtain ideal clustering results. (2) In order to filter noise, the DCore algorithm adopts three filtering strategies. However, it is usually difficult to determine a specific pattern, so it is difficult to select the corresponding strategy to detect outliers and noise. (3) e DCore algorithm needs to adjust five parameters to attain better clustering results. It is difficult to find the ideal combination of parameters to make the clustering perform better.

Natural
Neighbor. Natural neighbor [26] is a new concept that originates from the reality that the number of one's real friends should be the number of how many people are taken him or her as friends, and he or she takes them as friends at the same time. For example, if object p regards object q as a neighbor and object q regards object p as a neighbor at the same time, then object q is one of the natural neighbors of object p. To put it another way, the main idea of the natural neighbor stable structure is that objects lying in sparse regions possess a small number of neighbors. In contrast, objects lying in dense regions possess a large number of neighbors [28]. us, the natural neighbor stable structure of objects is formulated as follows: where KNN k (p) is the k nearest neighbor of object p.

Computational Intelligence and Neuroscience
Definition 1 (k nearest neighbors). e k nearest neighbors of object p are a set of objects q in dataset X with dist(p, q) ≤ dist(p, o), that is, where dist (p, o) is the distance between the object p and the kth object o.
Definition 2 (Reverse neighbors). e reverse neighbors of object p are a set of objects q that regard p as its k nearest neighbor, i.e., RNN k (p) � q ∈ X | p ∈ KNN k (q) .
e natural neighbor stable structure's formation process is represented as follows: continuously expand the neighbor searching range k increasing from 1 to λ (λ is named natural neighbor eigenvalue (NaNE)) [26]; in each searching process, calculate the number of reverse neighbors of each object and judge the following two conditions: (1) all objects have reverse neighbors and (2) the number of objects without reverse neighbors remains unchanged; and when one of these conditions is met, the natural neighbor stable structure is formed. e searching range k is equal to the natural characteristic value λ at this moment. erefore, λ is obtained by where k is initialized with 1, nb k (p) is the number of object p's reverse neighbor in the kth iteration (note that nb k (p) ≥ 0), and f(x) is defined as follows: Based on the above concepts, the natural neighbor is defined as follows: Definition 3 (Natural neighbors). For each object p, the k nearest neighbors are natural neighbors, denoted as NaN(p), where k is equal to the natural characteristic value λ.
Apparently, each object has the same number of natural neighbors in this paper. e details of the natural neighbor searching algorithm are shown in Algorithm 1.
Computational Intelligence and Neuroscience 3 construct a minimum spanning tree of density core points for clustering. Besides, we use the reverse nearest neighbors in the process of natural neighbor searching, which does not require any parameter settings. MST-DC can recognize extremely complicated clusters with large variations in density.

Density Core Set.
According to Algorithm 1, we calculate the number of reverse nearest neighbors for each object. Since the number of core objects' reverse nearest neighbors are greater than that of noncore objects' reverse nearest neighbors, we use the number of reverse nearest neighbors to extract core objects. e definition of the core object is as follows: Definition 4 (Core object). If one object can be considered as a core object, it must satisfy the following formula: where SRNN(p) represents the number of reverse nearest neighbors of object p, and k represents the natural characteristic value.
As is mentioned above, each data point regards its neighbor point as a potential density core point. is neighbor point is a true density core point when there are enough data objects to treat it as a potential density core point. Figure 1(a) shows an original dataset with three clusters. After the core point extraction process, as shown in Figure 1(b), the red regions represent potential clusters, and each point in the red regions is a core object. e gray points are noncore objects. Algorithm 2 presents the process of finding core sets, which can roughly retain the shape of clusters.

Clustering Core Objects.
After we obtain the density core points, how to cluster the density core points becomes a key task. We propose a method of clustering density core points based on the minimum spanning tree. e density core sets extracted in the dataset retain the general shape of the cluster, while the distance between the density core sets is further apart. After constructing the minimum spanning tree, it is easier to find the longest edge of the tree for cutting. e process of clustering the density core points based on the minimum spanning tree is as follows: Firstly, we Figure 1: Density core points extracted based on a reverse nearest neighbor number.
RCore � RCore ∪ p (7) end if (8) end for ALGORITHM 2: Find core set (find-cores). 4 Computational Intelligence and Neuroscience construct a minimum spanning tree based on the set of density core points, and secondly, we cut off the edges whose length is greater than the trimming threshold. Afterwards, we can obtain the cluster of the density core sets according to tree structure information after trimming. e trimming threshold cutθ is defined as follows:where mean(E dg e) represents the average value of all the edge weights in the minimum spanning tree, and st d(E dg e) represents the standard deviation of all the edge weights in the minimum spanning tree. ω is an empirical value, and its value range is [2,5], which has been verified by a large number of experiments. When ω � 3, it meets the requirements of most datasets. In this article, we choose ω � 3 as the experimental parameter. e setting of the trimming threshold is based on statistical principles, which can check whether there are outliers in the data. e length of the edge of the minimum spanning tree constructed conforms to the Gaussian distribution, and the edge we need to trim is the longer one located between different clusters.
Definition 5 (Trimming threshold). e threshold is derived from the overall minimum spanning tree.
e steps of clustering density core points are as follows: (1) We utilize the Prim algorithm to construct the minimum spanning tree on all the density core points. e length of the edge, which is computed based on the Euclidean distance, is used as the weight in the minimum spanning tree. e minimum spanning tree we build based on the density core points is shown in Figure 2.
(2) After building the minimum spanning tree, we get a minimum spanning tree edge set. Since we build the minimum spanning tree based on the density core, the weight of the tree in the same cluster is relatively small, and the change of weight is small as well. e weight of the edge between different clusters is larger, so it is easier for us to find the edges connecting different clusters and cut them off. As shown in Figure 3(a), the colored dots represent the weights of the edges of the minimum spanning tree. e weights of the two edges (red dots) are much larger than the weights of the other edges (blue dots). e red dotted line indicates the trimming threshold we calculated, and the trimming threshold can effectively identify the longer sides in the minimum spanning tree; Figure 3(b) shows the minimum spanning tree of the subcluster after cutting off the edge greater than the trimming threshold. From the figure, we can see that the red density core points have been divided into three parts, and each part of the density core has a minimum spanning tree after cutting.
(3) We cluster the density cores according to the minimum spanning subtree structure retained by the density core points of each cluster after trimming, namely, assigning the points on the same minimum spanning subtree to the same cluster.
According to the description of the above steps, the specific steps of clustering density core points are shown in Algorithm 3.

e MST-DC Clustering Algorithm.
In this section, we introduce a novel clustering algorithm, namely, MST-DC. e basic steps are as follows: firstly, we find the reverse nearest neighbors of each object according to the natural neighbor searching algorithm; secondly, we use formula (5) to get the core objects; thirdly, we build the minimum spanning tree of the density core set RCore, and the edge between the clusters is cut according to formula (6), and then, the density cores are clustered according to the obtained subcluster tree; fourthly, we apply the concept of outlier clusters in paper [29] to eliminate erroneous clusters. In this paper, a novel outlier cluster detection algorithm called ROCF is proposed based on the concept of mutual neighbor graph and on the idea that the size of outlier clusters is usually much smaller than the normal clusters; and finally, the noncore points are assigned to the clusters to which their closest density core points belong. e overall steps of the MST-DC algorithm are shown in Algorithm 4.

e Complexity Analysis.
Based on the description of the MST-DC clustering algorithm in the previous sections, the time complexity of MST-DC depends on following parts: (1) we use the natural neighbor algorithm optimized by KD − tree [30] to obtain the reverse nearest neighbors of each data point, the natural eigenvalue, and the Euclidean distance of the data points, and its time complexity is O(n log(n)); (2) the process of extracting core points is equivalent to traversing data points, and its time complexity is O(n); (3) the time complexity of clustering the core points based on the minimum spanning tree is mainly focused on the Prim algorithm to establish the minimum spanning tree. is Computational Intelligence and Neuroscience paper uses the heap-optimized Prim algorithm, whose time complexity is O(m log(m)), where m(m ≪ n) represents the number of obtained core points; and (4) we assign the remaining points to their nearest density core with time complexity of O(l), where l(l < n, n � l + m) represents the number of remaining noncore points. In summary, the total time complexity of MST-DC is O(n log(n)).
e larger the values of the three indexes are, the better the clustering result is.
We chose Accuracy as the first evaluation indicator. For n objects x i ∈ D j , p i and c i are the intrinsic category label and the predicted cluster label of x i , respectively, and the calculation formula of Accuracy is as follows: where map(.) is a mapping function that maps the predicted cluster label and its intrinsic cluster label by the Hungarian algorithm [34]; let δ(a, b) equal to 1 if a � b or equal to 0 otherwise. Accuracy ∈ [0, 1]; namely, the higher the values of the Accuracy are, the better the clustering performance will be.
e mutual information (MI) [32] can be used to measure the information shared by two clusters, given a set S of N data points, and two partitions of set S, namely, X � X 1 , X 2 , . . . , X r , and Y � Y 1 , Y 2 , . . . , Y s . Suppose that we select a point at random from S, then the probability that the point belongs to cluster X i is Entropy can be described as the information conveyed by the uncertainty that a randomly selected point belongs to a certain cluster. e entropy of the cluster X is given by the following formula: e MI between the clusters X and Y is defined by e NMI is calculated as follows:

Experiments on Synthetic Datasets.
We first conduct comparison experiments based on ten synthetic datasets. As shown in Table 1, the characteristics of ten synthetic datasets are described. Moreover, ten original synthetic datasets are displayed in Figure 4. D1 and D2 contain spherical clusters with a different number of clusters and densities. D1 contains three clusters and a total of 600 objects. D2 consists of five clusters with skewed distribution, and a total of 6699 objects. In contrast, the remaining synthetic datasets contain clusters with arbitrary shapes. D3 is composed of four-line clusters and a total of 1268 objects. D4 has a spherical cluster in the middle of two ring clusters and a total of 1897 objects.
D5 is composed of two moon manifold clusters with 1532 objects, including noise objects. D6 includes four manifold clusters and a total of 630 objects. D7 has four spherical clusters in two right-angle line clusters with some noise objects, and a total of 1916 objects. D8 has three spherical clusters in one manifold cluster with several noise objects, and a total of 1427 objects. D9 is composed of six square clusters that cross and parallel with each other, and a total of 8000 data objects including some noise objects. D10 consists of three circle clusters, two spiral clusters, and two spherical clusters with a total of 8533 objects, including some noise objects.
Computational Intelligence and Neuroscience e parameter settings of each clustering algorithm in the ten synthetic datasets are displayed in Table 2. As for the Kmeans algorithm, N represents the number of clusters in the dataset, and the initial clustering center is randomly selected. DBSCAN needs two parameters: Eps and Minpts. e cutoff distance dc of DPC is set as 2%. SNNDPC needs the parameter K to find k nearest neighbors. We test different K to achieve better results. e results of DCore are affected by the selection of parameters r1, r2, T 1 , T n , and R. We use different parameter settings to achieve better results. e LDP-MST algorithm needs to set the parameter N manually. Concerning MST-DC, there is no need to set parameters. In Table 2, the symbol " -" indicates that there is no need to set parameters in MST-DC. e experimental result of D1 is shown in Figure 5. It shows that all clustering algorithms can find correct clusters in D1, which means that these algorithms are effective for uniformly distributed spherical datasets. However, except MST-DC, other algorithms need parameter settings. Figure 6 shows that DPC, LDP-MST, and MST-DC can get the correct clustering on D2, while Kmeans, DBSCAN, DCore, and SNNDPC cannot. e number of clusters in the Kmeans algorithm is input by users. Because it cannot recognize clusters with different densities, the low-density area is mistakenly recognized as the same cluster, while the high-density area is erroneously partitioned as well. Because the choice of Eps is too large, DBSCAN aggregates D2 into  four clusters. Because of the improper choices of parameters, DCore and SNNDPC cannot correctly identify clusters in the D2 dataset. erefore, Figure 6 displays that using global fixed parameter settings for multidensity patterns is not applicable.
e experimental results on D3 are shown in Figure 7. It shows that except Kmeans and SNNDPC, other algorithms can well find the correct clusters. It also illustrates that Kmeans cannot be applied to line cluster datasets. Moreover, the incorrect setting of parameters leads to incorrect clustering results. e experimental results on D4 are shown in Figure 8. It demonstrates whether those algorithms can process circle clusters or not. It shows that Kmeans, DPC, DCore, and SNNDPC algorithms are not suitable for datasets containing circular clusters, and the correct clustering results cannot be obtained. DBSCAN, LDP-MST, and MST-DC algorithms can find the correct clusters for D4. e clustering results displayed in Figures 9-11 demonstrate whether those algorithms can process clusters with the arbitrary shape or not. e results of DBSCAN, LDP-MST, and MST-DC are similar. ree of them can find the clusters of the three datasets correctly. As shown in Figure 9(d), DCore mistakenly identifies only two points in D5. Besides, none of Kmeans, DPC, DCore, and SNNDPC can find the right clusters of D5, D6, and D7 datasets.  e clustering results shown in Figure 12 display that MST-DC, LDP-MST, and DCore have recognized the D8 datasets correctly, while Kmeans, DBSCAN, DPC, and SNNDPC have not. In other words, MST-DC, LDP-MST, and DCore can be good at dealing with manifold structure datasets. In Figure 12 Figure 14 demonstrate that only the LDP-MST and MST-DC algorithms obtained the correct clusters. DBSCAN and DPC have similar results that both of them cannot deal with spiral clusters and circle clusters. Although SNNDPC can detect the spiral clusters, they fail to handle the circle clusters. DCore can detect circle clusters; however, it cannot detect the spiral clusters. ence, MST-DC can be applied to more complex situations without parameter settings.
From Figures 5-14, we can see that MST-DC performs better than other algorithms. Moreover, there is no need to set any parameters so that several intrinsic flaws of other algorithms can be avoided. e running time of seven clustering algorithms on the synthetic datasets is shown in Table 3. It is obvious that although MST-DC runs slower than Kmeans, DBSCAN, and LDP-MST, it runs faster than DPC and SNNDPC evidently. What is more, the consuming time of MST-DC is similar to that of DCore.

Experiments on Real-World Datasets.
To further prove the superiority of the MST-DC algorithm, we also apply the proposed method to six real-world datasets, including Segmentation, Pageblock, Iris, Control, Column2C, and Breast, obtained from the University of California, Irvine (UCI) Machine Learning Repository. e characteristics of six real datasets are displayed in Table 4. Table 5 illustrates the parameter setting of each clustering algorithm in six real-world datasets of UCI, and the symbol " -" indicates that there is no need to set parameters in MST-DC. Table 6 shows the clustering performance of seven clustering algorithms according to three external criteria, namely, Accuracy, F − Measure, and NMI. Table 7 shows the running time of each algorithm, where " -" indicates that the algorithm has not run within the specified time (20 minutes). Computational Intelligence and Neuroscience 13 As shown in Table 6, except for the Iris dataset, the clustering performance of the MST-DC algorithm is superior to the Kmeans, DBSCAN, DPC, DCore, SNNDPC, and LDP-MST algorithms on other datasets. e LDP-MST algorithm obtains the optimal clustering results on the Iris dataset in terms of Accuracy, NMI, and F − Measure. Besides, LDP-MST achieves the best results on Segmentation and Control datasets in terms of Accuracy, and on Col-umn2C dataset in terms of NMI. On relatively simple datasets, the Accuracy, NMI, and F − Measure values of the Kmeans, DBSCAN, DPC, and SNNDPC algorithms are high; however, on datasets with higher dimensions or complex structures, the four algorithms have poor performances. Except for the Segmentation dataset, the DCore algorithm can obtain a relatively good clustering effect on multiple datasets. However, the DCore algorithm has a major flaw that it needs to set five parameters manually. It is usually very difficult to adjust the parameter combination for better clustering results.
According to Table 7, MST-DC is slower than the Kmeans, DBSCAN, and LDP-MST algorithms on the UCI dataset. However, the MST-DC algorithm is much faster than the DPC and SNNDPC algorithms. e running time of the MST-DC and DCore algorithms is similar on most datasets.
From the above analysis, we conclude that MST-DC provides an overall good performance in clustering compared with the other existing methods. Firstly, the MST-DC   algorithm employs a natural neighbor algorithm to obtain the reverse neighbor information and then extracts the density core point according to the number of reverse neighbors and natural characteristic value. is process does not need to set parameters, while other algorithms need to set parameters manually. Secondly, MST-DC only utilizes the density core points instead of all points to build the minimum spanning tree, which reduces the computational cost while excluding the interference of noise points. irdly, the MST-DC algorithm can recognize complex datasets efficiently and accurately. However, compared with Kmeans, DBSCAN, and LDP-MST, the time efficiency of the MST-DC algorithm is not optimal, which is worth further exploration.

Conclusions
In this paper, we propose a novel clustering algorithm named MST-DC. e process of the algorithm has the following four steps: firstly, we automatically obtain the reverse neighbors of each object based on the concept of natural neighbor searching without any parameters set by the user; secondly, we obtain the core objects according to the number of reverse neighbors of each object; thirdly, we construct a minimum spanning tree of the core objects to obtain the clustering result of core objects; and finally, unallocated objects are marked as the label of their nearest local core objects. e experimental results of synthetic and real-world datasets demonstrate that MST-DC can detect quite complex patterns with large variations in density. Besides, unlike most clustering methods, there is no need to set parameters in MST-DC. e concept of natural neighbor can automatically obtain the only parameter k used by MST-DC. erefore, our proposed algorithm, MST-DC, is superior to other algorithms.
Nevertheless, there are several aspects to be improved in this paper. Firstly, the similarity measure based on the Euclidean distance is used when acquiring natural neighbor information, extracting density cores, and assigning remaining points; however, the Euclidean distance is prone to Dimensional Curse in high-dimensional data spaces, resulting in poor clustering effects. erefore, we will explore the adaptability of this algorithm to high-dimensional data in the future. Secondly, the approach to assign the remaining points in this paper is to directly assign them to the cluster where the closest density core is located. Subsequently, we will study the new method of allocating the remaining points.
Data Availability e data that support the findings of this study are openly available on GitHub at https://github.com/qczggaoqiang/ MST-DC

Conflicts of Interest
e authors declare that they have no conflicts of interest.