MPE Mathematical Problems in Engineering 1563-5147 1024-123X Hindawi 10.1155/2018/8451796 8451796 Research Article CciMST: A Clustering Algorithm Based on Minimum Spanning Tree and Cluster Centers Lv Xiaobo 1 http://orcid.org/0000-0003-4626-1401 Ma Yan 1 He Xiaofu 2 Huang Hui 1 Yang Jie 3 Papakostas George A. 1 College of Information and Electrical Engineering Shanghai Normal University Shanghai 200234 China shnu.edu.cn 2 College of Physicians & Surgeons Columbia University New York USA columbia.edu 3 Computational Intelligence and Brain Computer Interface (CIBCI) Center University of Technology Sydney Australia uts.edu.au 2018 17122018 2018 10 10 2018 04 12 2018 17122018 2018 Copyright © 2018 Xiaobo Lv et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The minimum spanning tree- (MST-) based clustering method can identify clusters of arbitrary shape by removing inconsistent edges. The definition of the inconsistent edges is a major issue that has to be addressed in all MST-based clustering algorithms. In this paper, we propose a novel MST-based clustering algorithm through the cluster center initialization algorithm, called cciMST. First, in order to capture the intrinsic structure of the data sets, we propose the cluster center initialization algorithm based on geodesic distance and dual densities of the points. Second, we propose and demonstrate that the inconsistent edge is located on the shortest path between the cluster centers, so we can find the inconsistent edge with the length of the edges as well as the densities of their endpoints on the shortest path. Correspondingly, we obtain two groups of clustering results. Third, we propose a novel intercluster separation by computing the distance between the points at the intersection of clusters. Furthermore, we propose a new internal clustering validation measure to select the best clustering result. The experimental results on the synthetic data sets, real data sets, and image data sets demonstrate the good performance of the proposed MST-based method.

National Natural Science Foundation of China 61373004 61501297
1. Introduction

Clustering aims to group a set of objects into clusters such that the objects of the same cluster are similar, and objects belonging to different clusters are dissimilar. Clustering is an active research topic in statistics, pattern recognition, machine learning, and data mining. A wide variety of clustering algorithms have been proposed for different applications . The different clustering methods, such as partitional, hierarchical, density-based, and grid-based approaches, are not completely satisfactory due to the multiplicity of problems and the data distributions . For instance, as a well-known partitional clustering algorithm, the K-means algorithm often assumes a spherical shape structure of the underlying data, and it can detect clusters with irregular boundaries. Most of the hierarchical clustering algorithms cannot satisfy the requirement of clustering efficiency and accuracy simultaneously . DBSCAN is a classical density-based clustering algorithm that can find clusters with arbitrary shapes. However, it needs to input four parameters which are difficult to determine . CLIQUE combines grid-based and density-based clustering algorithms, and it works efficiently for small data sets. However, its cluster boundaries are either horizontal or vertical, owing to the nature of the rectangular grid . Sufficient empirical evidence has shown that minimum spanning tree (MST) representation is invariant to detailed geometric changes in the boundaries of clusters. Therefore, the shape of the cluster boundary has little impact on the performance of the algorithm, which allows us to overcome the problems commonly faced by the classical clustering algorithms .

The MST-based clustering algorithm is able to achieve the clustering result provided that the inconsistent edges between the clusters have been determined and removed. Hence, defining the inconsistent edge is one of the main problems to be solved in this paper. If we tackle this issue from the view of the length of edges as well as the density of points, the MST method commonly requires a set of parameters whose tunings are problematic in practical cases, which will bring the clustering result instability. Furthermore, many factors including the arbitrary shape of clusters and the different densities and noise make this problem more complex. We found that the shortest path between the cluster centers contains the inconsistent edge; that is, the search scope of inconsistent edges can be narrowed to the shortest path between the cluster centers. Based on this finding, we propose the cluster center initialization algorithm based on the geodesic distance and dual densities of points. In this method, the Euclidean distance between the vertices is modified with the geodesic distance in the MST. Global and local densities of the vertices are defined through adjusting the variance in the Gaussian function. Correspondingly, two groups of K cluster centers under different densities are achieved. Next, we find the K-1 shortest paths among the K(K-1)/2 paths between any pair of K cluster centers. Any K-1 inconsistent edges are determined and removed with consideration of the length of each edge as well as the densities of the two endpoints on the shortest path. Hence, we obtain two groups of clustering results. Then, we define a novel intercluster separation with the distance between the points at the intersection of clusters. The optimal clustering result is determined by combining intercluster separation and intracluster compactness. The key contributions of this paper include the following: (i) propose the use of cluster center initialization in MST-based clustering, (ii) give a cluster center initialization algorithm that takes advantage of geodesic distance, and (iii) develop a new intercluster separation.

The rest of this paper is organized as follows: in Section 2, we review some existing work on MST-based clustering algorithms. We next present our proposed cluster center initialization method in Section 3. In Section 4, we give the definition of inconsistent edges. Section 5 presents a new internal clustering validation measure. In Section 6, we analyze the time complexity of the algorithm. Section 7 presents the experimental evaluations. Finally, Section 8 concludes our work and discusses future work.

2. Related Work

A spanning tree is an acyclic subgraph of a graph G, which contains all the vertices from G. The minimum spanning tree (MST) of a weighted graph is the minimum weight spanning tree of that graph. The cost of constructing an MST is O(mlogn) with the classical MST algorithm, where m is the number of edges in the graph and n is the number of vertices . Enormous amounts of data in various application domains can be represented in a graph. The set of vertices in the graph represents the points in the data set and the edge connecting those vertices reveals the relationship between points. Usually, MST-based clustering algorithms consist of three steps: (1) construct a minimum spanning tree; (2) remove the inconsistent edges to get a set of connected components (clusters); (3) repeat step (2) until the terminating condition is satisfied. Since Zahn first proposed the MST-based clustering method, recent efforts focused on the definition of the inconsistent edges . Under the ideal condition that the clusters are well separated and there exist no outliers, the inconsistent edges are the longest edges . However, the longest edge does not always correspond to the inconsistent edge if there are outliers in the data set. Xu et al. used an MST to represent multidimensional gene expression data . They describe three objective functions. The first algorithm removes the k-1 longest edges so that the total weight of the K subtrees is minimized. The second objective function is to minimize the total distance between the center and each point in a cluster. The third objective function is to minimize the total distance between the “representative” of a cluster and each point in the cluster. The clustering result is vulnerable to the outliers when removing the inconsistent edges according to the lengths of edges. To solve this problem, Laszlo et al. proposed an MST-based clustering algorithm that puts a constraint on the minimum cluster size rather than on the number of clusters . Grygorash et al. proposed a hierarchical MST-based clustering approach (HEMST) that iteratively cuts edges, merges points in the resulting components, and rebuilds the spanning tree .

In addition to the inconsistent edges, the definition of the density of points is also one of the crucial factors that affect the performance of the clustering result. The traditional MST-based clustering algorithms only exploit the information of edges contained in the tree to partition a data set, which will make these algorithms more vulnerable to the outliers. The recent MST-based methods tend to define the inconsistent edges based on the local density around the point. Some methods define the density of points with the degree of the vertex. Chowdbury et al. proposed a density oriented MST-based clustering technique that assumes that the boundary between any two clusters must belong to a valley region (a region where the density of the data points is the lowest compared to those of the neighboring regions) and that the inconsistency measure is based on the finding of such valley regions . Luo et al. proposed an MST-based clustering algorithm with neighborhood density difference estimation . Wang et al. proposed to find a local density factor for each data point during the construction of an MST and discarding outliers . Zhong et al. proposed a graph-theoretical clustering method based on two rounds of minimum spanning trees to deal with separated clusters and touching clusters . For some specially distributed data, such as uniform distributed data, if only the local density of the point is taken into account, it cannot be guaranteed that the best clustering result will be achieved. To address this problem, we propose to calculate the global and local density of the point. Some MST-based algorithms are combined with other methods, such as information theory , k-means , and multivariate Gaussians .

3. The Proposed Cluster Center Initialization Method 3.1. Density Peaks Clustering

Among the recent cluster center initialization methods, density peaks clustering (DPC) has been widely used . We propose a new cluster center initialization method based on DPC in this paper. Here, we briefly describe DPC.

It is assumed in DPC that the cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. DPC utilizes two quantities: one is the density ρi of point xi and the other is its distance δi from points of higher densities.

The density ρi of point xi is defined as(1)ρi=jexp-Dxi,xj22σ2where D(xi,xj) is the Euclidean distance between points xi and xj, and σ is variance. Algorithm 1 shows the definition of σ.

<bold>Algorithm 1: </bold>Pseudocode of the Definition of <italic>σ</italic>.

( 1 ) Input: Data set X={xi,xiRP,i=1,2,,n}, the total number of points n, a predefined parameter s

( 2 ) Output: The value of σ

( 3 ) Begin

( 4 ) Calculate and sort the pairwise distance between points in ascending order, that is, d1,d2,,dn(n-1)/2

( 5 ) Calculate th=[sn(n-1)/2] (“[]” represents a rounding operation)

( 6 ) Calculate σ=dth

( 7 ) End

The distance between the point xi and the other points with higher densities, denoted by δi, is defined as(2)δi=minj:ρi<ρjdxi,xj,ifρi<ρjmaxjdxi,xj,otherwise

When the density ρi and the distance δi for each point have been calculated, the decision graph is further generated. The points with relatively high ρi and δi are considered as cluster centers.

3.2. Geodesic-Based Initialization Method

In the DPC method, the precondition to find the correct cluster centers is that the distribution of cluster centers conforms to the abovementioned assumptions. However, many studies show that the two assumptions have certain limitations in different scenarios. As can be seen from Figure 1(a), for the Three circles data set which is from , three cluster centers (represented as solid triangles) obtained by the DPC method lie in the red cluster and green cluster, respectively, yet none lie in the blue circle. As shown in Figure 1(b), there is only one point with a relatively large value of ρi and δi which lies in the red cluster. This is due to the fact that both the green cluster and the blue cluster are nonconvex shaped, and the densities of points in the blue cluster are smaller than that in the green cluster, which leads to the result that no cluster center lies in the blue cluster.

Three circles data set.

The cluster centers

Decision graph under Euclidean distance

The DPC method exploits the Euclidean distance between the two points as the distance measure. This distance measure is suitable for the data sets with convex shape, yet is not suitable for the data sets with nonconvex shape. To address this issue, this paper adopts a new distance metric-geodesic distance.

Let X be a data set with K clusters and n data points, that is, X={xi,xiRP,i=1,2,,n}. Data set X is represented by an undirected completed graph G=(V,E), where V={v1,v2,,vn}, |E|=n(n-1)/2. Each data point xi in data set X corresponds to a vertex viV. For the sake of convenience, the vertex vi in graph G is represented by xi. Let T=(V,ET) denote the MST of G=(V,E), where ET={e1,e2,,en-1}, eiE(G).

Lemma 1.

There is one and only one path between each pair of vertices in T.

Definition 2 (geodesic distance).

Suppose p={p1,p2,,pl}V is the path between two vertices xi and xj in T, where edge (pk,pk+1)ET, 1k<l-1. The geodesic distance between two vertices xi and xj is defined as(3)Dgxi,xj=k=1l-1Dpk,pk+1where D(pk,pk+1) is the Euclidean distance between two points xi and xj.

The Euclidean distance between pairwise points is replaced by geodesic distance, which leads to the result that the distance between pairwise points in the same cluster becomes smaller, while the distance between pairwise points from the different cluster is larger. For example, we employ statistical tests for the Three circles data set. We divide the interval [0,1] for the normalized distance measure into ten subintervals of equal length. Then we count the number of pairwise points in the same or from different clusters whose Euclidean distance or geodesic distance drops into each subinterval, respectively. It can be seen from Figure 2 that, with respect to the Euclidean distance and geodesic distance, a large quantity of pairwise points in the same cluster drop into the first four subintervals, which implies that the difference between both of them is small. In contrast, as for the Euclidean distance and geodesic distance, the differences of distribution of pairwise points from the different clusters are significant. The former is concentrated in the 2nd-7th subintervals, while the latter is distributed among all of the subintervals. The reason is that the shape of the Three circles data set is nonspherical. For the distance metric between pairwise points from the different clusters, the corresponding result is smaller when provided with the Euclidean distance and larger when provided with the geodesic distance.

The histograms of two distance measures for pairwise points in the same and different clusters.

After the geodesic distance is defined, the density ρi of point xi is redefined as(4)ρi=jexp-Dgxi,xj22σ2

The size of the density ρi is related to σ in (4), and σ is proportional to s in Algorithm 1 mentioned in Section 3.1; that is, the larger s is, the larger σ will be, and vice versa.

In addition, the distance δi between the points xi and the other points with higher densities is redefined as(5)δi=minj:ρi<ρjDgxi,xj,ifρi<ρjmaxjDgxi,xj,otherwise

For the purpose of adapting the selected cluster centers to data sets with arbitrary shape, we introduce the concept of global density and local density. The variance σ can be seen as the scale factor. The smaller the value of σ is, the smaller the scale is. Hence, the corresponding density ρi can be seen as the local density around the point xi. In contrast, the larger the value of σ is, the larger the scale is. And the corresponding density ρi can be seen as the global density around the point xi. The parameter s is set as 2% and 20% after a number of experiments in this paper, with which we can obtain the local density and global density of the point. The points with relatively higher ρi and δi are considered as cluster centers, and correspondingly two groups of cluster centers are achieved.

4. The Definition of Inconsistent Edge

For the data set with K categories, the MST-based clustering method attempts to partition the MST into K subtrees, {Ti}i=1K, by removing the K-1 inconsistent edges.

Lemma 3.

The inconsistent edge between two vertices must be in the path connecting two cluster centers of the different clusters to which the two vertices belong.

Proof.

Suppose data set X contains two clusters A and B whose cluster centers are Ca and Cb, respectively. Construct the MST T=(V,ET) for data set X. Given eabET connecting a vertex aA to a vertex bB, eab is an inconsistent edge. According to Lemma 1, there is one and only one path between points Ca and a, Cb and b, represented as pCaa={pa1,pa2,,pal}A, pCbb={pb1,pb2,,pbm}B. Correspondingly, the path between clusters Ca and Cb is pCaaeabpCbb. Thus, eab belongs to the path between Ca and Cb.

There are K(K-1)/2 paths among K cluster centers. Next, we need to find K-1 paths from them. The inconsistent edge must lie in the intersection of each pair of adjacent clusters. Obviously, the geodesic distance between the cluster centers of the two adjacent clusters is smaller than that of two nonadjacent clusters. Therefore, the methodology for selecting K-1 paths is to construct the MST Tc, TcT, according to the geodesic distances of K(K-1)/2 pairs of cluster centers in T, and, correspondingly, K-1 edges in Tc correspond to the paths in T.

After determining K-1 paths, the next task is to find the K-1 inconsistent edges on each of the K-1 paths. Generally, the inconsistent edge has two features: (1) Its length is longer. (2) The densities of the two end points are smaller. Based on this fact, we define a new parameter for the edge eij connecting xi and xj in the path.(6)ςij=Dxi,xjρi+ρjwhere D(xi,xj) is the Euclidean distance between points xi and xj and ρi and ρj are the local or global density of points xi and xj, respectively. For the K-1 paths, find and remove the edge with the largest value of ςij, and correspondingly we obtain K clusters.

5. Internal Clustering Validation Index

To adapt to the data sets with various characters, we obtain two groups of cluster centers under local density and global density, and then finally we achieve two groups of clustering results. We can exploit internal validation measures to determine the optimal result from the two clustering results when the external information is not available.

In general, the intercluster separation and intracluster compactness are used as the internal validation measures, where intercluster separation plays a more important role . The calculation of intercluster separation can be categorized into two classes: one is to take the distance of a single pair of points as the intercluster separation. For example, the maximum or minimum distance between pairwise points or the distance between the cluster centers is taken as the intercluster separation. Another is based on the average pairwise distance between points in the different clusters. Let us analyze the two categories. In the first category, the distance between the single pair of points cannot represent the distance between two clusters. The result of intercluster separation in this method is unavoidably wrong if there exist outliers in the data set. And, for the second category, the average distance between pairwise points reflects the average value of pairwise distance of points, which cannot reflect accurately the distance between clusters. Yang et al.  proposed an internal clustering validation index based on the neighbors (CVN), which can be exploited to select the optimal result among the multiple clustering results. Similar to CVN, Liu et al.  proposed the internal clustering validation index (CVNN), which exploits the intracluster or intercluster relationship between the point and its neighbors. It is required to take the relation between each point and its neighbors into consideration to calculate the intercluster distance with CVN or CVNN. But in fact, we need not consider all points of the data set. Figure 3 illustrates the clustering results with the proposed method on the Two moons data set, respectively, where s=2% and 20%. According to Figure 3, we can see that the proposed method gives the optimal clustering result in Figure 3(b) and the undesirable clustering result in Figure 3(a). The main basis for judging by human eyes whether the cluster result is correct or not is the size of the distance between the points at the intersection of the two clusters, and the distance between the points far from the intersection of the two clusters is not considered. As shown in Figure 3, the green circles denote the points at the intersection of the two clusters. As shown in Figure 3(a), the distance between the points from the different clusters is smaller than the distance in Figure 3(b); that is, the intercluster distance in Figure 3(b) is greater than that in Figure 3(a). Thus, we select the clustering result in Figure 3(b) as the optimal solution.

The clustering results of the Two moons data set.

s=2%

s=20%

Based on the above idea, we propose using the intercluster distance based on the distance between the points at the intersection of two different clusters. Let us consider the example in Figure 3(a). There are two clusters which we called the red cluster and the blue cluster. First, we calculate the minimum geodesic distance from each point in the red cluster to all of the points in the blue cluster. Then, we sort all of the minimum geodesic distances in ascending order and sum up the top 20% minimum geodesic distances. Next, we exchange the red cluster and the blue cluster. And similarly, we sum up the top 20% minimum geodesic distances. The average of the two previous results is taken as the distance between the red cluster and the blue cluster. For the two clusters which are located at the end points of the inconsistent edge, we calculate the intercluster distance according to the above method. Finally, we take the average of all intercluster distances as the intercluster separation. The detailed algorithm is shown as in Algorithm 2.

<bold>Algorithm 2: </bold>Pseudocode of Intercluster separation.

( 1 ) Input: Data set X={xi,xiRP,i=1,2,,n}, the total number of points n, the number of clusters K,

the clustering result {Cluster1,Cluster2,,ClusterK}, K-1 inconsistent edges Einc={e1,e2,,eK-1}, the

geodesic distance Dg(xi,xj) between points xi and xj

( 2 ) Output: Intercluster separation Sep

( 3 ) Begin

( 4 ) Construct K-1 pairs of adjacent clusters (Clusteri,Clusterj) according to eiEinc ( The two end points

of ei belong to Clusteri and Clusterj.)

( 5 ) Calculate the intercluster distance sepij between adjacent clusters Clusteri and Clusterj

(5.1) Select a pair of adjacent clusters (Clusteri,Clusterj)

(5.2) Calculate the minimum geodesic distance minxiClusteri{Dgxi,xjxjClusterj} from each point in

the Clusteri to all of the points in the Clusterj

(5.3) Sort all of the minimum geodesic distances minxiClusteri{Dg(xi,xj)xjClusterj} in ascending order

(5.4) Sum up the top 20% minimum geodesic distances {Dgi1,Dgi2,,Dgiχ1} (Here, suppose there are a

total of χ1 minimum geodesic distances)

(5.5) Similar to Step (5.4), for the adjacent clusters (Clusterj,Clusteri), sum up the top 20% minimum

geodesic distances {Dgj1,Dgj2,,Dgjχ2}. (Here, suppose there are a total of χ2 minimum geodesic

distances)

(5.6) Calculate the distance sepij=o=1χ1Dgio+p=1χ2Dgjp/χ1+χ2 between Clusteri and Clusterj

( 6 ) Calculate the average of the K-1 sepij

( 7 ) End

Next, we define the intracluster compactness CP. Numerous measures estimate the intracluster compactness based on the average pairwise distance. Hence the compactness of Clusteri with ni points can be defined as(7)cpi=2nini-1xi,yiClusteriDgxi,yi

The intracluster compactness of data set X is(8)CP=1Ki=1Kcpiwhere K is the cluster number.

The smaller the value of CP according to (7) and (8), the more compact the data set. We calculate the value of CP for the clustering results of Figure 3 with the above method. The value of CP for Figures 3(a) and 3(b) is 1.5835 and 1.8233, respectively, which indicates that the intracluster distance for Figure 3(a) is smaller than that of Figure 3(b). The value of cpi for the red cluster and the blue cluster in Figure 3(a) is 0.3997 and 2.7673, respectively, and the value of cpi for the red cluster and the blue cluster in Figure 3(b) is 1.9575 and 1.6892, respectively. For Figure 3(a), the value of cp of the blue cluster is greater than that of the red cluster. Thus, the value of CP is still smaller than the corresponding result of Figure 3(b). In conclusion, the previous method has its limitations.

This paper redefines the intracluster distance based on the greater pairwise geodesic distance between the points in the cluster; that is, the average of the greater pairwise geodesic distance is taken as the intracluster distance. For the intracluster compactness of the data set, we assign a weight to each intracluster distance before summing them up to avoid the aforementioned wrong result. The detailed algorithm is shown as in Algorithm 3.

<bold>Algorithm 3: </bold>Pseudocode of Intracluster compactness.

( 1 ) Input: Data set X={xi,xiRP,i=1,2,,n}, the total number of points n, the number of clusters K,

the clustering result {Cluster1,Cluster2,,ClusterK}, K-1 inconsistent edges Einc={e1,e2,,eK-1}, the

geodesic distance Dg(xi,xj) between points xi and xj

( 2 ) Output: Intracluster compactness CP

( 3 ) Begin

( 4 ) Sort the pairwise geodesic distances of all points from Clusteri{1,2,,K}

( 5 ) Extract the top 20% maximum geodesic distance {Dgi1,Dgi2,,Dgiωi} (here, suppose there are a total of

ω i maximum geodesic distances)

( 6 ) Calculate the average of the ωi maximum geodesic distances

( 7 ) Calculate the intracluster distance for the Clusteri, cpi=(Dgi1+Dgi2++Dgiωi)/ωi

( 8 ) Calculate Ω=ω1+ω2++ωK

( 9 ) Calculate the intracluster compactness of data set X, CP=i=1Kωi/Ωcpi (ωi/Ω is the weight of Clusteri)

( 10 ) End

We propose the internal clustering validation index ICV by combining intercluster separation Sep and intracluster compactness CP:(9)ICV=SepCP

In (9), the greater the value of Sep is, the smaller the value of CP is, and the greater the value of ICV is, which indicates the better clustering result. Hence, the clustering result corresponding to the greater value of ICV is taken as the optimal result.

6. Complexity Analysis

The flowchart of cciMST is illustrated in Figure 4. The computational complexity of cciMST is analyzed as follows.

Flowchart of cciMST.

Firstly, we do initialization work. We construct the MST for data set X with K clusters and n data points by using the Prim algorithm, which requires O(n2) calculations. In the calculations of all pairwise Euclidean distance and geodesic distance of data points, O(n2) and O(n) are required.

Next, we determine the cluster centers. The time complexity of calculating the densities ρi and distance δi of all data points is O(n2). It is required to sort all pairwise geodesic distances in ascending order to obtain the variance σ according to (4), which takes O(nlogn) time. The time for the selection of K data points with larger values of ρi and δi as cluster centers can be ignored due to Kn.

Then, we determine the inconsistent edges. It takes 2O(K2) to construct the MST Tc for two groups of K cluster centers and determine the edge with the largest value of ςij.

Finally, we select the optimal clustering result with internal validation measure. It will take O(n2) calculations for the calculation of Sep, as well as the calculation of CP. Both of the clustering results need to calculate the value of ICV, and hence the time complexity is 2O(n2).

Therefore, the whole time complexity of the proposed algorithm is 7O(n2)+O(n)+O(nlogn)+2O(K2).

7. Experimental Result 7.1. Experimental Setup

We evaluated cciMST on four synthetic data sets DS1-DS4, six real data sets, and seven images. The four synthetic data sets are taken from the literature [15, 17, 19]; see Figure 5. The six real data sets are taken from the UCI data sets , including Iris, Wine, Zoo, Liver-disorders, and Pendigits. The seven images are taken from the Berkeley image segmentation data set . The descriptions of the four synthetic data sets and the six real data sets are shown in Table 1. The experiments were conducted with MATLAB 2016a which has offered convenient functions. CciMST is compared to the following five clustering algorithms:

k-means .

Spectral clustering .

Density peaks clustering (DPC) .

Spliting-and merg clustering (SAM) .

Description of the four synthetic data sets and the six real data sets.

Data set Number of Instances Number of Attributes Number of Classes
DS1 512 2 4
DS2 299 2 3
DS3 1502 2 2
DS4 788 2 7
Iris 15 4 3
Wine 178 13 3
Zoo 101 16 7
Soybean 47 35 4
Liver-disorders 145 5 2
Pendigits 3498 16 10

Four synthetic data sets.

In the above five algorithms, k-means is one of the partitional clustering algorithms and single linkage is one of the hierarchical clustering algorithms. Both of them are traditional clustering algorithms. Spectral clustering is one of the graph-based clustering algorithms. DPC is a clustering algorithm by fast search and find of density peaks. SAM is a split-and-merge hierarchical clustering method based on MST. For k-means and spectral clustering, we take the best clustering result out of 1000 trial runs in terms of the external clustering validity index. The parameter of σ in spectral clustering is set as 0.

To evaluate the goodness of clustering results, we exploit four external clustering validation indices (CVI): accuracy (AC), precision (PR), recall (RE), and F1-measure (F1) . The larger the values of AC, PR, RE, and F1, the better the clustering solution. Suppose that a data set contains K classes denoted by C1,C2,,Ck. Let pi denote the number of points that are correctly assigned to class Ci. Let qi denote the points that are incorrectly assigned to the class Ci. Let ri denote the points that are incorrectly rejected from the class Ci. AC, PR, RE, and F1 are defined as follows:(10)AC=i=1KpiD(11)PR=i=1Kpi/pi+qiK(12)RE=i=1Kpi/pi+riK(13)F1=2×PR×REPR+RE

7.2. Experimental Results on the Synthetic Data Sets

DS1. This data set contains four parallel clusters with different densities. The clustering results are illustrated in Figure 6. Single linkage, SAM, and cciMST can identify the proper clusters. k-means can discover the sphere-shaped clusters properly, whereas it produces unsatisfactory partitions for the non-sphere-shaped clusters. For the spectral clustering algorithm, the similarity matrix is constructed by a Gaussian kernel function with Euclidean distance. However, its clustering result is similar to that of k-means. DPC determines the cluster centers through the decision graph constructed by ρi and δi. Wrong cluster centers will lead to the incorrect clustering result.

Clustering results on DS1.

DS2. This data set is composed by one Gaussian distributed cluster and two ring clusters surrounding the first one. Figure 7 illustrates the clustering results. K-means, spectral clustering, and DPC cannot provide improper clustering results. Single linkage, SAM, and cciMST can identify the three clusters properly.

Clustering results on DS2.

DS3. This data set contains two clusters shaped like crescent moons. The clustering results are illustrated in Figure 8. K-means, DPC, and SAM produce unsatisfactory partitions. In the clustering process of SAM, the data points in the subsets produced by k-means are reallocated to maintree. As shown in Figure 9, a data point in each of clusters C2 and C3 is redistributed into cluster C1, which leads to the improper clustering result. Single linkage, spectral clustering, and cciMST can identify the two clusters properly.

Clustering results on DS3.

Clustering result with SAM.

DS4. This data set contains seven Gaussian distributed clusters. Figure 10 illustrates the clustering results. Except k-means and single linkage, the rest of the clustering algorithms can identify the clusters properly.

Clustering results on DS4.

7.3. Experimental Results on the Real Data Sets

From Tables 27, the optimal result for the corresponding index is denoted in bold. For the Iris data set, Table 2 indicates that cciMST has the best performance and that the performance of SAM is slightly weaker than that of cciMST. In the case of the Wine data set, the corresponding clustering performances are shown in Table 3. Except for the PR index, the AC, RE, and F1 values of cciMST are higher than those of the other five methods. Moreover, the clustering performances of spectral clustering, DPC, and SAM are better than that of k-means and single linkage. For the Zoo data set, it can be seen from Table 4 that the performances of cciMST, SAM, and k-means are better than those of the other three methods. For the Soybean data set, Table 5 indicates that cciMST and DPC outperform the others. It can be seen from Table 6 that spectral clustering outperforms the other methods on the Liver-disorder, and the performance of cciMST is slightly lower than that of spectral clustering. For the Pendigits data set, Table 7 indicates that cciMST outperforms the other methods.

Clustering performances on Iris.

Index k-means Single linkage Spectral clustering DPC SAM cciMST
AC 0.8572 0.6800 0.8895 0.90667 0.9533 0.9600
PR 0.8572 0.6800 0.8895 0.90667 0.9533 0.9600
RE 0.8688 0.8367 0.8978 0.92708 0.9562 0.9619
F1 0.8630 0.7502 0.8936 0.91676 0.9548 0.9609

Clustering performances on Wine.

Index k-means Single linkage Spectral clustering DPC SAM cciMST
AC 0.6525 0.4269 0.7078 0.7079 0.6236 0.7135
PR 0.6321 0.3615 0.7029 0.7030 0.7944 0.7212
RE 0.6826 0.4709 0.7300 0.7258 0.6730 0.7470
F1 0.6559 0.4090 0.7162 0.7142 0.7287 0.7338

Clustering performances on Zoo.

Index k-means Single linkage Spectral clustering DPC SAM cciMST
AC 0.7304 0.6733 0.5087 0.5644 0.6634 0.8218
PR 0.6507 0.4508 0.3941 0.5094 0.5130 0.6397
RE 0.6151 0.4738 0.4196 0.3941 0.6481 0.5875
F1 0.6310 0.4620 0.4064 0.4444 0.5727 0.6125

Clustering performances on Soybean.

Index k-means Single linkage Spectral clustering DPC SAM cciMST
AC 0.7525 0.8085 0.7162 0.8936 0.7872 0.8936
PR 0.7599 0.775 0.6681 0.9162 0.8438 0.8750
RE 0.7647 0.9135 0.6407 0.9052 0.8015 0.9297
F1 0.7621 0.8386 0.6536 0.9107 0.8221 0.9015

Clustering performances on Liver-disorders.

Index k-means Single linkage Spectral clustering DPC SAM cciMST
AC 0.6964 0.6345 0.7117 0.5310 0.6069 0.6966
PR 0.6105 0.5182 0.6634 0.5869 0.5788 0.6106
RE 0.7514 0.8147 0.7006 0.5994 0.5737 0.7516
F1 0.6737 0.6335 0.6814 0.5931 0.5753 0.6738

Clustering performances on Pendigits.

Index k-means Single linkage Spectral clustering DPC SAM cciMST
AC 0.6631 0.1124 0.6800 0.7064 0.6635 0.8385
PR 0.6632 0.1086 0.6798 0.7037 0.7353 0.8390
RE 0.6760 0.6105 0.6855 0.6723 0.6613 0.8607
F1 0.6694 0.1844 0.6826 0.6876 0.6963 0.8497
7.4. Image Segmentation Results

To further evaluate the clustering performance of cciMST on real data sets, we perform image segmentation experiments on the Berkeley Segmentation Data set 300 (BSDS300) . BSDS300 consists of 200 training and 100 testing natural images of size 481321. As shown in Figure 11, seven images are extracted from the BSDS300. The first image has various colors of peppers, broccoli, and wooden frames. The second image has a deer, grass, and trees. The third image contains the sky, houses, and grass. The fourth image is composed by flowerbeds and concrete. The fifth image contains bears and sea. The sixth image is composed by sky, mountains, and trees. The seventh image has two horses and grass with different colors.

Image segmentation results (The first column displays the original images and the second-seventh columns display the segmentation results with k-means, single linkage, spectral clustering, DPC, SAM, and cciMST).

First, the seven images are segmented by simple linear iterative clustering (SLIC)  and the number of superpixels is 250. Then, the image is transformed from RGB to Lab space. We compute the normalized 4-bins histogram for each color channel of Lab space. Next, we concatenate the three histogram vectors and take them as one data point in the data set. Hence, an image has 250 data points described by 12 attributes. The 250 data points of each image are clustered using the six methods: k-means, single linkage, spectral clustering, DPC, SAM, and cciMST, respectively. The clustering results are shown in Figure 11. For the first image, DPC, SAM, and cciMST can properly detect pepper, broccoli, and wooden frames, while single linkage cannot properly detect wooden frames. For the second image, the segmentation results of cciMST are the best. In the case of the third image, cciMST can segment houses, sky, and grass satisfactorily. The segmentation performance of SAM is slightly lower than that of cciMST, and SAM is unable to properly separate the houses from the grass. For the fourth image, the segmentation results of SAM and cciMST are consistent with the perception of the human vision. DPC and cciMST properly segment the bear’s body, but improperly segment the legs of the bear. For the sixth image, DPC and cciMST can properly detect the sky, mountains, and trees. K-means, DPC, and cciMST can segment properly the horses in the seventh image.

8. Conclusions

Our MST-based clustering method tries to identify the inconsistent edges through the cluster centers. We exploit the geodesic distance between the two vertices in the MST as the distance measure. We also introduce the concept of global and local density of vertices. In addition, we propose the novel internal clustering validation index to select the optimal clustering result. The experimental results on synthetic data sets, real data sets, and image data illustrate that the proposed clustering method has the overall better performance. The future goal is to further improve the computational efficiency of the method.

Data Availability

The code used in this paper is released, which is written in Matlab and available at https://github.com/Magiccbo/CciMST.git.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors are grateful to the support of the National Natural Science Foundation of China (61373004, 61501297).

Jain A. K. Dubes R. C. Clustering methods and algorithms Algorithms for Clustering Data 1988 55 141 MR999135 Lloyd S. P. Least squares quantization in PCM IEEE Transactions on Information Theory 1982 28 2 129 137 10.1109/TIT.1982.1056489 MR651807 Zbl0504.94015 2-s2.0-0020102027 Ester M. Kriegel H. P. Sander J. Xu X. A densitybased algorithm for discovering clusters in large spatial databases with noise Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 1996 226 231 Agrawal R. Gehrke J. Gunopulos D. Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications Proceedings of the ACM SIGMOD International Conference on Management of Data 1998 94 105 Li J. Wang K. Xu L. Chameleon based on clustering feature tree and its application in customer segmentation Annals of Operations Research 2009 168 1 225 245 10.1007/s10479-008-0368-4 Zbl1179.68122 2-s2.0-62949096162 Prim R. C. Shortest connection networks and some generalizations Bell System Technical Journal 1957 36 6 1389 1401 10.1002/j.1538-7305.1957.tb01515.x 2-s2.0-84911584312 Kruskal J. On the shortest spanning subtree of a graph and the traveling salesman problem Proceedings of the American Mathematical Society 1956 7 48 50 MR0078686 10.1090/S0002-9939-1956-0078686-7 Zbl0070.18404 Zahn C. T. Graph-theoretical methods for detecting and describing gestalt clusters IEEE Transactions on Computers 1971 20 1 68 86 10.1109/T-C.1971.223083 2-s2.0-0014976008 Xu Y. Olman V. Xu D. Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees Bioinformatics 2002 18 4 536 545 2-s2.0-0035999974 10.1093/bioinformatics/18.4.536 12016051 Laszlo M. Mukherjee S. Minimum spanning tree partitioning algorithm for microaggregation IEEE Transactions on Knowledge and Data Engineering 2005 17 7 902 911 2-s2.0-22944452807 10.1109/TKDE.2005.112 Grygorash O. Yan Z. Jorgensen Z. Minimum spanning tree based clustering algorithms Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '06) October 2006 73 81 2-s2.0-38949095294 Chowdhury N. Murthy C. A. Minimal spanning tree based clustering technique: Relationship with bayes classifier Pattern Recognition 1997 30 11 1919 1929 2-s2.0-0031268723 10.1016/S0031-3203(96)00188-4 Luo T. Zhong C. A neighborhood density estimation clustering algorithm based on minimum spanning tree Lecture Notes in Computer Science 2010 6401 557 565 2-s2.0-78349302648 Wang X. Wang X. L. Chen C. Wilkes D. Enhancing minimum spanning tree-based clustering by removing density-based outliers Digital Signal Processing 2013 23 5 1523 1538 10.1016/j.dsp.2013.03.009 MR3103503 Zhong C. Miao D. Wang R. A graph-theoretical clustering method based on two rounds of minimum spanning trees Pattern Recognition 2010 43 3 752 766 2-s2.0-70449713359 10.1016/j.patcog.2009.07.010 Zbl1187.68520 Müller A. C. Nowozin S. Lampert C. H. Information theoretic clustering using minimum spanning trees Proceedings of the Symposium of the German Association for Pattern Recognition 34th 2012 205 215 10.1007/978-3-642-32717-9_21 Zhong C. Miao D. Fränti P. Minimum spanning tree based split-and-merge: A hierarchical clustering method Information Sciences 2011 181 16 3397 3410 2-s2.0-79957494283 10.1016/j.ins.2011.04.013 Vathy-Fogarassy A. Kiss A. Abonyi J. Hybrid minimal spanning tree and mixture of gaussians based clustering algorithms Proceedings of the IEEE International Conference on Tools with Artificial Intelligence 2006 313 330 Laio A. Rodriguez A. Clustering by fast search and find of density peaks Science 2014 344 6191 1492 1496 10.1126/science.1242072 2-s2.0-84903289127 Ng A. Y. Jordan M. I. Weiss Y. On spectral clustering: analysis and an algorithm Processing Systems: Natural and Synthetic 2001 MIT Press 849 856 Liu Y. Li Z. Xiong H. Gao X. Wu J. Wu S. Understanding and enhancement of internal clustering validation measures IEEE Transactions on Cybernetics 2013 43 3 982 994 2-s2.0-84890434201 10.1109/TSMCB.2012.2220543 Yang J. Ma Y. Zhang X. Li S. Zhang Y. An initialization method based on hybrid distance for k-means algorithm Neural Computation 2017 29 11 3094 3117 2-s2.0-85038819319 10.1162/neco_a_01014 UCI Machine Learning Repository http://www.ics.uci.edu/mlearn/MLRepository.html, http://www.ics.uci/mlearn/MLRespository.html, 2011 Unnikrishnan R. Pantofaru C. Hebert M. Toward objective evaluation of image segmentation algorithms IEEE Transactions on Pattern Analysis and Machine Intelligence 2007 29 6 929 944 2-s2.0-34247646574 10.1109/TPAMI.2007.1046 MacQueen J. Some methods for classification and analysis of multivariate observations Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability 1967 281 297 MR0214227 Sneath P. H. A. Sokal R. R. Numerical taxonomy. The principles and practice of numerical classification Taxon 1963 12 5 190 199 Shi J. Malik J. Normalized cuts and image segmentation IEEE Transactions on Pattern Analysis and Machine Intelligence 2000 22 8 888 905 2-s2.0-0034244751 10.1109/34.868688 Huang Y. J. Powers R. Montelione G. T. Protein NMR recall, precision, and F-measure scores (RPF scores): Structure quality assessment measures based on information retrieval statistics Journal of the American Chemical Society 2005 127 6 1665 1674 2-s2.0-13644252170 10.1021/ja047109h Achanta R. Shaji A. Smith K. Lucchi A. Fua P. Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods IEEE Transactions on Pattern Analysis and Machine Intelligence 2012 34 11 2274 2281 2-s2.0-84866657764 10.1109/TPAMI.2012.120