Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight.
National Natural Science Foundation of China6150252761379150Beijing University of Posts and TelecommunicationsSKLNST-2013-1-061. Introduction
With the arrival of the big data era, data has become an important asset. How to analyse the large scale data efficiently is becoming a big challenge [1, 2]. As an underlying method for data analysis, clustering can partition a data set into several subsets according to the similarities of points [3], and it has become a basic tool for image analysis [4, 5], community detection [6, 7], disease diagnosis [8], and so on. Therefore, more and more attention has been paid to the design of efficient and effective clustering algorithms.
Constrained clustering can improve the accuracy of the clustering result via encoding constraint information into unsupervised clustering. As an important area of clustering, many constrained clustering algorithms [9–17] have been proposed. Since spectral clustering often has high clustering accuracy and the suitability for a wide range of geometries [18, 19], constrained spectral clustering (CSC) [11–17] can usually have better performance than other constrained clustering algorithms. However, the On2 space complexity and On3 time complexity of many CSC algorithms [11–15] restrict their applications over large scale data sets, where n is the number of data points. The most efficient CSC algorithm known is SCACS algorithm [16], which reduces the space and time complexities to be linear with n through incorporating the landmark-based graph construction [20, 21] with the constrained normalized cuts problem [15]. What is needed to be noticed is that the constrained normalized cuts problem [15] makes SCACS algorithm solve the generalized eigenvector problem twice. In 2016, Cucuringu et al. [17] proposed a new CSC algorithm with better accuracy and shorter running time empirically than constrained normalized cuts problem. Taking a new encoding technique of constraint information, the new CSC model just needs the computation of eigenvectors once.
By means of integrating many basic partitions into a unified partition, ensemble clustering has many excellent properties such as the improvement of clustering quality, the robustness and stability of clustering results, the handling of noise, the reuse of knowledge [3], and the suitability to multisource and heterogeneous data [22]. Researchers have proposed many ensemble clustering algorithms [22–29]. Since there are different notations in different literatures, we call the integration of basic partitions as ensemble clustering or consensus clustering and call the union of the stages of basic clustering and ensemble clustering as cluster ensemble in the following. Among different ensemble clustering methods, the method based on coassociation matrix has become a landmark [22]. Specifically, the coassociation matrix is constructed to represent the similarities of pairs of points from the basic partitions and the final partition result is computed via the graph partition method on the matrix. Thus, this kind of method suffers from the high space and time complexity. Recently, Liu et al. [22] transformed spectral clustering on coassociation matrix to weighted k-means clustering over specific binary matrix equivalently, which decreased the space and time complexities vastly. However, when the number of basic partitions or clusters is large, the corresponding binary matrix will be high dimensional.
As the seminal work, Johnson and Lindenstrauss [30] pointed out that the random projection produced by random orthogonal matrix could preserve the pairwise distances of data sets approximately with reduced dimensions. Subsequently, a lot of researches constructed more matrices with the above properties: random Gaussian matrix [31], random sign matrix [32], random matrix based on randomized Hadamard transform [33], random matrix based on block random hashing [34], and so on. In addition, dimensionality reduction with random projection has also been widely applied to data mining methods such as classification [35], clustering [36–38], and anomaly detection [39]. In terms of object function, there are several works [36–38] to prove that random projection can maintain the accuracy of k-means clustering approximately. Since its objective function is different from that of k-means clustering, the theoretical analysis of the influence of random projection on weighted k-means clustering is still scarce.
Our Contribution. In this paper, our contributions can be divided into three parts: the first part is the proposition of a fast CSC algorithm which is suitable for a wide range of data sets; the second part is the analysis of the effect of random projection on the spectral ensemble clustering; the third part is the proposition of a scalable semisupervised cluster ensemble algorithm. More specifically, the contributions are as follows:
We propose a fast CSC algorithm whose space and time complexities are linear with the size of a data set: we compress the size of the original model proposed by Cucuringu et al. [17] by the encoding of landmark-based graph construction and improve the efficiency further via random sampling in the process of k-means clustering. Besides, we prove that the new CSC algorithm will have the comparable clustering result of the original model asymptotically. Experimental results show that the new algorithm not only can utilize the constraint information effectively, but also costs less running time and fits a wider range of data sets compared to the state of the art SCACS method.
With respect to the difference of objective function caused by random projection, we give a detailed proof that random projection can keep the clustering quality of spectral ensemble clustering within a small factor. Based on this theoretical analysis, we design a spectral ensemble clustering algorithm with reduced dimensions caused by sparse random projection. Experiments over different data sets also verify the correctness of our theoretical results. Moreover, since the theoretical analysis is also suitable for the ordinary weighted k-means clustering, the influence of random projection on weighted k-means clustering is also obtained.
We propose a scalable semisupervised cluster ensemble algorithm through the combination of the fast CSC algorithm and spectral ensemble clustering algorithm with random projection. The efficiency and effectiveness of the new cluster ensemble algorithm are also demonstrated theoretically and empirically.
The remainder of our paper is organized as follows. In Section 2, we introduce the CSC model of Cucuringu et al. [17], landmark-based graph construction, and two related components in our cluster ensemble algorithm: spectral ensemble clustering and random projection. In Section 3, we present our fast CSC algorithm and give its asymptotic property. Then, the algorithm formulation and theoretical analysis of spectral ensemble clustering with random projection are displayed in Section 4. In Section 5, we show the experiment results of our algorithms. Finally, we draw the conclusions of the article and put forward the future directions in Section 6.
2. Preliminaries
In this section, we present the CSC algorithm proposed by Cucuringu et al. [17] and introduce landmark-based graph construction [20, 21] which will be applied to our fast CSC algorithm. In addition, we also introduce spectral ensemble clustering algorithm [22] and sparse random projection [34] which can be used to speed up the spectral ensemble clustering.
2.1. Constrained Spectral Clustering
Here, we first introduce the notion of undirected graph which is very important in constrained spectral clustering and then show the CSC model proposed by Cucuringu et al. [17].
Let G=(V,E,W) be an undirected graph, where V={v1,v2,…,vn} is the vertex set, E is the edge set, and W is the weight set with respect to the edges. wij=wji is specially the nonnegative weight of the edge between the vertices vi and vj, indicating the level of “affinity” between vi and vj. If wij=0, there is no edge between the vertices vi and vj. We denote LG=D-W as the Laplacian matrix of G, where the diagonal entry of diagonal matrix D is D(i,i)=∑j≠iwij; W is an adjacency matrix with W(i,j)=W(j,i)=wij.
The constrained spectral clustering has three undirected graphs: one data graph GD and two knowledge graphs GML and GCL. In data graph GD=V,ED,WD, each weight indicates the similarity level of vertices in the corresponding edge. The “must link” (ML) graph GML=(V,EML,WML) gives the “must link” information of vertices: each edge in GML indicates that the corresponding vertices should be in the same group and the level of “must link” belief is described by the weight. The “cannot-link” (CL) graph GCL=(V,ECL,WCL) has analogous components to GML. The values of weights in the two knowledge graphs are both nonnegative and set according to the constraint information such as prior knowledge. For example, assuming that the range of value of weight is set from 0 to 1, if we have known that points v1, v2 are in the same group, their corresponding weight wML,12=1. If we only have 40% confidence in the constraint information that the two points are in the same group, the weight wML,12=0.4, and if we have no constraint information about these two points, wML,12=wCL,12=0.
Viewing pairwise similarities of vertices as the implicit ML constraints declaration, Cucuringu et al. [17] defined a generalized ML graph G~Dα=V,ED∪EML,WD+α∗WML where α is the level of trust for ML constrains. Let k be the number of clusters and xCi be the indicator vector of cluster Ci such that xCij=1 if the jth data point belongs to cluster Ci and xCij=0 otherwise. In order to violate as few ML constraints as possible and meet as many CL constraints as possible, the constrained k way cuts problem [17] can be described as (1)argminxC1,xC2,…,xCkmaxx∈xC1,xC2,…,xCkxTLG~DxxTLGCLxs.t.∑i=1kxCi=1n,xCi∈0,1n.
To solve the problem in (1) approximately, Cucuringu et al. [17] relaxed the condition “xi∈0,1n,∑i=1kxi=1n” to be the real vectors. Thus, the solution vectors of the relaxed problem are the first k nontrivial generalized eigenvectors of the problem (2)LG~Dx=λLGCLx.After getting the generalized eigenvectors, an additional embedding phase embeds the row vectors of eigenvectors matrix onto the k-dimensional sphere and gives the theoretical guarantees of clustering results. The detailed embedding procedures can be accessed in [17]. However, the construction cost and storage cost of data graphs for large scale data sets are both huge (On2). What is more, if the number of iterations in the process of k-means clustering on the embedded eigenvectors matrix is great, the process will also be time-consuming over large scale data sets.
2.2. Landmark-Based Graph Construction
Based on sparse coding theory [40], the landmark-based graph construction [20, 21] scales linearly with the number of data points and can suit large scale data sets very well.
Let data set be A∈Rn×d and the row vector ai of A be data points; sparse coding problem is defined as follows: (3)minU,ZAT-UZ2s.t.Zissparse,where each column vector of U∈Rd×p is the basis vector, column vectors of Z∈Rp×n are the representations of data points over U and p is the number of basis vectors. To avoid the high time complexity of solving sparse coding problem, landmark-based graph construction just samples points randomly from input data A as basis vectors. In the process of computing Z, if uj is among the r nearest basis vectors of data points ai, Zj,i can be computed as (4)Zj,i=Kσai,uj∑j′∈Ui,rKσai,uj′,where Ui,r is the indices set of the r nearest basis vectors of ai and Kσ· is Gaussian kernel function with bandwidth σ; otherwise Zj,i=0.
After obtaining the sparse representation Z∈Rp×n, graph affinity matrix is constructed as follows: (5)W=Z^TZ^,where Z^=D-1/2Z and D is a diagonal matrix with diagonal entry D(i,i)=∑jZ(i,j). Since Chen and Cai [20, 21] have pointed out that W was automatically normalized, the normalized graph Laplacian matrix for A is I-Z^TZ^. Considering p≪n, the O(npd) time of computing Z^ is much less than the On2d time of the nearest neighbors graph construction.
2.3. Spectral Ensemble Clustering
To gain the unified results from different basic partitions, spectral ensemble clustering applies spectral clustering to the coassociation matrix [24] derived from basic partitions. In 2015, Liu et al. [22] transformed spectral ensemble clustering into weighted kmeans clustering over specific binary matrix. This transformation decreased the time and space complexities effectively and our new ensemble clustering method is based on this nice transformation.
Given g basic clustering results Π=π1,π2,…,πg of data set A∈Rn×d; the coassociation matrix C is constructed in the following way: (6)Cj,k=∑i=1gηπiaj,πiak,where πiaj is the label of aj in the ith clustering result πi, and(7)ηa,b=1,ifa=b0,ifa≠b.
Viewing this coassociation matrix as adjacency matrix, spectral ensemble clustering uses spectral clustering to get final clustering result. In the process of the transformation from spectral clustering to weighted k-means clustering, binary matrix B=ba [22] is built as follows: (8)ba=ba1,…,bag,where b(a)i=bai1,…,baiki, baij=1 if πia=j, and baij=0 otherwise; “[]” indicates a row vector. The following lemma [22] presents the connection between spectral ensemble clustering and weighted k-means clustering.
Lemma 1 (see [<xref ref-type="bibr" rid="B24">22</xref>]).
Given a basic partitions set Π, let the corresponding coassociation matrix be C, the diagonal matrix whose diagonal elements are sums of rows of C be D1, and the diagonal element set of D1 be wb(a). Then normalized cuts spectral clustering on coassociation matrix C has equivalent objective function to weighted k-means clustering on data sets b(a)/wb(a) with weight set wb(a).
Through Lemma 1, the space and time complexities of spectral ensemble clustering can be decreased dramatically. However, when the number of basic partitions and cluster number are large, the binary matrix B will be a high dimensional data set, resulting in long running time for weighted k-means clustering.
2.4. Random Projection
Recently, random projection has become a common technique of dimensionality reduction [36–39, 41]. Random projection often has low computing complexity and can preserve the structure of original data approximately. In this paper, we use the sparse random projection proposed by Kane and Nelson [34]. When most of the elements of data are zero, the sparse random projection can utilize the sparsity of data effectively and speed up the process of dimensionality reduction.
Lemma 2 (see [<xref ref-type="bibr" rid="B19">34</xref>]).
For any 0<δ,ε<1/2, d>0, there exists an d×av sparse random matrix R, where a=Θε-1log1/δ and v=Θε-1, such that for any fixed x∈Rd(9)Pr1-εx22≤RTx22≤1+εx22>1-δ.
And the random matrix R can be constructed as follows: (10)RT=1a·Φ1·D1⋮1a·Φa·Da,where matrix Φl (l∈1,a) is a v×d sparse matrix with nonzero elements Φhi,i=1, h:1,…,d→1,…,v is a random hashing such that Prhi=j=1/v for i∈1,…,d,j∈1,…,v, and matrix Dl is a d×d diagonal matrix with PrDli,i=±1=0.5.
The number of nonzero (nnz) elements of sparse random matrix R is ad, and the time complexity of AR is nnzAa. Lemma 2 implies that the sparse random projection can preserve the length of data points approximately. Thus, for n data points, since there are nn-1/2 pairwise distances, we can conclude that the pairwise distances squares can be preserved within a factor of 1±ε with a=Θ2ε-1logn/δ.
3. Fast Constrained Spectral Clustering Framework
In this section, we introduce our fast CSC framework for large scale data sets. Inspired by [20, 21], we also try to compute the sparse representation Z^ and obtain the approximate adjacency matrix W=Z^TZ^, where Z^∈Rp×n, and p≪n. Then, our fast framework decreases the size of graph Laplacian through the above approximate graph reconstruction. At last, we analyse the asymptotic property of our new CSC algorithm.
3.1. Framework Formulation
To get the generalized eigenvector x approximately, we can let x=Z^Ty, where Z^∈Rp×n is the sparse representation in (5) and y∈Rp. Thus, bringing the x back to (1) can decrease the size of problem apparently if p≪n.
Specifically, we use Q to denote constraint matrix, where Qi,j=1 if edge (vi,vj)∈EML, Qi,j=-1 if edge vi,vj∈ECL, and Qi,j=0 otherwise. Let adjacency matrix be computed approximately by W=Z^TZ^. Next, bring x=Z^Ty into (1) and relax their solution over real vectors. Thus, we reformulate the original problem as the following problem.
Problem 3.
One has(11)argminy1,y2,…,ykmaxy∈y1,y2,…,ykyTZ^LG~DZ^TyyTZ^LGCLZ^Tys.t.yi∈Rpforanyi∈1,k.To obtain shorthand notations, we denote Z^LG~DZ^T by LCGD and denote Z^LGCLZ^T by LCCL. Thus, the first k nontrivial generalized eigenvectors of the problem (12)LCGDy=λLCCLyare the solution vectors of (11).
In order to speed up the k-means clustering on the embedded eigenvector matrix, we sample row vectors of eigenvectors matrix randomly and get k centers through k-means clustering over the selected row vectors. According to the distances between centers and row vectors, we can partition all the row vectors into different clusters. Cucuringu et al. [17] have pointed out that the specific embedding process after getting the generalized eigenvectors can concentrate the row vectors of eigenvector matrix onto the k-dimensional sphere and a simple partition algorithm such as k-means clustering can be applied to get the final clustering result. Since random sampling is a popular scalability method for k-means clustering [42], we will take it to improve the efficiency of the clustering on the row vectors of eigenvector matrix. The experimental results in Section 5 also show that random sampling has little influence on the clustering results and makes the algorithm more efficient than the original one.
Our fast CSC framework is shown in Algorithm 1. In our new algorithm, parameter α (in LG~D of Step (2)) stands for the trust level on constraint information. Since the α of the original problem (see (2)) has been taken to a constant in the previous work [17], we also set α as a constant.
Input: data set A∈Rn×d, the number of landmark points p, constraint matrix Q, cluster number k,
confidence parameter α, sample rate s;
Output: the grouping result.
(1) Compute the sparse representation Z^∈Rp×n in Equation (5);
(2) Compute Laplacian LCGD=Z^LG~DZ^T and LCCL=Z^LGCLZ^T, where LG~D
is the Laplacian matrix of G~D, LGCL is the Laplacian matrix of GCL;
(3) Solve the first k non-trivial generalized eigenvectors Y of Equation (12);
(4) Compute X=Z^TY;
(5) Embed X into a k-dimensional sphere X^ using the embedding process in [17];
(6) Sample n×s row vectors of X^ randomly and run k-means clustering on the sampled row vectors;
(7) Get the clustering result utilizing distances between centers of k-means clustering and row vectors of X^.
The complexity analysis of Algorithm 1 is presented as follows. The time of computing Z^ is O(npd). In Step (2), the LCGD is computed as follows:(13)LCGD=Z^LG~DZ^T=Z^I-Z^TZ^+αLGMLZ^T=Z^Z^T-Z^Z^T2+αZ^LGMLZ^T.Let the number of data points with constraint information be c; then the time cost for computing αZ^LGMLZ^T is O(p2c+pc2). Hence, the time cost of Steps (1) and (2) is Op2n+p3+O(p2c+pc2)=Op2n+p3+p2c+pc2. Besides, the time complexity of Step (3) is Op3, that of Step (4) is Okpn, and that of Step (5) is Okn. Thus, the time cost of the first 5 steps is O(p2n) considering p,c≪n and k≪p,c. Assuming the iteration numbers of k-means clustering are l, the time cost of Steps (6) and (7) is Onsk2l+nk2, which is much less than the time cost Onlk2 of k-means clustering on X^ with ns≪n. Hence, the time complexity of our algorithm is (14)Onp2+nk2+npd.Since three matrices Z^, LCGD, and LCCL are stored, the memory complexity is (15)Onp+p2.
3.2. Asymptotic Property of the Framework
In this subsection, we show that the partition result of our fast CSC algorithm could be comparable to that of the original model [17] as p converges to n.
Theorem 4.
Assuming the adjacency matrix W in the original model is full rank, the result of Step (4) in Algorithm 1 will converge to the generalized eigenvectors of (2) as p converges to n.
Proof.
From the construction of sparse representation Z^, we can get that (16)limp→nZ^=W^,where W^ is the normalized adjacency matrix. Equation (12) can be rewritten as (17)Z^I-W^+αLGMLZ^Ty=λZ^LGCLZ^Ty.Equally, we have that (18)Z^I-W^+αLGML-λLGCLZ^Ty=0.Since the rank of Z^ will be equal to n, Z^ can be removed. Thus the equation will be (19)I-W^+αLGMLZ^Ty=λLGCLZ^Ty.This equation shows that Z^Ty and λ in Step (4) of Algorithm 1 are indeed the eigenvector and eigenvalue of (2), respectively. Moreover, the number of eigenvectors of (19) will converge to n as p converges n. Hence Algorithm 2 could also get all the eigenvectors of (2) asymptotically.
<bold>Algorithm 2: </bold>Spectral ensemble clustering with random projection.
Input: binary matrix B∈Rn×d′, weights set wbx, cluster number k.
Output: the final partition result.
(1) Generate a d′×va sparse random matrix R meeting the requirements of Lemma 2,
where a=Θ2ε-1logn/δ, v=Θε-1, 0<δ,ε<1/2, va<d′;
(2) Compute B~=WB-1B, where WB is a diagonal matrix with diagonal entries wbx;
(3) Compute B^=B~R;
(4) Run weighted k-means clustering on B^ with weight set wbx to obtain the final clustering result.
Since the eigenvectors of our framework will converge to that of original CSC model [17] and the random sampling has little influence on the clustering result of embedded eigenvectors matrix, our new CSC algorithm will generate the partition result which is comparable to that of original framework. In addition, the reason why we give the assumption of Theorem 4 is that each row vector of adjacency matrix is the similarity representation of certain point over the whole data set, and those representations are often linearly independent. In the experiments, we have demonstrated this theory empirically on the 30 nearest neighbors adjacency matrices of three data sets.
4. Spectral Ensemble Clustering with Random Projection
In this section, we propose an improved spectral ensemble clustering algorithm with random projection. The new ensemble clustering not only improves the efficiency of spectral ensemble clustering algorithm designed by Liu et al. [22], but also can theoretically preserve the approximate clustering result.
4.1. Algorithm Formulation
In this subsection, we give the detailed procedure of our new spectral ensemble clustering algorithm. We denote the original spectral ensemble clustering [22] by SEC and our improved spectral ensemble clustering with random projection by SECRP.
From the description of Section 2.3, we can know that the SEC algorithm transforms the spectral clustering on the coassociation matrix into weighted k-means clustering on the specific binary matrix B. The dimension of binary matrix B is ∑i=1gki, where ki is the cluster number of basic partition πi. When the number of clusters and/or basic partitions is big, B is probably a high dimensional matrix on which the weighted k-means clustering runs slowly.
To avoid the high dimensions of B, we design an improved SEC algorithm with random projection for dimensionality reduction. The new algorithm SECRP is showed in Algorithm 2.
The complexity analysis of the new algorithm is as follows. Obviously, the running time of Steps (1) and (2) is very short (compared with that of Step (3)). The time of Step (3) is OnnzBa=Onga, where g is the number of basic partitions; nnz() denotes the number of nonzero entries. Another common method of dimensionality reduction is singular value decomposition (SVD). The time of running SVD on binary matrix B is Od′3+nd′2, and that of the product between eigenvectors and B is Ond′va. Since g≈d′/k, random projection with sparse random matrix is a cost-effective method of dimensionality reduction. With respect to the weighted k-means clustering, dimensionality reduction of random projection can decrease the running time of each iteration from O(nkd′) to O(nkva).
As a basic module, Algorithm 2 can be combined with different basic partition methods to produce different cluster ensemble algorithms. Thus, taking Algorithm 1 as the basic partition algorithm for Algorithm 2 could generate an efficient constrained cluster ensemble method with high accuracy (both basic partitions and final clustering are spectral clustering). Moreover, the last two steps of Algorithm 2 are just weighted k-means clustering with sparse random projection, which is also suitable for any other applications of weighted k-means clustering.
4.2. Theoretical Analysis of New Ensemble Algorithm
In this subsection, we demonstrate that our new algorithm SECRP can maintain the clustering result of SEC approximately.
For the theoretical analysis, we give the formal definition of weighted k-means clustering problem with matrix notation:
Given an n points set B (each row is a data point), diagonal matrix WB whose diagonal entries set wb is weights set and clusters number k find an n×k indicator matrix Xopt such that (20)Xopt=argminXWB1/2B-XXTWBBF2,where ·F2 denotes the square of Frobenius norm; X is selected from the set of all indicator matrices. An indicator matrix has one nonzero element on each row. Specifically, if the ith point belongs to the jth cluster, X(i,j)=1/wCj, where wCj denotes the sum of weights points in cluster Cj.
Since computing Xopt is an NP-hard problem, we focus on the approximate algorithm for weighted k-means clustering. The corresponding definition is as follows.
An algorithm is called the “γ-approximation” for weighted k-means clustering problem, if the algorithm takes B, k, and WB as input and outputs an indicator matrix Xγ such that (21)PrWB1/2B-XγXγTWBBF2≤γminXWB1/2B-XXTWBBF2≥1-δγ,where γ is the approximation factor and δγ is the failure probability of the “γ-approximation” weighted k-means clustering algorithm.
Though there is the γ-approximation k-means clustering algorithm such as [43], it is unclear whether the γ-approximation weighted k-means clustering algorithm exists or not. To facilitate the proof of our theory, we assume that the approximation algorithm exists and utilize the definition of approximation algorithm in the process of proof. And we will take the weighted version of the classical k-means clustering algorithm [44] as the weighted k-means clustering to verify our theoretical results in the following experiments.
Theorem 7.
Let n×d′ matrix B, weight set wba, and cluster number k be the inputs of Algorithm 2. Let ε∈0,1/3. Assuming that a γ-approximation weighted k-means clustering algorithm exists, then the output Xγ^ of Algorithm 2 satisfies with probability of at least 0.97-δγ:(22)WB1/2B~-Xγ^Xγ^TWBB~F2≤1+1+εγWB1/2B~-XoptXoptTWBB~F2.In the above, B~=WB-1B is the computing result of Step (2) in Algorithm 2; Xopt is the optimal solution of weighted k-means clustering on B~.
This theorem reveals that random projection not only can be used to improve the efficiency of spectral ensemble clustering with lower dimensions, but also maintains its final result approximately.
In the following, we present a useful lemma which is needed in the proof of Theorem 7. The results of the lemma are based on the results of [36] and Lemma 2.
Lemma 8.
Let B~, R, WB, k, and ε be the same as those in Theorem 7; denote WB1/2B~ by H, the product of top k singular vectors (left and right) and singular values of H by Hk.
(1) (Lemma 5 of [36]) Let the SVD of Hk be Hk=UkΣkVkT, where Uk and Vk are the left and right singular vector matrices; Σk is a diagonal matrix whose diagonal elements are the k singular values. With probability of at least 0.97,(23)Hk=HRVkTR†VkT+E,where † is the pseudoinverse of matrix; E is an n×d′ matrix with EF≤4εH-HkF.
(2) (Lemma 4 of [36]) For any n×d′ matrix G, with probability of at least 0.99,(24)GRF≤1+εGF.
(3) (Combination of Lemmas 2 and 3 of [36]) With probability of at least 0.99,(25)VkTR†2≤11-ε.
These conclusions are all about the influences of random matrix R on the norms of different matrices, which are useful for bounding the norms of the matrices in Theorem 7. In the following proof of Theorem 7, we start by decomposing the term WB1/2(B~-Xγ^Xγ^TWBB~)F2 in (22). Then, based on the influences of random matrix in Lemma 8, we manipulate the norms of the different terms in the decomposition result.
Proof.
Using the notation of Lemma 8, (22) can be decomposed into (26)WB1/2B~-Xγ^Xγ^TWBB~F2=I-WB1/2Xγ^Xγ^TWB1/2WB1/2B~F2=I-WB1/2Xγ^Xγ^TWB1/2HF2=I-WB1/2Xγ^Xγ^TWB1/2HkF2+I-WB1/2Xγ^Xγ^TWB1/2Hρ-kF2,where Hρ-k=H-Hk. The last equation is based on the orthogonality of Hk and Hρ-k.
We first give the bound of the second term of (26). According to our definition of indicator matrix, Xγ^TWB1/2WB1/2Xγ^=Ik. Thus, I-WB1/2Xγ^Xγ^TWB1/2 is a projector matrix; namely, its l2 norm is 1. As a result, we get (27)I-WB1/2Xγ^Xγ^TWB1/2Hρ-kF2≤Hρ-kF2≤I-WB1/2XoptXoptTWB1/2HF2,where the second inequality is caused by the fact that rankWB1/2XoptXoptTWB1/2≤k and the optimality of SVD.
We next bound the first term of (26). From the first statement of Lemma 8, we get (28)I-WB1/2Xγ^Xγ^TWB1/2HkF≤I-WB1/2Xγ^Xγ^TWB1/2HRVkTR†VkTF+EF≤I-WB1/2Xγ^Xγ^TWB1/2HRFVkTR†F+EF.From Definition 6 and the meaning of Xopt of Theorem 7, we get (29)I-WB1/2Xγ^Xγ^TWB1/2HRF≤γminXI-WB1/2XXTWB1/2HRF≤γI-WB1/2XoptXoptTWB1/2HRF.Using the statement 2 of Lemma 8, (29) can be transformed to (30)I-WB1/2Xγ^Xγ^TWB1/2HRF≤γ1+εI-WB1/2XoptXoptTWB1/2H.Combining the statement 3 of Lemma 8 and (30), we get (31)I-WB1/2Xγ^Xγ^TWB1/2HRFVkTR†F+EF≤γ1+ε·I-WB1/2XoptXoptTWB1/2HFVkTR†F+EF≤γ1+ε1-εI-WB1/2XoptXoptTWB1/2HF+EF≤γ1+ε1-ε+4εI-WB1/2XoptXoptTWB1/2HF.From (28) and (31), and rescaling ε, we can get (32)I-WB1/2Xγ^Xγ^TWB1/2HkF≤γ1+εI-WB1/2XoptXoptTWB1/2HF.
Finally, combining (27) and (32) concludes the proof.
It is easy to check that the above theoretical analysis can be also applied to ordinary weighted k-means clustering, indicating that the method of dimensionality reduction with random projection can preserve the clustering quality of weighted k means clustering approximately. Furthermore, the integration of Theorems 4 and 7 means that the new semisupervised cluster ensemble method (combination of Algorithms 1 and 2) can have an encouraging clustering result.
5. Experiments
In this section, we present the experimental results of our new algorithms in Sections 3 and 4. We implemented all the related algorithms in Matlab and conducted our experiments on a Windows machine with the Intel Core 3.6 GHz processor and 16 GB of RAM.
5.1. Data Sets and Experimental Settings
In order to facilitate the comparison, we performed experiments on three data sets which can be achieved from public web sites (http://archive.ics.uci.edu/ml/), (http://www.cad.zju.edu.cn/home/dengcai/). Table 1 summarizes their basic information.
Data sets information.
Data set
#instances
#attributes
#classes
Letter recognition
20,000
16
26
MNIST
70,000
784
10
CoverType
581,012
54
7
The constraint information is generated from the real labels of data sets. In our experiments, we sample the labeled points randomly from data sets. The constraint matrix Q is constructed as (33)Qi,j=1xi,xjhavethesamelabel-1xi,xjhavedifferentlabels0noconstraint.
The validation measures of the partition result used in our experiments are cluster accuracy (CA) [45] and normalized mutual information (NMI) [25]. The CA is computed as (34)CA=∑i=1kmaxclusteri∣labeln,where k is the cluster number of clustering result, n is the number of data points, maxclusteri∣label is the maximum number of points with the same true label in the ith cluster. For computing the NMI, we construct two random variables C and L from the clustering result and true label, respectively. The probability distributions of random variables are the proportions of different clusters (or classes) over the whole data set. The NMI is computed as follows: (35)NMI=MIC,LHC·HL=∑c,lnc,llogn·nc,l/nc·nl∑cnclognc/n∑lnllognl/n,where MI(C,L) denotes the mutual information of random variables C and L, H(·) denotes the entropy of a random variable, n is the number of data points, nc,l is the number of points in both cluster c and class l, nc is the points number of cluster c, and nl is the points number of class l. The values of CA and NMI both vary from 0 and 1, and the higher value means better clustering solution.
5.2. Comparisons of Different Constrained Spectral Clustering
In this subsection, we compare our fast CSC (constrained spectral clustering) algorithm with other spectral clustering algorithms. Following is the list of information of different algorithms in comparison:
LSC-R [20, 21]: the unsupervised spectral clustering baseline with landmark-based graph construction.
SCACS [16]: the most efficient CSC algorithm known and be set as the CSC baseline over MNIST and CoverType data sets.
CCS [17]: the original CSC model proposed in [17], set as the CSC baseline over LetterRec data set. (Since the constructions of the nearest neighbors graphs are both time-consuming on MNIST and CoverType data sets, we do not run CCS algorithm on these two data sets.)
CCS-L: our improved CCS algorithm with landmark-based graph construction.
CCS-LS: our improved CCS algorithm with landmark-based graph construction and random sampling.
In the process of the landmark-based graph construction, we fix the number of landmark points p=500 and the number of nearest neighbors r=3. The parameters in SCACS algorithm that we used are β0=0.1, which is the same as those in [16]. Since in the original model CCS [17] it has been pointed out that α could be a constant number and α was set to 5 in their implementation code, we also set α=5 in CCS, CCS-L, and CCS-LS.
First, we investigate the influence of the number of labeled points c on the performance of algorithms. We vary the value of c from 100 to 1000 with step size 100. For each value of c, we select the c labeled points randomly to produce constraint information and repeat 20 trials with different labeled points sets. The corresponding experimental results are presented in Figure 1. Figures 1(a), 1(b), and 1(c) are related to CA of clustering results, Figures 1(d), 1(e), and 1(f) are related to NMI, and Figures 1(g), 1(h), and 1(i) are related to running time. We can see that our algorithm CCS-LS outperforms LSC-R on all data sets and the values of CA and NMI increase with the growth of constraint information. Those indicate that our algorithm can employ the constraint information appropriately. Compared with SCACS, our algorithm has the similar performances on LetterRec and MNIST data sets and superior performances on CoverType data set, indicating that our algorithm adapts a wider range of geometries. Over the three data sets, the performances of CCS-LS are all close to CCS-L. What is more, our algorithm runs fastest among these algorithms.
Performance of clustering algorithms with different constraint information.
LetterRec
MNIST
CoverType
LetterRec
MNIST
CoverType
LetterRec
MNIST
CoverType
Next, we study the influence of random sampling (Step (5) of Algorithm 1) which can be seen in Figure 2. In the experiments, we fix c=500 and change the sample rate from 0.1 to 1 by a step size 0.1. We still run 20 independent trials considering the randomness and compute the means of validity measures. We can see that the values of CA and NMI vary slightly along with the growth of sample rate, verifying the feasibility of random sampling.
Influence of sample rates on proposed algorithms.
5.3. Performance of the Spectral Ensemble Clustering with Random Projection
Since cluster ensemble consists of two parts: basic partition clustering and ensemble clustering, we below combine different basic partition clustering algorithms and different ensemble clustering algorithms to get different cluster ensemble algorithms. Thus, the performance of new ensemble clustering algorithm (Algorithm 2) and new cluster ensemble algorithm (combination of Algorithms 1 and 2) can both be manifested. Following is the list of information of different cluster ensemble algorithms in comparison:
CK-SE: the basic partition clustering algorithm “CK” is the constrained k-means clustering algorithm [9], and the ensemble clustering algorithm “SE” is the spectral ensemble clustering (SEC) algorithm [22].
SCACS-SE: the basic partition clustering algorithm is SCACS [16] in Section 5.2, and the ensemble clustering algorithm is also SE [22].
CCSS-SE: the basic partition clustering algorithm “CCSS” is our fast CSC algorithm (Algorithm 1), and the ensemble clustering algorithm is also SE [22].
CCSS-SER: the basic partition clustering algorithm is CCSS, and the ensemble clustering algorithm “SER” is our spectral ensemble clustering with random projection (Algorithm 2).
In the phase of basic partition clustering, we fix the number of basic partitions as 50 and the parameters of basic clustering algorithms are the same as those in the last subsection. In addition, similar to the operation of SE [22], the basic partitions are obtained by varying the cluster number from k-5 to k+4. We repeat each cluster ensemble algorithm 10 times and present the average values of results.
First, we show the comparison of different cluster ensemble algorithms in terms of different constraint information in Figure 3. Here the dimensionality rd of CCSS-SER reduced by random projection is 40 and we change the number of labeled points c from 100 to 1000 with step size 100. In the figure, the validity measures of Figures 3(a)–3(c) and Figures 3(d)–3(f) are related to CA and NMI, respectively. Just like the results of last subsection, CCSS-SE has similar performance to that of SCACS-SE on LetterRec and MNIST data sets and has much better performance on CoverType data set. From the comparison between Figure 1 and 3, we can see that the two validity measures are both higher than those of the basic partition dramatically, verifying ensemble clustering’s improvement in clustering quality. Compared with CK-SE, CCSS-SE and CCSS-SER both have better performance significantly, which indicates that the basic partitions have an obvious impact on the final result and also verify the high accuracy of our new constrained spectral cluster ensemble method. In addition, the little difference of performance between CCSS-SE and CCSS-SER implies that the random projection can preserve the results of spectral ensemble clustering approximately on different constraint information.
Performance of ensemble clustering algorithms with different constraint information.
LetterRec
MNIST
CoverType
LetterRec
MNIST
CoverType
Second, we inspect the influence of dimensions of random projection on the performance of our algorithm in Figure 4 and Table 2. In Figure 4, the “SEC-SVD” denotes the SEC algorithm with dimensionality reduction of SVD. When rd is above certain bound, the validity measures of “SECRP” (denote our algorithm SECRP) are almost stable and similar to those of SEC over all three data sets. This indicates that the accuracy of clustering algorithm can be kept when the dimensions surpass a certain bound, which verifies Theorem 7. The small bound of dimensions (rd=40) also reveals the effectiveness of dimensionality reduction of random projection. With respect to SEC-SVD, although it can also preserve the accuracy of clustering algorithm, its running time is not encouraging. Even letting rd=20, the running time comparisons of original algorithm and SVD method over three data sets are 3.47 s/10.85 s, 4.91 s/14.54 s, and 22.06 s/326.61 s. These phenomena may be caused by the tardiness of SVD on large matrix and the breaking of sparseness of binary matrix B. In Table 2, the decrease of running time verifies the efficiency of our new spectral ensemble clustering. Combining this and subfigures (g,h,i) in Figure 1, the efficiency of new constrained cluster ensemble method is also verified. In addition, we can see the decrease of running time caused by random projection is declining with the growth of dimensions, indicating the relative small dimensionality with random projection is preferable.
Decrease of running time of SECRP from SEC with different dimensions rd.
rd
10
20
30
40
50
60
70
80
90
100
LetterRec
2.44
2.39
2.11
2.07
2.04
2.03
1.92
1.74
1.67
1.56
MNIST
2.76
2.68
2.66
2.58
2.51
2.34
2.31
2.11
2.16
2.03
CoverType
18.85
18.64
15.34
15.26
14.04
11.43
9.73
8.31
7.72
7.44
Performance of ensemble clustering algorithms with different dimension.
LetterRec
MNIST
CoverType
LetterRec
MNIST
CoverType
6. Conclusion
To handle large scale data sets, we propose a fast CSC algorithm. The new algorithm can decrease the space and time complexity of a recently introduced CSC model through landmark-based graph construction and improve its efficiency further by random sampling. The new algorithm not only has the similar property of original model asymptotically, but also is the most efficient and suitable to a wide range of data sets empirically. Taking the new CSC algorithm as basic partition algorithm, we design an efficient semisupervised cluster ensemble algorithm. In the stage of consensus clustering, we reduce the dimensionality of input of spectral ensemble clustering by sparse random projection and prove that the sparse random projection can keep the clustering quality approximately. The experimental results over several data sets also verify the efficiency and effectiveness of new cluster ensemble algorithm. Moreover, in the process of spectral ensemble clustering, the influence analysis of dimensionality reduction with random projection can also give the theoretical guarantee for the weighted k-means clustering with random projection. In the future, we will use techniques such as applying several different basic partition methods, selecting the results of basic partitions, and giving different weights for basic partitions to improve the performance of our cluster ensemble algorithm further.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Nature Science Foundation of China under Grants 61502527 and 61379150 and in part by the Open Foundation of State Key Laboratory of Networking and Switching Technology (Beijing University of Posts and Telecommunications) (no. SKLNST-2013-1-06).
ShenJ.LiuD.ShenJ.LiuQ.SunX.A secure cloud-assisted urban data sharing framework for ubiquitous-citiesLiuQ.CaiW.ShenJ.FuZ.LiuX.LingeN.A speculative approach to spatial‐temporal efficiency with multi‐objective optimization in a heterogeneous cloud environmentAggarwalC. C.ReddyC. K.ZhengY.JeonB.XuD.WuQ. M. J.ZhangH.Image segmentation by generalized hierarchical fuzzy C-means algorithmZhouZ.WuQ. J.HuangF.SunX.Fast and accurate near-duplicate image elimination for visual sensor networksRongH.MaT.TangM.CaoJ.A novel subgraph K+-isomorphism method in social network based on graph similarity detectionMaT.WangY.TangM.CaoJ.TianY.Al-DhelaanA.Al-RodhaanM.LED: a fast overlapping communities detection algorithm based on structural clusteringYuZ.ChenH.YouJ.LiuJ.WongH.-S.HanG.LiL.Adaptive fuzzy consensus clustering framework for clustering analysis of cancer dataWagstaffK.CardieC.RogersS.SchrödlS.Constrained k-means clustering with background knowledgeProceedings of the 18th International Conference on Machine Learning (ICML '01)June 2001Williamstown, Mass, USAWilliams College577584XingE. P.NgA. Y.JordanM. I.RussellS. J.Distance metric learning with application to clustering with side-informationProceedings of the 15th International Conference on Neural Information Processing Systems (NIPS '02)December 2002Vancouver, Canada505512KamvarS. D.KleinD.ManningC. D.Spectral learningProceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI '03)August 2003Acapulco, Mexico5615662-s2.0-84880821479LuZ.Carreira-PerpiñánM. Á.Constrained spectral clustering through affinity propagationProceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08)June 2008Anchorage, Ala, USA10.1109/CVPR.2008.45874512-s2.0-51949113919LiZ.LiuJ.TangX.Constrained clustering via spectral regularizationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR '09)June 2009Miami, Fla, USA42142810.1109/CVPRW.2009.52068522-s2.0-70450267700LuZ.IpH. H.Constrained spectral clustering via exhaustive and efficient constraint propagation6316Proceedings of the 11th European Conference on Computer Vision on Computer Vision (ECCV '10)September 2010Crete, Greece114WangX.QianB.DavidsonI.On constrained spectral clustering and its applicationsLiJ.XiaY.ShanZ.LiuY.Scalable constrained spectral clusteringCucuringuM.KoutisI.ChawlaS.MillerG. L.PengR.Simple and scalable constrained clustering: a generalized spectral methodProceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS '16)May 2016Cadiz, Spain445454NgA. Y.JordanM. I.WeissY.On spectral clustering: analysis and an algorithmProceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS '01)December 2001Vancouver, Canada849856ShiJ.MalikJ.Normalized cuts and image segmentationChenX.CaiD.Large scale spectral clustering with landmark-based representationProceedings of the 25th AAAI Conference on Artificial Intelligence and the 23rd Innovative Applications of Artificial Intelligence ConferenceAugust 20113133182-s2.0-80055037608CaiD.ChenX.Large scale spectral clustering via landmark-based sparse representationLiuH.LiuT.WuJ.TaoD.FuY.Spectral ensemble clusteringProceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15)August 2015Sydney, Australia71572410.1145/2783258.27832872-s2.0-84954187531FernX. Z.BrodleyC. E.Random projection for high dimensional data clustering: a cluster ensemble approach3Proceedings of the 20th International Conference on Machine Learning (ICML '03)August 20031861932-s2.0-1942517297FredA. L. N.JainA. K.Combining multiple clusterings using evidence accumulationStrehlA.GhoshJ.Cluster ensembles—a knowledge reuse framework for combining multiple partitionsWuJ.LiuH.XiongH.CaoJ.A theoretic framework of K-means-based Consensus ClusteringProceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI '13)August 2013Beijing, China179918052-s2.0-84896061966YuZ.LiL.LiuJ.ZhangJ.HanG.Adaptive noise immune cluster ensemble using affinity propagationYuZ.LuoP.YouJ.WongH.-S.LeungH.WuS.ZhangJ.HanG.Incremental semi-supervised clustering ensemble for high dimensional data clusteringYeM.LiuW.WeiJ.HuX.Fuzzy c-means and cluster ensemble with random projection for big data clusteringJohnsonW. B.LindenstraussJ.Extensions of lipschitz mappings into a hilbert spaceIndykP.MotwaniR.Approximate nearest neighbors: towards removing the curse of dimensionalityProceedings of the 13th Annual ACM Symposium on Theory of Computing1998ACM604613MR1715608AchlioptasD.Database-friendly random projections: Johnson-Lindenstrauss with binary coinsTroppJ. A.Improved analysis of the subsampled randomized Hadamard transformKaneD. M.NelsonJ.Sparser Johnson-Lindenstrauss transformsPaulS.BoutsidisC.Magdon-IsmailM.DrineasP.Random projections for linear support vector machinesBoutsidisC.ZouziasA.DrineasP.Random projections for κ-means clusteringProceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS '10)December 20102983062-s2.0-84860641163BoutsidisC.ZouziasA.MahoneyM. W.DrineasP.Randomized dimensionality reduction for c-means clusteringCohenM. B.ElderS.MuscoC.MuscoC.PersuM.Dimensionality reduction for k-means clustering and low rank approximationProceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC '15)June 201516317210.1145/2746539.27465692-s2.0-84958764795DingQ.KolaczykE. D.A compressed PCA subspace method for anomaly detection in high-dimensional dataLeeH.BattleA.RainaR.NgA. Y.Efficient sparse coding algorithmsProceedings of the 19th International Conference on Neural Information Processing Systems (NIPS'06)2006Vancouver, Canada801808PopescuM.KellerJ.BezdekJ.ZareA.Random projections fuzzy c-means (RPFCM) for big data clusteringProceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE '15)August 2015Istanbul, Turkey1610.1109/fuzz-ieee.2015.7337933HanJ.KamberM.PeiJ.KumarA.SabharwalY.SenS.A simple linear time (1+)-approximation algorithm for k-means clustering in any dimensionsProceedings of the 45th Symposium on Foundations of Computer Science (FOCS '04)October 2004Rome, Italy454462ArthurD.VassilvitskiiS.k-means++: the advantages of careful seedingProceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms200710271035MR2485254FahadA.AlshatriN.TariZ.AlamriA.KhalilI.ZomayaA. Y.FoufouS.BourasA.A survey of clustering algorithms for big data: taxonomy and empirical analysis