Fuzzy 𝑐 -Means and Cluster Ensemble with Random Projection for Big Data Clustering

Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of dimensions has been proposed. Together with the theoretical analysis, a new fuzzy 𝑐 -means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. The new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods.


Introduction
With the rapid development of mobile Internet, cloud computing, Internet of things, social network service, and other emerging services, data is growing at an explosive rate recently. How to achieve fast and e ective analyses of data and then maximize the data property's bene ts has become the focus of attention. e "four Vs" model [ ], variety, volume, velocity, and value, for big data has made traditional methods of data analysis unapplicable. erefore, new techniques for big data analysis such as distributed or parallelized [ , ], feature extraction [ , ], and sampling [ ] have been widely concerned.
Clustering is an essential method of data analysis through which the original data set can be partitioned into several data subsets according to similarities of data points. It becomes an underlying tool for outlier detection [ ], biology [ ], indexing [ ], and so on. In the context of fuzzy clustering analysis, each object in data set no longer belongs to a single group but possibly belongs to any group. e degree of an o b j e c tb e l o n g i n gt oag r o u pi sd e n o t e db yav a l u ei n[0, 1]. Among various methods of fuzzy clustering, fuzzy -means (FCM) [ ] clustering has received particular attention for its special features. In recent years, based on di erent sampling and extension methods, a lot of modi ed FCM algorithms [ -] designed for big data analysis have been proposed. However, these algorithms are unsatisfactory in e ciency for high dimensional data, since they initially do not take the problem of "curse of dimensionality" into account.
In , Johnson and Lindenstrauss [ ] used the projection generated by a random orthogonal matrix to reduce the dimensionality of data. is method can preserve pairwise distances of the points within a factor of 1± .Subsequently ,

Mathematical Problems in Engineering
As it can combine multiple base clustering solutions of the same object set into a single consensus solution, cluster ensemble has many attractive properties such as improved q u a l i t yo fs o l u t i o n ,r o b u s tc l u s t e r i n g ,a n dk n o w l e d g er e u s e [ ]. Ensemble approaches of fuzzy clustering with random projection have been proposed in [ -]. ese methods were all based on multiple random projections of original data set and then integrated all fuzzy clustering results of the projected data sets. Reference [ ] pointed out that their method used smaller memory and ran faster than the ones of [ , ]. However, with respect to crisp partition solution, their method still needs computing and storing the product of membership matrices, which requires time and space complexity with quadratic data size.
Our Contribution.I nt h i sp a p e r ,o u rc o n t r i b u t i o n sc a nb e divided into two parts: one is the analysis of impact of random projection on FCM clustering; the other is the proposition of a cluster ensemble method with random projection which is more e cient, robust, and suitable for a wider range of geometrical data sets. Concretely, the contributions are as follows: (i) We theoretically analyze that random projection can preserve the entire variability of data and prove the e ectiveness of random projection for dimensionality reduction from the linear independence of dimens i o n so fp r o j e c t e dd a t a .T o g e t h e rw i t ht h ep r o p e r t y of preserving pairwise distances of points, we obtain a modi ed FCM clustering algorithm with random projection. e accuracy and e ciency of modi ed algorithm have been veri ed through experiments on both synthetic and real data sets.
(ii) We propose a new cluster ensemble algorithm for FCM clustering with random projection which gets spectral embedding e ciently through singular value decomposition (SVD) of the concatenation of membership matrices. e new method avoids the construction of similarity or distance matrix, so it is more e cient and space-saving than method in [ ] with respect to crisp partition and methods in [ , ] for large scale data sets. In addition, the improvements on robustness and e ciency of our approach are also veri ed by the experimental results on both synthetic andrealdatasets.Atthesametime,ouralgorithmis not only as accurate as the existing ones on Gaussian mixture data set, but also obviously more accurate than the existing ones on the real data set, which indicates that our approach is suitable for a wider range of data sets.

Preliminaries
In this section, we present some notations used throughout this paper, introduce the FCM clustering algorithm, and give some traditional cluster ensemble methods using random projection.
. . Matrix Notations. We us e X to denote data matrix; x to denote the th row vector of X and the th point; to denote the ( , )th element of X. ( ) means the expectation of a random variable and Pr( ) denotes the probability of an event .Letcov( , ) be the covariance of random variables , ;letvar( ) be the variance of random variable . W ed e n o t et h et r a c eo fm a t r i xb yt r (),g i v e nA ∈ R × ; then ( ) For any matrix A, B ∈ R × ,wehavethefollowingproperty: Singular value decomposition is a popular dimensionality reduction method, through which one can get a projection: :X → R , with (x )=x V ,whereV contains the top right singular vectors of matrix X. e exact SVD of X takes cubic time of dimension size and quadratic time of data size.
. . Fuzzy -Means Clustering Algorithm (FCM). e goal of fuzzy clustering is to get a exible partition, where each point has membership in more than one cluster with values in [0, 1]. Among the various fuzzy clustering algorithms, FCM clustering algorithm is widely used in low dimensional data because of its e ciency and e ectiveness [ ]. We start from giving the de nition of fuzzy -means clustering problem and then describe the FCM clustering algorithm precisely.
De nition (the fuzzy -means clustering problem). Given a data set of points with features denoted by an × matrix X,apositiveinteger regarded as the number of clusters, and fuzzy constant >1 , nd the partition matrix U opt ∈ R × and centers of clusters V opt ={ k opt,1 , k opt,2 ,...,k opt, },s u c h that Here, ‖⋅‖denotes norm, usually Euclidean norm; the element of partition matrix denotes the membership of point in the cluster .Moreover,forany FCM clustering algorithm rst computes the degree of membership through distances between points and centers of clusters and then updates the center of each cluster based on the membership degree. By means of computing cluster centers and partition matrix iteratively, a solution is obtained. It should be noted that FCM clustering can only get a locally optimal solution and the nal clustering result depends on the initialization. e detailed procedure of FCM clustering is shown in Algorithm .
. . Ensemble Aggregations for Multiple Fuzzy Clustering Solutions with Random Projection.
ere are several algorithms

Mathematical Problems in Engineering
Input:datasetX (an × matrix), number of clusters , fuzzy constant ; Output: partition matrix U, centers of clusters V; Initialize:sampleU (or V) randomly from proper space; A : FCM clustering algorithm.
proposed for aggregating the multiple fuzzy clustering results with random projection. e main strategy is to generate data membership matrices through multiple fuzzy clustering solutions on the di erent projected data sets and then to aggregate the resulting membership matrices. erefore, different methods of generation and aggregation of membership matrices lead to various ensemble approaches about fuzzy clustering. e rst cluster ensemble approach using random projection was proposed in [ ]. A er projecting the data into low dimensional space with random projection, the membership matrices were calculated through the probabilistic model of Gaussian mixture gained by EM clustering. Subsequently, the similarity of points and was computed as = ∑ =1 ( | , ) × ( | , ),where ( | , ) denoted the the probability of point belonging to cluster under model and denoted the probability that points and belonged to the same cluster under model . eaggregatedsimilaritymatrix was obtained by averaging across the multiple runs, and the nal clustering solution was produced by a hierarchical clustering method called complete linkage. For mixture model, the estimation for the cluster number and values of unknown parameters is o en complicated [ ]. In addition, this approach needs ( 2 ) space for storing the similarity matrix of data points.
Another approach which was used to nd genes in DNA microarray data was presented in [ ]. Similarly, the data was projected into a low dimensional space with random matrix. en the method employed FCM clustering to partition the projected data and generated membership matrices U ∈ R × , = 1,2,..., with multiple runs .Foreachrun ,the similarity matrix was computed as M = U U . e nt h e combined similarity matrix M was calculated by averaging as M = (1/ ) ∑ =1 M . A distance matrix was computed by D = 1 − M and the nal partition matrix was gained by FCM clustering on the distance matrix D. Since this method needs to compute the product of partition matrix and its transpose, the time complexity is ( * 2 ) and the space complexity is ( 2 ).
Considering the large scale data set in the context of big da ta,[ ]p r o posedanewmeth odf o ra ggr ega tin gpa rti tio n matrices from FCM clustering. ey concatenated the partition matrices as U con =[ U 1 , U 2 ,...], instead of averaging the agreement matrix. Finally, they got the ensemble result as U = FCM(U con , ). is algorithm avoids the products of partition matrices and is more suitable than [ ] for large scale data sets. However, it still needs the multiplication of concatenated partition matrix when crisp partition result is wanted.

Random Projection
Dimensionality reduction is a common technique for analysis of high dimensional data. e most popular skill is SVD (or principal component analysis) where the original features are replaced by a small size of principal components in order to compress the data. But SVD takes cubic time of the number of dimensions. Recently, some literatures stated that random projection can be applied to dimensionality reduction and preserve pairwise distances within a small factor [ , ]. Low computing complexity and preserving the metric structure make random projection receive much attention. Lemma indicates that there are three kinds of simple random projection possessing the above properties.
Lemma (see [ , ]). Let matrix X ∈ R × be a data set of points and features. Given , > 0,let For integer ≥ 0 ,letmatrixR be a × ( ≤ )random matrix, wherein elements are independently identically distributed random variables from either one of the following three probability distributions: ( ) implies that if the number of dimensions of data reduced by random projection is bigger than a certain bound, then pairwise Euclidean distance squares are preserved within a multiplicative factor of 1± .
With the above properties, researchers have checked the feasibility of applying random projection to -means clustering in terms of theory and experiment [ , ]. However, as membership degrees for FCM clustering and -means clustering are de ned di erently, the analysis method can n o tb ed i r e c t l yu s e df o ra s s e s s i n gt h ee e c to fr a n d o m projection on FCM clustering. Motivated by the idea of principal component analysis, we draw the conclusion that the compressed data gains the whole variability of original data in probabilistic sense based on the analysis of the variance di erence. Besides, variables referring to dimensions of projected data are linear independent. As a result, we can achieve dimensionality reduction via replacing original data by compressed data as "principal components." Next, we give a useful lemma for proof of the subsequent theorem.
Lemma . Let ( 1≤ ≤ )be independently distributed random variables from one of the three probability distributions described in Lemma ; then Proof. According to the probability distribution of random variable ,itiseasytoknowthat Since centralization of data does not change the distance of any two points and the FCM clustering algorithm is b a s e do np a i r w i s ed i s t a n c e st op a r t i t i o nd a t ap o i n t s ,w e assume that expectation of the data input is . In practice, covariance matrix of population is likely unknown. erefore, we investigate the e ect of random projection on variability of both population and sample. eorem . Let data set X ∈ R × be independent samples of -dimensional random vector ( 1 , 2 ,..., ),and S denotes the sample covariance matrix of X. er a n d o m projection induced by random matrix R ∈ R × maps the -dimensional random vector to -dimensional random vector ( 1 , 2 ,..., ) = (1/ )( 1 , 2 ,..., )⋅R,a n dS * denotes the sample covariance matrix of projected data. If elements of random matrix R obey distribution demanded by Lemma and are mutually independent with random vector ( 1 , 2 ,..., ),then ( ) dimensions of projected data are linearly independent: cov( , ) = 0, ∀ ̸ = ; ( ) random projection maintains the whole variability: Proof. It is easy to know that the expectation of any element of random matrix ( ) = 0, 1 ≤ ≤ , 1 ≤ ≤ . As elements of random matrix R and random vector ( 1 , 2 ,..., ) are mutually independent, the covariance of random vector induced by random projection is We denote spectral decomposition of sample covariance matrice S by S = VΛV ,whereV is the matrix of eigenvectors and Λ is a diagonal matrix in which the diagonal elements are 1 , 2 ,..., and 1 ≥ 2 ≥⋅⋅⋅≥ .Supposingthedata samples have been centralized, namely, their means are 0 ,we can get covariance matrix S =( 1 / ) X X.F o rco n v enien ce , we still denote a sample of random matrix by R. u s , projected data Y =( 1 / )XR and sample covariance matrix of projected data In practice, the spectrum of a covariance o en displays a distinct decay a er few large eigenvalues. So we assume that there exists an integer ,limitedconstant >0,suchthat,for all > ,itholdsthat ≤ . en, Combining the above arguments, we achieve tr(S * )= tr(S) with probability , when →∞.
Part (1) of eorem indicates that compressed data produced by random projection can take much information with low dimensionality owing to linear independence of reduced dimensions. Part (2) manifests that sum of variances of dimensions of original data is consistent with the one of projected data, namely, random projection holds the variability of primal data. Combining results of Lemma with those of eorem , we consider that random projection canbeemployedtoimprovethee ciencyofFCMclustering algorithm with low dimensionality, and the modi ed algorithm can keep the accuracy of partition approximately.

FCM Clustering with Random Projection and an Efficient Cluster Ensemble Approach
. . FCM Clustering via Random Projection. According to the results of Section , we design an improved FCM clustering algorithm with random projection for dimensionality reduction. e procedure of new algorithm is shown in Algorithm .
Algorithm reduces the dimensions of input data via multiplying a random matrix. Compared with the ( 2 ) time for running each iteration in original FCM clustering, the new algorithm would imply an ( ( −2 ln ) 2 ) time for each iteration. us, the time complexity of new algorithm decreases obviously for high dimensional data in the case −2 ln ≪ . Another common dimensionality reduction method is SVD. Compared with the ( 3 + 2 ) time of running SVD on data matrix X, the new algorithm only needs ( −2 ln ) time to generate random matrix R.I t indicates that random projection is a cost-e ective method of dimensionality reduction for FCM clustering algorithm.
. . Ensemble Approach Based on Graph Partition. As different random projections may result in di erent clustering solutions [ ], it is attractive to design the cluster ensemble framework with random projection for improved and robust clustering performance. Although it uses smaller memory a n dr u n sf a s t e rt h a ne n s e m b l em e t h o di n[ ] ,t h ec l u s t e r ensemble algorithm in [ ] still needs product of concatenated partition matrix for crisp grouping, which leads to a high time and space costs under the circumstances of big data.
In this section, we propose a more e cient and e ective aggregation method for multiple FCM clustering results. e overview of our new ensemble approach is presented in Figure . e new ensemble method is based on partition Mathematical Problems in Engineering Input:datasetX (an × matrix), number of clusters , fuzzy constant , FCM clustering algorithm; Output: partition matrix U, centers of clusters V.

A
: FCM clustering with random projection.
A : Cluster ensemble for FCM clustering with random projection.
on similarity graph. For each random projection, a new data set is generated. A er performing FCM clustering on the new data sets, membership matrices are output. e elements o fmem ber s hi pma trixa r etr ea t edasth esimila ri tymeas ur e between points and the cluster centers. rough SVD on the concatenation of membership matrices, we get the spectral embedding of data point e ciently. e detailed procedure of new cluster ensemble approach is shown in Algorithm . In step (3) of the procedure in Algorithm , the le singular vectors ofÛ con are equivalent to the eigenvectors of U conÛ con . It implies that we regard the matrix product as a construction of a nity matrix of data points. is method is motivated by the research on landmark-based representation [ , ]. In our approach, we treat the cluster centers of each FCM clustering run as landmarks and the membership matrix as landmark-based representation.
us, the concatenation of membership matrices forms a combinational landmark-based representation matrix. In this way, the graph similarity matrix is computed as which can create spectral embedding e ciently through step (3). To normalize the graph similarity matrix, we multiply U con by ( ⋅D) −1/2 . As a result, the degree matrix of W is an identity matrix. ere are two perspectives to explain why our approach works. Considering the similarity measure de ned by in FCM clustering, proposition in [ ] demonstrated that singular vectors of U converged to eigenvectors of W as converges to ,w h e r eW was a nity matrix generated in standard spectral clustering. As a result, singular vectors of U con converge to eigenvectors of normalized a nity matrix W . us, our nal output will converge to the one of standard spectral clustering as converges to . Another explanation is about the similarity measure de ned by (x , x )=x x , where x and x are data points. We can treat each row ofÛ con as a transformational data point. As a result, a nity matrix obtained here is the same as the one of standard spectral embedding, and our output is just the partition result of standard spectral clustering.
To facilitate comparison of di erent ensemble methods for FCM clustering solutions with random projection, we denote the approach of [ ] by EFCM-A (average the products of membership matrices), the algorithm of [ ] by EFCM-C (concatenate the membership matrices), and our new method by EFCM-S (spectral clustering on the membership matrices). In the cluster ensemble phase, the main computations of EFCM-A method are multiplications of membership matrices. Similarly, the algorithm of EFCM-C also needs the product of concatenated membership matrices in order to get the crisp partition result. us the above methods both need ( 2 ) space and ( 2 ) time. However, the main computation of EFCM-S is SVD forÛ con and -means clustering for A. eo v e r a l ls p a c ei s ( ), the SVD time is (( ) 2 ),andthe -means clustering time is 2 ,wher e is iteration number of -means. erefore, computational complexity of EFCM-S is obviously decreased compared with the ones of EFCM-A and EFCM-C considering ≪ and ≪ in large scale data set.

Experiments
In this section, we present the experimental evaluations of new algorithms proposed in Section . We implemented the related algorithms in Matlab computing environment and conducted our experiments on a Windows-based system with the Intel Core . GHz processor and GB of RAM.
. . Data Sets and Parameter Settings. We conduc te d t he experiments on synthetic and real data sets which both have relatively high dimensionality. e synthetic data set had data points with dimensions which were generated from Gaussian mixtures in proportions (0.25, 0.5, 0.25).
For the parameters of FCM clustering, we let =10 −5 ,we let maximum iteration number be , we let fuzzy factor be ,andweletthenumberofclustersbe =3for synthetic data set and =1 9for ACT data sets. We also normalized the objective function as obj * = obj/‖X‖ 2 ,w h e r e‖⋅‖ is Frobenius norm of matrix [ ]. To minimize the in uence introduced by di erent initializations, we present the average values of evaluation indices of independent experiments.
In order to compare di erent dimensionality reduction methods for FCM clustering, we initialized algorithms by choosing points randomly as the cluster centers and made sure that every algorithm began with the same initialization. In addition, we ran Algorithm with = 10, 20, . . . , 100 for synthetic data set and = 100, 200,...,1000 for ACT data set. Two kinds of random projections (with random variables from ( ) in Lemma ) were both tested for verifying their feasibility. We also compared Algorithm against another popular method of dimensionality reduction-SVD. What calls for special attention is that the number of eigenvectors corresponding to nonzero eigenvalues of ACT data is only , so we just took = 100, 200, ...,700on FCM clustering with SVD for ACT data set.
Among comparisons of di erent cluster ensemble algorithms, we set dimension number of projected data as = 10, 20, . . . , 100 for both synthetic and ACT data sets. In order to meet ≪ f o rA l g o r i t h m ,t h en u m b e ro fr a n d o m projection was set as for the synthetic data set and for the ACT data set, respectively.
. . Evaluation Criteria. For clustering algorithms, clustering validation and running time are two important indices for judging their performances. Clustering validation measures evaluate the goodness of clustering results [ ] and o en can be divided into two categories: external clustering validation and internal clustering validation. External validation measures use external information such as the given class labels to evaluate the goodness of solution output by a clustering algorithm. On the contrary, internal measures are to evaluate the clustering results using feature inherited from data sets. In this paper, validity evaluation criteria used are rand index and clustering validation index based on nearest neighbors for crisp partition, together with fuzzy rand index and Xie-Beni index for fuzzy partition. Here, rand index and fuzzy rand index are external validation measures, whereas the clustering validation index based on nearest neighbors index and Xie-Beni index are internal validation measures. where 11 is the number of pairs of points that exist in the same cluster in both clustering result and given class labels, 00 is the number of pairs of points that are in di erent subsets for both clustering result and given class labels, and 2 equals ( −1)/2. e value of RI ranges from to , and the higher value implies the better clustering solution.

( ) Fuzzy Rand Index (FRI) [ ].
FRI is a generalization of RI with respect to so partition. It also measures the proportion of pairs of points which exist in the same and di erent clusters in both clustering solution and true class labels. It needs to compute the analogous 11 and 00 through contingency table, described in [ ]. erefore, the range of FRI is also [0, 1] and the larger value means more accurate cluster solution.
( ) Xie-Beni Index (XB) [ ]. XB takes the minimum square distance between cluster centers as the separation of the partition and the average square fuzzy deviation of data points as the compactness of the partition. XB is calculated as follows: is just the objective function of FCM clustering and k is the center of cluster . esmallest XB indicates the optimal cluster partition.

( ) Clustering Validation Index Based on Nearest Neighbors (CVNN) [ ].
eseparationofCVNNisaboutthesituation of objects that have geometrical information of each cluster, and the compactness is the mean pairwise distance between objects in the same cluster. CVNN is computed as follows: where Sep( , ) = max =1,2,..., ((1/ )⋅∑ =1 ( / )) and Com( ) = ∑ =1 ((2/ ( −1))⋅∑ , ∈Clu ( , )).Here, is the number of clusters in partition result, max is the maximum cluster number given, min is the minimum cluster number given, isthenumberofnearestneighbors, is the number of objects in the th cluster Clu , denotes the number of nearest neighbors of Clu 's th object which are not in Clu , and ( , ) denotes the distance between and . elower CVNN value indicates the better clustering solution.
Objective function is a special evaluation criterion of validity for FCM clustering algorithm. e smaller objective function indicates that the points inside clusters are more "similar." Running time is also an important evaluation criterion o en related to the scalability of algorithm. One main target of random projection for dimensionality reduction i st od e c r e a s et h er u n t i m ea n de n h a n c et h ea p p l i c a b i l i t yo f algorithm in the context of big data.

. . Performance of FCM Clustering with Random Projection.
e experimental results about FCM clustering with random projection are presented in Figure where  (h). "SignRP" denotes the proposed algorithm with random sign matrix, "GaussRP" denotes the FCM clustering with random Gaussian matrix, "FCM" denotes the original FCM clustering algorithm, and "SVD" denotes the FCM clustering with dimensionality reduction through SVD. It should be notedthattrueXBvalueofFCMclusteringinsub gure(d) is . e + ,not .
From Figure , we can see that FCM clustering with random projection is clearly more e cient than the original FCM clustering. When number of dimensions is above certain bound, the validity indices are nearly stable and similar to the ones of naive FCM clustering for both data sets. is veri es the conclusion that "accuracy of clustering algorithm can be preserved when the dimensionality exceeds a certain bound." e e ectiveness for random projection method is also veri ed by the small bound compared to the total dimensions (30/1000 for synthetic data and 300/67500 for ACT data). Besides, the two di erent kinds of random projection methods have the similar impact on FCM clustering because of the analogous plot. e higher objective function values and the smaller XB indices of SVD method for synthetic data set indicate that the generated clustering solution has better separation degree between clusters. e external cluster validation indices also verify that SVD method has better clustering results for synthetic data. ese observations state that SVD method is more suitable for Gaussian mixture data sets than FCM clustering with random projection and naive FCM clustering.
Although the SVD method has a higher FRI for synthetic data set, the random projection methods have analogous FRI values for ACT data set and better objective function values for both data sets. In addition, the random projection approaches are obviously more e cient as the SVD needs cubic time of dimensionality. Hence, these observations indicate that our algorithm is quite encouraging in practice.

. . Comparisons of Di erent Cluster Ensemble
Methods. e comparisons of di erent cluster ensemble approaches are s h o w ni nF i g u r e a n dT a b l e .S i m i l a r l y ,( a )a n d( c )o ft h e g u r ec o r r e s p o n dt ot h es y n t h e t i cd a t as e ta n d( b )a n d( d ) corresponds to the ACT data set. We use RI, (a) and (b), a n dr u n n i n gt i m e ,( c )a n d( d ) ,t op r e s e n tt h ep e r f o r m a n c e of ensemble methods. Meanwhile, the meanings of EFCM-A, EFCM-C, and EFCM-S are identical to the ones in Section . . In order to get crisp partition for EFCM-A and EFCM-C, we used hierarchical clustering-complete linkage method a er getting the distance matrix as in [ ]. Since all three cluster ensemble methods get perfect partition results on synthetic data set, we only compare CVNN indices of di erent ensemble methods on ACT data set, which is presented in Table . In Figure , running time of our algorithm is shorter for both data sets. is veri es the result of time complexity analysis for di erent algorithms in Section . . e three cluster ensemble methods all get the perfect partition for synthetic data set, whereas our method is more accurate than the other two methods for ACT data set. e perfect partition results suggest that all three ensemble methods are suitable for Gaussian mixture data set. However, the almost 18%i m p r o v e m e n to nR If o rA C T d a t as e ts h o u l db ed u e to the di erent grouping ideas. Our method is based on the graph partition such that the edges between di erent clusters have low weight and the edges within a cluster have high weight. is clustering way of spectral embedding is more suitable for ACT data set. In Table , the smaller values of CVNN of our new method also show that new approach has better partition results on ACT data set. ese observations indicate that our algorithm has the advantage on e ciency andadaptstoawiderrangeofgeometries.
W eal soc o m pa r eth es ta b il i t yf o rth r eee n se m b l em e thods, presented in Table . From the table, we can see that the standard deviation of RI about EFCM-S is a lower order of magnitude than the ones of the other methods. Hence, this resultshowsthatouralgorithmismorerobust.
Aiming at the situation of unknown clusters' number, w ea l s ov a r i e dt h en u m b e ro fc l u s t e r s in FCM clustering and spectral embedding for our new method. We denote this version of new method as EFCM-SV. Since the number of random projections was set as for ACT data set, we changed the clusters' number from to as the input of FCM clustering algorithm. In addition, we set the clusters' number from to as the input of spectral embedding and applied CVNN to estimate the most plausible number of clusters. e experimental results are presented in Table . T : Standard deviations of RI of runs with di erent dimensions on ACT data.

Mathematical Problems in Engineering
In Table , the values with respect to "EFCM-SV" are the  average RI values with the estimated clusters' numbers for individual runs. e values of "+CVNN" are the average clusters' numbers decided by the CVNN cluster validity index. Using the estimated clusters' numbers by CVNN, our method gets the similar results of ensemble method with correct clusters' number. In addition, the average estimates of clusters' number are close to the true one. is indicates that our cluster ensemble method EFCM-SV is attractive when the number of clusters is unknown.

Conclusion and Future Work
e "curse of dimensionality" in big data gives new challenges for clustering recently, and feature extraction for dimensionality reduction is a popular way to deal with these challenges. We studied the feature extraction method of random projection for FCM clustering. rough analyzing the e ects of random projection on the entire variability of data theoretically and veri cation both on synthetic and real world data empirically, we designed an enhanced FCM clustering algorithm with random projection. e new algorithm can maintain nearly the same clustering solution of preliminary FCM clustering and be more e cient than feature extraction method of SVD. What is more, we also proposed a cluster ensemble approach that is more applicable to large scale data sets than existing ones. e new ensemble approach can achieve spectral embedding e ciently from SVD on the concatenation of membership matrices. e experiments showed that the new ensemble method ran faster, had more robust partition solutions, and tted a wider range of geometrical data sets.
A future research content is to design the provably accurate feature extraction and feature selection methods for FCM clustering. Another remaining question is that how to choose proper number of random projections for cluster ensemble method in order to get a trade-o between clustering accuracy and e ciency.
[ ] J .Z h a n g,X .T a o ,a n dH .W a n g," O u t l i e rd e t e c t i o nf r o ml a r g e distributed databases, " World Wide Web,vol. ,no. ,pp. -, . [ ] C. Ordonez, N. Mohanam, and C. Garcia-Alvarado, "PCA for large data sets with parallel data summarization, " Distributed and Parallel Databases,vol. ,no. ,pp.