Spectral Nonlinearly Embedded Clustering Algorithm

As is well known, traditional spectral clustering (SC) methods are developed based on the manifold assumption, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. But, for some highdimensional and sparse data, such an assumptionmight be invalid. Consequently, the clustering performance of SCwill be degraded sharply in this case. To solve this problem, in this paper, we propose a general spectral embedded framework, which embeds the true cluster assignment matrix for high-dimensional data into a nonlinear space by a predefined embedding function. Based on this framework, several algorithms are presented by using different embedding functions, which aim at learning the final cluster assignment matrix and a transformation into a low dimensionality space simultaneously. More importantly, the proposed method can naturally handle the out-of-sample extension problem. The experimental results on benchmark datasets demonstrate that the proposed method significantly outperforms existing clustering methods.


Introduction
As one of the fundamental topics in data mining and machine learning, clustering has been successfully applied in various fields.Generally speaking, the target of clustering is to group the examples into a number of classes, or clusters.Over the past decades, a large family of clustering algorithms has been studied extensively, which is mainly divided into two categories: generative clustering approaches and discriminative clustering models.Generative clustering approaches, for example, mixture models [1,2], generally integrate Bayesian approaches into its models.However, generative models add restrict assumptions on the class-conditional densities, which might lead to unconvincing clustering results when these assumptions do not hold.Discriminative methods, such as spectral clustering (SC) [3] and K-means clustering [4], learn discriminative models based on loss functions from unlabeled data through the low-density separation assumption.
Recently, discriminative clustering methods, such as the variants of kernel-based clustering and spectral clustering, have attracted more and more renewed attentions.It is easy to perform them to capture nonlinear cluster structures.Motivated by the outstanding performance of support vector machine (SVM) in supervised learning, maximum margin clustering (MMC) [5][6][7] methods have been developed to obtain a decision boundary that can separate data points into different clusters to the utmost extent.Although these clustering methods have the ability of exploiting nonlinear data structures, they are still sensitive to high-dimensional data points.For example, K-means clustering iteratively computes the distance between each data point and the center of each cluster.Hence, its clustering performance severely depends on the distance measurement.However, high-dimensional data, such as some image data, would have a bad influence on the similarity computation by virtue of Euclidian distance, and the performance of K-means clustering would be degraded dramatically.SC can perform clustering by utilizing the spectrum of the similarity matrix to discover the nonlinear and low-dimensional manifold structure of data points.In other words, it heavily relies on the manifold assumption [8,9], namely, that two nearby data points of a low-dimensional manifold have the same class label.However, for high-dimensional and sparse data, the manifold assumption may not hold due to the bias caused by the curse of dimensionality.Nie et al. [10] have validated that graph-based spectral clustering methods cannot always 2 Mathematical Problems in Engineering exploit the low-dimensional manifold structure, which would result in the performance degradation of SC.Another challenge for traditional SC methods is that they do not solve the out-of-sample extension problem; that is, the discrete cluster assignment vectors for some new unseen samples cannot be automatically obtained.The algorithm proposed in [11] takes advantage of the Nyström method to approximate the eigenfunction for the unseen data points.The method described in [12] makes good use of some heuristics to evaluate the implicit eigenfunction for the new data points.But, the performance of these methods heavily relies on the estimated affinity matrix defined between training and new data points.
To improve the clustering performance of SC for highdimensional data further, in this paper, we firstly propose a general spectral embedded clustering framework, which incorporates dimensionality reduction methods into the model of SC.Secondly, by using different low-dimensional embedding functions, we derive the corresponding optimization models and develop the spectral nonlinearly embedded algorithms based on extreme learning machine (ELM) and kernel functions, respectively.Our main contributions include the following: (1) A general spectral embedded clustering framework is presented by imposing a linearity regularization on the objective function of SC.The rest of this paper is organized as follows.Related works are introduced in Section 2. In Section 3, we present the general spectral embedded clustering framework and derive several different models by using different embedding functions.The relationship between ESEC and KSEC is demonstrated and the ESEC clustering algorithm is described in detail.In addition, clustering for out-of-sample data is also discussed.To validate our model, experimental results are reported in Section 4. Finally, we give the related conclusions and a discussion of future works in Section 5.In order to avoid confusion, we give a list of the main notations used in this paper in Notations section.

Related Works
2.1.Spectral Clustering.Given a dataset X = {x  }  =1 , the main task of clustering is to partition X into  clusters.SC aims at finding a cluster assignment matrix of the training data by a weighted graph G whose vertices are over X.Several SC algorithms have been proposed in [3,13,14].In this paper, we mainly discuss the SC algorithm with k-way normalized cuts [3].
Specifically, denote an undirected weighted graph by G = {X, W}, where X is a vertex set and W ∈ R × represents an affinity matrix.Each entry   of the symmetric matrix W is used to record the edge weights that characterize the similarity relationship between a pair of vertices of G.   is commonly defined by ( The Laplacian graph L is defined by L = D − W, where D is a diagonal matrix with the diagonal elements as   = ∑    , ∀  .Based on the normalized cut criterion, where the size of a subset of a graph is measured by the weights of its edges and the normalized Laplacian matrix is used, the optimization problem can be transformed into the following trace maximization problem [3]: max where I  denotes the identity matrix of size  by  and F ∈ R × represents the cluster assignment matrix with continuous values by relaxation.Then optimal solution F of (2) can be obtained by eigenvalue decomposition of the matrix D −1/2 WD −1/2 .

Extreme Learning Machine.
The output function of ELM for generalized single-hidden-layer feedforward neural networks (SLFNs) in the case of one output node is where  = [ 1 , . . .,   ]  is the vector of the output weights between the hidden layer of L nodes and the output node and h(x) = [ℎ 1 (x), . . ., ℎ  (x)] is the output (row) vector of the hidden layer with respect to the input x.In fact, h(x) maps the data from the d-dimensional input space to the Ldimensional hidden-layer feature space (ELM feature space) H. ELM is to minimize the training error as well as the norm of the output weights [15] min where  is a tradeoff parameter between the complexity and fitness of the decision function and H is the hidden-layer output matrix denoted by Similar to support vector machine (SVM), to minimize the norm of the output weights ‖‖ is actually to maximize the distance of the separating margins of the two different classes in the ELM feature space: 2/‖‖, which actually controls the complexity of the function in the ELM feature space.

General Spectral Embedded Clustering Framework
As mentioned above, SC methods greatly depend on the construction of the affinity matrix W. For some highdimensional data, it might not exhibit an evident lowdimensional manifold structure.In this case, the clustering performance of SC may be inferior to the K-means clustering.
In the following subsections, we will firstly propose a general spectral embedded clustering framework, which incorporates a linearity regularization into the traditional normalized SC model.By using different embedding functions, this framework can generate a family of spectral embedded clustering algorithms, such as SEC, KSEC, and ESEC.Secondly, we demonstrate the relationship between ESEC and KSEC.The ESEC algorithm is then proposed for high-dimensional data clustering.Finally, the out-ofsample extension problem is discussed for our proposed ESEC method.

Formulation. Generally, clustering models of traditional SC methods can be transformed into the following minimization problem:
min where is the normalized Laplacian matrix.
To make use of the underlying dense grouping structure of data in a low-dimensional subspace, the proposed general framework introduces a regularization term into the optimization problem (6), which controls the error between the cluster assignment matrix and the low-dimensional embedding of the data.Specifically, we minimize the following objective function: min where  and   are two regularization parameters and (X) = ( 1 (X), . . .,   (X)) ∈ R × is the low-dimensional embedding of training data.The second term represents the error between the relaxed cluster assignment matrix F and the low-dimensional embedding of the data.The third term is the norm penalty of (X) and represents the complexity of functions in a high-dimensional feature space.
In dimensionality reduction, linear embedding functions and nonlinear embedding functions are commonly used to address out-of-sample problems.This is due to the fact that they contain few parameters, which are not expensive in computational time and memory.In this paper, we mainly discuss kernel-based and ELM-based nonlinear embedding functions.
If we choose a linear embedding function which is equivalent to the SEC method proposed in [12].
Alternatively, if we consider an embedding function in ELM feature space, that is, (x) = ∑  =1   ℎ  (x) = h(x), then (X) = H, where H represents the hidden-layer output matrix of ELM.Problem (7) can be reformulated as min which is referred to as ESEC.

Method.
Firstly, to solve the optimization problems (9), we transform them into another simple form and have the following theorem.
Theorem 1.The optimization problems ( 9) can be transformed into the following minimization problem: where L F = I  − K(K +   I  ) −1 and I  denotes the identity matrix of size n by n.
Proof.Problem ( 9) is firstly transformed into the following form: min where By setting the derivatives of the objective function ( 16) with respect to  to zero, we have By substituting  in ( 12) by ( 13), the optimization problem (12) becomes min which can be denoted as follows: min where L F = I  − K(K +   I  ) −1 .This completes the proof of Theorem 1.
Based on Theorem 1, the relaxed cluster assignment matrix F * of KSEC can be achieved by computing the eigenvectors of L + L F corresponding to the  smallest eigenvalues.The columns of F * are corresponding to the top  eigenvectors.Finally, the discrete-valued cluster assignment matrix can be obtained by clustering each row of F * .
To inherit the advantage of fast learning speed of ELM, we mainly discuss ESEC based on ELM with multioutputs, since ELM with single output can be regarded as a special case of it.We have the following theorem on ESEC, which is the foundation of the proposed ESEC algorithm.

Theorem 2. The optimization problem (10) can be transformed into the following minimization problem:
min where Proof.By setting the derivatives of the objective function (10) with respect to  to zero, we have By substituting  in ( 10) by ( 17 Problem ( 18) can be further transformed into the following objective function: min which can be denoted as follows: min where L H = I  − HH  (  I  + HH  ) −1 .L H can be transformed into another form as follows: This completes the proof of Theorem 2.
ESEC makes good use of an embedding function in ELM feature space instead of RKHS.Thus, the form of ESEC is similar to that of KSEC.It can be proved that there is a link between ESEC and KSEC.We have the following theorem.

Theorem 3. If the mapping h(x) in ELM is
, where  denotes any kernel function and L is the number of hidden nodes in ELM and {(a  ,   )}  =1 (  is the parameter of kernel function (x 1 , x 2 )) are random sampling points from any continuous probability distribution, then ESEC is an approximation of KSEC by discretizing the embedding function (x) = ∑  =1   (x  , x).
The proposed ESEC algorithm is described as follows.
Input.The input is the training dataset X = [x 1 , x 2 , . . ., x  ] ∈ R × and the number of clusters .
Output.The output is the class assignment matrix of cluster Y.
Step 1. Construct the graph Laplacian L from X.
Step 2. Randomly generate input weights {(a  ,   )}  =1 and initiate an ELM network of  hidden neurons; calculate the output matrix of the hidden layer.
Step 4. Compute the matrix L + L H .
Step 5. Find the eigenvectors of L + L H corresponding to the  smallest eigenvalues, which form the optimal F * .
Step Return the class assignment matrix of cluster F *  .

Computational Complexity.
From Algorithm 1, we can see that the most costly computation is computing the matrix L H and carrying out the eigen-decomposition of L + L H .If  ≪ , computing L H needs to obtain the inversion of   I  + H  H, whose computational complexity is ( 3 +  2 ).In addition, the computational complexity of eigenvalue decomposition of L + L H is ( 3 ).Thus, the total computational complexity of ESEC is ( 3 +  3 +  2 ), where  ≪ .Correspondingly, for KSEC, computational complexity of calculating L F is ( 3 ) and its total computational complexity is (2 3 ).Consequently, ESEC has lower computational complexity than KSEC.

Clustering for Out-of-Sample Data.
By performing Algorithm 1, we can obtain the cluster assignment matrix F *  for the training data.Thus,  can be easily computed by using formula (17).Then, for any new data point x ∈ R  , we can obtain the prediction result In this paper, we use the spectral rotation method to calculate the discrete cluster assignment vector for x.Firstly, an orthogonal matrix R is computed by the following spectral rotation method: where 1  and 1  denote the  × 1 and  × 1 vectors of all 1s, respectively.R ∈ R × is an orthogonal matrix and Y * is defined by where Diag(A) represents a diagonal matrix with the same diagonal elements as the square matrix A. Secondly, the discrete cluster assignment vector for x is calculated as follows: Finally, the class of the data point x is where ỹ() is the ith element in the vector ỹ.

Experiments
To  In in-sample clustering, we assign a cluster label to each unlabeled in-sample data point.The proposed ESEC algorithm is compared with K-means (KM) clustering, SC [3], SEC [12], and KSEC.For KM, an EM-like algorithm is used to assign cluster labels as in [16].In out-sample clustering, the cluster label of each unseen data point is assigned to the closest cluster center learned from the in-sample data points for KM.We use the proposed out-of-sample approach to cope with unseen data for ESEC.Similar out-of-sample method is also used for KSEC and SEC by using different embedding functions.Since the Nyström method [10] can be used to deal with unseen data, we also compare ESEC with the Nyström method for the out-of-sample SC.

Experimental Setup.
Each dataset is randomly divided into seen and unseen samples, and we use the seen data to obtain the optimal parameters of different clustering methods by cross-validation.Then, we use the unseen data to test the performance of all algorithms using the obtained optimal parameters.In the experiments, 80% of the data are randomly selected as seen data and the remaining data are used as unseen data.
For the Nyström SC method, we set the same  = 2 0 , where  0 = 1/, with  being the mean value of the square distance between the in-sample data as suggested in [18] and  = {−3, −2, −1, 0, 1, 2, 3}.For the in-sample clustering, the best clustering results from the best parameters for SEC, KSEC, and ESEC are reported in Table 2.For ESEC, we use the RBF kernel as the hidden node function and a grid search of the number of hidden nodes  on {10, 20, . . ., 900, 1000} is conducted to seek for the optimal result by using fivefold cross-validation.By means of the optimal parameters in insample setting, the results for the out-of-sample clustering are obtained and reported in Table 3.
It should be noted that the results of all clustering methods rely on the initialization.To get statistical results for different parameters and random partitions, all clustering algorithms are independently repeated 50 times, and we report the mean clustering result and standard deviation using the best parameters on the seen and unseen data.In the experiments, we set the number of clusters as the number of classes  in each dataset.The clustering accuracy (ACC) (refer to [19] for its definition) and time cost are used to evaluate the clustering performance.

In-Sample
Clustering Experiments.To compare clustering performances of various clustering algorithms, we report the in-sample clustering results on all the datasets in Table 2.
As can be seen from  In Figure 1, we further analyze the sensitivity of the insample clustering performances of SEC, KSEC, and ESEC with respect to the parameter .We can see from Figure 1 that ESEC prefers a large value for  on Yale and ORL, and its performance on these datasets is relatively stable when  is set as a large value.While ESEC and KSEC favor a small value of  for COIL20 and Isolet.We can observe that ESEC outperforms KSEC and SEC in a wide range of ; that is, the clustering accuracy of ESEC is less sensitive to the parameter for most of the datasets when compared with SEC and KSEC.

Out-of-Sample Clustering
Experiments.We also study the performances of KM, Nyström SC, SEC, KSEC, and ESEC for the out-of-sample extension.Table 4 shows the clustering accuracies of these methods for the out-of-sample clustering on all the datasets.The optimal parameters of SEC, KSEC, and ESEC are determined by cross-validation from the in-sample clustering.From Table 4, it can be seen that SEC, KSEC, and ESEC significantly outperform the Nyström method for outof-sample clustering.The reason is that the Nyström method utilizes Nyström extension to evaluate the similarity matrix between the unseen data, which might be inaccurate or even has a serious deviation.However, our proposed framework aims at minimizing the error between the cluster assignment matrix and the low-dimensional embedding of the data, which is feasible for handling real-world data.Thus, ESEC has the natural ability of solving out-of-sample extension problems.In addition, KM is sharply degraded on Yale and ORL compared with the corresponding results in Table 2.This is due to the fact that the unseen face data has the large variation compared to the seen data.On the other hand, the clustering accuracies of ESEC are comparable to the insample testing results, which validates that ESEC has better generalization performance.ESEC achieves 6 best clustering results among all ten testing results and has comparable results on the rest of the datasets when compared to KSEC.Consequently, the proposed ESEC algorithm provides a new way to cope with the out-of-sample data in clustering tasks.

Conclusion
In this paper, we propose a general spectral embedded clustering framework based on the objective function of SC, from which SEC, KSEC, and ESEC can all be derived by using different embedding functions.By virtue of ELM, the fast spectral nonlinearly embedded clustering algorithm (ESEC) is proposed, which can naturally solve the out-of-sample extension problem for the clustering tasks.Experimental results on benchmark datasets validate the effectiveness and efficiency of the proposed ESEC method for both insample and out-of-sample clustering.In the future, we intend to develop a new semisupervised clustering framework by incorporating pair constraints into the present framework and propose some semisupervised clustering algorithms based on spectral nonlinearly embedded clustering models.

Notations
R  : The input -dimensional Euclidean space B: The output 0-1 binary space : The number of total training data points : The number of classes that the samples belong to X: X = [x 1 , . . ., x  ] ∈ R × is the training data matrix Y: Y = [y 1 , . . ., y  ]  ∈ B × is the 0-1 class assignment matrix; y  ∈ B ×1 is the label vector of x  , and all components of y  are 0s except one being 1 (⋅): (X) = ( 1 (X), . . .,   (X)) ∈ R × is the embedding vector function L H = I  − HH  (  I  + HH  ) −1 or L H = I  − H(  I  + H  H) −1 H  .I  denotes the identity matrix of size  by  and  is the number of hidden layer nodes in ELM.

6 .
Treat each row of F * as a new training sample, and use the K-means algorithm to cluster the  training samples into  clusters.Let F *  be the final discrete class assignment matrix of cluster for training data.

Table 2 :
Performance comparison of clustering accuracy using KM, SC, SEC, KSEC, and ESEC for the in-sample clustering on ten datasets.

Table 3 :
Time cost comparison of KM, SC, SEC, KSEC, and ESEC for the in-sample clustering on ten datasets.

Table 2 ,
SC outperforms KM for most of the low dimensionality datasets, such as Iris, Glass, Wine, WPBC, and SpectfHeart.But, it might become worse on the high dimensionality datasets, such as Yale and Isolet.This is due to the fact that SC prefers the datasets that have a clear manifold structure in a low-dimensional space.If this assumption does not hold, it even performs worse than the KM algorithm.The performance of SEC is better than KM and SC on Glass, WPBC, USPS, and Isolet.Hence, it does not achieve overwhelming advantages for in-sample clustering on all datasets.One possible explanation is that SEC improves SC by introducing the linear embedding functions, which is only applicable to the data with linear or approximately linear structures.KSEC and ESEC significantly outperform KM, SC, and SEC in most cases.KSEC and ESEC all achieve 5 best in-sample clustering results among all datasets.It should

Table 4 :
Performance comparison of different algorithms for the out-of-sample clustering on ten datasets.Compared with KSEC, ESEC achieves better or at least comparable results, which demonstrates that the proposed ESEC method is effective on all the datasets and has the ability to handle the datasets that do not have a clear manifold structure in a low-dimensional space.The running time of all algorithms is listed in Table3.It is shown that ESEC runs much faster than KSEC, which is consistent with the theoretical analysis, and the running time of ESEC and KSEC is lower than that of KM, SC, and SEC for most of the datasets.Overall, compared with other methods, the proposed ESEC method has better or comparative in-sample performance at much faster training speed.