Multiple Kernel Spectral Regression for Dimensionality Reduction

Traditional manifold learning algorithms, such as locally linear embedding, Isomap, and Laplacian eigenmap, only provide the embedding results of the training samples. To solve the out-of-sample extension problem, spectral regression (SR) solves the problem of learning an embedding function by establishing a regression framework, which can avoid eigen-decomposition of dense matrices.Motivated by the effectiveness of SR, we incorporatemultiple kernel learning (MKL) into SR for dimensionality reduction. The proposed approach (termedMKL-SR) seeks an embedding function in the Reproducing Kernel Hilbert Space (RKHS) induced by the multiple base kernels. An MKL-SR algorithm is proposed to improve the performance of kernel-based SR (KSR) further. Furthermore, the proposed MKL-SR algorithm can be performed in the supervised, unsupervised, and semi-supervised situation. Experimental results on supervised classification and semi-supervised classification demonstrate the effectiveness and efficiency of our algorithm.


Introduction
In real applications, the resulting data representations are generally high dimensional.Practical algorithms usually behave badly when faced with many unnecessary features.Hence, finding a way of transforming them into a unified space of lower dimension can facilitate the underlying tasks such as pattern recognition or regression problems.Dimensionality reduction (DR) techniques, which have been widely used in many fields of information processing, include unsupervised, supervised, and semisupervised methods due to different assumptions about the data distribution or the availability of the data labeling.
In order to handle the data sampled from a nonlinear low dimensional manifold, many manifold learning techniques, such as ISOMAP [1], Locally Linear Embedding (LLE) [2], and Laplacian Eigenmap [3], have been proposed in recent years, which reduce the dimensionality of a fixed training set in a way that can maximally preserve certain interpoint relationships.One of the major limitations of these methods is that they do not generally address the out-of-sample problem.Although some methods explicitly require an embedding function either linear or in RKHS when minimizing the objective function [4,5], the computation of these methods involves eigendecomposition of dense matrices which is expensive in both time and memory.Spectral regression (SR), which is fundamentally based on regression and spectral graph analysis [6][7][8][9][10], can avoid eigen-decomposition of dense matrices and has better performance at a faster learning speed.Moreover, it can be performed either in supervised, unsupervised, or semisupervised situation.Kernel SR (KSR) is the kernelized version of SR in the reproducing kernel Hilbert space (RKHS), which can further improve the performance of SR.While KSR is based on a single kernel, in practice it is often hard to select a suitable kernel.A common way to an automatic selection of optimal kernels is to learn a linear combination of base kernels.Motivated by the effectiveness of SR, we introduce a framework called MKL-SR that incorporates multiple kernel learning (MKL) into the training process of SR.We will illustrate the formulation of MKL-SR with graph embedding [11], which provides a unified view for a large family of DR methods.Any DR technique expressible by graph embedding can therefore be generalized by MKL-SR to boost their power by automatically selecting optimal kernels.As the corresponding SR algorithm would do, the proposed approach not only solves the out-of-sample extension problem but also improves the performance of kernel-based SR (KSR) for the supervised, semisupervised, and unsupervised learning problems.
The paper is structured as follows.In Section 2, we briefly introduce the related work.We provide the MKL-SR framework and present the optimization process in Section 3. The experimental results are shown in Section 4. Finally, we give the related conclusions in Section 5.

Related Work
Since the relevant literature is quite extensive, our survey instead emphasizes the key concepts crucial to the establishment of the proposed framework.

Spectral Regression Algorithm.
In the traditional spectral dimensionality reduction algorithms, seeking an embedding function which minimizes the objective function involves eigen-decomposition of dense matrices, which has the high computational cost in both time and memory.The SR algorithm uses the least squares method to get the best projection direction, instead of computing the density matrix of features, so it has much faster learning speed.An affinity graph G of both labeled and unlabeled points is constructed to find the intrinsic geometry structure and to learn the responses with the given data.Then, with these responses, the ordinary regression is applied to learning the embedding function.
Given a training set with  labeled samples x 1 , x 2 , . . ., x  and ( − ) unlabeled samples x +1 , x +2 , . . ., x  , where the sample x  ∈   belongs to one of  classes, let   be the number of labeled samples in the th class (the sum of   is equal to ).The SR algorithm is summarized as follows.
Step 1. Constructing the adjacency graph G let X be the training set and let G denote a graph with  nodes, where the th node corresponds to the sample x  .In order to model the local structure as well as the label information, the graph G will be constructed through the following three steps.
(1) If x  is among -nearest neighbors of x  or x  is among -nearest neighbors of x  , then nodes  and  are connected by an edge.
(2) If x  and x  are in the same class (i.e., same label), then nodes  and  are also connected by an edge.
(3) Otherwise, if x  and x  are not in the same class, then the edge will be deleted between nodes  and .
Step 2. Constructing the weight matrix W let W be the sparse symmetric  ×  matrix, where W  represents the weight of the edge joining vertices  and .
(1) If there is no any edge between nodes  and , then W  = 0.
(2) Otherwise, if both x  and x  belong to the th class, then W  = 1/  , else W  =  ⋅ (, ), where  (0 <  ≤ 1) is a given parameter to adjust the weight between supervised and unsupervised neighbor information.Therein, (, ) is a similarity evaluation function between x  and x  ; we have two variations, the first one is simple-minded function (, ) = 1 and the second one is heat kernel function: where  ∈ R.
Step 3.For eigen-decomposing let D be the  ×  diagonal matrix, whose (, )th element is the sum of the th column (or row) of W. Find  0 ,  1 , . . .,  −1 , which are the largest  generalized eigenvectors of the eigenproblem where the first eigenvector  0 is a vector of all ones with eigenvalue 1.

Multiple Kernel
Learning.MKL learns a kernel machine with multiple kernel functions or kernel matrices.Recent studies have shown that MKL not only increases the recognition accuracy but also enhances the interpretability of the resulting classifiers.Given a set of base kernel functions {  }  =1 , an ensemble kernel function  is defined by Consequently, an often-used MKL decision function derived from binary-class SVM is The training process of MKL generally optimizes over both the coefficients {(  )}  =1 and {(  )}  =1 .In recent years, dimensionality reduction methods based on multiple kernels have been proposed to improve the performance of those using single kernel.In [12], kernel learning was first incorporated into DR methods.Then, a multiple kernel DR framework was designed in [13].Recently, Zhu et al. proposed a dimensionality reduction method by Mixed Kernel Canonical Correlation Analysis (CCA) [14,15].In this method, the high dimensional data space is mapped into the reproducing kernel Hilbert space (RKHS) with a linear combination between a local kernel and a global kernel.Kernel CCA is further improved by performing Principal Component Analysis (PCA) followed by CCA for effective dimensionality reduction, which can be implemented in supervised learning, semisupervised learning, and transfer learning.Motivated by their work, we aim to incorporate the MKL optimization into SR to yield more flexible dimensionality reduction schemes.

The MKL-SR Framework
We first explain how to integrate MKL and SR for dimensionality reduction.Then, we propose an optimization procedure to complete the framework.

MKL-SR Model.
Suppose that the ensemble kernel  in MKL-SR is generated by linearly combining the base kernels {  }  =1 as in (8).Selecting a nonlinear function in RKHS induced by the kernel function The constrained optimization problem for 1 MKL-SR is defined as follows: where The additional constraints in (12) arise from the use of the ensemble kernel in (8) and are to ensure that the resulting kernel  in MKL-SR is a nonnegative combination of base kernels.
Observe from ( 10) that the one-dimensional projection of MKL-SR is specified by a sample coefficient vector  and a kernel weight vector .The two vectors, respectively, account for the relative importance among the samples and the base kernels in the construction of the projection.To generalize the formulation to uncover a multidimensional projection, we consider a set of  − 1 sample coefficient vectors, denoted by The resulting projection will map samples to a ( − 1)dimensional euclidean space.Similar to the 1 case, a projected sample x  can be written as The optimization problem (10) can now be extended to multidimensional MKL-SR as min subject to 3.2.Optimization Algorithm.Since direct optimization to ( 16) is difficult, we instead adopt an iterative, two-step strategy to alternately optimize A and .At each iteration, one of A and  is optimized while the other is fixed, and then the roles of A and  are switched.Iterations are repeated until convergence or a maximum number of iteration is reached.
Solving the problem (19) directly involves eigendecomposition of dense matrices, which has the high computational cost in both time and memory.In order to solve the eigenproblem in (19) efficiently, we use the following theorem.(20) Thus,  is the eigenvector of the eigenproblem (19) with the same eigenvalue .
Theorem 1 shows that, instead of solving the eigenproblem (19), the embedding functions can be acquired through two steps.
(2) Find  which satisfies K = y.Similar to SR, a possible way is to find  which can best fit the equation in the least squares sense as where y  is the th element of y.
Since the matrix D is guaranteed to be positive definite, the eigenproblem in (2) can be stably solved.Moreover, both D − W and D are sparse matrices.The top  eigenvectors of eigenproblem in (2) can be efficiently calculated with Lanczos algorithms [13].In addition, the technique to solve the least square problem is already matured and there exist many efficient iterative algorithms that can handle very large scale least square problems.

On Optimizing 𝛽.
By fixing , the optimization problem ( 16) becomes min where The additional constraints  ≥ 0 cause the optimization to (22) to be no longer transformed into a generalized eigenvalue problem.It is actually a nonconvex quadratically constrained quadratic programming (QCQP) problem [13], which is a NP-hard problem.Thus, we instead consider solving its convex relaxation by adding an auxiliary variable  of size  ×  as min , trace (K   K  ) Subject to trace e    ≥ 0,  = 1, 2, . . ., , where e  in (26) is a column vector whose elements are 0 except that its th element is 1.To obtain the convex relaxation of the nonconvex QCQP problem (22), we relax the equation  =   to  ≽   , which can be equivalently expressed by the constraint in (27) according to the Schur complement lemma [16].The optimization problem (24) is a semidefinite programming (SDP), which can be efficiently solved.It can be note that the numbers of constraints and variables in (24) are linear and quadratic to , respectively.In practice, the value of  is often small.Thus, the proposed MKL-SR algorithm listed in Algorithm 1 mainly includes a sequence of SR training.

Novel Sample Embedding.
After accomplishing the training procedure of MKL-SR, we can project a testing sample z into the learned subspace by where Several algorithms such as the nearest neighbor rule or means clustering can be used to complete classification or clustering tasks.In the experiments of this paper, we specifically discuss the effectiveness of MKL-SR in different learning tasks, including unsupervised learning for clustering, supervised, and semisupervised learning for face recognition.

Experiments
We used seven datasets (ionosphere, letter, digit, and satellite) from the UCI machine learning repository to perform unsupervised learning task.For the letter and satellite data sets, we only used their first two classes.Several multiclass data sets were created from the digits data.The experiments on supervised and semisupervised classification were performed on the CMU PIE face data set and the extended Yale B data set [17,18], respectively.All the face images are manually aligned and cropped.The pixel values are scaled to [0, 1].The basic information about these data sets is listed in Table 1.
All the experiments have been performed in MATLAB 7.14.0environment running in a 3.10 GHZ Intel Core i5-2400 with 3GB RAM.

Experiments on Unsupervised Learning.
To validate that MKL-SR is effective for an unsupervised dimensionality reduction task, we applied the proposed algorithm as a tool to learn an appropriate kernel function for KSR.Each data set was reduced by SR, single kernel based SR, kernel principal component analysis (KPCA), and MKL-SR, respectively.The normalized cut spectral clustering (NC) algorithm was adopted to evaluate the clustering performance on the reduced data.We set the number of clusters equal to the true number of classes and compared the clusters generated by these algorithms with the true classes by computing the clustering accuracy measure as where   denotes the th cluster in the final results,   is the true th class, and (  ,   ) is the number of entities which belong to class  and are assigned to cluster .
To obtain stable results, for each data set, we computed the average results of each algorithm over 20 runs.For comparison, we also performed the NC algorithm in the original data space (Baseline).For SR, KSR, and MKL-SR, the dimension of the subspace is the number of categories.For KPCA, we tested its performance with all the possible dimensions and report the best result.For SR, KSR, and MKL-SR, we simply set the value of the parameter  as 1.
For KSR and KPCA, the Gaussian function exp(−‖x − a‖ 2 ) with width 1 was selected.For MKL-SR, we use a linear kernel function, a polynomial kernel function, and a Gaussian kernel function.Table 2 lists the mean of 20 different random repetitions as well as the standard deviation.From Table 2, we observe that the performance of kernel based algorithms is much better than SR, which indicates that the performance of linear DR algorithms can be improved by virtue of nonlinear kernel functions.MKL-SR significantly surpasses KSR and KPCA, which are single kernel based approaches.This is due to the fact that MKL-SR is able to learn a better kernel by MKL, which is considerably more effective than a single Gaussian kernel.The performance of KSR is very close to that of KPCA, but the number of reduced dimensions of KPCA has to be verified by testing many times.In addition to the fixed number of reduced dimensions, we also try to examine how the compared algorithms work when applying KPCA to obtain projected data of a varied number of dimensions.Thus, MKL-SR is easy to be implemented and has better performance than other algorithms.

Experiments on Supervised Learning.
In this experiment, we mainly compared MKL-SR with the following approaches: KPCA, LDA, SR, and KSR.In order to evaluate the performance of these algorithms, we performed the SVM algorithm in the original face image space (baseline) and KPCA, LDA, SR, KSR, and MKL-SR subspace.The kernels and parameters are set in the same way as in the unsupervised learning.From each class of the CMU PIE face data sets, we randomly selected  (the number of training samples per class) samples for training.
For each given , we averaged the results over 30 random splits and computed the mean as well as the standard deviation, which are listed in Table 3.As can be seen from Table 3, the performance of KPCA and LDA is even worse than that of the baseline method, which resulted from the limitation of KPCA and LDA.As is well known that KPCA is unsupervised, thus it cannot effectively exploit the supervised information, which results in the worst performance in supervised case.LDA does not utilize the regularization approach to control the model complexity.Thus, it cannot solve the over-fitting problem in small sample size case.In contrast, SR, KSR, and MKL-SR take advantage of the The key parameter in MKL-SR is the regularization parameter  ≥ 0 which controls the smoothness of the embedding function based on multiple kernels.Next, we discuss the impact of parameter  on the performance of MKL-SR.Figure 1 shows the performance of MKL-SR as a function of the parameter .For convenience, the -axis is plotted as /(1 + ) which is strictly in the interval [0, 1].As can be seen from Figure 1, MKL-SR obtains the best performance near the middle of the interval.When /(1 + ) decreases to zero or increases to one, the performance of MKL-SR decreases sharply.Fortunately, good performance can be achieved over a wide range of , which shows that the parameter selection is not a crucial problem in MKL-SR algorithm.In reality, we can use cross validation to verify the best parameter or simply select a value between 0.1 and 1.

Experiments on Semisupervised
Learning.In the semisupervised case, we compared the performance of MKL-SR with KPCA and semisupervised KSR.For comparison, we performed the SVM algorithm in the original face image space (baseline), KPCA, and semisupervised KSR and MKL-SR subspace.For KSR and MKL-SR, we simply set the value of the parameter  as 1.In the semisupervised MKL-SR, the parameter  (0 <  ≤ 1) was selected by cross validation.The kernels and parameters are set in the same way as in the unsupervised learning.For the extended Yale B face data set, a random subset with  (= 5, 10, 20, 30, 40) images per individual was first taken to form the training set and the rest of the data set was used to be the testing set.In the training set, we only use one half data as labeled data and the rest as unlabeled data.KPCA only uses unlabeled data and the SVM algorithm is also performed on the reduced data based on KPCA.KSR and MKL-SR use both labeled and unlabeled data.The  is set to be 7 for the -nearest neighbor graph over all the training samples in KSR and MKL-SR.
We average the classification accuracy over 30 random splits for each given .The mean as well as the standard deviation is shown in Table 4. From Table 4, we can observe that KSR and MKL-SR can efficiently exploit both labeled and unlabeled data to discover the intrinsic geometry structure in the data; that is, the reduced data can preserve the original intrinsic geometry structure very well.Thus, they outperform the baseline method and KPCA, which cannot utilize all the available data.The performance of MKL-SR is much better than that of KSR, which indicates that the final kernel matrix learned by MKL-SR is still better than the one based on a single kernel in the semisupervised case.Overall, the proposed MKL-SR algorithm can achieve better performance in the supervised, semisupervised, and unsupervised case.

Conclusion
In this paper, we propose a new dimensionality reduction framework called MKL-SR.By means of SR, we solve the out-of-sample extension problem by seeking an embedding function in RKHS induced by multiple kernels.Thus, this method can not only construct the nonlinear embedding function in the form of convex combination of base kernels but also improve the performance of single kernel based SR in the supervised, semisupervised, and unsupervised case.Experimental results validate the effectiveness and efficiency of the MKL-SR algorithm.In the near future, we will further explore how to integrate different MKL methods into our model.

Theorem 1 .
Let y be the eigenvector of the eigenproblem in(2) with eigenvalue .If K = y; then  is the eigenvector of the eigenproblem in (19) with the same eigenvalue .Proof.We have y = y.At the left side of (19), replacing K by y, we have KK = Ky = Ky = Ky = KK.

Table 1 :
Description of the datasets.

Table 2 :
Clustering accuracy (in percent) based on different DR methods.

Table 3 :
Recognition accuracy rates on PIE (mean ± std-dev%).Tikhonov regularizer to improve the smoothness of projection functions, and they can perform better than KPCA and LDA.By substituting the nonlinear embedding functions with the linear ones, KSR and MKL-SR all outperform SR.The performance of MKL-SR is better than that of KSR based on a single kernel, which indicates that MKL-SR can select an appropriate kernel and validates the effectiveness of our method.