Low-Rank Kernel-Based Semisupervised Discriminant Analysis

Semisupervised Discriminant Analysis (SDA) aims at dimensionality reduction with both limited labeled data and copious unlabeled data, but it may fail to discover the intrinsic geometry structure when the data manifold is highly nonlinear. The kernel trick is widely used to map the original nonlinearly separable problem to an intrinsically larger dimensionality space where the classes are linearly separable. Inspired by low-rank representation (LLR), we proposed a novel kernel SDA method called low-rank kernel-based SDA (LRKSDA) algorithm where the LRR is used as the kernel representation. Since LRR can capture the global data structures and get the lowest rank representation in a parameter-free way, the low-rank kernel method is extremely effective and robust for kinds of data. Extensive experiments on public databases show that the proposed LRKSDA dimensionality reduction algorithm can achieve better performance than other related kernel SDA methods.


Introduction
For many real world data mining and pattern recognition applications, the labeled data are very expensive or difficult to obtain, while the unlabeled data are often copious and available.So how to use both labeled and unlabeled data to improve the performance becomes a significant problem [1,2].Recently, semisupervised dimensionality reduction has attracted considerable attention, which can be directly used in the whole database [3].Illuminated by semisupervised learning (SSL), many methods have been put forward to relieve the so-called small sample size (SSS) problem of LDA [4,5].Semisupervised Discriminant Analysis (SDA) first is proposed by Cai et al. [2], which can easily resolve the out-of-sample problem [6] and is more suitable for the real world applications.In SDA algorithm, the labeled samples are used to maximize the different classes' separability and the unlabeled ones to estimate the data's intrinsic geometric information.
Semisupervised Discriminant Analysis may fail to discover the intrinsic geometry structure when the data manifold is highly nonlinear [2,7].The kernel trick [8] has been widely used to generalize linear dimensionality reduction algorithms to nonlinear ones, which maps the original nonlinearly separable problem to an intrinsically larger dimensionality space where the classes are linearly separable.So the kernel SDA (KSDA) [2,7] can discover the underlying subspace more exactly in the feature space, which brings a better subspace for the classification task by a nonlinear learning technique.Cai et al. discussed how to perform SDA in Reproducing Kernel Hilbert Space (RKHS), which gives rise to kernel SDA [2].You et al. have presented the derivations of a first approach to optimize the parameters of a kernel.It can map the original class distributions to a space where these are optimally (with respect to Bayes) separated with a hyperplane [7].A new kernel-based nonlinear discriminant analysis algorithm is proposed to solve the fundamental limitations in LDA [9].A novel KFDA kernel parameters optimization criterion is presented for maximizing the uniformity of class-pair separabilities and class separability in kernel space simultaneously [10].To overcome the nonlinear dimensionality reduction problems and adopting multiple features restrictions of LFDA, Wang and Sun proposed a new dimensionality reduction algorithm called multiple kernel local Fisher discriminant analysis (MKLFDA) based on the multiple kernel learning [11].
The kernelization of graph embedding applies the kernel trick on the linear graph embedding algorithm to handle data with nonlinear distributions [12].Weinberger et al. described an algorithm for nonlinear dimensionality reduction based on semidefinite programming and kernel matrix factorization which learns a kernel matrix for high dimensional data that lies on or near a low-dimensional manifold [13].
Low-rank matrix decomposition and completion are recently becoming very popular since Yang et al. and Chen et al. proved that a robust estimation of an underlying subspace which can be obtained by decomposing the observations into a low-rank matrix and a sparse error matrix [14,15].Recently, Liu et al. propose a low-rank representation method which is robust to noise and data corruptions due to its ability to decompose noise from the data set [14].More recently, low-rank representation [16,17], as a promising method to capture the underlying low-dimensional structures of data, has attracted much attention in the pattern analysis and signal processing communities.LRR method [16][17][18] seeks the lowest rank representation of all data jointly, such that each data point can be represented as a linear combination of some bases.
The major problem of kernel methods is to find the proper kernel parameters.But all these kernel methods usually use fixed global parameters to determinate the kernel matrix, which are very sensitive to the parameters setting.In fact, the most suitable kernel parameters may vary greatly at different random distribution of the same data.Moreover, the kernel mapping of KSDA always analyze the relationship of the data using the mode one-to-others, which emphasizes local information and lacks global constraints on their solutions.These shortcomings limit the performance and efficiency of KSDA methods.To overcome the disadvantages of the traditional kernel methods, inspired by LRR, we proposed a novel kernel-based Semisupervised Discriminant Analysis called low-rank kernel-based SDA (LRKSDA) where the lowrank representation is used as the kernel method.Compared with other kernels, the low-rank kernel jointly obtains the representation of all the samples under a global low-rank constraint [19].Thus it is better at capturing the global data structures and very robust to different random distribution of the data set.In addition, we can get the lowest rank representation in a parameter-free way, which is very convenient and robust for kinds of data.Extensive experiments on public databases show that our proposed LRKSDA dimensionality reduction algorithm can achieve better performance than other related methods.
The rest of the paper is organized as follows.We start by a brief review on an overview of SDA in Section 2. We then introduce the low-rank kernel-based SDA framework in Section 3. Then Section 4 reports the experiment results on real world database tasks.In Section 5, we conclude the paper.

Overview of SDA
Given a set of samples [x 1 , . . ., x  , x +1 , . . ., x + ], where  =  + , the first  samples are labeled as [y 1 , . . ., y  ], and the remaining  are unlabeled ones.They all belong to  classes.The SDA [2] hopes to find a rejection matrix a, which motivates us to present the prior assumption of consistency by a regularizer term.The objective function is as follows: where S  and S  are the between class scatter and total class scatter matrix.And S  is defined as the within class scatter matrix where  is the mean vector of the total sample,   is the number of samples in the th class,  () is the average vector of the th class, and x ()  is the th sample in the th class.The parameter  in (1) balances the model complexity and the empirical loss.The regularizer term supplies us with the flexibility to incorporate the prior knowledge in the applications.We aim at constructing (a) graph combining the manifold structure through the available unlabeled samples [2].The key of SSL algorithm is the prior assumption of consistency.For classification, it means that the nearby samples are likely to have same label [20].And for dimensionality reduction, it implicates that the nearby samples have similar embeddings (low-dimensional representations).
Given a set of samples {x  }  =1 , we can construct the graph G to represent the relationship between nearby samples by NN algorithm.Then put an edge between  nearest neighbors of each other.The corresponding weight matrix S is defined as follows: where   (x  ) denotes the set of  nearest neighbors of x  .Then (a) term can be defined as follows: where D is a diagonal matrix whose entries are column (or row since S is symmetric) sum of S; that is, D  = ∑  S  .The Laplacian matrix [21] is We can get the objective function of the SDA with regularizer term (a) [2]: By maximizing the generalized eigenvalue problem, we can obtain the projective vector a:

Low-Rank Kernel-Based SDA Framework
3.1.Low-Rank Representation.Yan and Wang [22] proposed sparse representation (SR) to construct  1 -graph [23] by solving  1 optimization problem.However,  1 -graph lacks global constraints, which greatly reduce the performance when the data is grossly corrupted.To solve this drawback, Liu et al. proposed the low-rank representation and used it to construct the affinities of an undirected graph (here called LR-graph) [19].It jointly obtains the representation of all the samples under a global low-rank constraint, and thus it is better at capturing the global data structures [24].Let X = [x 1 , x 2 , . . ., x  ] be a set of samples; each column is a sample which can be represented by a linear combination of the dictionary A [19].Here, we select the samples themselves X as the dictionary A: where Z = [z 1 , z 2 , . . ., z  ] is the coefficient matrix with each z  being the representation coefficient of x  .Different from the SR which may not capture the global structure of the data, LRR seeks the lowest rank solution by solving the following optimization problem [19]: The above optimization problem can be relaxed to the following convex optimization [25]: Here ‖ ⋅ ‖ * denotes the nuclear norm (or trace norm) [26] of a matrix, that is, the sum of the matrix's singular values.By considering the noise or corruption in our real world applications, a more reasonable objective function is min where ‖ ⋅ ‖  can be  2,1 -norm or  1 -norm.In this paper we choose  2,1 -norm as the error term which is defined as The parameter  is used to balance the effect of low rank and the error term.The optimal solution Z * can be obtained via the inexact augmented Lagrange multipliers method [27,28].

Kernel SDA.
Semisupervised Discriminant Analysis may fail to discover the intrinsic geometry structure when the data manifold is highly nonlinear.The kernel trick is a popular technique in machine learning which uses a kernel function to map samples to a high dimensional space [8,29,30].By using the kernel trick, we can nonlinearly map the original data to the kernel feature space.

Low-Rank
Kernel-Based SDA.The major problem of all these kernel methods is to find the proper kernel parameters.And they usually use fixed global parameters to determinate the kernel matrix, which is very sensitive to the parameters setting.In fact, the most proper kernel parameters may vary greatly at different random distribution even if they are for the same data.Moreover, the traditional kernel mapping always analyzes the relationship of the data using the mode one-toothers, which emphasizes local information and lacks global constraints on their solutions.These shortcomings limit the performance and efficiency of KSDA methods.To overcome these shortcomings mentioned above, inspired by low-rank representation, we propose a novel kernel-based Semisupervised Discriminant Analysis (LRKSDA) where LRR is used as the kernel representation.
Let ,  :   →  be a low-rank mapping from   into a low-rank kernel feature space .For the database X = [x 1 , x 2 , . . ., x  ], a reasonable objective function is as follows: min The optimal solution Z = [z 1 , z 2 , . . ., z  ] is the coefficient matrix with each z  being the low-rank representation coefficient of x  .
Let Z = [z 1 , z 2 , . . ., z  ] denote the data matrix in the kernel space.The projective vectors  1 ,  2 , . . .,   are the eigenvector problem in (6) and  ×  transformation matrix is Θ = [ 1 ,  2 , . . .,   ].The number of the feature dimensions  can be decided by us.Then a data point can be embedded into  dimensional feature space by x → y = Θ  z, (13) where z is the low-rank representation of x.
Since the low-rank representation jointly obtains the representation of all the samples under a global low-rank constraint to capture the global data structures, we can get the lowest rank representation in a parameter-free way, which is very convenient and robust for kinds of data.So low-rank kernel-based SDA algorithm can improve the performance to a very large extent.The step of the LRKSDA is as follows.
Firstly, map the labeled and unlabeled data to the LRgraph kernel space.Secondly, execute the SDA algorithm for dimensionality reduction.Finally execute the nearest neighbor method for the final classification in the derived low-dimensional feature subspace.The procedure of lowrank kernel-based SDA is described as follows.
Output.The classification results.
Step 1. Map the labeled and unlabeled data X to feature space by the LRR algorithm: min Step 2. Implement the SDA algorithm for dimensionality reduction.
Step 3. Execute the nearest neighbor method for final classification.

Experiments and Analysis
In this section, we conduct extensive experiments to examine the efficiency of low-rank kernel-based SDA algorithm.The simulation experiment is conducted in MATLAB7.11.0 (R2010b) environment on a computer with AMD Phenom(tn)II P960 1.79 GHz CPU and 2 GB RAM.(1) Extended Yale Face Database B [2].This database has 38 individuals and around 64 near frontal images under different illuminations per individual.Each face image is resized to 32 × 32 pixels.And we select the first 20 persons and choose 20 samples of each subject.

Experiment Overview
(2) ORL Database [22].The ORL database contains 10 different images of each for 40 distinct subjects.The images are taken at different times, varying the lighting, facial expressions, and facial details.Each face image is manually cropped and resized to 32 × 32 pixels, with 256 grey levels per pixel.
(3) CMU PIE Face Database [2].It contains 68 subjects with 41,368 face images.The face images were captured under varying poses, illuminations, and expressions.The size of each image is resized to 32 × 32 pixels.We select the first 20 persons and choose 20 samples for per subject.(5) Seeds Data Set.It contains 210 instances for three different wheat varieties.A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes.
(6) SPECT Heart Data Set.The database describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images.Each of the patients is classified into two categories: normal and abnormal.The database of 267 SPECT image sets was processed to extract features that summarize the original SPECT images.The pattern was further processed to obtain 22 binary feature patterns.

Compared Algorithms.
In order to demonstrate how the semisupervised dimensionality reduction performance can be improved by low-rank kernel-based SDA, we list out SDA, KSDA1, and KSDA2 algorithm for comparison.In all experiments, the number of the nearest neighbors in the NN regularizer graph is set to 4.
The classification accuracy is influenced by the kernel parameters.So after comparing, we choose a proper kernel parameters  and  for the KSDA1 and KSDA2 algorithm in each database in the following pairs, respectively, where (0.9, 0.9) is for Extended Yale Face Database B, (0.55, 1.5) is for ORL database, (0.9, 0.9) is for CMU PIE database, (0.65, 0.2) is for Musk database, (0.05, 0.6) is for Seeds Data Set, and (0.8, 0.3) is for SPECT Heart Data Set, respectively.Since the most suitable kernel parameters vary greatly at different random distribution even if they are for the same data, these kernel parameters are relatively suitable after comparing by many times' runs.

Experiment 1: Different Algorithms Performances.
To examine the effectiveness of the proposed LRKSDA algorithm, we conduct experiments on the six public databases.
In our experiments, we randomly select 30% samples from each class as the labeled samples to evaluate the performance with different numbers of selected features.The evaluations are conducted with 20 independent runs for each algorithm.We average them as the final results.First we utilize different kernel methods to get the kernel mapping, and then we implement the SDA algorithm for dimensionality reduction.Finally, the nearest neighbor approach is employed for the final classification in the derived low-dimensional feature subspace.For each database, the classification accuracy for different algorithms is shown in Figure 1.Table 1 shows the performance comparison of different algorithms.Note that the results are the best results of all these different selected features mentioned above.From these results, we can observe the following.In most cases, our proposed low-rank kernel-based SDA algorithm consistently achieves the highest classification accuracy compared to the other algorithms.LRKSDA achieves the best performance when the dimensionality is larger than a certain low dimension.And the classification accuracy is much higher than the other kernel SDA algorithms.So it improves the classification performance to a large extent, which suggests that low-rank kernel is more informative and suitable for SDA algorithm.
Since the proper kernel parameters are the most important thing of these traditional algorithms and since the kernel parameters of KSDA1 and KSDA2 algorithm are fixed global parameters, the two algorithms are very sensitive to different data or different random distribution of the same data.The performance improvement of these KSDA methods is not obvious.More seriously, as a result of randomly select labeled samples, the random distribution in each run may not adapt the so-called proper kernel parameters of KSDA1 and KSDA2 algorithm.Moreover, the traditional kernel mapping always analyzes the relationship of the data using the mode oneto-others, which emphasizes local information and lacks global constraints on their solutions.This situation may result in not good performance in some case, while the lowrank representation is better at capturing the global data structures.And we can get the lowest rank representation in a parameter-free way, which is very convenient and robust for kinds of data.So low-rank kernel-based SDA separates the different classes very well compared to other kernel SDA.And it can improve the performance to a very large extent, which means that our proposed low-rank kernel method is extremely effective.

Experiment 2:
Influence of the Label Number.We evaluate the influence of the label number in this part.The experiments are conducted with 20 independent runs for each algorithm.We average them as the final results.The procedure is the same with experiment 1.For each database, we vary the percentage of labeled samples from 10% to 50% and the recognition accuracy is shown in Tables 2 and 3, from which we observe the following.
In most cases, our proposed low-rank kernel-based SDA algorithm consistently achieves the best results, which is robust to the label percentage variations.While some other compared algorithms are not as robust as our LRKSDA algorithm, we can see that the classification accuracy is very awful when the label rate is low.Thus, our proposed method has much superiority than the traditional KSDA and SDA algorithms.Sometimes these traditional methods   may achieve good performances in some databases with high enough label rate.But they are not as as our proposed algorithm.Since the labeled data is very expensive and difficult, our proposed algorithm is much robust and suitable to the real word data.As we mentioned in the previous part, since the lowrank kernel method gets the kernel matrix in a parameterfree way, it is robust for different kinds of data, while for the traditional kernel like Gaussian radial basis function kernel and polynomial kernel, if the data's structure does not fit the stable kernel parameters they used, they cannot obtain the good representation of the original data set.Therefore, the low-rank kernel method is much more stable for all the data sets we use.And the low-rank representation jointly obtains the representation of all the samples under a global low-rank constraint, which can capture the global data structures.So it is robust to the label percentage variations even though the label rate is low.

Experiment 3: Robustness to Different Types Noises.
In this test we compare the performance of different algorithms in the noisy environment.Extended Yale Face Database B and Musk database are randomly selected in this experiment.The Gaussian white noise, "salt and pepper" noise, and multiplicative noise are added to the data, respectively.The Gaussian white noise is with mean 0 and different variances from 0 to 0.1.The "salt and pepper" noise is added to the image with different noise densities from 0 to 0.1.And multiplicative noise is added to the data , using the equation  =  +  * , where  and  are the original and noised data and  is uniformly distributed random noise with mean 0 and varying variance from 0 to 0.1.The number of labeled samples in each class is 30%.The experiments are conducted with 20 runs for each algorithm.We average them as the final results.The procedure is the same with experiment 1.For each graph, we vary the parameter of different noise.The results are shown in Tables 4 and 5.
As we can see, our proposed low-rank kernel-based SDA algorithm always achieves the best results, which means that our method is stable for Gaussian noise, "salt and pepper" noise, and multiplicative noise.And because of the robustness of the low-rank representation to noise, our method LRKSDA is much more robust than other algorithms.With the different kinds of gradually increasing noise, the

4. 1 . 1 .
Databases.The proposed LRKSDA is tested on six real world databases, including three face databases and three University of California Irvine (UCI) databases.In these experiments, we normalize the sample to a unit norm.

( 4 )
Musk (Version 2) Data Set 2. This database contains 2 classes and 6598 instances with 166 features.Here, we randomly select 300 examples for the experiments.

Figure 1 :
Figure 1: Classification accuracy of different SDA algorithms on the six databases of (a) Extended Yale Face Database B, (b) ORL database, (c) CMU PIE face database, (d) Musk (Version 2) Data Set 2, (e) Seeds Data Set, and (f) SPECT Heart Data Set.

Table 1 :
Classification accuracy of different SDA algorithms on six databases.

Table 2 :
Classification accuracy of different graphs on ORL, Yale, and USPS databases.

Table 3 :
Classification accuracy of different graphs on Musk, Seeds, and SPECT Heart databases.

Table 4 :
Classification accuracy of different graphs with varying noise on Yale B database.