Manifold Adaptive Kernelized Low-Rank Representation for Semisupervised Image Classification

Constructing a powerful graph that can effectively depict the intrinsic connection of data points is the critical step to make the graph-based semisupervised learning algorithms achieve promising performance. Among popular graph construction algorithms, low-rank representation (LRR) is a very competitive one that can simultaneously explore the global structure of data and recover the data from noisy environments. Therefore, the learned low-rank coefficient matrix in LRR can be used to construct the data affinity matrix. Consider the existing problems such as the following: (1) the essentially linear property of LRR makes it not appropriate to process the possible nonlinear structure of data and (2) learning performance can be greatly enhanced by exploring the structure information of data; we propose a newmanifold kernelized low-rank representation (MKLRR) model that can perform LRR in the data manifold adaptive kernel space. Specifically, the manifold structure can be incorporated into the kernel space by using graph Laplacian and thus the underlying geometry of data is reflected by thewrapped kernel space. Experimental results of semisupervised image classification tasks show the effectiveness of MKLRR. For example, MKLRR can, respectively, obtain 96.13%, 98.09%, and 96.08% accuracies on ORL, Extended Yale B, and PIE data sets when given 5, 20, and 20 labeled face images per subject.


Introduction
Since it is usually not easy to collect a large number of labeled samples to train learning models, the semisupervised learning (SSL) paradigm, which can harness both labeled and unlabeled samples to improve the learning performance, draws a lot of attention in recent studies [1][2][3][4][5][6][7].Among existing SSL algorithms, graph-based algorithms are a class of the most popular approaches in which the label propagation can be performed on the graph [8][9][10][11].The underlying idea for graph-based algorithms is to characterize the relationship between data pairs by an affinity matrix.Although researchers have pointed that sparsity, high discriminative power, and adaptive neighborhood are desirable properties for a good graph [12], how to learn a good graph that can accurately uncover the latent relationship in data is still a challenging problem.
Among existing graph construction methods, the nearest neighbors and -neighborhood are the two most widely used algorithms.However, they are usually sensitive to noisy environments, especially when data points contain outliers.To construct more effective graph, many new algorithms were proposed.The sparse graph [8] is parameterfree and insensitive to outliers, which is derived by encoding each datum as a sparse representation of the remaining samples.Sparse graph can automatically select the most informative neighbors for each datum.However, since sparse representation encodes each datum individually, the resultant sparse graph only emphasizes the local structure of data, while it neglects considering the global structure of data.This property will deteriorate its performance, especially when data are grossly corrupted [13].Different from sparse representation that enforces the representation coefficient to be sparse [14], low-rank representation aims to learn the data (3) We conduct extensive experiments on semisupervised image classification tasks to evaluate the effectiveness of MKLRR and the experimental results show that MKLRR can get pretty promising performance.
The remainder of this paper is organized as follows.In Section 2, we give a brief review on the conventional LRR model and the semisupervised learning framework to be used in our work.Section 3 describes the model formulation, optimization method, and complexity analysis of the manifold adaptive kernelized LRR model in detail.Experimental studies of MKLRR on semisupervised image classification task will be introduced in Section 4. Section 5 concludes the whole paper and presents an extension of MKLRR as our future work.

Related Work
In this section, we give a brief review of the conventional lowrank representation model [15] and the semisupervised classification framework based on Gaussian Fields and Harmonic Functions (GHF) [1].
2.1.LRR.Given a set of samples X = [x 1 , x 2 , . . ., x  ] ∈ R × , LRR aims to represent each sample as a linear combination of the bases in A = [a 1 , a 2 , . . ., a  ] ∈ R × by X = AZ, where Z = [z 1 , z 2 , . . ., z  ] is the matrix in which each z  is the representation coefficient corresponding to sample x  .Therefore, each entry in z  can be viewed as the contribution to the reconstruction of x  with A as the dictionary.LRR seeks to find the lowest rank solution by solving the following optimization problem [15]: It is NP-hard to directly optimize the rank function.Therefore, we usually use the trace norm (also called nuclear norm) as the closest convex surrogate to the rank norm, which leads to the following objective [29]: where ‖⋅‖ * is the sum of its singular values of a certain matrix [30].Considering the fact that samples are usually noisy or even grossly corrupted, a more reasonable objective for LRR can be expressed as min where E ∈ R × and ‖E‖ 2,1 = ∑  =1 √∑  =1  2  .The second term in (3) is to characterize the error term by modeling the sample-specific corruptions.Also, some existing studies employed the ℓ 1 -norm to measure the error term [31,32].The optimal solution Z * can be obtained via the inexact augmented Lagrange multiplier method [31].

GHF.
Assume that we have a data set X = [x 1 , . . ., x  , x +1 , . . ., x  ] ∈ R × from  classes, where x  ,  = 1, . . ., , and x  ,  =  + 1, . . ., , are the labeled and unlabeled samples, respectively.The label indicator matrix Y ∈ R × is defined as follows: for each sample x  ( = 1, . . ., ), y  ∈ R  is its label vector.If x  is from the th ( = 1, 2, . . ., ) class, then only the th entry of y  is one and all the other entries are zeros.If x  is an unlabeled data, then y  = 0.
GHF is a well-known graph-based semisupervised learning framework in which the predicted label matrix F ∈ R × is estimated on the graph with respect to the label fitness and the manifold smoothness.Let f  and y  , respectively, denote the th rows of F and Y. GHF tries to minimize the following objective: where  ∞ is a very large value such that ∑  =1 ‖f  − y  ‖ 2 can be approximately satisfied.S ∈ R × is an affinity matrix to depict the pairwise similarity of samples.Obviously, (4) can be rewritten in the compact matrix form as min where the graph Laplacian matrix L S ∈ R × can be calculated as L S = D − S;   = ∑    (or ∑    since S is usually a symmetric matrix) is a diagonal degree matrix.U is also a diagonal matrix with the first  and the remaining  −  diagonal entries as  ∞ and 0, respectively.

Manifold Adaptive Low-Rank Representation
3.1.Manifold Adaptive Kernel.In this section, we show how to incorporate the manifold structure into the reproducing kernel Hilbert space (RKHS), which leads to manifold adaptive kernel space.Kernel trick is usually applied with the hope of discovering the nonlinear structure in data by mapping the original nonlinear observations into a higher dimensional linear space [33].The most commonly used kernels are Gaussian and Polynomial kernels.However, the nonlinear structure captured by the data-independent kernels may not be consistent with intrinsic manifold structure, such geodesic distance, curvature, and homology [34,35].
In this work, we adopt the manifold adaptive kernel proposed by [28].Let V be a linear space with a positive semidefinite inner product (quadratic form) and let  : H → V be a bounded linear operator.We define H to be the space of functions from H with manifold inner product: H is still a RKHS [28].Given samples x 1 , . . ., x  , let  : H → R  be the evaluation map. () = ( (x 1 ) , . . .,  (x  )) .
It can be shown that the reproducing kernel in H is where I is an identity matrix, K is the kernel matrix in H, and  ≥ 0 is a constant controlling the smoothness of the functions.The key issue now is the choice of M, so that the deformation of the kernel induced by the data-dependent norm is motivated with respect to the intrinsic geometry of the data.Without loss of generality, we assume that there are   data points to be utilized to derive the linear space V.It is easy to rewrite formulation (10) in compact matrix form as where the matrices Here, I is an identity matrix with the same size as K H  .KH was referred to as the kernel matrix in the warped RKHS.
The key issue now is the choice of M. As mentioned above, manifold structure can be discovered by the graph Laplacian associated with the data points.

The Objective Function.
From [11], the objective of kernel low-rank representation was formulated as min In order to learn the low-rank representation that is consistent with the manifold geometry, it is natural to take advantage of the manifold adaptive kernel in KLRR.
In order to model the manifold structure, we construct a nearest-neighbor graph .For each data point x  , we find its  nearest neighbors denoted by N(x  ) and put an edge between x  and its neighbors.There are many choices for the weight matrix on the graph and we use the "0-1" form defined as follows: The graph Laplacian [36] is defined as  W is symmetric).The graph Laplacian provides the following smoothness penalty on the graph: Therefore, it is natural to substitute M with the graph Laplacian L. For convenience, we make use of all the available data points to derive the linear space V in the warped RKHS (i.e.,   = ); then (11) can be rewritten as where KM indicates that this kernel matrix is in a manifold RKHS.
Using the nuclear norm to replace the rank function, we arrive at the following objective of manifold adaptive kernelized LRR as min Figure 1 shows the connection between MKLRR and LRR as well as its variants.As we can see, LRR variants such as GLRR, LRRLC, and MLRR can be reached by incorporating manifold information.By using the kernel trick, the KLRR model can find the lowest rank representation in RKHS.Further, by considering the geometric structure of data in RKHS, we can formulate the MKLRR model.Both KLRR and MKLRR are nonlinear models, since an implicit nonlinear mapping is employed.

Optimization.
To make objective ( 16) separable, we introduce an auxiliary variable J with respect to Z and then we have the following objective: The corresponding augmented Lagrangian function is min where Y 1 and Y 2 are Lagrange multipliers and  > 0 is a penalty parameter.The inexact augmented Lagrange multiplier (ALM) algorithm is employed to optimize objective ( 18) [31].The detailed optimization process is summarized in Algorithm 1.
The updating rule for J is based on singular value thresholding operator which is given by the following theorem [30].
is given by The updating rule for E can be obtained by soft-shrinkage operator [15], which is given as below.
Theorem 2. Let Q = [q 1 , q 2 , . . ., q  , . ..] be a given matrix and let ‖ ⋅ ‖  be the Frobenius norm.If the optimal solution to min is W * , then the th column of W * is  (3) Fix the others and update Z by (4) Fix the others and update E by (5) Update the multipliers Below we give a brief analysis on the computational complexity of MKLRR.Constructing the  nearest-neighbor graph in the first step of MKLRR needs ( 2 ).In the second step, computing the data-independent kernel matrix K needs ( 2 ) and computing the manifold adaptive kernel matrix needs ( 3 ).In the fourth step, the complexity of semisupervised learning based on GHF is ( 3 ) ( ≪ ).Below we give a detailed description on the complexity of Algorithm 1. Obviously, the main computation burden of MKLRR lies in the updating of J, since it involves the singular value decomposition (SVD).Specifically, in equation ( * ) in Algorithm 1, the SVD is operated on an  ×  matrix, which is time-consuming if the number of samples (i.e., ) is large.As referred to in [37], by substituting A with the orthogonal basis of the dictionary, the computation can be reduced to ( 2 ), where  is the rank of dictionary A. The computational complexity of updating Z is trivial owing to its simple closed form solution.The complexity of updating E is ( 2 ).Thus, the computation complexity of MKLRR-based semisupervised learning is (( 2 + 2 )+ 2 + 3 ) in general, where  is the number of iterations of loop in Algorithm 1.

Experiments
This section evaluates the effectiveness of the proposed MKLRR algorithm on semisupervised classification task.Specifically, we will compare the performance of MKLRR (i) NN: if one sample is among the  nearest neighbors of the other, then these two samples are viewed as connected.In NN-1,  is set to 5; and in NN-2,  is set to 8. The distance information is measured by the "Heatkernel" function, where the variance  is the average of squared Euclidean distances for all edged pairs on graph (ii) ℓ 1 graph [8,38]: the ℓ 1 -norm regularized least squares problem is optimized by the ℓ 1 - package [39].The regularization parameter to enforce the sparsity is searched from {10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 } (iii) LNP (linear neighborhood propagation): we follow the pipeline in [40] to construct the graph.The neighborhood size in LNP is set to 40 (iv) SPG (sparse probability graph) [41]: we implement the SPG algorithm by setting   as one-quarter of the size of data set and  is set to 0.001 as suggested by [41] (v) LRR (low-rank representation) [15]: for all data sets, we tune the parameter in the range {10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 } to achieve the best performance (vi) GLRR (graph regularized low-rank representation) [17]: in [17], the accelerated gradient method [42] was employed to optimize GLRR by updating J, which is the corresponding auxiliary variable with respect to Z, while in our implementation, the GLRR objective function was relaxed as described in [10] and J was updated by using the SVT operator [31] (vii) NNLRS (nonnegative low-rank and sparse graph) [9]: we construct the LRR graph with nonnegative and sparse properties.The weighted parameters are set as guided in [9] (viii) LRCB (low-rank representation with -matching constraint) [20]: as suggested in [20], we set the parameters  1 and  2 as 2 and 0.03 for all the four data sets.The parameter  is set as 5, 5, 10, and 5 in the ORL, Extended Yale B, PIE, and USPS data sets, respectively.
(ix) KLRR (kernel low-rank representation) [11] For both KLRR and MKLRR, we use the Gaussian kernel function defined as (x  , x  ) = exp(−‖x  − x  ‖ 2 /2 2 ) and the band width parameter is set as the mean value of all the distances of each data pair.When constructing the weight matrix, the number of nearest neighbors  is set as 5.The regularization parameter  in conventional LRR model is searched from the candidate values of {10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 }.Similar to the usage in [43], we fix the parameter  = 1 in all experiments below.

Experiment on Synthetic Data.
Similar to studies [11,15], a synthetic data set is constructed as follows.We construct 5 independent subspaces {S  } 5 =1 ⊂ R 100 whose bases {U  } 5

𝑖=1
are computed by U +1 = TU  ,  ≤  ≤ 4, where T is a random rotation and U 1 is a random orthogonal matrix with dimension 100 × 100.Therefore, each subspace has a dimension of 100.We sample 200 data vectors from each subspace by X  = U  Q  , 1 ≤  ≤ 5, with Q  being a 100 × 200 i.i.d.N(0, 1) matrix.We randomly choose 30% of the total samples to corrupt.For example, if data vector x is chosen, its observed vector is computed by adding Gaussian noise with zero mean and variance 0.3‖x‖ 2 .
We select different numbers of labeled samples to evaluate the performance of different graph construction methods.Table 1 shows the classification accuracies of different graphs on the synthetic data set.The results are obtained from ten independent runs.From the results, we can find that all LRR variants can achieve good performance even when given only a few labeled samples.KLRR is slightly better than MKLRR by 0.26% when there is only one labeled sample per class.MKLRR obtains the best results in all the remaining cases.
Since GLRR is also related to incorporating the structure information of data into LRR, we show the learned block diagonal structure, respectively, by LRR, GLRR, KLRR, and MKLRR in Figure 2. Generally, although the visual discrepancies between MKLRR and its counterparts are minor, the block diagonal structure obtained by MKLRR is clearer than the others.Most of the values within each block of MKLRR graph are obviously larger than those of KLRR graph.

Experiment on ORL Data Set.
The ORL data set contains ten different images of each of 40 distinct subjects.The images were taken at different times, varying the lighting, facial expressions, and facial details.Each image is manually cropped and normalized to size of 32 × 32 pixels.Figure 3 shows some example images of two subjects from the ORL data set.
We repeat all the experiments ten times.In each time, we randomly select a subset of images from each subject to create a labeled sample set.In this experiment, 1, 2, 3, 4, and 5 images per subject are randomly selected as labeled samples and the remaining images are regarded as unlabeled samples.The random indices are kept the same for all compared algorithms.The classification accuracies of different algorithms with different numbers of labeled samples on the ORL data set are shown in Table 2, in which MKLRR outperforms all the compared algorithms.For example, when we select 1, 2, 3, 4, and 5 images per person as labeled samples, the accuracies of MKLRR are higher than those of the second best algorithm by 1.88% (LCRB), 3.18% (KLRR), 2.18% (SPG), 0.85% (SPG), and 1.28% (SPG), respectively.pixels.Figure 4 shows some images of two subjects from the Extended Yale B data set.

Experiment on Extended
We use the first 20 subjects and get 1262 images in total in the Extended Yale B data set to evaluate different methods.In this experiment, 4, 8, 12, 16, and 20 images per subject are randomly selected as labeled samples and the remaining images are regarded as unlabeled samples.The random indices are kept the same for all compared algorithms.Table 3 shows the classification accuracies of different algorithms with different numbers of labeled samples on the Extended Yale B data set.We can easily find that, with increasing number of labeled samples, all algorithms can obtain better classification results.Although the results are close to being saturated, MKLRR still can make some improvements.For example, it gets the accuracy of 98.09% when given 20 labeled images per person, which is 1.13% higher than that of KLRR.In particular, when given small number of labeled samples, MKLRR shows great superiority to the remaining algorithms.

Complexity
There is about 3% improvement when comparing MKLRR with KLRR when the number of labeled samples of each person is only 4. Since there are some noises in this data set, the performance of the basic KNN algorithm greatly decreases.

Experiment on PIE Data
Set.The CMU PIE data set contains 41368 images of 68 subjects with different poses, illumination, and expression.We only use their images in five near frontal poses (C05, C07, C09, C27, and C29) and under different illumination and expressions.The first 15 subjects are selected and there are 2550 face images in total.Each image is manually cropped and resized to 32 × 32 pixels.Figure 5 shows some images of two subjects from the PIE data set.
Identical to the Extended Yale B data set, we also select 4, 8, 12, 16, and 20 images per subject as labeled samples and let the remaining images be unlabeled samples.The random indices are kept the same for all compared algorithms.Table 4  shows the classification accuracies of different algorithms with different numbers of labeled samples on the PIE data set.It is obvious that MKLRR outperforms the other algorithms in all cases.Particularly, MKLRR performs much better than the others when given 4 labeled samples per person.
4.6.Experiment on USPS Data Set.The USPS digit database [44] consists of 9298 handwritten digit images of 10 numbers (0-9).The size of each image is 16 × 16 pixels.We select 200 samples from each class and thus the resultant data set has 2000 images in total.Figure 6 shows some images of the 10 numbers from the USPS data set.
In this experiment, we randomly select 10%, 20%, 30%, 40%, and 50% samples per digit as labeled samples and let the remaining images be unlabeled samples.The random indices are kept the same for all compared algorithms.Table 5 shows the classification accuracies of different algorithms with different numbers of labeled samples on the USPS data set.All algorithms can obtain excellent performance on this data set including the simple KNN algorithm.The classification accuracies of MKLRR are higher than other algorithms in most cases.

Parameter Sensitivity Analysis.
There are two important parameters in MKLRR, which are the regularization parameter  to construct the manifold adaptive kernel and  to control the impact of the noise term.It is obvious that MKLRR will boil down to KLRR when we set the parameter  as zero.In our previous experiments, the usage we employ is to empirically fix the value of  to one that follows the similar ideas in [43,45].In this section, we will analyze the parameter sensitivity of  and  by the way of investigating one while fixing the other.Figure 7 shows how the performance of MKLRR varies with the change of  on the Extended Yale B and PIE data sets, respectively, where we fix  = 1.Here, four images per person are labeled and the remaining are unlabeled.For making it easier to do comparison, we also include the results of KLRR in the figure.We can find that [10 −2 , 10] is a reasonable interval for the selection of values of .
Figure 8 shows how the performance of MKLRR varies with the change of  on the Extended Yale B and PIE data sets, respectively, where we fix  = 1.There are also four labeled samples for each person.Generally, MKLRR is insensitive to the variation of  if it is set as a slightly large value.
For the remaining data sets, the parameter sensitivities of MKLRR with respect to  and  have similar tendencies as shown in Figures 7 and 8.

Conclusion and Future Work
In this paper, we have proposed a new low-rank representation model for semisupervised image classification, which is called manifold adaptive kernel low-rank representation (MKLRR).Different from most existing LRR variants that consider the structure information in the original data space, our proposed model explicitly takes the intrinsic manifold structure depicted by nearest-neighbor graph into consideration.The graph Laplacian corresponding to the local geometry of the data is incorporated into the manifold adaptive kernel space, in which the low rank representation model is then calculated.Extensive experiments performed on both synthetic and benchmark data sets have shown excellent performance of MKLRR-based graph for semisupervised classification when given limited labeled samples.
As a limitation of general two-stage graph-based semisupervised learning methods, the information of labeled samples is neglected in graph construction stage.Therefore, it is necessary to take this point into consideration in order to construct more discriminative graph.This will be our future work and one of possible approaches is to introduce a constraint matrix that can depict the partial label information of data into LRR model.

8 ) end while Algorithm 1 :
Optimization to(18).(iv)Semisupervised classification: calculate the Laplacian matrix L  = D − Z and do semisupervised classification based on(5) with some state-of-the-art graph construction methods on four representative image data sets.All experiments are conducted on platform Intel(R) Core(TM) i7-4700MQ CPU @2.40 GHz 16.0 GB RAM Windows 8.1 System and Matlab 2013a.4.1.Experimental Settings.For the comparison methods, several baseline methods are compared including some stateof-the-art graph-based semisupervised learning methods: Yale B Data Set.The Extended Yale B data set consists of 2414 human face images of 38

Figure 3 :
Figure 3: Example face images of two subjects from the ORL data set.

Figure 4 :
Figure 4: Example face images of two subjects from the Extended Yale B data set.

Figure 5 :
Figure 5: Example face images of two subjects from the PIE data set.

Figure 6 :
Figure 6: Example digit images from the USPS data set.
, y  )}  =1 ⋃{x  }  =+1 , regularization parameters , Z kernel KM in the warped RKHS according to (15) (iii) Manifold kernel low-rank representation: optimize the MKLRR model and obtain the low-rank representation coefficient matrix Z via Algorithm 1. Shrink some small values in Z and then make it symmetric and nonnegative as Z = (|Z| + |Z  |)/2 Input: data points {(x 1, and  = 1 − 8; Output: the low-rank representation coefficient matrix Z. (1) while not converged do (2) Fix the other variables and update J by

Table 1 :
Classification accuracies (%) of different graphs on synthetic data set.

Table 2 :
Classification accuracies (%) of different algorithms with different number of labeled samples on ORL data set.

Table 3 :
Classification accuracies (%) of different algorithms with different number of labeled samples on the Extended Yale B data set.

Table 4 :
Classification accuracies (%) of different algorithms with different number of labeled samples on the PIE data set.

Table 5 :
Classification accuracies (%) of different algorithms with different number of labeled samples on the USPS data set.