Supervised Kernel Optimized Locality Preserving Projection with Its Application to Face Recognition and Palm Biometrics

Kernel Locality Preserving Projection (KLPP) algorithm can effectively preserve the neighborhood structure of the database using the kernel trick. We have known that supervised KLPP (SKLPP) can preserve within-class geometric structures by using label information. However, the conventional SKLPP algorithm endures the kernel selection which has significant impact on the performances of SKLPP. In order to overcome this limitation, a method named supervised kernel optimized LPP (SKOLPP) is proposed in this paper, which can maximize the class separability in kernel learning.The proposed method maps the data from the original space to a higher dimensional kernel space using a data-dependent kernel. The adaptive parameters of the data-dependent kernel are automatically calculated through optimizing an objective function. Consequently, the nonlinear features extracted by SKOLPP have larger discriminative ability compared with SKLPP and are more adaptive to the input data. Experimental results on ORL, Yale, AR, and Palmprint databases showed the effectiveness of the proposed method.


Introduction
In recent years, the kernel methods have been widely studied for feature extraction and pattern recognition.They map the input data onto a kernel space and assume the nonlinear problem can be transferred to linear problem, which can be conveniently solved by linear algorithms.However, different kernel geometrical structures give different class discriminations, and the inappropriate selection of kernel function will induce disastrous effects, because the kernel matrix determines the geometrical structure of the mapped data in the kernel space.Thus it is necessary to use an adaptively optimized kernel function to improve the classification performance.We know that optimizing kernel parameters cannot change the geometrical structures of kernel in the feature space [1,2], so, Schölkopf et al. [3] proposed an empirical kernel map which maps original input data onto a subspace of the empirical feature space, since the training data have the same geometrical structure in both the empirical feature space and the kernel space, and the former is easier to access than the latter.Cristianini et al. [4] and Lanckriet et al. [5], respectively, employed the alignment and margin as the measure of data separation to evaluate the adaptability of a kernel to input data.He and Niyogi [6] pointed out that locality preserving projection (LPP) is capable of discovering nonlinear method with kernel trick.To utilize the merit of LPP, Wang and Lin [7] proposed supervised kernel LPP (SKLPP) to improve kernel LPP (KLPP), by using classification information in kernel feature extraction process.Li et al. [8] extended LPP with nonparametric similarity measure and then optimized the kernel with maximum margin criterion for feature extraction and recognition.Sun and Zhao [9] proposed a normalized Laplacian based optimal LPP method.Lu et al. [10] proposed a regularized generalized discriminant LPP approach.Lu and Tan [11] proposed a parametric regularized LPP.Pang and Yuan [12] proposed to substitute L2-norm with L1-norm to improve the robustness of LPP against outliers.Although SKLPP showed good performance in [7], the selection of kernel function has a significant influence on the kernel feature extraction and the problem was widely studied in the previous works [7,[13][14][15].In [15], we proposed a Kernel Optimized PCA (KOPCA) with sparse representation-based 2 Mathematical Problems in Engineering classifier (SRC).Although both KOPCA and SKOLPP aim to improve the recognition rate through optimizing the kernel function, they employed different feature extraction methods where KOPCA extracted features by PCA and SKOLPP extracted features by LPP.In [16], we proposed a Supervised Gabor-wavelet-based Kernel Locality Preserving Projections (SGKLPP) method, which integrated the Gaborwavelet representation of face images and the SKLPP method to improve the recognition rate.Gabor wavelets extracted the features brought by illumination and facial expression changes, and SKLPP solved the nonlinear feature extraction and classification problem.
In [14], Pan et al. applied the optimizing kernel [17] to kernel discriminant analysis (KDA) called adaptive quasiconformal kernel discriminant analysis (AQKDA).Different from optimizing kernel based on Fisher Criterion [18], the maximum margin criterion (MMC) was chosen to extract the feature by maximizing the average margin between different classes of data in the quasiconformal kernel mapping space.
Li et al. proposed class-wise locality preserving projection (CLPP) which utilized the class information for feature extraction [19].In CLPP, a nonparametric similarity measure of LPP was proposed by Li et al. and then the optimized kernel with maximum margin criterion was used for feature extraction.According to the nonparametric similarity measure, the local structure of the original data was constructed which took consideration of both the local information and the class label information.Moreover, Li et al. applied the kernel trick to CLPP to increase its performance on nonlinear feature extraction.
In [8], Li et al. proposed the Kernel Self-optimized Locality Preserving Discriminant Analysis (KSLPDA).In the paper, the authors integrated CLPP [19] and data-dependent kernel based MMC [14] to form a constraint optimization equation for KSLPDA.
In [20], Li et al. proposed the Quasiconformal Kernel Common Locality Discriminant Analysis (QKCLDA).In QKCLDA, the quasiconformal kernel based on Fisher Criterion is used for breast cancer diagnoses.Li et al. divided the procedure of QKCLDA into two steps.First, the original data was mapped to a low-dimensional space via quasiconformal kernel locality projection.Secondly, the low-dimensional data was mapped to a common space.
In SKOLPP, we first construct a data-dependent kernel [18] to maximize the class separability based Fisher Criterion.Then, we use gradient descent method to optimize the object function where the combination coefficients can be obtained.Last, integrating the supervised kernel locality preserving projections [21], optimized kernel LPP can be used to extract features.In this paper, SKOLPP aims to optimize the kernel function.By retaining the local information and optimizing the kernel function by maximizing the between-class distance, SKOLPP surpassed the above methods.
The paper is organized as follows.In Section 2, we optimize the kernel in the empirical feature space by seeking the optimal combination of coefficients with data-dependent kernel based on Fisher Criterion.In Section 3, we employ the optimized kernel function mentioned above to construct the supervised kernel optimized LPP (SKOLPP).Finally, in Section 4, experiments are executed on ORL, Yale, AR, and Palmprint databases to demonstrate the effectiveness of the optimized kernel in classification.

Kernel Optimization in the Empirical
Feature Space 2.1.Data-Dependent Kernel.The geometrical structure of the data in the feature space is determined by the kernel functions, which means that choosing different kernels may induce different class discrimination performance [4].
Because there is no general kernel function that can be suitable to all databases, it is necessary to choose a datadependent kernel to solve this problem.In this paper, a datadependent kernel which is similar to that used in [13] is employed as the objective kernel to be optimized.

Fisher Criterion Based Kernel Optimization.
In [18], we note that the geometrical structure of data in the kernel feature space and empirical feature space is the same.That is, the optimized kernel parameters cannot change the geometrical structures of kernel in the feature space.It is better to measure class separability in the empirical feature space, because it is easier to access the empirical feature space than the kernel feature space.Specifically, we use the Fisher Criteria for measuring the class separability: where  is the well-known Fisher scalar, tr denotes the trace of given matrix,   is the between-class scatter matrix, and   is the within-class scatter matrix. measures the class separability in the feature space rather than in the projection subspace. is a good choice for the task of kernel optimization as it is independent of the projections.So optimizing the datadependent kernel means maximizing Fisher scalar .
We can call the matrices  and  between-class and within-class kernel scatter matrices, respectively.Then, they can be written as where  is the data-dependent kernel and   denotes the submatrix of the kernel matrix .Apparently,   is the kernel matrix corresponding to the samples in class .
For the basic kernel  0 , the matrices  0 and  0 are similar to formulae ( 6) and (7).Now, the relationship between Fisher scalar  and the kernel scatter matrices can be established: The proof is given in the Appendix.We use the standard gradient approach to maximize () and let  1 = q   0 q =      0  and  2 = q   0 q =      0 .Then, we have  1 / = 2  1  0  1  and  2 / = 2  1  0  1 .Thus, the iteration algorithm is as follows: In order to maximize , we let / = 0, and then However, the number of training samples is not enough in real-world applications.Thus, it is hard to get the invertible matrix of  0 .Additionally, we use the general gradient descent method to get  which is an approximate value.Then, the updating equation for maximizing the class separability  is given by where  is the learning rate and () =  0 (1 − /), where  is the number of iterations,  denotes the current iteration number, and  0 is the initial learning rate.When we get ,  can be calculated by 1  =  and the data-dependent kernel  is easy to achieve.

Locality Preserving Projections and Supervised Kernel Optimized LPP
In this section, first, LPP algorithm is reviewed briefly, and then we use the optimized kernel function mentioned above to construct the supervised optimizing kernel LPP.

Locality Preserving Projections.
Locality Preserving Projections (LPP) [6] is a linear manifold learning method which seeks an embedding that retains local information and obtains a face subspace that best perceives the crucial face manifold structure [22].Given a matrix  = [ 1 ,  2 , . . .,   ]  ∈  × with each point   ∈  ×1 .Similar to other subspace learning algorithms, LPP uses the obtained transformation matrix  = [ 1 ,  2 , . . .,   ] ∈  × with the basis vector   ∈  ×1 to map the high-dimensional points  ∈  ×1 to low-dimensional points  ∈  ×1 :  =   .The objective function of LPP is defined as follows to compute the optimal basis vector  ∈  ×1 : where   measures the similarity of   and   .Heat kernel is frequently used to define   : where parameter  ∈  is predefined.In (12) the similarity   monotonously increases with the decrease of the distance between   and   .It is worth noting that if   and   do not belong to the same class, the value of   will be zero.
The minimization problem of (11) can be reduced to the eigendecomposition problem [6]: where  is a diagonal matrix,   = ∑   , and  =  −  is the Laplacian matrix [23].There is a constraint as follows: where  Φ  = ( Φ )Φ(  ).The optimal transformation matrix  Φ = [ Φ 1 ,  Φ 2 , . . .,  Φ  ] can be obtained through (13).The eigenvector can be expressed as follows: where  = [Φ( where  is the data-dependent kernel matrix defined in (3) and  Φ is a diagonal matrix where  Φ  = ∑   Φ  .The local structure information of data in the original space can be presented by matrix  Φ .That is, if  Φ  is more important, the value of   is bigger.Consider the constraint ( Φ )   Φ  Φ = 1; that is,    Φ  = 1.Then minimization problem can be transformed as We can obtain the optimal  by solving (18).Therefore, the essence of SKOLPP is clear.That is, we first use Fisher Criterion to maximize the class separability and a datadependent kernel can be obtained.Then, we seek the optimal projection matrix of kernel optimized SLPP to extract feature.Last, a classifier can be adopted for classification.In this paper, we use nearest neighbor classifier for recognition.

Experimental Results
In this section, we first verify the assumption that the classification performance is probably worse in the feature space than without using kernel tricks in some cases.And we demonstrate that our proposed kernel optimization algorithm can obtain a better performance of classification.Then we test the proposed SKOLPP and other methods on the ORL, Yale, AR, and Palmprint database.

Kernel Optimization on Synthetic Gaussian Distributed
Database.In this part, we generated two simple datasets with Gaussian distribution by computer.Figure 1 From this figure, we can see that some samples of the two classes are overlapped.We use polynomial kernel function  0 (, ) = [⟨, ⟩+]  where  = 3 to project the data into the empirical feature space and Figure 1(b) shows the projection of the data in the empirical feature space onto the first three significant dimensions corresponding to the first three largest eigenvalues of .From Figure 1(b), we observe that the class separability is worse in the feature space than that in the input space.Figure 1(c) shows the corresponding results when we use Gaussian kernel function  0 (, ) = exp(−(1/)‖ − ‖ 2 ) with  = 1.0×10 5 to project the data into the empirical feature space.Similarly, the class separability cannot be endured.As a consequence, the kernel optimization algorithm is necessary to overcome this trouble.In order to show the effectiveness of the optimization algorithm, we carried out another experiment.In this experiment, we use the polynomial kernel  0 (, ) = (, ) 3 and Gaussian kernel  0 (, ) = exp(−1.0× 10 −5 ‖ − ‖ 2 ) as basic kernels.We select one-third of the samples randomly to form the empirical core set {  }.
In the polynomial kernel and Gaussian kernel, the initial learning rate  0 of the algorithm is 0.1 while the number of iterations is 200. Figure 2(a) shows the projections of the data in the empirical feature space when the thirdorder polynomial kernel is used as the basic kernel.The corresponding results, when the Gaussian kernel is used, are shown in Figure 2(b).From Figure 2, we can see that the class separability of the data in the feature space is improved significantly while our kernel optimization algorithm is used.

SKOLPP on ORL and Yale
Databases.This experiment is conducted on the well-known face image databases (ORL and Yale).
The ORL database contains 40 individuals.Each of them includes 10 different images, which show variations in facial expressions (smiling or not smiling), facial details (glasses or no glasses), and poses.Yale database is more challenging than ORL, which contains 165 grayscale images of 15 individuals.The images demonstrate variations in lighting condition (leftlight, center-light, and right-light), facial expression (normal, happy, sad, sleepy, and surprised), and facial details (glasses or no glasses).Some sample images from the same individual on ORL and Yale dataset are shown in Figures 3 and 4.
In the experiment,  images ( = 2, 3, 4, 5) are randomly selected from the image gallery of each individual to form the training sample set   , respectively, and the corresponding remaining images are taken to form the testing set   .The results are averaged by 5 random replicates.Table 1 presents the top recognition accuracy of PCA, KPCA, KOPCA, KFD, SVM, KMSVM, SLPP, SKLPP, and SKOLPP along different number of training samples on ORL database.
Table 1 shows the top recognition rate of all the methods on ORL database.It is clear to see that SKOLPP performs the best.Moreover, SKOLPP still performs well when the number of training samples is small.It is worth noting that SKOLPP works better than KOPCA and KMSVM while all of them used optimizing kernel.One of the reasons is that SKOLPP  retains the local information and obtains a face subspace that best perceives the crucial face manifold structure.Table 2 shows the results of all the algorithms on Yale database.Obviously, SKOLPP performs always better than other methods along different number of training samples.Experiment 2. We design this experiment in order to test the performance of all the algorithms under different value of  2 in Gaussian kernel function.The value of  2 ranges from 10 4 to 10 8 .Similarly, we select five images of each class as training samples and the results are shown in Figures 5 and 6.
From Figure 5, we can see that SKOLPP performs best compared with other methods when the value of  2 is 7    whereas the result is not perfect when  2 = 10 4 ∼10 5 .The small value of  2 is more suitable for KPCA and KFD on ORL database.It is worthwhile to note that SKOLPP performs always better than other methods including KMSVM on ORL database (Figure 5). Figure 6 shows the corresponding results on Yale database.Although, in the case where the values of  2 are small in Figure 6, the SKOLPP method works slightly worse than KMSVM, when the value of  2 becomes larger, SKOLPP surpasses KMSVM and achieves the highest recognition result.The recognition rate of SKOLPP reaches 95.5% and 92.9% on ORL and Yale databases, respectively.Experiment 3. The polynomial function is used in this part to test the performance of the proposed method.Three kinds  Five images of each class are selected as training samples and the rest of the images are used for testing.Table 3 shows the performance of different methods under different polynomial functions mentioned above.
From Table 3, we can see that the performance of KFD is unsatisfactory whereas SKLPP and SKOLPP achieve better results than other methods.Not surprisingly, KOSLPP achieves a better result as well and reaches the highest recognition rate of 94.4%.The corresponding results on Yale database are presented in Table 4.

SKOLPP on AR Database
. This experiment is conducted on the AR face database.The AR database [24] contains over 4000 color images corresponding to 126 people's face (70 men and 56 women).Images feature frontal view faces  with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf).The images of each person were taken in two sessions, separated by two weeks' time.The same pictures were taken in both sessions.In this experiment, we take 100 individuals (50 men and 50 women) and use the first 13 images of each person to test the performance of all the algorithms.Thus, the total number of images used in this experiment is 1300.All images are gray with 256 levels and are of size 13 × 100 pixels.To simplify the computation of the experiments, we cropped each image manually and resized each image to 48 × 48 pixels.Figure 5 shows the samples of one person.To fully evaluate the performance of SKOLPP, we make three tests based on variations in facial expressions, lighting conditions, and occlusions.Gaussian function is performed with  2 = 10 7 for SKOLPP.

Facial Expressions.
In this test, we randomly select two images from Figure 7 (1-4) as training samples; then the remaining two images are used for testing.Therefore, the total number of training samples is 200.These images have different facial expressions.Table 5 shows the top recognition rate of different algorithms.From Table 5, we can see SKOLPP achieves better results than other methods (PCA, 2DPCA [25], LDA [26], NPE [27], SLPP, and SKLPP).Particularly, SKOLPP performs better than SKLPP by 6.5% recognition rate.

Lighting Conditions.
To test SKOLPP together with other methods under varying lighting conditions, we selected images from Figure 7 (1, 3, and 6) as training samples.Images 2, 4, 5, and 7 in Figure 7 were considered as testing samples.Thus the number of training samples is 300, while that of the testing samples is 400.The recognition rates are summarized in Table 6.It is obvious that SKOLPP is the most effective technique dealing with illumination variation among the listed methods.SKOLPP exceeds 6.6% rates compared with SKLPP.

Occlusions.
In this part, we test the recognition rate under varying occlusions.We took images 1-7 in Figure 7 as training samples.Then the number of training samples is 700.Meanwhile, we took the rest of the images (8-13) (Figure 7) as test samples.Table 7 shows the top recognition rates of all the involved methods.Apparently, SKOLPP delivers the best result of all the algorithms while SKLPP is worse than SKOLPP by nearly 20% recognition rates.SKLPP, SLPP, and NPE also achieve good result.samples were captured in the first session and the other ten in the second session.The average interval between the first and the second collection was two months.In this experiment, we took 200 different palms and used the first 5 images of the first session and second session, respectively.Thus, the total number of images used in this experiment is 2000.All images are gray with 256 levels and are of size 384 × 284 pixels.

SKOLPP on
To simplify the computation of the experiments, we cropped each image manually and resized each image to 64 × 64 pixels.Figure 8 shows the samples of one person.The maximal recognition rates of each method and the corresponding dimension are given in Table 8.A random subset with   (= 3, 4, 5) is taken with labels to form the training set and the remaining part   (= 7, 6, 5) to form the testing set.From Table 8, we notice that SKOLPP consistently outperforms other methods in all the cases.Particularly in the cases of 4/6 and 5/5, SKOLPP boosts over 5% recognition rates compared with SKLPP and even delivers nearly 10% of improvement recognition rates compared to PCA.In addition, SKLPP gets the second best results while the performance of SLPP is slightly better than NPE.PCA performs the worst among all the methods.
Through the experimental results, we can come to the conclusion that SKOLPP indeed improves the class discrimination in the empirical feature space compared with SKLPP, and it is robust to the influence of illumination, facial expression, and occlusions.

Conclusion
In this paper, we proposed an efficient classification method by maximizing a measure of the class separability in the feature space named supervised kernel optimized LPP (SKOLPP).Based on the Fisher Criterion, our method achieved satisfactory classification performance by preserving the geometrical structure of the data in the kernel feature space.SKOLPP integrates the merit of kernel optimization and SKLPP to increase the performance of nonlinear feature extraction and classification, and it is robust to the influence of illumination, facial expression, and occlusions.Several experiments were conducted to demonstrate the effectiveness of SKOLPP.

Appendix
Proof.Note the empirical feature mapping Φ  :  → R  , and   = Φ  (  ); we know the dot product matrix  has exactly  positive eigenvalues.

Figure 1 :
Figure 1: 2D database and its projections in the feature space onto the first three significant dimensions.(a) Two classes of data samples with two Gaussian distributions.(b) 3D projection in the feature space for polynomial kernel with  = 3. (c) 3D projection in the empirical feature space for Gaussian kernel with  = 1.0 × 10 5 .

Figure 2 :
Figure 2: Improvement of class separability via kernel optimization algorithm.(a) 3D projection in the empirical feature space for polynomial basic kernel  = 3.(b) 3D projection in the empirical feature space for Gaussian basic kernel with  = 1.0 × 10 5 .

Figure 3 :
Figure 3: Ten sample images from Yale database.

Figure 4 :
Figure 4: Twenty sample images from ORL database.

Figure 5 :
Figure 5: Optimal average recognition accuracy (%) among different parameters of Gaussian kernel function on ORL database.

Figure 6 :
Figure 6: Optimal average recognition accuracy (%) among different parameters of Gaussian kernel function on Yale database.

Figure 7 :
Figure 7: Image sample from AR database under variations in facial expressions, lighting conditions, and occlusions.

Figure 8 :
Figure 8: Samples of the cropped images in the PolyU Palmprint database.

Table 5 :
Top recognition rates of different methods on AR database under varying facial expressions.

Table 6 :
Top recognition rate of different methods on AR database under varying lighting expressions.

Table 7 :
Top recognition rate of different methods on AR database under varying occlusions.
Palmprint Database.The PolyU Palmprint database contains 7752 grayscale images corresponding to 386 different palms in BMP image format (http://www4.comp.polyu.edu.hk/∼biometrics/).Around twenty samples from each of these palms were collected in two sessions, where ten

Table 8 :
Top recognition rates of different methods on the PolyU Palmprint database and the corresponding dimensions.