Laplace Graph Embedding Class Specific Dictionary Learning for Face Recognition

The sparse representation based classification (SRC) method and collaborative representation based classification (CRC) method have attracted more and more attention in recent years due to their promising results and robustness. However, both SRC and CRC algorithms directly use the training samples as the dictionary, which leads to a large fitting error. In this paper, we propose the Laplace graph embedding class specific dictionary learning (LGECSDL) algorithm, which trains a weight matrix and embeds a Laplace graph to reconstruct the dictionary. Firstly, it can increase the dimension of the dictionary matrix, which can be used to classify the small sample database. Secondly, it gives different dictionary atoms with different weights to improve classification accuracy. Additionally, in each class dictionary training process, the LGECSDL algorithm introduces the Laplace graph embedding method to the objective function in order to keep the local structure of each class, and the proposed method is capable of improving the performance of face recognition according to the class specific dictionary learning and Laplace graph embedding regularizer. Moreover, we also extend the proposed method to an arbitrary kernel space. Extensive experimental results on several face recognition benchmark databases demonstrate the superior performance of our proposed algorithm.


Introduction
The sparse representation algorithm based on dictionary learning (dictionary learning for sparse representation) is attracting more and more attention in computer vision due to its impressive performance in many applications, such as image processing, image ranking [1], human activity recognition [2], and image classification [3,4].Different from the traditional subspace methods, such as PCA, the sparse representation algorithm allows the bases of a dictionary to be much larger than the dimension of the sample characteristics, so the sample can be fitted more effectively.
We know that deep learning-based methods are currently the mainstream methods in image classification.Taigman et al. [5] proposed a DeepFace neural network for face recognition, which has achieved human-level performance.Ding and Tao [6] proposed a comprehensive framework based on Convolutional Neural Networks to overcome challenges in video-based face recognition.Florian Schroff et al. [7] proposed a FaceNet which can learn the mapping from face images to a compact Euclidean space.Sun et al. [8] proposed a DeepID2+ convolutional network which increases the dimension of hidden representations and adds supervision to early convolutional layers.Liu et al. [9] proposed a multipatch deep CNN and deep metric learning method to extract discriminative features for face recognition.However, the depth learning method performed well when the sample size was large, and the effect was not satisfactory under the condition of a small database.Therefore, we propose a dictionary learning method based on Laplacian embedding and sparse representation, which can still achieve good results in the case of very small samples.
The sparse representation based classifier has been widely used in the field of face recognition.Normally, classifying the samples involves two stages: first, obtain the sample feature, and then the sample feature can be sent to the classifier for classification.In the process of feature extraction, many subspace methods are proposed.The principal component analysis method was proposed to reduce a complex data set to a lower dimension to reveal the hidden, simplified data structures [10].The linear discriminant analysis (LDA) algorithm was proposed to find the projection hyperplane that minimizes the interclass variance and maximizes the distance between the projected means of the classes [11].Tao et al. [12] proposed a general tensor discriminant analysis method as a preprocessing step for the LDA algorithm to reduce the undersampling problem.The locality preserving projection algorithm was proposed to preserve the neighborhood structure of the data [13].In the procedure of classification, Liu et al. [14] proposed a new belief-based -nearest neighbor classifier to make the classification result more robust to misclassification errors.Noh et al. [15] proposed a nearest neighbor algorithm to enhance the performance of the nearest neighbor classification by learning a local metric.Although the -nearest neighbor classifier and the nearest neighbor classifier have achieved good results on some data sets, they did not select the most discriminatory feature of the sample to classify.So, subspace-based classifier design methods were proposed to improve the classification effect.
The sparse representation based classification algorithm uses training samples to construct an overcomplete dictionary, and the test samples can be well represented as a sparse linear combination of elements from the dictionary [16].But the subsequent research shows that sparseness cannot extract the most discriminatory features of the samples.Collaborative representation based classification (CRC) was proposed, which uses the L2-norm constraint to reveal the internal structure of the testing sample [17].Although the SRC and CRC methods have achieved superior performance in visual recognition, both SRC and CRC algorithms directly use the training samples as the dictionary matrix.The direct use of training samples to build dictionaries can lead to two drawbacks: first, very few samples to build an overcomplete dictionary, which may result in low classification accuracy, and second, very redundant dictionary samples, which prevent the original signals from being effectively expressed, resulting in poor classifier performance.
So, dictionary learning methods are proposed to improve the classification effect.Discriminative dictionary learning approaches can be divided into three types: shared dictionary learning, class specific dictionary learning, and hybrid dictionary learning.The shared dictionary learning method usually uses all training samples to obtain a classification dictionary.Lu et al. [18] proposed a locality weighted sparse representation based classification (WSRC) method which utilizes both data locality and linearity to train a classification dictionary.Yang et al. [19] proposed a novel dictionary learning method based on the Fisher discrimination criterion to improve the pattern classification performance.Yang et al. [20] proposed a latent dictionary learning method to learn a discriminative dictionary and build its relationship to class labels adaptively.Jiang et al. [21] proposed an algorithm to learn a single overcomplete dictionary and an optimal linear classifier for face recognition.Zhou et al. [22] presented a dictionary learning algorithm to exploit the visual correlation within a group of visually similar object categories for dictionary learning where a commonly shared dictionary and multiple category-specific dictionaries are accordingly modeled.The class specific dictionary learning method trained a dictionary for each class of samples.Sun et al. [23] learned a class specific subdictionary for each class and a common subdictionary shared by all classes to improve the classification performance.Wang and Kong [24] proposed a method to explicitly learn a class specific dictionary for each category, which captures the most discriminative features of this category, and simultaneously learn a common pattern pool, whose atoms are shared by all the categories and only contribute to representation of the data rather than discrimination.
The hybrid dictionary learning method is the combination of the above two methods.Rodriguez and Sapiro [25] proposed a new dictionary learning method which uses a class-dependent supervised constraint and orthogonal constraint; this method learns the intraclass structure while increasing the interclass discrimination and expands the difference between classes.Gao et al. [26] learned a categoryspecific dictionary for each category and a shared dictionary for all the categories, and this method improves conventional basic-level object categorization.Liu et al. [27] proposed a locality sensitive dictionary learning algorithm with global consistency and smoothness constraint to overcome the restriction of linearity at a relatively low cost.Although the hybrid dictionary learning method achieved good results, these methods usually operate in the original Euclidean space, which cannot capture nonlinear structures hidden in data.So, many kernel-based classifiers are designed to solve this problem.Nguyen et al. [28] presented a dictionary learning method for sparse representation based on the kernel method.Liu et al. [29] proposed a multiple-view self-explanatory sparse representation dictionary learning algorithm (MSSR) to capture and combine various salient regions and structures from different kernel spaces, and this method achieved superior performance in the field of face recognition.
As better effects had been achieved by MSSR algorithm, this algorithm neither took into consideration the details of training samples in the original sample space nor protected this powerful information conducive to classification in the dictionary space.Therefore, in this algorithm, the Laplace constraint is added to the objective function to make the closely similar samples in low dimensional space also very close in the high dimensional dictionary space.
Motivated by this, we proposed a Laplace graph embedding class specific dictionary learning algorithm and extended this method to arbitrary kernel space.The main contribution is listed in four aspects.(1) We propose a Laplace embedding sparse representation algorithm.It combines the advantages of SRC's discriminant ability and maintains the intrinsic local geometric feature of the sample features by Laplace embedding.(2) We propose a Laplace embedding constraint dictionary learning algorithm to construct superior subspace and reduce the residual error.
(3) We extend this algorithm to arbitrary kernel space to find the nonlinear structure of face images.(4) Experimental results on several benchmark databases demonstrate the superior performance of our proposed algorithm.
The rest of the paper is organized as follows.Section 2 overviews the three classical face recognition algorithms.Section 3 proposes our Laplace graph embedding class specific dictionary learning algorithm with kernels.The solution to the minimization of the objective function is elaborated in Section 4.Then, experimental results and analysis are shown in Section 5. Finally, discussions and conclusions are drawn in Section 6.

Overview of SRC and CRC
In this section, we will briefly overview two classical face recognition algorithms, SRC and CRC.
Suppose that there are  classes in the training samples and each class has   elements. = ∑  =1   , where  represents the total number of training samples;  represents all the training samples,  = [ 1 ,  2 , . . .,   , . . .,   ], where   ∈  ×  ;  represents the dimension of the sample features;   represents the th class of the training samples.Supposing that  is a test sample and  ∈  ×1 , the sparse representation of sample  can be expressed as where  is the sparse coding of sample  in the th dictionary and  is the regularization parameter in formula (1), which is used to control the sparsity and accuracy of the expression.
The collaborative representation based classification algorithm applies L2-norm constraint on the object function; the objective of the CRC algorithm can be rewritten as follows: where  is the regularization parameter to control the expression accuracy of the object function.
Both SRC and CRC methods directly use the training samples as the dictionary.And each base in the dictionary has the same contribution to the sample expression.The testing sample  can be encoded as Here,   is the dictionary matrix composed of the th class training samples, and  is the sparse coding of .Directly using the training samples as the dictionary leads to high residual error.Liu et al. [29] proposed a single-view self-explanatory sparse representation dictionary learning algorithm (SSSR).Supposing that  represents the class number of the training samples and   means the collection of sample characteristics of class , the objective function of the SSSR method can be formulated as min where   is the sparse codes of the th class and    represents the th column of   .The SSSR algorithm reconstructed the dictionary matrix,   is the dictionary weight matrix,   ∈    × , and   is the number of the th classes.  expands the original dictionary space into a more complete dictionary space; when the identity matrix   ∈    ×  appears, the class specific dictionary learning algorithm evolves into the SRC method.The existence of   matrix makes dictionary learning more flexible in the process of expression, and the reconstruction error may be reduced as well.
Meanwhile, Liu et al. [29] extended the SSSR algorithm into kernel spaces, which can map the original sample features into a high dimensional nonlinear space for better mining of nonlinear relationships between samples.The objective function of the multiple-view kernel-based class specific dictionary learning algorithm (KCSDL) is shown as follows: where  :   →   means the kernel function; it maps the original feature space into a high dimensional kernel space.

Our Proposed Approach
Although the above methods have achieved good results in the field of face recognition, there are still some deficiencies.
The SSSR algorithm uses a reconstructed dictionary matrix to make sparse representation on samples; however, it does not take into account the fact that only the sparsity constraint on the target is not necessary to gain results for better classification.
Motivated by this, we have proposed the sparse representation algorithm based on Laplace graph embedding, while taking into account the sparse representation on the samples; this algorithm mines the details implicit in the training samples; therefore, the same sample is more concentrated in the sparse expression space, so as to reduce the fitting error and improve the classification effect.
The objective function of our proposed sparse representation algorithm based on Laplace graph embedding now becomes

Optimization of the Objective Function
In this section, we focus on solving the optimization problem for the proposed Laplace graph embedding class specific dictionary learning algorithm.The dictionary weight matrix  c and sparse representation matrix   can be optimized by iterative approaches.When each element in the   matrix is updated, the remaining elements in   matrix and   matrix are fixed; at this time, the objective function is changed into an L2norm constrained least-squares minimization subproblem.Similarly, when each element in   matrix is updated,   matrix and the remaining elements in   matrix are fixed.The objective function can be seen as an L1-norm constrained least-squares minimization subproblem.

L1-Norm Regularized Minimization Subproblem.
When updating the elements in   matrix, the nonupdated elements in   and   matrix will be fixed.Here, the objective function can be formulated as where   is the weight value which describes the neighboring degree of    and    and   =  −‖   −   ‖/ .   and    are training samples that belong to the th class, and  is a constant which controls the range of   .Formula (7) can be simplified as where  =  − ,   = ∑    , and matrix  is the weight matrix expressing the sample neighboring distance.(  ,   ) means the kernel function of the sample, and (  ,   ) is calculated prior to dictionary updating.(  ,   ) = (  )  (  ).
In this algorithm, each element in   is updated sequentially; when   is updated, the other elements in the  matrix are regarded as constants.After ignoring the constant term of formula (8), formula (8) According to the solving method in [29], it is easy to obtain the solution of the minimum value of (   ) under the current iteration condition: where  =   (  ,   )  and    = {   ,  ̸ =  ‖  ̸ = ; 0,  = ,  = }.

Experimental Results
In this section, we present experimental results on five benchmark databases to illustrate the effectiveness of our method.We compare the Laplace graph embedding class specific dictionary learning algorithm (LGECSDL) with some stateof-the-art methods.In the following section, we introduce the experimental environment setting, database descriptions, and experimental results.In the end, we accordingly analyze the experimental results.
There are two parameters in the objective function of the LGECSDL algorithm that need to be specified. is an important parameter in the LGECSDL algorithm which is used to adjust the trade-off between the reconstruction error and the sparsity.We increase  from 2 −12 to 2 −1 in each experiment and find the best  in our experiments.
is another important factor in the LGECSDL algorithm. is used to control the trade-off between the reconstruction error and the collaborative information.We increase  from 2 −12 to 2 −1 and find the best  in all of our experiments.

Database Descriptions.
There are five image databases involved in our experiments.The first one is the Extended YaleB database, which contains 38 categories and 2414 frontal-face images.All the images are captured under varying illumination conditions.In our experiments, the image has been cropped and normalized to 32 × 32 pixels.Figure 1 shows several example images in the Extended YaleB database.
The second one is the AR database.The AR database contains over 3000 images of 126 individuals; images are shot under different conditions of expression, illumination, and occlusion, and each person has 26 images.Figure 3 shows some examples in the AR database.
The third database is the CMU-PIE database.The CMU-PIE database consists of 41368 pieces of pictures, which are captured under different lighting conditions, poses, and expressions.The database contains 68 individuals in total, and each person has 43 different kinds of images with 13 different poses.We selected two types of images to carry out our experiment: five near-frontal poses and all different illumination conditions.We chose 11,554 images in total for our evaluation.Each person has about 170 images.Figure 5 shows some example images in the CMU-PIE database.
We also selected the Caltech101 database to verify the LGECSDL algorithm.The Caltech101 database contains 9144 images belonging to 101 categories; each class has 31 to 800 images.We selected 5 images as training images in each class and the rest as test images.Figure 7 shows some examples in the Caltech101 database.
The fifth database is Oxford-102 flower database that contains 8,189 flower images belonging to 102 categories.Each image contains 40 to 250 images and the minimum edge  9 shows several images in the Oxford-102 flower database.

Experiments on the Extended YaleB Database.
We randomly selected 5 images as the training samples in each category and 10 images as the testing samples.In our experiments, we set the weight of the sparsity term  as 2 −9 , 2 −7 , and 2 −7 for the linear kernel, Hellinger kernel, and polynomial kernel, respectively.The optimal  is 2 −10 , 2 −8 , and 2 −10 for the linear kernel, Hellinger kernel, and polynomial kernel, respectively.We independently performed all the methods ten times and then reported the average recognition rates.Table 1 shows the recognition rates of all the algorithms using different kernel methods.
From Table 1, we can clearly see that LGECSDL achieves the best recognition rates of 80.18%, 91.93%, and 81.05% in the linear kernel, Hellinger kernel, and polynomial kernel space, respectively, while KCSDL, the second best method, arrives at 78.55%, 88.98%, and 79.68%.Since illumination variations of images are relatively large in the Extended YaleB database, these experiment results validate the effectiveness and robustness of LGECSDL for image recognition with illumination variations.VGG19 neural network in this experiment can only achieve the highest recognition rate of 53.79%.Using a small database to train neural networks does not take advantage of neural networks.We also verify the effect of  and  on the LGECSDL algorithm, and the experimental results are shown in Figure 2.
From Figure 2, we can easily know that the LGECSDL algorithm has achieved better recognition results in Hellinger kernel space.With the parameter  varied from 2 −13 to 2 −3 , the recognition rate increased gradually and then decreased.The influence of parameter  on the LGECSDL algorithm is similar to that of the parameter .The highest recognition rate was achieved when  = 2 −7 and  = 2 −8 in Hellinger kernel space.In the linear kernel space, the recognition rate achieves the maximum value at  = 2 −9 and  = 2 −10 , and in the polynomial kernel space, the maximum recognition rate was obtained at  = 2 −7 and  = 2 −10 .

AR Database.
In this experiment, we randomly selected 5 images of each individual as training samples and the rest for testing.Each image has been cropped to 32 × 32 and pulled into a column vector; the image vectors have been performed by 2 normalization.All the methods are independently run ten times, and the average recognition rates are reported.The recognition rate of AR database is shown in Table 2.
From Table 2, we can clearly see that LGECSDL algorithm outperforms the other methods in all kernel spaces.The LGECSDL algorithm achieves the best recognition rate of 94.6% in the polynomial kernel space; in the linear kernel space and Hellinger kernel space, the recognition rates are 94.5% and 94.13%, respectively.
Moreover, we can also know from Table 2 that the KCSDL algorithm achieves the best recognition rate of 91.12% in the linear kernel space, which is the highest one among the other methods.From these experimental results, we further confirm the effectiveness and robustness of the LGECSDL algorithm for image recognition with illumination variations and expression changes.We also verify the effect of  and  on the AR database; Figure 4 shows the experiment results.
From Figure 4, we can clearly see that the recognition rate reached the maximum value when  is 2 −7 and  is 2 −8 in Hellinger kernel space and polynomial kernel space; in the linear kernel space, the recognition rate achieves the highest value when  is equal to 2 −9 and  is 2 −8 .With  changed from 2 −13 to 2 −3 , the recognition rate increased first and then decreased.The  parameter shows a similar trend, and when  is greater than 2 −8 , the recognition rate decreases rapidly.

CMU-PIE Database.
In this experiment, we chose the CMU-PIE database to evaluate the performance of the LGECSDL algorithm.Five images of each individual are randomly selected for training and the remainder for testing.We also cropped each image to 32 × 32 and then pulled them into a column vector.Finally, we normalized all the vectors by 2 normalization.We independently ran all the methods ten times and then reported the average recognition rates.Table 3 gives the recognition rates of all the methods under different kernel spaces.
From Table 3, we can see that the LGECSDL algorithm always achieves the highest recognition rates under all different kernel spaces.In the polynomial kernel space, the LGECSDL algorithm outperforms KCSDL, which achieves the second highest recognition rate, by more than 4% improvement of recognition rate.In Hellinger kernel space, the LGECSDL achieves the best recognition rate of 81.03% and 6% points higher than the KCSDL algorithm.In the linear kernel space, the face recognition rate of LGECSDL and KCSDL is 79.12% and 74.46%, respectively.From these experimental results, we confirm the effectiveness and robustness  of the LGECSDL algorithm.We also evaluate the effect of  and  on the CMU-PIE database; Figure 6 shows the experiment results.
From Figure 6, we can see that when  is 2 −7 and  is 2 −8 , the face recognition rate reaches the highest value of 81.03% in Hellinger kernel space, and when  is 2 −7 and  is 2 −10 , the highest face recognition rate is obtained in the polynomial kernel space.We can also know from Figure 6 that when  is  greater or less than the maximum value, the recognition rate decreases rapidly, and parameter  also has the same effect on the face recognition rate.

Caltech101
Database.In this experiment, we further evaluate the performance of the LGECSDL algorithm for image recognition on the Caltech101 database.Figure 7 shows some examples in the Caltech101 database.We randomly split the Caltech101 database into two parts.Part one, which contains about 5 images of each subject, is used as a training set, and the other part is used as a testing set.We use the VGG ILSVRC 19 layers model to obtain the features of each image.Here, we employ the second fully connected layer outputs as the input features whose dimension size is 4096.
We independently ran all the methods ten times and then gave the average recognition rates of all the methods in Figure 8.
LGECSDL-H represents the LGECSDL algorithm in Hellinger kernel space, and LGECSDL-P represents the LGECSDL algorithm in the polynomial kernel space, similarly to the CSDL algorithm.From Figure 8, we can easily see that the LGECSDL-H algorithm achieves the highest recognition rate, and LGECSDL-P is the second one.More concretely, LGECSDL-H and LGECSDL-P achieve the recognition rates of 83.02% and 82.88%, respectively, while CSDL-P, the third best method, arrives at 82.23%.Experimental results show that training the VGG19 network with a small database did not give the desired results.VGG19 network achieved the recognition rate of 68.63%.
We also verified the computational time of each method in Caltech101 database.The experimental environment consists of the following: Core i7 CPU (2.4 GHz), 8 GB memory,  Windows 7 operating system, and NVIDIA Quadro K2100M computer graphics processor.From Table 4, we can see that the VGG19 method achieves the best result.This is mainly due to the neural network GPU accelerated architecture that saves most of the computational time.The second is SVM, followed by CSDL-H algorithm.The LGECSDL-H algorithm needs 109.50 milliseconds to classify each picture, whereas the LGECSDL-P algorithm requires 112.25 milliseconds.

Oxford-102 Flower Database.
In this experiment, we chose the Oxford-102 database to evaluate the performance the LGECSDL algorithm in the case of image recognition with precise image classification.Five images of each individual are randomly selected for training, and the rest of the images are for testing.The image features are obtained from the outputs of a pretrained VGG ILSVRC 19 layers network which contains five convolutional and three fully connected layers.Here, we use the second fully connected layer outputs as the input features whose dimension size is 4096.We independently ran all the methods ten times and then reported the average recognition rates.The best recognition rates of all the methods are presented in Figure 10.
From Figure 10, we can clearly know that the LGECSDL achieves the best recognition rates of 71.41% and 70.85% in polynomial kernel space and Hellinger kernel space, respectively, while CRC arrives at 69.7%, which is the highest one among those of the other methods.LGECSDL-H and LGECSDL-P outperform VGG19, SVM, SRC, CRC, ProKCRC, CSDL-H, and CSDL-P by at least 1.2% improvement of recognition rate.The experimental results show that, in the small database experiment, other methods have a higher recognition rate than the VGG19 neural network.The classical method has more advantages in the small sample base experiment.We also verify the performance of the LGECSDL algorithm with different values of  or  in different kernel spaces on the Oxford-102 database.The performance with different values of  or  is reported in Figures 11 and 12.
From Figure 11, we can see that the LGECSDL algorithm achieves the maximum value when  = 2 −4 and  = 2 −4 ; when  is fixed, the recognition rate increases firstly and then decreases with the increase of .Similarly, the recognition rate also increases firstly and then decreases with the increase of .
From Figure 12, we can see that the LGECSDL algorithm in the polynomial kernel space achieves the maximum recognition rate when  = 2 −5 and  = 2 −5 .In the polynomial kernel space, the influence of  and  on the algorithm is similar to that in Hellinger kernel space.

Conclusion
We present a novel Laplace graph embedding class specific dictionary learning algorithm with kernels.The proposed LGECSDL algorithm improves the classical classification algorithm threefold.First, it concisely combines the discriminant ability (sparse representation) to enhance the interpretability of face recognition.Second, it greatly reduces the residual error according to Laplace constraint dictionary learning.Third, it easily finds the nonlinear structure hidden in face images by extending the LGECSDL algorithm to arbitrary kernel space.Experimental results on several publicly available databases have demonstrated that LGECSDL can provide superior performance to the traditional face recognition approaches.

Figure 1 :
Figure 1: Examples of the Extended YaleB database.
Tuning  on linear space Tuning  on Hellinger space Tuning  on polynomial space Tuning  on linear space Tuning  on Hellinger space Tuning  on polynomial space

Figure 2 :
Figure 2: Parameter tuning on the Extended YaleB database.

Figure 3 :
Figure 3: Examples of the AR database.

Figure 4 :
Figure 4: Parameter tuning on the AR database.

Figure 5 :
Figure 5: Examples of the CMU-PIE database.

Figure 6 :
Figure 6: Parameter tuning on the CMU-PIE database.

Figure 10 :
Figure 10: The best recognition rates of all the methods on Oxford-102.

Table 2 :
Recognition rate on the AR database (%).

Table 4 :
Computational time of each method in Caltech101.