Jointly Learning the Discriminative Dictionary and Projection for Face Recognition

,


Introduction
Face recognition (FR) is an imperative issue in the field of image processing and computer vision. Recently, plenty of face recognition methods have been proposed [1][2][3][4][5]. However, the problems of occlusion, illumination, pose, and small sample size are still huge challenges for face recognition [6][7][8]. Currently, sparse representation-based classification (SRC) [9] has been successfully employed, in which the overcomplete dictionary can represent the query face image well. Significantly, the dictionary designed for SRC utilizes all training images. SRC has shown favorable properties in FR, particularly when images are partly occluded. Nevertheless, the unsure and noisy components may lead to the ineffectiveness of the dictionary in representing query samples. Moreover, the dictionary's size is consistent with the number of training images. us, the computational cost of solving sparse representation coefficients will increase if the training samples' number is large. At last, the dictionary does not take the structure of the training set or class label into account, which will make the dictionary lack discriminant information. To address these issues, predefined dictionaries that use bases such as Haar or Gabor wavelet instead of training samples are presented [10,11], but none of these bases is proposed for SRC [12].
Dictionary learning (DL) is significant for SRC because it can suppress the useless information to promote the representation and discrimination [13]. To learn a discriminative and small-sized dictionary, a substantial amount of methods have been presented [14][15][16], which can be roughly divided into two categories: unsupervised and supervised. Unsupervised DL methods have achieved satisfactory results by minimizing the representation error. e method of optimal directions (MOD) [17] was proposed for unsupervised DL. MOD updated the dictionary by minimizing the representation error and achieved the convergence by an iteration-based strategy. However, the computation of the inverse matrix in the MOD was very complicated. e K-singular value decomposition (K-SVD) [18] method was proposed based on the MOD, which performed SVD decomposition on the representation error term and selected the decomposition terms as the updated dictionary atoms and the corresponding coding coefficients. e most substantive difference between MOD and K-SVD is the dictionary updating strategy, in which K-SVD updates one atom and its corresponding coding coefficients each time until all atoms are updated. erefore, the MOD can be considered as a simplified version of K-SVD. Although the performance of the K-SVD method has been improved, the computational complexity of updating atoms is also high. To enhance the efficiency of DL, an effective reconstructed DL method was presented in [19], which was based on alternating optimization over two subsets of variables. Skretting and Engan [20] introduced a forgetting factor λ into the DL algorithm to make the algorithm less dependent on the initial dictionary. In [21], metafaces were learned from the training samples, which can promote the representation ability of the dictionary. Although unsupervised DL methods have achieved impressive recognition results, there still exists a limitation in their practical applications. Due to the absence of label information, the dictionaries obtained by unsupervised DL methods were always lacking the discriminative ability. To overcome this problem, many supervised DL methods that utilize the label information have been proposed. In [22], a discriminative K-SVD algorithm was proposed to ensure the representative and discriminative abilities of the learned dictionary. To better utilize the correspondence between the dictionary and labels, the label consistent K-SVD [23] algorithm, which associated the label information with each atom to promote the discriminative ability of the dictionary, was put forward. Recently, the Fisher discrimination dictionary learning (FDDL) [24] algorithm was proposed to learn a class-specific dictionary for FR. Based on the Fisher discrimination criterion [25], the representation error associated with each class was employed for classification. Ding and Ji [26] applied a kernel-based robust disturbance dictionary to significantly enhance the recognition accuracy of occluded faces. Since the supervised DL methods explored the label information of training samples to promote the discriminative ability of the learned dictionary, they have achieved well performance for FR. Recent progresses in SRC have made video-based face recognition become a growing research topic. e video can be treated as a set of images obtained from different poses, illuminations, and expressions. e main difficulty is how to effectively use the multiframe information. In [27], a video dictionary was adopted to encode different video information, i.e., pose, temporal, and illumination. In [28], a multivariate sparse representation method was suggested for video-based face recognition, which was robust to noise and occlusion. ese two methods learned the dictionary for FR, but they did not consider the impact of other constraints on algorithm performance. Xu et al. [29] proposed a method to learn a structured dictionary for video-based face recognition, which adopted the nuclear norm to make the coding coefficient matrix be low-rank. However, this method did not enhance the discriminative ability of the representation coefficients. In addition, it utilized the samples in the original space to learn the dictionary and the coding coefficient matrix, which ignores the influence of noise and other irrelevant information.
Dimensionality reduction (DR) is an essential step to decrease the cost of data computation and storage. It also eliminates the irrelevant information to enhance the discriminative ability of features [30][31][32][33]. Zhang et al. [34] proposed a novel unsupervised algorithm to obtain the orthogonal projection, which can ensure that the samples were well reconstructed in the projected subspace. Clemmensen et al. [35] utilized the sparseness criterion to realize linear discriminant analysis so that the classification and feature selection can be achieved concurrently. In [36], a linear discriminative projection was learned by maximizing the ratio of the between-class representation error to the within-class representation error in the projected space. In [37], the sparsity criterion and the maximum margin criterion [38] were combined to obtain the discriminant projection. Although these SRC-based DR methods yielded notable results, they only acquired the low-dimensional features of the samples and failed to supply an explicit discriminative dictionary.
To overcome this limitation, a series of methods have been suggested to combine DR and DL into a unified framework. By combining the sparseness criterion with PCA, Nguyen et al. [39] presented a sparse embedding method for simultaneously solving the DR and DL problems. e projection matrix was learned for retaining the sparse structure of samples, and the dictionary was learned in the reduced space simultaneously. However, it ignored the distinguish ability of different class samples in the subspace. In [40], the sigmoid function and the ratio of intraclass representation error to interclass representation error were utilized to learn the discriminative dictionary and projection simultaneously, but it ignored both the intraclass and interclass scatter matrix of the coefficients and low-dimensional samples. To address this problem, Feng et al. [41] introduced an orthogonal projection matrix, which can be obtained through maximizing the total scatter and betweenclass scatter of the training set, in the projection and dictionary simultaneously learning framework. Liu et al. [42] utilized the discriminative graph constraints to achieve nonnegative feature projection and dictionary learning simultaneously. Lu et al. [43] also presented a framework, which can simultaneously learn low-dimensional features and dictionaries, to deal with the video-based face recognition problem. Although these jointly learning methods have achieved success, they did not exploit the discriminative relationship between low-dimensional features and dictionary. To address this issue, a novel method called jointly learning the discriminative dictionary and projection (JLDDP), which simultaneously learns the dictionary and projection in a unified framework, is proposed for FR in this paper. Compared with the existing methods, JLDDP has four characteristics. First, the discriminative ability of the dictionary can be enhanced via imposing the Fisher discrimination criterion on the coding coefficients. Second, the projection learned by our approach enables the closeness of samples from the same class, while keeping the samples from different classes far away in the lowdimensional subspace. ird, JLDDP combines the processes of projection learning and DL into a uniform framework, so the dictionary and projection can be automatically optimized. Last, we design an iterative optimization algorithm to solve our model and provide a theoretical proof for its convergence. e remaining part is organized as follows. Some of the related work is briefly reviewed in Section 2. e details of JLDDP are provided in Section 3. Experiments and comparisons are carried out in Section 4, and conclusions are provided in Section 5.

Related Work
2.1. SRC. SRC was proposed by Wright et al. [9] for face recognition. Assume there are n classes of samples, and the training set can be expressed as X � [X 1 , . . . , . , x i,n i ] ∈ R m×n i denotes the subset of the training samples that contains n i samples of class i. Let x i,j (j � 1, 2, . . . , n i ) represent the m-dimensional vector stretched by the j-th sample of class i. SRC assumes that a testing sample can be well estimated by the linear combination of the training samples from the same class, so let y ∈ R m denote a testing sample of class i; it can be expressed as y � a i,1 x i,1 + a i,2 x i,2 + · · · + a i,n i x i,n i , where a i,j is the corresponding coding coefficient. Suppose we utilize the training set to represent y, the corresponding coefficient vector entries except those related to the i-th class should be zero. In SRC, the l 1 -minimization is applied to handle the coefficient vector, i.e., a � argmin a ‖y − Xa‖ 2 2 + λ‖a‖ 1 , where λ is a tradeoff parameter. e i � ‖y − Xδ i (a)‖ 2 denotes the representation error of class i, where δ i (·): R n ⟶ R n i can choose the coefficients of class i. e classification criterion is identity(y) � argmin i e i .

Dictionary
Learning. In this section, the DL methods, including unsupervised K-SVD [18] and supervised FDDL [24], will be reviewed.

K-SVD.
In the K-SVD algorithm [18], an overcomplete dictionary is learned from the training set for image compression and denoising. e objective function of K-SVD is formulated as where X is the training set, D is the dictionary, α is the sparse coding coefficient matrix of X over D, and T is the parameter to adjust the sparsity. To optimize equation (1), the sparse coding coefficient α and the dictionary D are updated iteratively. However, there is no corresponding relation between the class label and the dictionary atoms. us, K-SVD is unsuitable for solving classification problems. [24] combines the class label information and the Fisher discrimination criterion to learn a structured discriminative dictionary, which performs classification by the representation error for each class. e FDDL model is formulated as

FDDL. Different from K-SVD, FDDL
where X is the training set, λ 1 and λ 2 are tradeoff parameters, and each column of D is normalized to a unit vector.
F is the discriminative term, ‖α‖ 1 is the sparse regularization term, and tr(S W (α)− S B (α)) + η‖α‖ 2 F is the discriminative coefficient term to enforce the discriminative ability of the sparse representation coefficients. e objective function of FDDL can be optimized by updating the dictionary and sparse representation coefficients iteratively. Although FDDL has achieved a good performance for FR, the process is time-consuming. erefore, PCA is applied to extract features from all samples firstly in FDDL.

Methodology
In this section, we firstly describe the proposed JLDDP, which incorporates DL and projection learning into a unified framework. Secondly, the novel iterative update algorithm of JLDDP is deduced. irdly, the convergence analysis is given. Fourthly, we provide the classification schemes which characterize the class-specific representation error for FR. Finally, we analyze the guideline for parameter setting.

Mathematical Problems in Engineering
Let P be the projection that reduces the feature dimension of samples. e structured (class-specific) dictionary is denoted by D � [D 1 , D 2 , . . . , D c ], where D i is the i-th class subdictionary. e coding coefficient matrix of P T Y over D is denoted by X, which can be refined to X � [X 1 , X 2 , . . . , X c ], where X i is the i-th class submatrix of coding coefficient X. Actually, X i can also be expressed as , where X j i is the coding coefficient of P T Y i over the subdictionary D j . In JLDDP, the projection, dictionary, and coding coefficients are jointly learned with the following model: where R(P, D, X) denotes the representation error term, ‖X‖ 1 is the l 1 -regularization on X, C(X) is the coding coefficient term imposing discriminative label information on DL, and S(P) is the projection learning term projecting the samples into a more discriminative space. ω 1 , ω 2 , and ω 3 are the tradeoff parameters. Each atom d k in the dictionary has a unit norm. Next, more detailed descriptions of the terms in equation (3) will be given.

Representation Error Term.
When the training samples are represented by a dictionary, we expect the dictionary to have both strong reconstructive ability and strong discriminative ability. In addition, the samples can be reconstructed not only by the whole dictionary but also by the subdictionary from the same class. erefore, the representation error term is expressed as e representation error term is designed to obtain a small representation error that is calculated by the low-dimensional training samples P T Y i and the structured dictionary D. First, each class of low-dimensional training samples P T Y i should be well represented by the structured dictionary D, i.e., Second, each class of low-dimensional training samples should be well represented by the dictionary from the same class, rather than other classes, which indicates that P T Y i should be well represented by D i as much as possible, but not by D j (j ≠ i).
Hence, X i i should have some significant coefficients, and X j i (j ≠ i) should have nearly zero coefficients.

Coding Coefficient Term.
We can make the dictionary discriminative by constraining the coding coefficients [24]. According to the Fisher discrimination criterion, the within-class scatter should be minimized, and the betweenclass scatter should be maximized, which can make the coding coefficients have discriminative ability. Hence, the coding coefficient term is formulated as T is the between-class scatter of X, n i is the number of samples in class i, and m is the mean vectors of X. We impose the Fisher discrimination criterion on X to improve the discriminative ability, which indicates the within-class scatter S w (X) should be minimized, and the between-class scatter S b (X) should be maximized. ‖X‖ 2 F is an elastic term, and the convexity of equation (5) is proved in [24].

Projection Learning Term.
e projection matrix P should preserve the energy of samples as much as possible and make the samples from different classes separable in the low-dimensional space. erefore, the projection learning term is expressed as where T are the within-class scatter and the between-class scatter of P T Y, respectively. y k ′ denotes the k-th sample from class i in the low-dimensional space. m i ′ and m ′ denote the mean vectors of P T Y i and P T Y, respectively. We adopt the Fisher discrimination criterion on low-dimensional samples, i.e., tr(S w (P T Y))− tr(S b (P T Y)), to enhance the discriminative ability of features. Moreover, we minimize the term − ‖P T Y‖ 2 F to guarantee that the energy of Y can be well preserved.
By incorporating equations (4)- (6), we obtain the JLDDP model as shown in equation (3). e iterative update scheme is adopted to optimize the objective function, and the detailed optimization process of JLDDP is presented in the following section.

Optimization.
e objective function of JLDDP is not convex for P, D, and X jointly, but it is convex with regard to each of them when the others are fixed. us, equation (3) can be divided into three subproblems and optimized by an iterative update scheme.

Updating X with Fixed P and D.
Suppose that P and D are fixed, we can update X � [X 1 , X 2 , . . . , X c ] class-by-class, i.e., we fix all X j (j ≠ i) to update X i . erefore, the simplified form of equation (3) can be obtained as follows: 4 Mathematical Problems in Engineering where M k and M i are the mean vector matrices of class k and class i, respectively. M is the mean vector matrix of all classes. Except for ‖X i ‖ 1 , the other terms in equation (7) are differentiable. Since equation (7) is strictly convex, we can employ iterative projection methods (IPM) [44] to solve it.

Updating D with Fixed P and X.
To obtain the optimal structured dictionary D, we need to update the subdictionary D i class-by-class, while P, X, and all other D j (j ≠ i) are fixed. en, equation (3) can be simplified as where X i represents the coding coefficients of P T Y over the subdictionary D i . We can employ the algorithm in [19] to solve equation (8), i.e., update D i atom-by-atom.

Updating P with Fixed D and X.
When the dictionary D and the coding coefficient matrix X are fixed, equation (3) can be simplified to We can obtain equation (10) by the mathematical derivation of equation (9): If we set the derivative of P as zero in equation (10), we acquire For convenience, , and t 5 � YY T to replace the corresponding parts of equation (11). en, we gain the explicit solution of the projection matrix P as shown in the following: e above iterative optimization process of JLDDP will stop when the algorithm is convergent or the maximum number of iterations is attained. Algorithm 1 is the summary of the whole optimization process.

Convergence.
e optimization process of JLDDP can be simplified into three subproblems that can be solved iteratively, as formulated in equations (7), (8), and (12). It has been proved that the subproblem in equation (7) is convex in [24]. Obviously, equation (8) is quadratic programming, so it is convex. In each iteration, the value will decline after solving X and D via equations (7) and (8), respectively, as proved in [21,44]. Moreover, the subproblem in equation (12) can obtain an explicit solution.
us, to justify the convergence of JLDDP, we need to demonstrate that the value of equation (3) is nonincreasing after optimization. For convenience, let ϕ(P, D, X) denote the objective function of JLDDP. Before proving the convergence of Algorithm 1, we should establish eorem 1 first. Theorem 1. If Algorithm 1 is used to solve ϕ(P, D, X), the objective function value is nonincremental.
Proof. Let ϕ(P t , D t , X t ) indicate the value in the t-th iteration.
When solving the subproblem min X ϕ(P t , D t , X), we utilize the method in [44] to obtain the optimal value of X t+1 with fixed P t and D t . is subproblem is convex, so we can obtain Mathematical Problems in Engineering When solving the subproblem min D ϕ(P t , D, X t ), we employ the method in [21] to obtain the optimal value of D t+1 with fixed P t and X t . It is still a convex problem, so we have When solving the subproblem min P ϕ(P, D t , X t ), we can obtain the explicit solution with fixed D t and X t based on equation (12). erefore, Combining equations (13)-(15), we have Now, the theorem has been proved. Since each term in equation (3) is nonnegative, the objective function value has a low bound. According to eorem 1 and the Cauchy convergence criterion [45], the optimization algorithm presented for JLDDP is convergent.

Classification.
e learned projection P can reduce the dimension of the testing sample y t , and the low-dimensional feature P T y t can be coded over the learned dictionary D. erefore, we can obtain the coding coefficient x ′ by is the coding coefficient and x i ′ is the coding coefficient vector associated with class i. α is a tradeoff parameter. e structured dictionary D is learned to ensure the coding coefficients of the identical class are similar, and the coding coefficients of various classes are different. In addition, the coding coefficients have a stronger discriminative ability through the constraints of the Fisher discrimination criterion. erefore, not only the representation error but also the distance information of the coding coefficients obtained by equation (17) is useful for classification. We classify the testing sample y t by label y t � arg min where x i ′ is the mean vector of x ′ related to class i and c is a tradeoff parameter.
erefore, how to properly set their values is important. Fortunately, each parameter has a clear physical meaning, which can supply a guideline for setting the value. e parameter ω 1 is used to control the sparsity of the coding coefficient matrix, whose value needs to be set as a moderate value. e parameter ω 2 can adjust the coding coefficient term based on the Fisher discrimination criterion, whose value should not be set either too small or too large. Since an extremely small ω 2 value will lead to the loss of latent discrimination information, a too large ω 2 value will make other terms be neglected. e parameter ω 3 is used to constrain the projection learning term based on the Fisher discrimination criterion. Analogous to the parameter ω 2 , a relatively small ω 3 value can decrease the projection learning term effect. However, a relatively large ω 3 value will make the objective function dominated by the projection learning term, and the role of other terms will be neglected.

Comparison with the Existing Work.
In order to highlight the novelty of our work, we compare the proposed JLDDP method with some related studies. First, although some terms in the objective function of FDDL [24] are similar to those in our JLDDP, they are different from each other. Specifically, FDDL utilizes PCA to project original features into a low-dimensional subspace, which is separated from the process of dictionary learning. us, FDDL does not exploit the relationship between the low-dimensional features and the learned dictionary, which cannot effectively learn the appropriate features for the discriminative dictionary learning task. To solve this problem, our proposed JLDDP simultaneously learns the feature projection matrix and dictionary in a unified framework, which can ensure that the learned projection matrix is most beneficial for discriminative dictionary learning.
at is, the learned projection matrix and dictionary in our JLDDP are relevant and mutually beneficial. Hence, jointly optimizing them can achieve better performance for face recognition. Second, the proposed JLDDP also seems like the dictionary learning methods in [46][47][48]. However, there exist some significant differences between them. To be specific, (1) the methods in [46][47][48], respectively, learn multiple class-specific (1) Input: the training set Y � [Y 1 , Y 2 , . . . , Y c ], iteration number T, parameters ω 1 , ω 2 , and ω 3 .
(3) Repeat steps 3-6 until convergence or t < T conditions. (4) Update X t with fixed P t− 1 and D t− 1 by equation (7). (5) Update D t with fixed P t− 1 and X t by equation (8). (6) Update P t with fixed D t and X t by equation (12). (7) Output: projection matrix P, structured dictionary D, coding coefficient matrix X.
ALGORITHM 1: e algorithm of JLDDP. 6 Mathematical Problems in Engineering subdictionaries and a common subdictionary shared by all classes. en, they combine the learned class-specific subdictionaries and common subdictionary to achieve the recognition task. In our JLDDP, we only need to learn a subdictionary for each class and combine all subdictionaries as a whole dictionary. erefore, there is no need to learn and update the common dictionary during the model optimization, which can make sure that our model has a fast convergence speed and high computational efficiency. (2) Similar to FDDL, the methods in [46][47][48] do not consider feature projection matrix learning in the process of dictionary learning. us, the feature projection is separated from the process of dictionary learning in them, which cannot learn the best combination of the low-dimensional feature and dictionary for face recognition. (3) e regularization criteria in the objective functions adopted in [46][47][48] were different from our proposed JLDDP, e.g., [46,48] used l 1 -norm, and [47] used l 2,1 -norm to enforce the learned coefficients of the dictionary to be sparse, while our proposed JLDDP utilizes the intraclass and interclass scatter of coefficients as constraints, which can improve the discrimination of the model. ird, Lin et al. [49] proposed a RDCDL method which utilizes the low rank and sparse constraint to extract the disturbance components (e.g., noise, outliers, and occlusion) in the training samples. In RDCDL, a set of training samples and a set of alternative training samples with simulated facial variation are employed to build a dictionary learning model with a complex and comprehensive dictionary. e comprehensive dictionary includes a class-shared dictionary, a class-specific dictionary, a simulated disturbance dictionary, and a real disturbance dictionary. e main difference between our JLDDP and RDCDL lies in that we only adopt classspecific dictionary to construct the whole dictionary, which is simpler than Lin's model and can deeply decrease the computational complexity. Besides, RDCDL utilizes PCA to reduce the feature dimension of samples, which is separated from the process of dictionary learning. However, our JLDDP combines the processes of feature projection and dictionary learning into a unified framework to obtain a more suitable low-dimensional feature, which is quite different from RDCDL. Moreover, it is worth noting that RDCDL only adopts the intraclass scatter of coefficients as the discrimination constraint but neglects the interclass scatter of coefficients, while our JLDDP utilizes both the intraclass scatter and the interclass scatter to improve the discriminative ability of the learned dictionary. Fourth, Zhang et al. [40] proposed a SS-DSPP model which can simultaneously learn the dictionary and the projection matrix, but it is still very different from our JLDDP in the following aspects. SS-DSPP takes advantage of the relationship between the reconstruction error of training samples by the same class dictionary and the reconstruction error of training samples by different classes. Nevertheless, the discrimination constraint on coefficients is not considered in it. In addition, SS-DSPP also ignores the class information of low-dimensional features obtained after projection but only imposes an orthogonal constraint on the projection matrix, which leads to reducing the discrimination capability of the model to some extent. To solve these problems, our JLDDP utilizes the Fisher discrimination criterion to constrain the intraclass and interclass scatters of coefficients and low-dimensional samples, which can ensure the discrimination ability of the JLDDP model. In summary, although the proposed method shares several similarities with the aforementioned approaches [24,40] and [46][47][48][49], our JLDDP is different from them in the dictionary learning process, projection learning process, or coefficient constraint. Specifically, JLDDP simultaneously learns the dictionary and projection matrix in a unified framework by adopting the intraclass and interclass scatter as the constraint of coefficients and the samples.
us, JLDDP can explore the intrinsic relationship between the dictionary and the feature learning, which can improve the classification performance of both the image-based and the video-based face recognition.

Experimental Results
We conduct extensive experiments on image-based and video-based face databases to confirm the validity of JLDDP.

Image Database Description. ORL [50], CMU PIE [51]
, FERET [52], and LFW [53] databases are used to prove the validity of JLDDP for image-based face recognition. Some examples from the ORL, CMU PIE, FERET, and LFW databases are shown in Figure 1.
e ORL face database includes 400 images of 40 subjects. e images reflect the changes of illumination, pose, expression, and whether glasses are worn. e CMU PIE face database includes 41,368 images of 68 subjects. In 43 distinct illumination conditions, images are taken across 13 various poses and with 4 diverse expressions. We adopt a subset of 24 images for each person in this experiment. e FERET database is recorded in a real environment with a lot of images. It includes 14,051 face images of more than 1,000 subjects. e face images have the characteristics of different expressions, postures, and illuminations. In addition, the time span of image acquisition in the FERET database is very large. We adopt a subset which contains 1,400 images of 200 subjects in this experiment. e LFW database is collected in unconstrained environments, which is very challenging. is database contains 13,233 face images of 5,749 subjects. However, most of the people have only one image in the database. erefore, we select 158 subjects from LFW, which has at least 10 distinct images, to verify the effectiveness of algorithms. In [54], a new sparse representation-based alignment method is proposed for real-world images, which can eliminate the variety of orientations, expressions, and other factors as much as possible. We use this method to deal with the original LFW database for all the recognition methods. Table 1 provides the detailed database information. All images are clipped by selecting eye coordinates manually and normalized to 32 × 32 pixels.

Experiment Setting.
In the image-based face recognition task, we compare our method with some representative methods, including SRC [9] with PCA and LDA, LCK-SVD [23], FDDL [24], DRSRC [34], LSD [29], DSRC [40], JDDRDL [41], and JNPDL [42]. e l 1 -l s toolbox [55] is adopted to handle the l 1 -minimization problem in the SRC-related algorithms. e source code of the l 1 -l s toolbox can be found at http://web.stanford.edu/∼boyd/ l1_ls/. e source code of FDDL can be found at http:// www4.comp.polyu.edu.hk/cslzhang/code/FDDL.zip. e source code of LC-KSVD can be found at http://users. umiacs.umd.edu/∼zhuolin/projectlcksvd.html. e other methods are based on our implementations, and the parameters are tuned based on the settings reported in their papers. We set the number of atoms for each class of the dictionary in JLDDP as half of the training samples. rough randomly chosen training and testing samples, experiments are conducted 10 times totally, and the average recognition accuracies and standard deviations are reported. All the methods are developed in MATLAB and implemented on a computer with an Intel Core i3-2100 CPU at 3.2 GHz and 8 GB physical memory.
We first compare the recognition performance under various feature dimensions, and next, we compare the recognition performance under various number of training samples. For convenience, the number of training and testing samples is represented by l and h, respectively. Tables 2 and 3 show the data descriptions.
We compare the recognition performance under different parameter values. We adjust the parameter values by searching the grid {0, 0.0001, 0.001, 0.01, 0.1, 1} in an alternate manner to obtain the optimal parameter combination. Finally, we provide the convergence evaluation. We set the number of atoms for each class of the dictionary in JLDDP as half of the training samples. rough randomly chosen training and testing samples, experiments are conducted 10 times totally, and the average recognition accuracies and standard deviations are reported.

Recognition Results and Analysis. (1) Recognition
Performance under Different Feature Dimensions. In the first experiment, we employ different feature dimensions to verify the performance of various methods. Table 2 shows the number of training samples and the reduced feature dimensions. e reduced feature dimension of LDA can be one less than the number of classes at most, and we cannot vary the feature dimensions as other methods.
us, the results of LDA + SRC are not shown in the first experiment. In LC-KSVD and FDDL, PCA is adopted to reduce the sample dimension. Tables 4-7 demonstrate the recognition accuracies on the four databases by various number of dimensions. In most instances, the performance of JLDDP is better than the other methods. Moreover, several points can be seen from the tables. First, DRSRC is an unsupervised DR method that is designed based on SRC, so the accuracy is higher than PCA + SRC in most cases. is illustrates that the well-designed projection is more suitable for the classification. Second, compared with PCA + SRC and DRSRC, the average recognition accuracies of LCK-SVD, FDDL, and LSD are higher. e reason is that, after reducing the dimension of the samples with PCA and LCK-SVD, FDDL and LSD can learn a representative and discriminative dictionary, which is a key role in SRC. ird, LCK-SVD, FDDL, and LSD enhance the discrimination ability of the dictionary, but they do not jointly learn the projection that can preserve much discriminative information. erefore, their performance is not as good as JDDRDL, DSRC, JNPDL, and        Nevertheless, the experimental results still indicate that JLDDP can achieve relatively stable and high recognition accuracy in general under different feature dimensions. e superiority of our approach is due to that JLDDP can discover the latent discriminative ability of samples in the lowdimensional space and learn the class-specific dictionary simultaneously.
(2)·Recognition Performance under Various Number of Training Samples. e effectiveness of JLDDP under various number of training samples is compared with other methods on the ORL, CMU PIE, FERET, and LFW databases. e number of training samples and test samples used is listed in Table 3. Tables 8-11 show the recognition accuracies and the corresponding feature dimensions. e corresponding feature dimensions are annotated in parentheses. When there are only 2 training samples per subject, JDDRDL, DSRC, JNPDL, and JLDDP that learned the dictionary and projection jointly obtain better performance than other methods. When the number of training samples is increased, the performance of all the methods is improved in general, except for the LDA + SRC and LCK-SVD methods in the FERET database. Compared with other methods, JLDDP can achieve the best average recognition accuracies and a relatively small feature dimension, which demonstrate its capability to address practical applications.

(3)·Recognition Performance under Different Parameter
Values. We test the impacts of various parameter values on four image-based face recognition databases. Since there are three parameters in the proposed JLDDP, we fix two of them and then analyze the influence of the remaining parameter. e physical meaning of the parameters is described in Section 3. For the ORL, CMU PIE, FERET, and LFW databases, the number of training samples is set as 5, 7, 4, and 5, respectively. e top average recognition results obtained by JLDDP under various parameter values are shown in Figure 2. When the parameter values of ω 1 , ω 2 , and ω 3 equal to zero, the recognition accuracy of JLDDP is relatively low, which indicates that each term in the objective function of JLDDP is significant for classification. With the increasing of each parameter value, the performance of JLDDP improves gradually. When ω 1 � 0.0001, ω 2 � 0.0001 or 0.001, and ω 3 � 0.001 or 0.01, the proposed JLDDP performs best on the four databases. However, after achieving its best performance, the recognition accuracy dramatically decreases with the increase of each parameter value. Hence, ω 1 , ω 2 , and ω 3 should be set as moderate values to obtain a good performance, which is conform to our analysis in Section 3. at is, if the parameter value is too large, the corresponding term in equation (11) will play a leading role, which makes other terms be neglected. In contrast, if the parameter value is too small, the corresponding term will lose its constraint ability.
To further evaluate the role of each term in our model, we, respectively, set the parameter values of ω 1 , ω 2 , and ω 3 as zero to test the performance of JLDDP. Here, the number of training samples is set as 5, 7, 4, and 5 for ORL, CMU PIE, FERET, and LFW databases, respectively. e top average recognition results obtained by JLDDP under various situations are shown in Table 12. In this table, the baselines  are results obtained by the optimal parameter combination  in Tables 9-11. From the experimental results, we can see that the proposed method cannot achieve its best recognition accuracies when one of the parameters ω 1 , ω 2 , and ω 3 is equal to zero, which indicates that the sparse constraint term, the coding coefficient term, and the projection learning term are all essential to improve the recognition performance of our JLDDP method. Besides, the recognition accuracies are dramatically decreased when ω 1 is set as zero, that is, the sparse constraint term is omitted, which indicates the sparse constraint in the dictionary representation is very important to improve the discriminative ability of our model. Furthermore, the recognition accuracies are very close when ω 2 or ω 3 is set as zero, but much lower than the baselines. is means the coding coefficient term and the projection learning term are also indispensable in our JLDDP since they can bring the intraclass and interclass information into our model to ensure the discrimination of coefficients and low-dimensional features.
(4)·Convergence Evaluation. Figure 3 demonstrates the convergence curves of JLDDP on the ORL, CMU PIE, FERET, and LFW databases. In each figure, the x-axis represents the iteration number, and the y-axis represents the value of the objective function. From this figure, we can find that the proposed iterative updating algorithm of JLDDP is convergent, which is conformable to our convergence analysis in Section 3.

Classification Scheme.
To further evaluate the performance of JLDDP, we perform face recognition experiments on video. Here, we suppose V t � v t 1 , . . . , v t j , . . . , v t n t is a testing face video, where v t j is the j-th (1 ≤ j ≤ n t ) frame and n t is the total number of frames. According to Lu et al. [43], we project each frame into a low-dimensional feature space by the learned projection P and then obtain the corresponding coding coefficients by equation (17). Finally, the class label of the frame can be obtained by the following equation as [42] label v t j � arg min where i v t j is the projection of v t j onto the span of atoms in D i [26]. Finally, we apply the majority voting to determine the testing video's label after obtaining the entire frames' label: where Z i denotes the total votes from the i-th class.

Video Database Description.
e Honda [56], MoBo [57], and YTC [58] databases are employed to verify the performance of JLDDP. All the videos in the Honda database are recorded indoors with normal lighting conditions and    It is a large low-resolution video database for face recognition, which is highly compressed. Each video contains 8 to 400 frames. In the experiment, the cascaded face detector [59] is used to detect the face, and then all the faces are resized to grayscale images with 30 × 30 pixels.

Experiment Setting.
We compare the proposed JLDDP with several existing classical video-based face recognition methods, including MSM [60], DCC [61], MMD [62], MDA [63], AHISD [64], CHISD [64], SANP [65], DFRV [27], LSD [29], and SFDL [43]. e source code of DCC can be found at http://mi.eng.cam.ac.uk/∼tkk22. e source code of AHISD and CHISD can be found at http://mlcv.ogu.edu.tr/ softwareimageset.html. Since the source codes of other methods are not provided by their authors, we implement them by ourselves and follow the same parameter settings in their corresponding papers. In the video-based experiments, the parameters ω 1 , ω 2 , and ω 3 of JLDDP are empirically set as 0.0001, 0.0005, and 0.005, respectively. e number of atoms per class for the Honda, MoBo, and YTC databases is set as 20, 25, and 40, respectively. We select the best accuracy that  In the first experiment, the proposed JLDDP is compared with the state-of-the-art methods. e training set of the Honda and MoBo databases contains one video of each subject, and the testing set contains the remaining videos. If the subject has only one video, we separate the video into two clips and select one video for training and another video for testing randomly. e training set of the YTC database contains 3 videos of each subject, and the testing set contains 6 videos of each subject. In the second experiment, the influence of different training and testing frames on the performance of various methods is tested. We randomly choose 50, 100, and 200 frames from each video as the training set and another 50, 100, and 200 frames as the testing set.

Comparison with the Contrast Methods.
In the first experiment, our JLDDP is compared with several existing methods. Table 13 tabulates the recognition accuracies of the methods on the Honda, MoBo, and YTC databases. e  recognition accuracies of MDA, LSD, SFDL, and JLDDP are higher than those of MSM, DCC, MMD, AHISD, CHISD, SANP, and DFRV in most cases. erefore, we can infer that the supervised methods can exploit more discriminative information than the unsupervised methods. Moreover, our JLDDP surpasses the compared methods. e main reason is JLDDP can project the frames into a discriminative low-dimensional subspace, which is beneficial to obtain the discriminative coding coefficients with the class-specific dictionary.

Comparison under Various Number of Frames.
In the second experiment, various number of frames are selected as the training set to compare the robustness of JLDDP with other methods. Figure 4 shows the top average recognition accuracies of different methods on the Honda, MoBo, and YTC databases with various number of frames. e recognition accuracies are improved with increasing of the number of frames. JLDDP can achieve the best recognition accuracy with different numbers of frames. is is because joint learning of the projection and dictionary can enable JLDDP to obtain more discriminative information.

Conclusions
is paper presents a JLDDP method for sparse representation-based face recognition. By combining DL and DR into a unified framework, our JLDDP obtains the adaptive projection and dictionary. e proposed JLDDP achieves commendable performance and robustness on seven benchmark image-based and video-based databases. Moreover, an effective iterative algorithm is proposed to solve the optimization problem, and the convergence is strictly proven.
Data Availability e data are derived from public domain resources.