Sparsity Preserving Discriminant Projections with Applications to Face Recognition

Dimensionality reduction is extremely important for understanding the intrinsic structure hidden in high-dimensional data. In recent years, sparse representation models have been widely used in dimensionality reduction. In this paper, a novel supervised learning method, called Sparsity Preserving Discriminant Projections (SPDP), is proposed. SPDP, which attempts to preserve the sparse representation structure of the data and maximize the between-class separability simultaneously, can be regarded as a combiner of manifold learning and sparse representation. Specifically, SPDP first creates a concatenated dictionary by classwise PCA decompositions and learns the sparse representation structure of each sample under the constructed dictionary using the least square method. Secondly, a local between-class separability function is defined to characterize the scatter of the samples in the different submanifolds. Then, SPDP integrates the learned sparse representation information with the local between-class relationship to construct a discriminant function. Finally, the proposed method is transformed into a generalized eigenvalue problem. Extensive experimental results on several popular face databases demonstrate the feasibility and effectiveness of the proposed approach.


Introduction
In many fields such as object recognition [1,2], text categorization [3], and information retrieval [4], the data are usually provided in high-dimensional form; this makes it difficult to describe, understand, and recognize these data.As an effective method, dimensionality reduction has been widely used in practice to handle these problems [5][6][7][8].Up to now, a variety of dimensionality reduction algorithms have been designed.Based on the data structure they utilize, these methods fall into three categories: global structure-based methods, local neighborhood-based methods, and sparse representation-based methods.
Principal Component Analysis (PCA) [9], Linear Discriminant Analysis (LDA) [10], and their kernelized versions are typical global structure-based methods [11,12].Owing to its simplicity and effectiveness, PCA, which aims at maximizing the variance of the projected data, has extensive applications in the fields of science and engineering.PCA is a good dimensionality reduction method; however, it does not employ the label information of the samples, leading to inefficiency of the classification.Unlike PCA, LDA is a supervised method that attempts to identify an optimal projection by maximizing the between-class scatter and as such minimizing the within-class scatter.Because the label information is fully exploited, LDA has been proven more efficient than PCA in classification [13].However, LDA can extract at best  − 1 features ( is the number of categories), which is unacceptable in many situations.Moreover, both PCA and LDA are based on the hypothesis that samples from each class lie on a linear subspace [14,15]; that is, neither of them can identify the local submanifold structure hidden in high-dimensional data.
Recently, manifold learning methods, which are especially useful for the analysis of the data that lie on a submanifold of the original space, have been proposed [16][17][18][19][20][21][22][23][24][25][26].Representative manifold learning methods include Isomap [16], Laplacian Eigenmaps (LE) [17], and Locally Linear Embedding (LLE) [18].All these nonlinear methods are able to discover the optimal feature subspace by solving an optimization problem based on the weight graph question; however, none of them can overcome the "out-of-sample" problem [19].That is, they yield maps that are characterized only on the training data points but how to evaluate the maps on new test data points is still unclear.In order to address this problem, Cai et al., respectively, developed the linear visions of the above manifold learning methods such as isometric projection [20], Locality Preserving Projections (LPP) [21], and Neighborhood Preserving Embedding (NPE) [22].However, these methods suffer from a limitation that they do not encode discriminant information, which is very important for recognition tasks.Recently, Gui et al. proposed a new supervised learning algorithm called Locality Preserving Discriminant Projections (LPDP) to improve the classification performance of LPP and applied it to face recognition [26].Experimental results show that LPDP is more suitable for recognition tasks than LPP.
Sparse representation, as a new branch of the state-of-theart techniques for signal representation, has attracted considerable research interests [27][28][29][30][31][32][33][34][35][36][37][38].It attempts to preserve the sparse representation structure of the samples in a lowdimensional embedding subspace.The representative dimensionality reduction algorithms based on sparse representation include Sparsity Preserving Projections (SPP) [39], Sparsity Preserving Discriminant Analysis (SPDA) [40], Discriminative Learning by Sparse Representation Projections (DLSP) [41], Sparse Tensor Discriminant Analysis (STDA) [42], and sparse nonnegative matrix factorization [43].It is worthwhile to note that a sparse model also depends on the subspace assumption: each sample can be linearly expressed by other samples from the same class; that is, each sample can be sparsely recovered by samples from all classes.In general, these sparse learning algorithms provide superior recognition accuracy compared with the conditional methods.However, all these dimensionality reduction methods based on sparse coding mentioned above are required to solve the ℓ 1 norm minimization problem to construct the sparse weight matrix.Therefore, they are computationally prohibitive for large-scale problems.For example, SPP attempts to preserve the sparse reconstructive relationship of the data [39], which is an effective and powerful technique for dimensionality reduction.However, the computational complexity of SPP is excessively high and hence, it cannot be used extensively for large-scale data processing (in fact, the time cost for constructing the sparse weight graph is ( 4 ), where  indicates the total number of training samples).Moreover, SPP does not absorb the label information.Thus, the algorithm is unsupervised.
Motivated by the above works, a novel supervised learning method, called Sparsity Preserving Discriminant Projection (SPDP), is proposed in this paper.By integrating SPP with local discriminant information for dimensionality reduction, SPDP can be viewed as a combiner of sparse representation and manifold learning.Because sparse representation can implicitly discover the local structure of the data owing to the sparsity prior, this property can be used to describe the local structure.However, differing from the existing SPP, which is time-consuming in sparse reconstruction for each test sample, SPDP first creates a concatenated dictionary using classwise PCA decompositions and learns the sparse representation structure of each sample under the constructed dictionary quickly with the least square method.Then, a local between-class separability function is defined to characterize the scatter of the samples in the different submanifolds.Subsequently, by integrating the sparse representation information with the local between-class relationship, SPDP attempts to preserve the sparse representation structure of the data and maximize the local between-class separability simultaneously.Finally, the proposed method is converted into a generalized eigenvalue problem.
It is worth emphasizing some merits of SPDP and the main contributions of this paper: (1) SPDP is a supervised dimensionality reduction method that attempts to identify a discriminating subspace where the sparse representation structure of the data and the label information are maintained.Meanwhile, the separability of different submanifolds is maximized; that is, different submanifolds can be distinguished more clearly.(2) SPDP is able to explore the local submanifold structure hidden in high-dimensional data because the manifold learning is employed to characterize the local between-class separability.(3) The time required for extracting discriminant vectors in SPDP is significantly less than many algorithms based on sparse representation.Therefore, the proposed method can be widely applied for large-scale problems.(4) Label information is employed twice in SPDP.First, it is absorbed in constructing the dictionary for sparse representation and calculating the sparse coefficient vector, which may contribute to a more discriminating sparse representation structure.Further, it is utilized in computing the local between-class separability, which is more conducive for classification.
The rest of this paper is organized as follows: Section 2 briefly reviews the existing SPP algorithm.The SPDP algorithm is described in detail in Section 3. The experimental results and analysis are presented in Section 4 and the paper ends with concluding remarks in Section 5.

Brief Review of Sparsity Preserving Projections (SPP)
SPP aims to preserve the sparse reconstruction relationship of the samples [39].Given a set of training samples {x  }  =1 , where x  ∈ R  and  is the number of training samples, let X = [x 1 , x 2 , . . ., x  ] ∈ R × be the data matrix consisting of all the training samples.SPP first seeks the sparse reconstruction coefficient vector s  for each sample x  through the following modified ℓ 1 minimization problem: where s  = [s 1 , . . ., s ,−1 , 0, s ,+1 , . . ., s , ]  is an -dimensional column vector in which the th element is equal to zero, implying x  is removed from X, and the element s  ,  ̸ = , denotes the contribution of x  for reconstructing x  .Then, the sparse reconstructive weight matrix S is given as follows: where s  is the optimal solution of (1).The final optimal projection vector w is obtained through the following maximization problem: with S  = S + S  − S  S.This problem transforms to a generalized eigenvalue problem.It follows that SPP must resolve  time-consuming ℓ 1 norm minimization problems to obtain the sparse weight matrix S. Thus, the computational complexity of SPP is excessively high and therefore not widely applicable to large-scale data processing.Moreover, SPP does not exploit the prior knowledge of class information, which is valuable for classification and recognition problems such as face recognition.

Sparsity Preserving Discriminative Learning
In this section, the proposed SPDP algorithm is described in more detail.To reduce the disadvantage that is inevitable for SPP to resolve  time-consuming ℓ 1 norm minimization problems to obtain the sparse weight matrix S, SPDP first constructs a concatenated dictionary through classwise PCA decompositions and learns the sparse representation structure of each sample under the constructed dictionary quickly using the least square method.To enhance the discriminant performance, it defines a local between-class separability function to characterize the scatter of the samples in the different submanifolds.Then, by integrating the sparse representation information with the local interclass relationship, SPDP aims to maximize the separation between the submanifolds (or intrinsic clusters) without destroying localities and meanwhile preserve the sparse representation structure of the data.Hence, the proposed algorithm is expected to preserve the intrinsic geometry structure and have superior discriminant abilities.

Constructing the Concatenated Dictionary.
For convenience, we first provide some notations used in this paper.Assume that X = {x 1 , x 2 , . . ., x  } is a set of training samples, where x  ∈ R  .We can categorize the training samples as . ., ) consists of samples from class .Suppose that samples from a single class lie on a linear subspace.Thus, each sample can be sparse linearly represented by samples from all classes.The subspace model is a powerful tool to capture the underlying information in real data sets [44].For the convenience of PCA decomposition and relevant calculations, we first center the samples from each class at the origin, X = [x 1 −   , x 2 −   , . . ., x   −   ] ( = 1, 2, . . ., ), where   denotes the mean of class ; that is, x  /  .Therefore, the training sample can be recast as X = [ X1 , X2 , . . ., X ].Afterwards, PCA decomposition is conducted for every X ( = 1, 2, . . ., ), whose objective function is where ∑  is the sample covariance matrix of X .For every class , the first   principal components are selected to construct (in fact,   is automatically selected by the value of the PCA ratio from the system).Thus, a sample x from class  can be simply represented as with D = [D 1 , D 2 , . . ., D  ] and s = [0  , 0  , . . ., 0  , s  , 0  , . . ., 0  ]  .D  is the dictionary of class  by the PCA decomposition above, D is the concatenated dictionary composed of all D  ( = 1, 2, . . ., ), s is the sparse representation of a sample x under the concatenated dictionary D, and s is the coefficient vector under the dictionary D  .In fact, s can be quickly computed from the least square method as The orthogonality of each principal component of PCA decomposition of the same class is utilized in the reduction of the above formula.The process of constructing the concatenated dictionary is presented in Figure 1.
According to the preceding procedure, each training sample corresponds to a sparse representation under the concatenated dictionary D and the sparse coefficient vector s of any training sample from class  can be quickly computed from the least square method (in fact, it is the primary reason that the proposed approach is significantly faster than SPP, which will be explained in detail in Section 4.4) because the computational process of s involves only D  , which is column orthogonal in view of ( 5) and (6).

Preserving Sparse Representation Structure.
As can be seen in Section 3.1, to some extent, the dictionary D describes the intrinsic geometric properties of the data and the sparse coefficient vectors explicitly encode the discriminant information of the training samples.Thus, it is hoped that this valued property in the original high-dimensional space can be preserved in the low-dimensional embedding subspace.Therefore, the objective function is expected to look for an optimal projection that can best preserve the sparse representation structure: where s  is the sparse reconstruction vector corresponding to x  .

Characterization of the Local Interclass Separability.
To effectively discover the discriminant structure embedded in high-dimensional data and improve the classification performance, in this subsection, we construct a local interclass weight graph.Because data in the same class lie on one or more submanifolds and data belonging to different classes are distributed on different submanifolds, it is important for classification problems to distinguish one submanifold from another.Therefore, a local between-class separability function is defined in this section to characterize the separability of the samples in different submanifolds.The aim of SPDP is that different submanifolds can be distinguished more clearly after being projected; hence, the local between-class separability of different submanifolds should be maximized.Thus, we can construct a label matrix B to describe the local and interclass relationships of each point as follows: where ‖x  −x  ‖ 2 2 denotes the geodesic distance between points x  and x  ,  is a parameter which is often set to be as the standard deviation of the samples,  −  () denotes the index in the  nearest neighbors of the sample x  , however with a different class label, and B is called the local betweenclass weight matrix (or local interclass weight graph).As can be seen in the above definition, if two distant points x  and x  belong to different submanifolds, the scatter of them is big and vice versa.That is, the points belonging to different submanifolds should be located farther after projection.Therefore, the local interclass separability can be characterized as the following equation: where y  = w  x  ( = 1, 2, . . ., ) is the low-dimensional representation of the original data, which can be obtained by projecting each x  onto the direction vector w ∈ R  .With algebraic simplifications, (11) can be rewritten as where L is Laplacian matrix with definition L = D  − B and D  is a diagonal matrix [45]; that is, D   = ∑  B  .Equation ( 12) characterizes the separability (or scatter) of the data set in different submanifolds.Therefore, each manifold can be separated clearly, as long as the optimal projection w * is adopted.

Sparsity Preserving Discriminant Projections.
To achieve improved recognition results, we explicitly integrate the sparsity preserving constraint as indicated in (7) with the local between-class separability as illustrated in (12).The novel supervised algorithm SPDP, which not only preserves the sparse representation structure but also separates each submanifold as distant as possible, is defined as where the denominator term   (w) measures the quality of preserving the sparse representation structure and the numerator term   (w) measures the separability of different submanifolds.It is well known that the criterion of LDA is to maximize the between-class scatter and, meanwhile, minimize the within-class scatter.Similar to LDA, the aim of SPDP is to maximize the ratio of the local between-class separability to the sparse representation weight scatter.Letting the objective function can be recast as the following optimization problem: Then, the optimal w's are the eigenvectors corresponding to the largest  eigenvalues of the following generalized eigenvalue problem: It is worth noting that since the training sample size is much smaller than the feature dimensions for those highdimensional data, M might be singular.This problem can be tackled by projecting the training set X onto a PCA subspace spanned by the leading eigenvectors to get X  and replacing X by X  .
Based on the above discussion, the proposed SPDP is summarized in Algorithm 1.
Step 2. Calculate the coefficient vector s under the dictionary D  for each sample based on (6) to obtain the sparse coefficient vector s and then calculate S.
Step 4. Calculate the projecting vectors by the generalized eigenvalue problem in (16).

Experiments
In this section, the proposed SPDP algorithm is tested on three publicly available face databases (Yale [13], ORL [46], and CMU PIE [47]) and compared with six popular dimensionality reduction methods-PCA, LDA, LPP, NPE, LPDP, and SPP.For PCA, the only model parameter is the subspace dimension and for LDA, the performance is directly influenced by the energy of the eigenvalues kept in the PCA preprocessing phase.For LPP and NPE, the supervised versions are adopted.In particular, the neighbor mode in LPP and NPE is set to be "supervised"; the weight mode in LPP is set to be "Cosine."The empirically determined parameter  in LPDP is taken to be 1 [26],  in SPP is set to be 0.05 as indicated in [39], and  in SPDP is set to be the standard deviation of the samples.The nearest neighbor classifier (1−) is employed to predict the classes of the test data.All experiments are accomplished with MATLAB R2013a on a personal computer with Intel(R) Core i7-4770 K 3.50 GHz CPU, 16.0 GB main memory, and the Windows 7 operating system.

Experiment on Yale Face
Database.The Yale face database contains 165 face images of 15 individuals.There are 11 images per individual.These images were collected under different facial expressions (normal, happy, sad, surprised, sleepy, and wink) and configurations (left-light, center-light, and right-light) and with or without glasses.All the images are cropped to a size of 32 × 32 and then normalized to have a unit norm.Some samples from this database are presented in Figure 2.For each person,  ( varies from 2 to 8) images are randomly selected as the training samples and the remaining 11 −  for the test.For each , the results are averaged over 50 random splits.Table 1 presents the best recognition rate and the associated standard deviation of the seven algorithms under the different sizes of the training set. Figure 3(a) presents the best recognition rate versus the variation of the size of the training set. Figure 3(b) is the variation rules of the recognition rates of the seven algorithms under different reduced dimensions when the size of the training samples from each class is fixed as six.The fact that the upper bound for the dimensionality of LDA is  − 1 ( is the number of categories) because there are at most  − 1 generalized nonzero eigenvalues [13] deserves to be noted; similar situations will occur in other experiments in this paper.Hence, one can see that the SPDP algorithm significantly outperforms the other methods.points, under different lighting conditions, varying facial expressions.In our experiment, each image is cropped to a resolution of 32×32 as shown in Figure 4. We randomly select  ( varies from 2 to 8) pictures from each person for training; the remainder are used for testing.We repeat these splits 50 times and report the average results.Table 2 displays the best classification accuracy of the seven algorithms under the different sizes of the training set; the number in parentheses is the corresponding standard deviation.

Experiment on CMU PIE Face Database.
In this subsection, it is verified that the proposed algorithm achieves higher classification accuracy than the other dimensionality reduction methods under varying illumination, pose, and expression.The CMU PIE face database contains over 41,368 face images of 68 subjects that were captured by 13 synchronized cameras and 21 flashes under varying poses, illumination, and expression.In our experiments, we choose the five frontal poses (C05, C07, C09, C27, and C29).This leaves 170 face images per subject; all the images are cropped to 32 × 32. Figure 6 shows some pictures of one subject.A random subset with (=5, 10, 15, 20) pictures per subject is selected with labels to form the training set; the remainder are used for testing.For each given , we average the classification accuracies over 50 random splits.Table 3 presents the best recognition rate and the associated standard deviation in brackets of the seven algorithms under the different size of the training set.The critical factor of the above phenomenon is that the approaches of SPP and SPDP to obtain the sparse representation structure are entirely different.In SPP,  time-consuming ℓ 1 norm minimization problems are required to be solved to construct the sparse weight matrix, whose computational cost is ( 4 ) [48,49], whereas SPDP can achieve this significantly faster through only  PCA decompositions and  least square methods.Because  PCA decompositions can be completed in ( 2 ∑  =1   ) according to the more efficient algorithm [50], the time cost for learning the sparse coefficient vector of each sample, which only involves the least square method, is (  ) and the sparse weight matrix S can be calculated with ( ∑  =1     ); the computational complexity of SPDP to learn the sparse representation structure is ( 2 ∑  =1   +  ∑  =1     ).In general,   ≪ ,   ≪ , and  ≪ ; hence, SPDP performs considerably faster than SPP as indicated in Tables 4, 5, and 6.

Overall Observations and Discussions.
Several observations and analysis can be achieved from the above experimental results.
(1) From Tables 1, 2, and 3 and Figures 3(a  (2) From Figures 3(b), 5(b), and 7(b), it can be observed that the reduction dimensions for SPDP to achieve the best recognition rate are less than those of the other compared algorithms.This saves a considerable amount of time and storage space after obtaining the optimal embedding functions.
(3) From Tables 4, 5, and 6, it is indicated that SPDP is considerably faster than SPP in obtaining the discriminant vectors.This is because the method SPDP uses to learn the sparse representation structure which is more effective than that of SPP as analyzed in Section 4.4.

Conclusions
This paper proposed a new supervised learning method, called Sparsity Preserving Discriminative Projections (SPDP), by combining manifold learning and sparse representation.Specifically, SPDP first constructs a concatenated dictionary by means of classwise PCA decompositions and learns the sparse representation structure of each sample under the constructed dictionary quickly using the least square method.Then, it defines a local between-class separability function to characterize the separability of the samples in different submanifolds.Subsequently, SPDP integrates the sparse representation information with the local between-class relationship.Thus, SPDP preserves the sparse representation structure of the data and maximizes the local between-class separability simultaneously.Finally, the proposed method is transformed into a generalized eigenvalue problem.Extensive experiments on three publicly available face data sets confirmed the promising performance of the proposed SPDP approach.

Figure 1 :
Figure 1: The process of constructing the concatenated dictionary.

Figure 2 :
Figure 2: Some face samples from the Yale database.
Database.There are 400 images of 40 people in the ORL face data set, where each one has 10 different pictures.The images were collected at different time

Figure 3 :
Figure 3: Recognition rates of the seven algorithms on the Yale database: (a) the best recognition rates versus the different size of the training set and (b) the average recognition rates versus the variation of dimensions when the size per class is fixed as six.

Figure 5 (
a) presents the best recognition rate versus the variation of the size of the training set. Figure 5(b) is the variation rules of the recognition rates of the seven algorithms under different reduced dimensions when the size of the training samples from each class is fixed as five.It can be seen that SPDP and LPDP are superior to other compared methods (their performances on the ORL database are quite similar), especially when the size of the training set is small.The reason may be that both SPDP and LPDP consider the discriminant information and local structure of the data.

Figure 4 :
Figure 4: Some face samples from the ORL database.

Figure 7 (
a) presents the best recognition rate versus the variation of the size of the training set.
Figure 7(b) is the variation rules of the recognition rates of the seven algorithms under different reduced dimensions when the size of the training samples from each class is fixed as ten.We can observe that the proposed SPDP outperforms the other dimensionality reduction methods such as PCA, LDA, LPP, NPE, LPDP, and SPP about pose, illumination, and expression variations.

4. 4 .
Comparison of Time Cost for Acquiring the DiscriminantVectors of SPP with SPDP.In this subsection, the time cost for acquiring the discriminant vectors of SPDP is compared with that of SPP.Tables4, 5, and 6 list the average time costs for acquiring the discriminant vectors of SPP and SPDP versus the different sizes of the training set on the three face data sets.It is demonstrated that SPDP is significantly faster than SPP in acquiring the embedding functions in our experiments, especially in the large-scale problems such as CMU PIE.

Figure 5 :
Figure 5: Recognition rates of the seven algorithms on the ORL database: (a) the best recognition rates versus the different size of the training set and (b) the average recognition rates versus the variation of dimensions when the size per class is fixed as five.

Figure 6 :
Figure 6: Some face samples from the CMU PIE database.

Figure 7 :
Figure 7: Recognition rates of the seven algorithms on the CMU PIE database: (a) the best recognition rates versus the different size of the training set and (b) the average recognition rates versus the variation of dimensions when the size per class is fixed as ten.

Table 1 :
The best recognition rate and the corresponding standard deviation of the seven algorithms under the different size of the training set on Yale ( is the training sample size).

Table 2 :
The best recognition rate and the corresponding standard deviation of the seven algorithms under the different size of the training set on ORL ( is the training sample size).

Table 3 :
The best recognition rate and the corresponding standard deviation of the seven algorithms under the different size of the training set on CMU PIE ( is the training sample size).

Table 4 :
Time (s) for acquiring the discriminant vectors of SPP and SPDP on Yale ( is the training sample size).

Table 5 :
Time (s) for acquiring the discriminant vectors of SPP and SPDP on ORL ( is the training sample size).

Table 6 :
Time (s) for acquiring the discriminant vectors of SPP and SPDP on CMU PIE ( is the training sample size).