Fast Second-Order Orthogonal Tensor Subspace Analysis for Face Recognition

Tensor subspace analysis (TSA) and discriminant TSA (DTSA) are two effective two-sided projection methods for dimensionality reduction and feature extraction of face image matrices. However, they have two serious drawbacks. Firstly, TSA and DTSA iteratively compute the left and right projection matrices. At each iteration, two generalized eigenvalue problems are required to solve, which makes them inapplicable for high dimensional image data. Secondly, the metric structure of the facial image space cannot be preserved since the left and right projection matrices are not usually orthonormal. In this paper, we propose the orthogonal TSA (OTSA) and orthogonal DTSA (ODTSA). In contrast to TSA and DTSA, two trace ratio optimization problems are required to be solved at each iteration. Thus, OTSA and ODTSA have much less computational cost than their nonorthogonal counterparts since the trace ratio optimization problem can be solved by the inexpensive Newton-Lanczos method. Experimental results show that the proposed methods achieve much higher recognition accuracy and have much lower training cost.


Introduction
Many applications in the field of information process, such as data mining, information retrieval, machine learning, and pattern recognition, require dealing with high-dimensional data.Dimensionality reduction has been a key technique for achieving high efficiency in manipulating the high-dimensional data.In dimensionality reduction, the high-dimensional data are transformed into a low-dimensional subspace with limited loss of information.
Principal component analysis (PCA) [1] and linear discriminant analysis (LDA) [2] are two of the most well-known and widely used dimension reduction methods.PCA is an unsupervised method, which aims to find the projection directions by maximizing variance of features in the lowdimensional subspace.It is also considered as the best data representation method in that the mean squared error between the original data and the data reconstructed using the PCA transform result is the minimum.LDA is a supervised method and is based on the following idea: the transform results of the data points of different classes should be far as much as possible from each other and the transform results of the data points of the same class should be close as much as possible to each other.To achieve this goal, LDA seeks to find optimal linear transformation by minimizing the within-class distance and maximizing the between-class distance simultaneously.The optimal transformation of LDA can be computed by solving a generalized eigenvalue problem involving scatter matrices.LDA has been applied successfully for decades in many important applications including pattern recognition [2][3][4], information retrieval [5], face recognition [6,7], microarray data analysis [8,9], and text classification [10].One of main drawbacks of LDA is that the scatter matrices are required to be nonsingular, which is not true when the data dimension is larger than the number of data samples.This is known as the undersampled problem and also called a small sampled problem [2].To make LDA applicable for undersampled problems, researchers have proposed several variants of LDA including PCA + LDA [7,11], LDA/GSVD [12,13], two-stage LDA [14], regularized LDA [15][16][17][18], orthogonal LDA [19,20], null space LDA [21,22], and uncorrelated LDA [20,23].

Journal of Applied Mathematics
As is well known, both PCA and LDA only take into account the global Euclidean structure of the original data.However, the high-dimensional data in real-world often lie on or near a smooth low-dimensional manifold.So, it is important to preserve the local structure.Locality preserving projection (LPP) [24] is a locality structure preserving method that aims to preserve the intrinsic geometry of the original data.Usually, LPP has better performance than the methods only preserving the global structure information such as PCA and LDA for recognition problems.Moreover, LPP is not sensitive to noise and outliers.In its original form, LPP is only an unsupervised dimension reduction method.The supervised version of LPP (SLPP) [25] exploits the class label information of the training samples and thus has a higher classification accuracy than the unsupervised LPP.Other improvements to LPP include the discriminant locality preserving projection (DLPP) [26] and the orthogonal discriminant locality preserving projection method (ODLPP) [27].
During dealing with two-dimensional data such as images, the traditional approach is first to transform these image matrices into one-dimensional vectors and then apply these dimension reduction methods mentioned above to the vectorized image data.The approach of vectorizing image matrices can bring high computational cost and a loss of the underlying spatial structure information of the images.In order to overcome disadvantages of the vectorization approach, researchers have proposed 2D-PCA [28], 2D-LDA [29], 2D-LPP [30], and 2D-DLPP [31].These methods are directly based on the matrix expression of image data.However, these two-dimensional methods only employ single-side projection and thus cannot still preserve the intrinsic spatial structure information of the images.
In the last decade, some researchers have developed several second-order tensor methods for dimension reduction of image data.These methods aim to find two subspaces for twosided projection.Ye [32] has proposed a generalized low-rank approximation method (GLRAM), which seeks to find the left and right projections by minimizing the reconstruction error.Moreover, an iterative procedure is presented.One of the main drawbacks of GLRAM is that one eigenvalue decomposition is required at each iteration step.So, the computational cost is high.To overcome this disadvantage, Ren and Dai [33] have proposed to replace the projection vectors obtained from the eigenvalue decomposition by the bilinear Lanczos vectors at each iteration step of GLRAM.Experimental results show that the approach based on bilinear Lanczos vectors is competitive with the conventional GLRAM in classification accuracy, while it has a much lower computational cost.We note that GLRAM is an unsupervised method and only preserves the global Euclidean structure of the image data.Tensor subspace analysis (TSA) [34] is another twosided projection method for dimension reduction and feature extraction of image data.TSA preserves the local structure information of the original data, while it does not employ the discriminant information.Wang et al. [35] have proposed a discriminant TSA (DTSA) by combining TSA with the discriminant information.Like GLRAM, both TSA and DTSA use an iterative procedure to compute the optimal solution of two projection matrices.At each iteration of TSA and DTSA, two generalized eigenvalue problems are required to solve, which makes them inapplicable for dimension reduction and feature extraction of high-dimensional image data.
In this paper, we propose the orthogonal TSA (OTSA) and orthogonal DTSA (ODTSA) by constraining the left and right projection matrices to orthogonal matrices.Similarly to TSA and DTSA, OTSA and ODTSA also iteratively compute the left and right projection matrices.However, instead of solving two generalized eigenvalue problems as in TSA and DTSA, we require solving two trace ratio optimization problems at each iteration of OTSA and ODTSA during iteratively computing the left and right projection matrices.Thus, OTSA and ODTSA have much less computational cost than their nonorthogonal counterparts since the trace ratio optimization problem can be solved by the inexpensive Newton-Lanczos method.Two experiments on face recognition are conducted to evaluate the efficiency and effectiveness of the proposed OTSA and ODTSA.Experimental results show that these methods proposed in this paper achieve much higher recognition accuracy and have much lower training cost than TSA and DTSA.
The remainder of the paper is organized as follows.In Section 2, we briefly review TSA and DTSA.In Section 3, we firstly propose the OTSA and ODTSA.Then, we give a brief review of the trace ratio optimization problem and outline the Newton's method and the Newton-Lanczos for solving the trace ratio optimization problem.Finally, we present the algorithms for computing the left and right projection matrices of OTSA and ODTSA.Section 4 is devoted to numerical experiments.Some concluding remarks are provided in Section 5.

A Brief Review of TSA and DTSA
In this section, we give a brief review of TSA and DTSA, which are two recently proposed linear methods for dimension reduction and feature extraction of face recognition.
Given a set of  image data, where For simplicity of discussion, we assume that the given data set X is partitioned into  classes as where   is the number of samples in the th class and  = ∑  =1   .
Let  ∈ R × denote the total within-class similarity matrix.Its entry is defined by where  is a positive parameter which can be determined empirically and ‖ ⋅ ‖ denotes the Frobenius norm for a matrix, that is, ‖‖ = √∑  ∑   2  .Note that the total within-class similarity matrix  has a block diagonal form, where the th block is the within-class similarity matrix   of the th class and the size of the th block is equal to the number   of samples in the th class; that is, The between-class similarity matrix  is defined as follows: where is the mean of samples in the th class.Define the diagonal matrix  = diag( 1 ,  2 , . . .,   ) with Then,   =  −  is called the within-class Laplacian matrix and is symmetric positive semidefinite.Similarly, the between-class Laplacian matrix is defined as where  is a diagonal matrix with its th entry being the row sum of the th row of .
In two-sided projection methods such as TSA and DTSA for dimension reduction and feature extraction of matrix data, we aim to find two projection matrices are easier to be distinguished.

Tensor Subspace Analysis.
In TSA, we seek to find the left and right transformation matrices ,  by solving the following optimization problem: The numerator part of the objective function in (9) denotes the global variance on the manifold in low-dimensional subspace, while the denominator part of the objective function is a measure of nearness of samples from the same class.Therefore, by maximizing the objective function, the samples from the same class are transformed into data points close to each other and samples from different classes are transformed into data points far from each other.Define These two matrices, respectively, are called the total left and right transformation matrices in [35].
The optimization problem ( 9) can be equivalently rewritten as the following optimization problem: or max Here and in the following,   denotes an identity matrix of order  and ⊗ represents the Kronecker product of the matrices.
Clearly, from the equivalence between the maximization problem ( 9) and the optimization problem (11) or (12), we have the following results, from which an iterative algorithm for the computation of the transformation matrices  and  results.
Theorem 1.Let  and  be the solution of the maximization problem (9).Then, Consider the following.
(1) For a given ,  consists of the  2 eigenvectors of the generalized eigenvalue problem corresponding to the largest  2 eigenvalues.
(2) For a given ,  consists of the  1 eigenvectors of the generalized eigenvalue problem corresponding to the largest  1 eigenvalues.
Based on Theorem 1, iterative implementation of TSA has been given in Algorithm 1; see, also, [34].

Discriminant Tensor Subspace Analysis.
In this subsection, we simply review the second-order DTSA, which is proposed in [35] for face recognition.DTSA combines the advantages of tensor methods and manifold methods and thus preserves the spatial structure information of the original image data and the local structure of the samples distribution.Moreover, by integrating the class label information into TSA, DTSA obtains higher recognition accuracy for face recognition.
In DTSA, the optimization problem is described as follows: where   is the mean of samples in the th class.We note that the objective function in (15) has the same denominator part as that of the objective function in (9) and however has a different numerator part from that of the objective function in (9).Since the numerator part of the objective function in (15) is established based on the class label information, DTSA has better performance than TSA for transforming samples from different classes into data points far from each other.
Define the mean left and right transformation matrices   ,   by . . .
Then, similarly, the optimization problem (15) or the optimization problem max where   is the within-class Laplacian matrix and   is the between-class Laplacian matrix.
Similarly, for the optimization problem (15), we have the following result.
Theorem 2. Let  and  be the solution of the maximization problem (15).Then, Consider the following.
(1) For a given ,  consists of the  2 eigenvectors of the generalized eigenvalue problem corresponding to the largest  2 eigenvalues.
(2) For a given ,  consists of the  1 eigenvectors of the generalized eigenvalue problem corresponding to the largest  1 eigenvalues.
The algorithm proposed in [35] for implementing DTSA is described in Algorithm 2.

Orthogonal TSA and DTSA
Although TSA and DTSA are two effective methods for dimension reduction and feature extraction of facial images, they still have two serious defects.Firstly, as shown in the section above, the column vectors of the left and right transformation matrices  and  are the eigenvectors of symmetric positive semidefinite pencils.So, they are not usually orthonormal.The requirement of the orthogonality of the columns of projection matrices is common in that orthogonal projection matrices preserve the metric structure of the facial image space.Thus, orthogonal methods have better locality preserving power and higher discriminating power than nonorthogonal methods.Secondly, at each iteration step of TSA algorithm or DTSA algorithm, two generalized eigenvalue problems are required to solve for iteratively computing the left and right projection matrices.As a result, when computational efficiency is critical, relatively high computational complexities of TSA and DTSA make them inapplicable for real applications.
In this section, we propose the orthogonal TSA (OTSA) and the orthogonal DTSA (ODTSA) for dimension reduction and feature extraction of facial images.
In OTSA, we seek to obtain the orthogonal projection matrices  and  by solving the optimization problem max while in ODTSA, the optimization problem to be solved is max Clearly, for OTSA and ODTSA, we have the following theorems.
Theorem 3. Let  and  be the solution of the maximization problem (21).Then, Consider the following.
(1) For a given ,  is the solution of the trace ratio optimization problem max (2) For a given ,  is the solution of the trace ratio optimization problem max Theorem 4. Let  and  be the solution of the maximization problem (22).Then, Consider the following.
(1) For a given ,  is the solution of the trace ratio optimization problem max (2) For a given ,  is the solution of the trace ratio optimization problem max The only difference between OTSA and TSA or between ODTSA and DTSA is that  and  are constrained to orthogonal matrices in OTSA and ODTSA.However, the projection matrices  and  of orthogonal methods are quite different from those of nonorthogonal methods.In nonorthogonal methods,  and  can be formulated by some eigenvectors of the generalized eigenvalue problems, while those of orthogonal methods are the solutions of the trace ratio optimization problems.

Trace Ratio Optimization.
In this subsection, we consider the following trace ratio optimization problem: max where ,  ∈ R × are symmetric matrices.
For the trace ratio optimization problem (27), we have the following result, which is given in [36].
Theorem 5. Let ,  be two symmetric matrices and assume that  is positive semidefinite with rank greater than −.Then the ratio (27) admits a finite maximum value  * .
Define the function () as follows: We collect some important properties presented in [36] on the function () in the following theorem.Some of them indicate the relation between the trace ratio optimization problem (27) and the function ().Theorem 6.Let ,  be two symmetric matrices and assume that  is positive semidefinite with rank greater than −.Then ( Theorem 6 shows that instead of solving the trace ratio optimization problem (27),  * , the solution of the trace ratio optimization problem (27), can be obtained through two steps: (1) compute the solution  * of the nonlinear equation () = 0; (2) compute the  eigenvectors of the matrix  −  *  corresponding to the largest  eigenvalues.
Newton's method [37] is the most well-known and widely used method for solving a nonlinear equation.The iterative scheme of Newton's method for solving () = 0 takes the form where (  ) ∈ R × consists of the  eigenvectors of the matrix  −    corresponding to the largest  eigenvalues.We now outline the procedure of Newton's method for solving the trace ratio optimization problem (27) in Algorithm 3.
We remark that since Newton's method is commonly of quadratic convergence, only several iterations are required in Algorithm 3 for obtaining a good approximation of  * .The main cost at each iteration in Algorithm 3 is due to the computation of the  eigenvectors of a symmetric matrix corresponding to the largest  eigenvalues.

Lanczos Vectors.
In this subsections we review the Lanczos procedure for generating the Lanczos vectors of a symmetric matrix and the Newton-Lanczos method for solving the trace ratio optimization problem (27).
Given a symmetric matrix  and an initial unit vector V. Let K  (, V) denote the Krylov subspace associated with  and V, which is defined as The Lanczos vectors V 1 , V 2 , . . ., V  , which form an orthonormal basis of the Krylov subspace K  (, V), can be established by the 3-term recurrence with  1  0 = 0.The coefficients   and  +1 are computed so as to ensure that V  +1 V  and ‖V +1 ‖ 2 = 1.The pseudocode of the Lanczos procedure for constructing the Lanczos vectors It is known [38] that Lanczos vectors are commonly good approximation of the eigenvectors of a symmetric matrix corresponding to the largest eigenvalues.So it is reasonable that the  eigenvectors of the matrix  −  corresponding to the largest  eigenvalues in Algorithm 3 are substituted by the Lanczos vectors of the matrix − to save the expensive cost for computing the  eigenvectors.This substitution deduces the Newton-Lanczos method for solving the trace ratio optimization problem (27), which is outlined in Algorithm 5; see, also, [36].
End Do Algorithm 3: Newton's method for trace ratio optimization.

OTSA and ODTSA.
Similarly, from Theorem 3, we can obtain two iterative procedures for computing the left and right transformation matrices  and  of OTSA and ODTSA.Algorithms 6 and 7 summarize the steps to compute  and  for OTSA and ODTSA, respectively.The trace ratio optimization problem in Algorithms 6 and 7 can be solved by Newton's method or Newton-Lanczos method.For distinguishing these two cases, we use OTSA-N and ODTSA-N to denote the OTSA and ODTSA algorithms with the trace ratio optimization problem being solved by Newton's method and use OTSA-NL and ODTSA-NL to denote the OTSA and ODTSA algorithms with the trace ratio optimization problem being solved by Newton-Lanczos method.
In each iteration of TSA, it costs about 2 It is known that for solving the trace ratio optimization problem (27), it costs 9 3 flops in Newton's method (Algorithm 3) and 2 2 in the Newton-Lanczos method (Algorithm 5), where  is the number of the Newton's iteration steps.So, the total cost for each iteration of OTSA-N, ODTSA-N, OTSA-NL, and ODTSA-NL is about  In our experiments, the parameters  1 and  2 in all the methods are set to be 10.The parameter  is set to 1.The mean and standard deviation of recognition accuracy of 10 runs of tests of six algorithms are presented in Table 2.The training time for each method is presented in Table 3.It shows that for all methods, the recognition increases with the increase in training sample size.Moreover, the orthogonal methods have higher recognition accuracy than their nonorthogonal versions, and the orthogonal methods based on the Newton-Lanczos approach cost least computational time.As in the previous experiments, the parameters  1 and  2 are set to be 10, and  is set to 1.The mean and standard deviation of recognition accuracy of 10 runs of tests for the   Yale database are presented in Table 4.The training time of each method for the Yale database is presented in Table 5.

Conclusion
In this paper, we propose an orthogonal TSA and orthogonal DTSA for face recognition by constraining the left and right projection matrices to orthogonal matrices.Similarly to TSA and DTSA, OTSA and ODTSA also iteratively compute the left and right projection matrices.However, instead of solving two generalized eigenvalue problems as in TSA and DTSA, it requires solving two trace ratio optimization problems at each iteration of OTSA and ODTSA during iteratively computing the left and right projection matrices.Thus, OTSA and ODTSA have much less computational cost than their nonorthogonal counterparts since the trace ratio optimization problem can be solved by the inexpensive Newton-Lanczos method.Experimental results show that these methods proposed in this paper achieve much higher recognition accuracy and have much lower training cost than TSA and DTSA.

4. 2 .
Experiment on the Yale Database.The Yale face database contains 165 gray-scale images for 15 individuals where each individual has 11 images.These facial images have variations in lighting conditions (left-light, center-light, right-light), facial expressions (normal, happy, sad, sleepy, surprised, and wink), and with/without glasses.The 11 sample images of one individual from the Yale database are shown in Figure 2.

Figure 1 :
Figure 1: Sample images for one individual of the ORL database.

Figure 2 :
Figure 2: Sample images for one individual of the Yale database.

Table 3 :
Training time (second) on ORL database.

Table 5 :
Training time (second) on Yale database.