Multiple Data-Dependent Kernel Fisher Discriminant Analysis for Face Recognition

Kernel Fisher discriminant analysis (KFDA) method has demonstrated its success in extracting facial features for face recognition. Compared to linear techniques, it can better describe the complex and nonlinear variations of face images. However, a single kernel is not always suitable for the applications of face recognition which contain data from multiple, heterogeneous sources, such as face images under huge variations of pose, illumination, and facial expression. To improve the performance of KFDA in face recognition, a novel algorithm namedmultiple data-dependent kernel Fisher discriminant analysis (MDKFDA) is proposed in this paper.The constructed multiple data-dependent kernel (MDK) is a combination of several base kernels with a data-dependent kernel constraint on their weights. By solving the optimization equation based on Fisher criterion and maximizing the margin criterion, the parameter optimization of data-dependent kernel and multiple base kernels is achieved. Experimental results on the three face databases validate the effectiveness of the proposed algorithm.


Introduction
Face recognition has received extensive attention in many image processing applications.In these applications, the original face images commonly lie in a high-dimension space, resulting in low recognition accuracy and high cost.Existing image feature extraction algorithms can roughly fall into two categories: feature extraction based on signal processing and learning-based feature extraction [1][2][3][4].By the utilization of learning-based approach, the original images can be mapped into a lower-dimensional feature space in which the essential structure of the original space becomes clear.To this end, Fisher discriminant analysis (FDA) [5], principal component analysis (PCA), and locality preserving projection (LPP) [6] are typical learning-based feature extraction techniques.Moreover, one of the most famous algorithms applied in face recognition is Fisher face, which is based on a twophase framework: PCA plus LDA [3,4].It maximizes the between-class scatter and minimizes the within-class scatter to separate one class from the others.However, the entire above mentioned algorithms are linear subspaces analysis methods in essence, so they are inadequate to depict the complexity face images.To overcome the limitation, many nonlinear algorithms, such as kernel-based PCA (KPCA) [7] and FDA (KFDA) [8], have been devised and attained good performance in face recognition.It has been demonstrated that KFDA is a feasible nonlinear feature extraction algorithm for face recognition.However, the performance of KFDA is sensitive to the kernel function selection and its parameters.Moreover, the ability of single kernel is quite limited in depicting geometrical structure of some aspects for the input data.Once the face images are captured under huge variations of pose, illumination, facial expression, and so forth, single kernel-based FDA could not be suitable for the face recognition.In summary, kernel functions play an important part in face recognition applications [9,10].
As a consequence, various approaches have been developed to handle the above issues, and two main categories are identified as follows.(1) Devise multiple kernels by convex combination of multiple basic kernels.If so, different data descriptors can be used to depict the geometrical structures of original data from multiple views, which can complement to improve recognition performance [11][12][13][14].(2) Develop datadependent kernel (DK) by conformal transformation of the basic kernel.If so, the designed kernel would be adaptive to the input data, leading to a substantial improvement in the performance of KFDA algorithm [15][16][17].
In this paper, to improve the performance of KFDA, we proposed a novel feature extraction algorithm for face recognition called multiple data-dependent kernel Fisher discriminant analysis (MDKFDA) based on the multiple kernel learning (MKL).The main contributions of this paper lie in the following.(1) By introducing MDK into KFDA, maximum discrimination performance can be achieved in feature space.(2) Multiple image features extracted in different descriptors are fully utilized in the MDKFDA algorithm.
(3) Nonlinear discriminant features are produced by the adoption of MKL.
The rest of this paper is organized as follows.Section 2 shows a brief overview of MKL and KFDA.In Section 3, we illustrate the proposed MDKFDA algorithm and introduce the parameter optimization scheme for data-dependent kernel and multiple base kernels.Extensive experimental results on face recognition are reported in Section 4. Finally, Section 5 concludes this paper.

Related Work
In this section, we will briefly introduce some previous works related to this paper, including KFDA and MKL.

Kernel Fisher Discriminant Analysis.
KFDA is a nonlinear feature extraction algorithm which combines the nonlinear kernel trick with FDA.Because of its ability to extract the discriminatory nonlinear features, KFDA and its variations are frequently used for face recognition.In this paper, a twophased KFDA framework proposed by Yang et al. [18] is adopted to construct the MDKFDA.The two-phased KFDA framework mainly contains two parts: KPCA is applied to reduce the dimension of input space, and then LDA is used to further extract the features in the KPCA-transformed space.
Given the input training sample set including  samples: where x  is the training sample of -dimension and   represents the class label of x  .Given a sample x, its nonlinear mapped image can be denoted as Φ(x), and the discriminate feature vector z can be obtained as follows: Equation ( 2) contains two transformations P and G.The transformation P represents KPCA which transforms the input space R  into feature space R  , while the transformation G is the Fisher discriminant transformation in the KPCA-transformed space R  .Firstly, the issues in the process of KPCA are described as follows.
For a given nonlinear mapping Φ, the input space R  can be projected into the feature space F, which is considered as Hilbert space.The covariance operator on the feature space F can be represented as where Φ = (1/) ∑  =1 Φ(x  ).The way to find the nonzero eigenpair of  is illustrated as follows.Previously, to simplify the deduced process, the covariance operator is reformulated as Let us denote Q = [Φ(x 1 ), . . ., Φ(x  )] and construct a  ×  Gram matrix K = Q T Q, whose element K can be calculated through the use of kernel tricks: We adopt the  largest positive eigenvalues  1 ≥  2 ≥ ⋅ ⋅ ⋅ ≥   of K and their corresponding orthonormal eigenvectors u 1 , u 2 , . . ., u  to calculate the eigenvectors w 1 , w 2 , . . ., w  of H as follows: Hence, we can get the KPCA-transformed feature vector y = (y 1 , . . ., y  ) T , and the th KPCA feature is obtained by Above all, we can describe the P as follows: Secondly, the issues in the process of Fisher discriminant transformation are illustrated as follows.
FDA is used for further feature extraction in the KPCAtransformed space R  .In order to maximize the Fisher criterion, we first define the between-class scatter operator S Φ  and the within-class scatter operator S Φ  in feature space F. Consider where   is the number of training samples in class , x  represents the th samples in class ,  Φ  is the mean of the mapped samples in class , and  Φ 0 is the mean across all mapped samples.Thus, we can obtain the Fisher criterion by where V is the discriminant vector.According to Mercer kernel function theory, each V can be described by the elements of feature space {Φ(x 1 ), Φ(x 2 ), . . ., Φ(x  )}, and there always exist coefficients   ,  = 1, 2, . . ., , such that where Hence, the Fisher optimal discriminant vectors are the stationary points V 1 , . . ., V  ,  ≤ , and, correspondingly, the transformation G in (2) can be denoted by

Multiple Kernel Learning.
In general, MKL refers to the process of learning a kernel machine which is the combination of multiple base kernel functions/matrices.Recent research efforts have shown that MKL is not only able to find an optimal combination weight of base kernels but also improve the performance of the resulting classifiers.
As mentioned above, S is the -dimensional training sample set.For a given nonlinear mapping Φ, the original data is projected into empirical feature space F: Using Mercer's theorem [19], the inner product of two transformed vectors Φ(x) in the nonlinear space can be expressed as where the operator ⟨⋅⟩ means inner product.Such kernel function is usually called Mercer kernel, and some commonly used Mercer kernels are shown as follows [20]: Among them, Gaussian kernel is one of the most widespread kernels.However, Gaussian kernel can only reflect the local nonlinear feature of the data, while the linear kernel and polynomial kernel are overall kernel functions.It has been shown that the kernel-based feature extraction algorithm is appropriate for solving the nonlinear problems in face recognition.Nevertheless, the disadvantage of single kernelbased algorithms is lack of the generalization representation capability for multidimensional and multiclass data.Recent applications have indicated that MKL could provide us a more flexible framework to fuse information from different data source and enhance the performance of classifiers [21][22][23][24][25][26].
In the MKL framework, given  basic kernel functions {  }  =1 , the multiple kernel function can be generally represented as [27] where the weighted coefficient   is commonly obtained by solving the optimal object function of the kernel subspace learning algorithm.It is noticed that optimizing the coefficient   is a critical problem for improving the performance of MKL.

The Proposed Multiple Data-Dependent Kernel (MDK)
As mentioned above, given a training data S, the elements of MDK can be formulated as where x, x  ∈ R  ,    (x, x  ) is the th basic kernel chosen from the commonly used ones such as Gaussian kernel or polynomial kernel,  is the number of candidate basic kernel,   is the weight for the th basic kernel, and (⋅) is the factor function called data-dependent kernel (DK) which takes the form of where   ,  = 0, 1, . . .,  is the combination coefficient.The set {x ec ∈ R  }, called the "empirical cores, " are chosen from the training data.It is notable that MDK also satisfies the Mercer condition, since  MDDK (x, x  ) is equal to multiply DK and basic kernels together, which is the linear combination of kernels.
As mentioned above, we can see that the main problems in MDK are to choose the optimal weight   for basic kernels    (x, x  ) and the coefficients   of data-dependent kernel (x).In this paper, we adopt the iterative method based on the maximum margin criterion (MMC) and Fisher scalar to optimize weight   and coefficient   , respectively.The schematic of MDK is shown in Figure 1.section.In KFDA, we measure the class separability in kernel feature space and the kernel Fisher criterion can be expressed as

Weight Optimization for Multiple
In this section, the diagonalization strategy [28] is adopted to find the optimal V opt , based on which, the maximum margin criterion (MMC) [29] is employed as the objective function to optimize weight   ,  = 1, 2, . . ., .Consider To maximize L(), we introduce a Lagrangian to solve the optimization problem as follows: A series of partial derivatives can be achieved through differentiating L(, ) with respect to  1 ,  2 , . . .,   and .By setting these derivatives to zero, we can use Newton's iteration method to solve these equations, and the optimized weights for multiple kernels are achieved as follows:

Coefficients Optimization for Data-Dependent Kernel.
Since the optimized weights for multiple kernels have been achieved, investigating proper coefficients of DK is described in this section.In empirical feature space, let J = tr(S Φ  )/ tr(S Φ  ) denote the Fisher scalar, and S Φ  and S Φ  have been defined in ( 9) and (10), respectively.Given training dataset S, K = [(x  , x  )] × is the kernel matrix for all samples, whose element can be described as   = (x  , x  ), (,  = 1, 2, . . ., ), and K  (,  = 1, 2, . . ., ) is the   ×   submatrices of the K. Hence, K can be written as Consequentially, the between-class scatter matrix B and within-class scatter matrix W can be expressed as follows: For the multiple kernel  multiple (x  , x  ) = ∑  =1     (x  , x  ), B and W can be replaced by the B 0 and the W 0 , respectively, and their corresponding kernel matrix is translated into K = Q multiple Q, in which Q = diag((x 1 ), (x 2 ), . . ., (x  )) and (x) have been defined in (19), and then we have Theorem 1.Let 1  be the -dimensional vector with unity elements; the Fisher scalar J is equivalent to Proof.As shown in Section 2.1, the dimension of empirical feature space is set as  ( < ).Then, H and H  ,  = 1, . . ., , are, respectively, defined as the  ×  and   × ,  = 1, . . .,  matrices whose rows are the vectors {h T  },  = 1, 2, . . .,  and {h T  },  = 1, 2, . . .,   .The m Φ 0 and m Φ  in Section 2.1 can be expressed as follows: Moreover, since the empirical feature space maintains the dot product, ( 24) is equivalent to . .
From the above description, it is remarkable that B = QB 0 Q and W = QW 0 Q, where QI  = q.Finally, the relationship is proved: In order to obtain the optimal coefficients, J should be maximized.Since q = K 1 , J can be reformulated as where To maximize J(), the standard gradient approach is adopted.Given that respectively, the partial differential of J 1 () and J 2 () with respect to  are and the partial differential of J() is Let J()/ = 0; it is obtained that Hence, J is equal to the largest eigenvalue of the matrix , and the optimal  is the eigenvector corresponding to the largest eigenvalue; thus, an iteration algorithm is employed to calculate the optimal : where  is the learning rate, and it is given by where  0 is the initial learning rate, , and  denotes the current iteration number and prespecified iteration number, respectively.In summary, the optimal coefficients can be obtained by choosing  0 and  properly.3.3.Complete MDKFDA Algorithm.In summary of the discussion so far, the steps of complete MDKFDA algorithm are described as follows.
Step 1 (construct the MDK).Gaussian kernel is adopted to construct the data-dependent kernel (DK), while the linear kernel, Gaussian kernel, and polynomial kernel are employed as the base kernels of multiple kernel function.
Step 2 (optimize the weights).The maximum margin criterion (MMC) is employed as the objective function to optimize weights for multiple kernels, and the optimized coefficients for DK are achieved by virtue of the Fisher scalar.
Step 3 (transform the data).The MDK is used to transform the input space R  into feature space R  , by which the orig- . The transformation is P = (w 1 , . . ., w  ).
Step 4 (extract the Fisher discriminant vectors).S Φ  , S Φ  in R  are calculated to get the Fisher criterion J Φ (V).By maximizing J Φ (V), the Fisher optimal discriminant vectors are achieved, and the Fisher discriminant transformation is Step 5 (obtain the MDKFDA feature vector).Based on the first four steps, the expression of the MDKFDA feature vector z = G T P T Φ(x) is obtained.

Experimental Results and Discussions
In this section, we conduct several experiments on three face databases to evaluate the performance of the proposed MDKFDA algorithm by comparing it with several widespread algorithms in face recognition, including PCA, LPP, FDA, KPCA, KFDA, and DKFDA.The ORL face database [30], YALE face database [31], and PIE face database [32] are adopted in the experiments, and partial sample images of one person from different databases are shown in Figures 2, 3, and 4. In the following experiments, we select randomly 5 images per individual as training set and the rest 5 for testing.To make the experiments more reasonable, we repeated the trails 10 times to achieve an average performance.Three kernels are employed as the base kernels of multiple kernel function in MDK, including linear kernel, Gaussian kernel, and polynomial kernel.Moreover, Gaussian kernel is adopted to construct DK. Table 1 shows the comparison of maximal average recognition ratio between several nonkernel algorithms and the proposed MDKFDA algorithm.Table 2 reports the comparison of maximal average recognition ratio between several single kernel-based algorithms and the proposed MDKFDA.According to the experiment results, the MDKFDA algorithm outperforms the other algorithms, which implies that the MDKFDA can integrate multiple base kernels with data-dependent kernel (DK) effectively to improve the recognition ratio.Besides, it can be seen that all the single kernel-based feature extraction algorithms outperform their corresponding linear versions, which indicates that the kernel-based algorithm is advantageous for face recognition.Tables 5 and 6, it can be seen that the MDKFDA algorithm outperforms the other algorithms.

Discussions.
Experiments based on the three face databases have been systematically implemented, and the results reveal some interesting findings which are summarized as follows.
(1) The single kernel-based nonlinear feature extraction algorithms such as KPCA, KFDA, and DKFDA perform better than their corresponding linear versions such as PCA and LDA.The main reason is attributed to the fact that, compared to the linear techniques, the features extracted by kernel-based algorithms can better describe the complex and nonlinear variations of face images, that is, illumination, pose, and facial expression.Hence, a better recognition rate can be achieved.
(2) The average recognition rates of PCA, LDA, and LPP on PIE database are significantly less than those on the other two databases.The reason is mainly as follows.
Since the images from the PIE database are captured under more complicated conditions, they appear to have more complicated nonlinear characteristics than those from the other databases, which makes them difficult to handle by linear algorithms.
(3) The DKFDA algorithm outperforms the other single kernel-based algorithms, that is, KPCA, KLPP, and KFDA.As shown in Table 6, when KPCA with polynomial kernel is adopted, the recognition ratio of face images is not increased.This is because the characteristics of kernel are ill-suited for some database.However, the DK can solve this problem, and the reason is that the DKFDA algorithm has the adaptability for different databases.The structure of DK can be changed by adjusting the kernel parameter using iterative method, so various input data can be better expressed.
(4) The proposed MDKFDA algorithm consistently performs better than the KPCA, KLPP, KFDA, and DKFDA as well as their corresponding linear version, which indicates that, compared to the single kernelbased algorithms, the MDKFDA algorithm can effectively integrate the multiple base kernels with datadependent kernel (DK) and gain a good performance on face recognition.

Conclusions
In this paper, on the assumption that multiple kernel-based recognition algorithms can depict the complex and heterogeneous face image dataset by the utilization of multiple descriptors, a novel kernel-based approach for face recognition, called multiple data-dependent kernel Fisher discriminant analysis (MDKFDA), is proposed in this paper.Focusing on the construction of MDK, two main issues have been considered.The first issue concerns optimizing the weights of multiple base kernels.For this purpose, by maximizing the margin maximization criterion (MMC), an iterative method based on Lagrange multipliers is adopted to yield the optimized weights.The second issue aims at optimizing the coefficients of data-dependent kernel.To this end, by solving the optimization equation based on Fisher scalar, a gradientbased learning algorithm is employed to yield the optimized coefficients.Finally, the resulting multiple kernel functions and data-dependent kernel are integrated together as a new kernel, which is incorporated into the KFDA to construct the MDKFDA.Experiments on three face databases prove the effectiveness of the MDKFDA, and this algorithm is ready to be applied to other classification applications in the future.

Figure 1 :
Figure 1: Schematic of the proposed MDK.

Figure 2 :
Figure 2: Sample images of one person in the ORL database.

Figure 3 :
Figure 3: Samples images of one person in the YALE database.

Figure 4 :
Figure 4: Sample images of one person in the PIE database.

4. 1 .
Face Recognition Using ORL Database.The ORL face database contains 400 face images of 40 individuals, and variations in these 400 face images include angle, lighting, expression, and face details.As shown in Figure2, the size of all the original images were shaped into 48 * 48 pixels, and the primary part of the original image was reserved.

Table 1 :
Comparison of recognition ratio between linear algorithms and MDKFDA.

Table 2 :
Comparison of recognition ratio between single kernelbased algorithms and MDKFDA.

Table 3 :
Comparison of recognition ratio between linear algorithms and MDKFDA.

Table 4 :
Comparison of recognition ratio between single kernelbased algorithms and MDKFDA.

Table 5 :
Comparison of recognition ratio between linear algorithms and MDKFDA.

Table 6 :
Comparison of recognition ratio between single kernelbased algorithms and MDKFDA.