Facial Expression Recognition Based on Discriminant Neighborhood Preserving Nonnegative Tensor Factorization and ELM

A novel facial expression recognition algorithm based on discriminant neighborhood preserving nonnegative tensor factorization (DNPNTF) and extreme learning machine (ELM) is proposed. A discriminant constraint is adopted according to the manifold learning and graph embedding theory. The constraint is useful to exploit the spatial neighborhood structure and the prior defined discriminant properties. The obtained parts-based representations by our algorithm vary smoothly along the geodesics of the data manifold and have good discriminant property. To guarantee the convergence, the project gradientmethod is used for optimization. Then features extracted byDNPNTF are fed into ELMwhich is a trainingmethod for the single hidden layer feed-forward networks (SLFNs). Experimental results on JAFFE database and Cohn-Kanade database demonstrate that our proposed algorithm could extract effective features and have good performance in facial expression recognition.


Introduction
Facial expression recognition plays an important role in human-computer interaction, and 55% information is transferred by facial expression in face-to-face human communication [1].Although many methods were proposed, recognizing facial expression is still challenging due to the complex, variable, and subtle facial expressions.
Recently, the nonnegative matrix factorization (NMF) was introduced into facial expression recognition [7].NMF decomposes the face samples into two nonnegative parts: the basis images and the corresponding weights.As many data in bases and weights were degenerated too close to zero, NMF derived the parts-based sparse representations.For facial expression recognition, the localized subtle features, such as the corners of mouth, upward or downward eyebrows, and change of eyes, are critical for the recognition performance.Since NMF yields the parts-based representations, it outperforms the subspace-based models.To further improve NMF, several variants have been presented by introducing different constraints to the objective function.Li et al. put forward a local NMF (LNMF) by adding a local constraint to the basis images [8], to learn a localized, parts-based representation.Hoyer gave a sparse constraint NMF (SNMF) by incorporating sparseness constraint into both the bases and the weights [9].Cai et al. developed a graph constraint NMF (GNMF) by adding a graph preserving constraint to the weights [10].Zafeiriou et al. used the discriminant NMF (DNMF) for frontal face verification [11].Wang et al. extended NMF to PNMF with a PCA constraint and FNMF with a Fisher constraint [12].

Mathematical Problems in Engineering
For facial expression recognition, NMF and its variants vectorize the samples before factorization, which may lose the local geometric structures.However, the spatial neighborhood relationships within pixels are critical for image representation, understanding, and recognition [13].Another drawback of NMF is that it could not generate a unique decomposition result.Welling and Weber developed a positive tensor factorization (PTF) algorithm, which handled images as 2D matrices directly [14].Shashua and Hazan proposed the nonnegative tensor factorization (NTF) which implemented the factorization in the rank-one tensor space [15].The factorization in the tensor space could preserve the local structures and guarantee the uniqueness of the decomposition.
On the other hand, the choice of classifier plays an important role for recognition.For facial expression recognition, nearest neighbor (NN) and support vector machine (SVM) are the commonly used methods [16].The sparse representation classifier (SRC) was adopted in [17].Recently, the extreme learning machine (ELM) was proposed for classification which is a training method for the single hidden layer feed-forward networks (SLFNs) [18].The conventional methods need long time to converge or may lose the generalization property due to overfitting.However, ELM converges fast and provides good generalization performance.For ELM, the input weights and biases are randomly assigned, and the output weights can be simply calculated by the generalized inverse of the hidden layer output matrix.Therefore it converges extremely fast and obtains an excellent generalization capability.Many variants of ELM were proposed for different applications [19][20][21][22][23][24][25][26], including the Kernel-based ELM [21] and the incremental ELM (I-ELM) [23], which lead to the state-of-the-art results in different applications.
In this paper, we propose a novel facial expression recognition algorithm based on discriminant neighborhood preserving nonnegative tensor factorization (DNPNTF) and ELM.It works well in the rank-one tensor space.The simple ELM is adopted to testify its effectiveness for facial expression recognition [18].Our algorithm is composed of two stages: feature extraction and classification.Firstly, to extract the discriminant features, a neighborhood preserving constraint form of NTF is used.The constraint is derived according to the manifold learning and graph embedding theory [27][28][29].Since the columns of the weighting matrix have a oneto-one correspondence with the columns of the original sample, the discriminant constraint is added to the weighting matrix.With the neighborhood preserving constraint, the obtained parts-based representations vary smoothly along the geodesics of the data manifold and are more discriminant.Secondly, the discriminant features extracted by DNPNTF are fed into ELM classifier to conduct the recognition task.
The rest of this paper is organized as follows.The mathematical notations are given in Section 2. In Section 3, we give the detailed analysis about DNPNTF and its optimization procedure.ELM is introduced in Section 4, and the experiments are given in Section 5. Finally, the conclusions are drawn in Section 6.
Definition 1 (inner product and tensor product [30]).The inner product of two tensors A, B ∈ R  1 × 2 ×⋅⋅⋅×  is defined as The tensor product of two tensors ( Definition 2 (rank-one tensor [30]).A th-order tensor A ∈ R  1 × 2 ×⋅⋅⋅×  could be represented as a tensor product of Here A is called rank-one tensor, and Definition 3 (mode product [30]).The mode  product of where

The DNPNTF Algorithm
In this section, we give a detailed description about the proposed DNPNTF algorithm.Instead of converting into vectors, it processes the samples in rank-one tensor space.The objective function of NTF is adopted, which could learn the parts-based representation and have the sparse property.To discover the spatial local geometric structure and the discriminant class-based information, a constraint is added in the objective function according to the manifold learning and graph embedding analysis.To guarantee the convergence, the project gradient method is used.

The Analysis of DNPNTF.
Given a  ×  image database X, it contains  sample images X = [x 1 , . . ., x  ] ∈ R × .The dimension of each sample is  ( =  1 ×  2 ).In NTF, the database is organized as a 3rd-order tensor Mathematical Problems in Engineering 3 where u  ∈ R  1 , k  ∈ R  2 , and z  ∈ R  (1 ≤  ≤ ) describe the first, second, and third modules of A, respectively.Each sample x  is approximated by To incorporate more properties into NTF, different constraints could be added into the objective function.The constraint form of objective function is where is the constraint function about {, } and  is the constraint about . and  are the corresponding positive coefficients.To encode the spatial structure and discriminant class-based information into sparse representations, we propose a constraint function  according to the manifold learning and graph embedding analysis.In NTF, the columns of the weighting matrix have a one-to-one correspondence with the columns of the original image matrix.Therefore, we add the discriminant constraint to {z  |  =1 }, and  for z  is defined as where S   denote the graphs with different properties and   are the corresponding coefficients.By deriving different S  , the graph embedding model could have different properties, such as the neighborhood preserving property and the discriminant property.Now, we discuss the selection of S   .The most commonly used graph is the Laplacian graph, and S   is calculated in form of the heat kernel function as Here, S  measures the similarity between a pair of vertices and has neighborhood preserving property.To further incorporate the class-based discriminant information, we derive a universal penalty graph [27], where the similarity matrix S   is defined as where By solving the generalized eigenvalue decomposition problem, the graph embedding criterion in (11) can be calculated as And the final objective function of DNPNTF is min where  > 0 and  > 0 to make sure ( 13) should be nonnegative.
Mathematical Problems in Engineering 3.2.Projected Gradient Method of DNPNTF.The most popular approach to minimize NMF or NTF is the multiplicative update method.However, it cannot ensure the convergence of the constraint forms of NMF or NTF.In this paper, the projected gradient method is used to solve DNPNTF.The objective function of DNPNTF can be stated as where  is a positive constant.The goal of ( 14) is to find {U, V} and Z by solving the following problem: To find the optimal solution, ( 15) is divided into three subproblems: first, we fix V and Z and update U to arrive at the conditional optimal value of the subminimization problem; second, we fix U and Z and update V; last, we fix U and V and update Z. Three functions are defined as  V,Z (U) =  obj (A‖UVZ),  U,Z (V) =  obj (A‖UVZ), and  U,V (Z) =  obj (A‖UVZ).The update rules are defined as Now the task is calculating ∇ V,Z (U () ), ∇ U,Z (V () ), and ∇ U,V (Z () ).

3.2.1.
The Calculation of U and V. Firstly, we discuss the calculation of ∇ V,Z (U () ) and ∇ U,Z (V () ).The objective function could be written as The differential of  is And the partial differential for where the th element in e  ∈ R  1 is 1 and others are 0.That is, (e  )  = 1 and (e  )  ̸ = = 0.According to Definition 1, for any order tensors 19) could be written as According to (16), the update rule for u   is To confirm the nonnegative of u   , (u   ) is set to be Then the update function is where  ; ∈ R 1× represents the th row of  = [u 1 , u 2 , . . ., u  ], ⊙ is the matrix Hadamard product,  ;; ∈ R  2 × represents the matrix which fixes the first module of , and traversals are the other two modules.It is defined as Similarly, the update rule of the th element of where , ⊙ is the matrix Hadamard product,  ;; ∈ R  1 × represents the matrix which fixes the second module of , and traversals are the other two modules.It is defined as Now, k  and u  are calculated.

The Calculation of Z.
Then we discuss the calculation of ∇ U,V (Z () ).The differential of  along z  (∀, 1 ≤  ≤ ) is For (z   ( − )z  )/2, the partial differential for z   is where the th element in e  ∈ R  is 1 and others are 0.That is, (e  )  = 1 and (e  )  ̸ = = 0. Then the partial differential for z   is According to (16), the update rule for z   is

Mathematical Problems in Engineering
The update step (z   ) is set as And the final update rule of z   is Now u  , v  , and z  in the objective function are all calculated.

Extreme Learning Machine
ELM is proposed by Huang et al. [18] for SLFNs.Unlike the traditional feedforward neural network training methods, such as the gradient-descent method, the standard optimization method, and the least-square based method, ELM need not tune the hidden layer of SLFNs which may cause learning complicated and inefficient.It could reach the smallest training error and have better generalization performance.The learning speed of ELM is fast, and the parameters have not to be tuned manually.In our proposed algorithm, the extracted features by DNPNTF are fed into ELM for classification.
Given a training set X = {(x  ,   ) | x  ∈   ,   ∈   ,  = 1, 2, . . ., }, where x  is the  × 1 input feature vector and   is a ×1 target vector.ELM with  hidden nodes and activation function () is modeled as where   = ( 1 ,  2 , . . .,   )  represents the input weight, which is the th neuron in the hidden layer and the input layer;   = ( 1 ,  2 , . . .,   )  is the weight vector between the th hidden neuron and the output layer;   is the target vector of the th input data.In training step, ELM aims to approximate  training samples with zero error, which means ∑  =1 ‖  −   ‖ = 0. Then there exist   ,   , and   satisfying that Equation ( 35) can be reformulated compactly as where is called the hidden layer output matrix of the neural network, and the th column of  is the th hidden neuron output with respect to inputs  1 ,  2 , . . .,   .It is proved by Huang et al. [18] that weights and biases need not be adjusted and can be arbitrarily given.Therefore, the output weights could be determined by finding the least-square solution β as where  † is the Moore-Penrose generalized inverse of matrix .
As analyzed by Huang, ELM could obtain a good generalization performance with a dramatically increased learning speed by solving (39).

Experiments
In this section, we apply DNPNTF via ELM to facial expression recognition.We compare DNPNTF with NMF [7], DNMF [11], and NTF [9] and give the experimental results by employing ELM, NN, SVM [16], and SRC [17].Two facial expression databases are used: the JAFFE database [31] and the Cohn-Kanade database [32].Raw facial images are cropped according to the position of eyes and normalized to 32 × 32 pixels.Figure 1 shows an example of the original face image and the corresponding cropped image.According to the rank-one tensor theory, gray level images are encoded in tensor space.
Since the results of ELM may vary during each different execution, we repeat the execution for 5 times and take the average value as the final result.It is proved by theory analysis and experiments that the classification performance of ELM is affected by the hidden activation function and the number of hidden nodes [23].However, in this paper we just focus on the application of ELM to facial expression recognition.The activation function used in our algorithm is a simple sigmoidal function.The number of hidden nodes is set to be the same as the number of facial expression classes (e.g., 7 for the JAFFE database and 6 for the Cohn-Kanade database).For SVM, the radial basis function (RBF) is used, which is exp(−‖x  − x  ‖ 2 ) ⋅  that is set to be 3 as an empirical value.For SRC, "Homotopy" algorithm is used to solve the minimization of ℓ 1 norm constraint.[31] is an expression database which contains 213 static facial images captured from 10 Japanese females.Each person poses 2 to 4 examples for each of the 6 prototypic expressions (anger, disgust, fear, happiness, sadness, and surprise) plus the natural face.To evaluate the algorithms, we randomly partition all images into 10 groups, with roughly 70 samples in each group.We take any 9 groups for training and calculate the recognition rates with the remaining one.We repeat it for all the 10 possible choices.Finally, the average result over 10 times' testing was taken.

Experiments on JAFFE Database. The JAFFE database
The average recognition rates of different feature extraction algorithms are shown in Figure 2, where the vertical axis represents the correct recognition rate in percentage and the horizontal axis represents the corresponding dimensions (from 1 to 120).Here, only the NN classifier is used.In the lower range of dimensions, the recognition rates of DNPNTF are similar to other algorithms.This is because DNPNTF extracts the parts-based sparse representations, and only a few features could be generated for recognition in the low range of dimensions.In the higher range of dimensions, DNPNTF outperforms the others.With the increase of the extracted parts-based features, DNPNTF could achieve good recognition performance.Since different constraints were added, the improved versions of NMF, including DNMF and NTF, outperform the conventional NMF.
The top recognition rates of different algorithms with corresponding dimensions are illustrated in Table 1.NMF achieves the highest rate at a low dimension, while DNMF achieves the highest rate at a high dimension.Although more dimensions are needed, DNPNTF achieves the highest recognition rate compared with others.This is because the constraints about manifold structure and discriminant information are considered, which are critical for classification.
Figure 3 shows the basis images obtained on the JAFFE database by NMF, NTF, and DNPNTF.Based on the principle of NMF, the face images are represented by combining multiple basis images with addition only, and the basis images  are expected to represent facial parts.In this database, the basis images calculated by NMF are not sparse.NTF and DNPNTF which execute in the tensor space could generate parts-based sparse representations.Since more constraints were adopted, DNPNTF generate sparser basis images which reflect distinct features for recognition.Then we give the experiments to prove the effectiveness of DNPNTF via ELM.The average recognition rates of DNPNTF with ELM, NN, SVM, and SRC are given in Figure 4, where the vertical axis represents the correct recognition rate in percentage and the horizontal axis represents the corresponding dimensions (from 1 to 120).ELM and SRC achieve better recognition performance compared with NN and SVM, and ELM achieves the highest recognition rate.The top recognition rates with the corresponding dimensions are given in Table 2.

Experiments on Cohn-Kanade Database.
The Cohn-Kanade database [32] consists of a large amount of image sequences starting from natural face and ending with the peak of the corresponding expression.104 subjects with different ages, genders, and races are instructed to pose a series of 23 facial displays, including the 6 prototypic expressions.In our experiments, for every image sequence, we take 2 to 8 continuous frames near the peak expression as the static samples.We use the face images of all subjects.We partition the subject to 3 exclusive groups, and in each group,  for each of the prototypic expression, we select 100 samples; that is, there are 600 samples in each group and the size of the total set is 1800.During the experiment, we adopt the leaveone-group-out strategy and 3-fold cross-validation: each time two groups are taken as training set and the remaining group is left for testing.This procedure is repeated for 3 times.The average recognition rates of different algorithms on the Cohn-Kanade database are shown in Figure 5, where the vertical axis represents the correct recognition rate in percentage and the horizontal axis represents the corresponding dimensions (from 1 to 120).Here, only the NN classifier is used.Table 3 shows the top recognition rates with the corresponding dimensions.The recognition rates obtained on the Cohn-Kanade database are lower than those obtained on the JAFFE database.It can be explained that the experiments on the Cohn-Kanade database are person-independent, which are more difficult than the person-dependent experiments on the JAFFE database.From Figure 5, we can see that the performance of DNPNTF is superior to others with nearly all dimensions.Its recognition rates improve with the increase of dimensions.Lastly, we give the experiments about different classifiers on the Cohn-Kanade database.The average recognition rates of DNPNTF via ELM, NN, SVM, and SRC are shown in Figure 6, and the top recognition rates are given in Table 4. SVM and SRC achieve better performance compared with NN.ELM achieves the best recognition accuracy among all tested algorithms on almost all dimensions.It means ELM could use the information contained in extracted features better than other classifiers.

Conclusions
In this paper, a novel DNPNTF algorithm with the application to facial expression recognition was proposed, which adopts ELM as the classifier.To incorporate the spatial information and the discriminant class information, a discriminant constraint is added to the objective function according to the manifold learning and graph embedding theory.To guarantee the convergence, the project gradient method is used for optimization.Theoretical analysis and experimental  results demonstrate that DNPNTF could achieve better performance compared with NTF, NMF, and its variant.Then the discriminant features generated by DNPNTF are fed into ELM to learn an optimal model for recognition.In our experiments, DNPNTF via ELM achieves higher recognition rate compared with NN, SVM, and SRC.

By minimizing ( 5 )
, the bases {u  k   |  =1 } and the corresponding weights {z  |  =1 } are conducted.The inner product of {u  v   |  =1 } and the sample image is calculated to derive the low-dimensional parts-based representation.

Figure 1 :
Figure 1: Illustration of the preprocess for original images.

Figure 2 :
Figure 2: The recognition rate versus dimension curves achieved on the JAFFE database.

Figure 3 :
Figure 3: Basis images obtained on the JAFFE database.

Figure 4 :
Figure 4: The recognition rate versus dimension curves of DNPNTF by using different classifiers on the JAFFE database.

Figure 5 :
Figure 5: The recognition rate versus dimension curves achieved on the Cohn-Kanade database.

Figure 6 :
Figure 6: The recognition rate versus dimension curves of DNPNTF by using different classifiers on the Cohn-Kanade database.
−z       2 S  −      z  −z       2 S (  ) represents the nearest  pairs of samples between class   and the other classes.The purpose of the penalty graph was to separate marginal samples between different classes.Now, the objective function of constrained NTF becomes min u  ,k  ,z            A −  ∑ =1 u  ⊗k  ⊗z            2 +      z u  , k  , z  ≥ 0, ∀.

Table 1 :
The top recognition rate and the corresponding dimension achieved on the JAFFE database.

Table 2 :
The top recognition rate of different classifiers achieved on the JAFFE database.

Table 3 :
The top recognition rate and the corresponding dimensions achieved on the Cohn-Kanade database.

Table 4 :
The top recognition rate and the corresponding dimensions achieved on the Cohn-Kanade database.