A Facial Expression Recognition Method Using Improved Capsule Network Model

Aiming at the problem of facial expression recognition under unconstrained conditions, a facial expression recognition method based on an improved capsule network model is proposed. Firstly, the expression image is normalized by illumination based on the improved Weber face, and the key points of the face are detected by the Gaussian process regression tree. Then, the 3dmms model is introduced. The 3D face shape, which is consistent with the face in the image, is provided by iterative estimation so as to further improve the image quality of face pose standardization. In this paper, we consider that the convolution features used in facial expression recognition need to be trained from the beginning and add as many diﬀerent samples as possible in the training process. Finally, this paper attempts to combine the traditional deep learning technology with capsule conﬁguration, adds an attention layer after the primary capsule layer in the capsule network, and proposes an improved capsule structure model suitable for expression recognition. The experimental results on JAFFE and BU-3DFE datasets show that the recognition rate can reach 96.66% and 80.64%, respectively.


Introduction
Human facial expression is a kind of representation language which is naturally or deliberately revealed by the complex stimulation of environment, context, and mood in the process of communication and can be perceived by the visual system [1][2][3]. It is also the expression that human facial muscles produce stress movement under certain semantic stimulation or active movement driven by consciousness. It is generally believed that human facial expressions are controlled and stress-induced. e so-called controllability of consciousness means that human beings (especially actors) can make or suppress any expression at random [4,5]. e so-called stress convergence refers to that most people will make similar expressions under the stimulation of a specific semantic environment. For example, when people hear interesting events or face beautiful things, they will naturally show the expression of happiness; when people face expired smelly food or ugly bad scenes, they will generally show the expression of disgust; when people face sudden emergencies, they usually show the expression of surprise, etc. [6].
When the human stress expression is contrary to the normal situation, it usually reflects the inhibition of some psychological factors on the expression. Because the human facial expression is rich in psychological and emotional information, has stress convergence, and is controlled by consciousness, it is easy to get the general attention of a large number of scholars in the field of psychology and pattern recognition [7,8]. Under the existing technical conditions, people began to use pattern recognition technology to establish the mapping relationship between the face image and the facial expression in the image. rough the computer automatic judgment of human facial expression, the research field of expression recognition is proposed. In a broad sense, facial expression recognition is a process of automatic analysis of face image data by computer. e process of automatic image analysis by computer is just the main content of computer vision [9]. As a special subject born out of the traditional pattern recognition technology, computer vision studies the difficulties faced by the traditional pattern recognition methods in image data or processes and refines the image data to facilitate the indexing, classification, and automatic analysis of image data [10].
In this paper, the practical significance of research on facial expression recognition technology of single face image is stronger than that of expression recognition technology based on image sequence. e reason is that compared with expression recognition technology based on image sequence, expression recognition technology based on single image can reflect the defects of existing image processing technology and recognition technology in a specific application. It is helpful to improve the applicability of the existing image processing and discrimination technology. erefore, this paper focuses on the expression recognition of a single face image under unconstrained conditions. For the problem of illumination and pose standardization of face image, the existing lighting and pose standardization methods are easy to lose texture details, so it is not suitable for facial expression recognition. is paper considers that under the premise of unconstrained expression recognition, it is necessary to further improve the degree of expression detail reservation. us, the subsequent discriminant model can work effectively. To solve the problem of facial expression recognition, a facial expression recognition method using improved capsule network model is proposed. e main contributions of this paper are as follows.
In this paper, we consider that the convolution features used in facial expression recognition need to be trained from the beginning and add as many different samples as possible in the training process. e proposed method focuses on temporal attention. e attention module uses sigmoid as an activation function, which can not only select important features but also suppress irrelevant information. It can also help smooth the mismatch between the training set and test set and improve the final recognition rate.

Related Works
For the facial expression recognition under nonconstrained condition, scholars have proposed many methods. For example, reference [11] proposed a hybrid expression recognition method using High-order Joint Derivative Local Binary Pattern (HJDLBP) and Local Binary Pattern (LBP). e model efficiency is improved by removing unwanted areas and preserving the facial area. e study in [12] proposed a facial expression recognition framework combining two-dimensional Gabor and local binary patterns. By extracting salient features of facial expression, the model efficiency was improved. e study in [13] proposed an adaptive model parameter initialization method based on the multilayer maxout network linear activation function, which improved the performance of the model by extracting highly relevant features of the image sequence. e study in [14] proposed an expression recognition method based on Wasserstein generative adversarial network, which improved the model efficiency by suppressing slight changes of the face. e study in [15] proposed a Deep Cascaded Peakpiloted Network (algorithm in reference [15]), which extracts key and subtle details in the image through peakconducting feature transformation to improve the accuracy of the model. However, these methods do not consider the edge characteristics of the image. e study in [16] proposed a facial expression recognition method combining multiple facial features and support vector machines. By extracting important facial features, reducing image noise points can improve the accuracy of the model. e study in [17] proposed a deep convolution BiLSTM fusion network facial expression recognition method, which extracts spatial features from each frame through a convolutional neural network and then models the temporal dynamics. Feature fusion improves the model recognition rate. e study in [18] proposed the facial expression recognition method based on a facial video sequence. By extracting features represented by temporal local binary pattern, the efficiency of the model was improved. e study in [19] proposed a conditional convolutional neural network enhanced random forest expression recognition method (algorithm in Reference [19]), which reduces the noise points of the data set and improves the accuracy of the model. However, when the training data is small, these methods are prone to underfitting. e study in [20] proposed facial expression recognition based on incremental active learning, which improves the accuracy of the model by reducing the noise points of the image. e study in [21] proposed a multifeature fusion facial expression recognition method based on Extreme Learning Machine (ELM), which improved the accuracy of the model by fusing multiple features. e study in [22] proposed a facial expression recognition method based on feature space and principal component analysis. e method encodes the known image through feature space to improve the accuracy of the model. e study in [23] proposed a facial expression recognition method based on the Two-Stream Convolutional Neural Network (T-SCNN), which improved the accuracy of the model by fusing RGB images and temporal features. However, when the amount of data is large, these methods are prone to overfitting. e study in [24] proposed a multilayer perceptron algorithm for facial expression recognition, which increased the accuracy of the model by adding hidden neurons. However, the parameters of the model were difficult to adjust. e study in [25] proposed a facial expression recognition method based on hidden Markov, which improved the efficiency of the model by extracting the more important features of the image. However, under unconstrained conditions, the model is less robust.
Based on the above analysis, in the field of facial expression recognition, deep learning has good modeling and processing ability for facial expression images, but only when the face illumination and pose are constrained can the model be effectively recognized. Aiming at the problem of facial expression recognition under unconstrained conditions, a facial expression recognition method based on an improved capsule network model is proposed. e improved capsule model can effectively classify facial expressions under unconstrained conditions, which makes up for the deficiency of pure deep convolution network in acquiring sparse features hidden in discriminative texture, and improves the generalization ability of existing expression classification models for illumination and pose differences.

Overall Architecture of the Proposed Method
rough the analysis of the existing deep convolution neural network in the application of expression recognition, this paper thinks that the illumination and posture correction technology has very important application value in alleviating the dependence of deep convolution neural network on the number of samples and improving the quality of perception weight. It can be set by using light treatment technology. We can change the lighting conditions of the face to generate expression samples under different lighting conditions; we can also use the posture correction technology to realize the generalization of the face pose to generate different facial expression samples. According to the idea of local density sampling, this may make the final training model to the real model. e framework of facial expression recognition based on deep learning is shown in Figure 1. In this paper, the illumination and projection analysis technology is still used to analyze, correct, and perceive the face pattern. Under the existing technical conditions, this kind of pretreatment is still a necessary step. In the specific implementation process, a batch of dense sample graphs are generated and input into the deep convolution model for weight training to fully alleviate the problem of insufficient sample number. In the stage of model recognition, illumination and projection analysis are still used to analyze, correct, and perceive the face pattern. In this paper, we add an attention layer after the primary capsule layer in the capsule network and propose that a capsule structure is suitable for expression feature extraction.

Illumination Normalization.
A new illumination normalization method based on Weber face (WF) [26] is proposed, which can not only extract illumination-insensitive features effectively but also suppress boundary marks at sudden changes of light. Assume that the lighting component is I(x, y). In WF, all ratios are multiplied by a combination coefficient α and α ∈ (0, 1), as follows: At this time, while enhancing the effective information, the noise will also be enhanced, and the coding value of the area most affected by the light is erefore, to reduce the noise, multiply the interval (−K 0 , −K 1 ) ∪ (K 0 , K 1 ) by a suppression factor. e revised WF definition is as follows: where β ∈ (0, 1) is a coefficient that suppresses the influence of light: A � −1, 0, 1 { }.α is used to adjust (increase or decrease) the difference between the WF encoding values of adjacent pixels: the interval B is the interval that is greatly affected by light. e interval has more noise. It can be known from equation (2): According to the WF theorem, the minimum perceivable ratio is constant. erefore, the subinterval of [−π/2, π/2] is called the low perceptual interval. In the subinterval, since the ratio is smaller than the minimum perceptible ratio, its changes will not be perceivable by the human eye. at is, even if the pixels of the interval are affected by the light, the change is small and can be ignored. k 0 can be defined as follows: e low-frequency component is regarded as a largescale feature, which is the part mainly affected by light. e high-frequency component is regarded as a small-scale feature, which is the light-invariant feature. ose close to ±π/2 are changing fast and can be regarded as high-frequency components, and those close to 0 are slow changing and can be regarded as low-frequency components. e interval [−π/4, π/4] is regarded as low-frequency interval.

Key Point Detection.
For key point detection [27], a model based on the Gaussian process regression tree is proposed. A special kernel function and random partition kernel function are designed. Given a random partition P, the definition of the kernel function is as follows: Firstly, k p (a, b) is proved to be a reasonable kernel function. Define To prove that k p is a semipositive definite function, the expectation is decomposed into the limit of summation and each single term is proved to be a semipositive definite function.
For any dataset of size N, the covariance matrix of k p can be arranged into a diagonal matrix: It can be seen that YK ρ Y T is a semipositive definite matrix. erefore, for any dataset, K ρ is also a semipositive definite matrix and it can be concluded that k ρ is a reasonable kernel function. Analogously, the kernel function defined in the random partition is applied to the random forest. e kernel function k(x 1 , x 2 ) of the Gaussian process regression tree can be composed of M trees and the distribution of nodes in the tree. e formula is shown in the following formulas: τ ‴ is the split function, and σ 2 k refers to the scaling parameters of the kernel function. σ 2 n and σ 2 k are the hyperparameters of the Gaussian process regression tree. e maximum likelihood function is solved by the probability density function of the training samples. e formula is shown as follow: Take the logarithm of the above formula: Let σ 2 r � σ 2 n /σ 2 k and find the maximum value of the maximum likelihood function. Derivate σ r : e nonparametric nature of Gaussian process regression results in a large amount of calculation and the computational complexity of K −1 s is O(N 3 ). Using the reduced-rank approximation method can reduce the amount of calculation. Let K(X, X) � σ 2 k QQ T and K −1 s can be simplified to the following formula: where Due to the special construction of the kernel function, is the index of the leaf node of the sampleithat falls into the treem. K r is the matrix of size BM * BM and the computational complexity also changes from

Face Posture Standardization. Based on the 3D
Morphable Model (3DMM), a new face posture normalization method is proposed [28]. e eigenvalue corresponding to the weighting coefficient of each eigenvector is covariance. 0 is the Gaussian probability distribution of the mean and the expression is as follows:  where S is the average three-dimensional face shape, E S i is each principal component obtained by the PCA algorithm on the three-dimensional vertex dataset of the face. λ S i is the corresponding eigenvalue of E S i , and the combination coefficient α S i follows a Gaussian distribution with 0 as the mean and λ S i as the variance. Extract facial feature point P i in the model and the corresponding part does not change the linear combination relationship. So: Given a face image, obtain the 2-dimensional feature point estimate p i : (x ″ , y i ″ ) of the point P j in the face image and record P + as the position of the camera coordinate system origin in the world coordinate system of S, and E p i is the matrix formed by merging the feature vector set E p i { } by column. R 3: is the third row of the rotation matrix R. M c is the internal parameter matrix of the camera, and z ″ is the common scale factor of the projection imaging. N point energy equation combined with the deformation coefficient estimate is as follows: In the above formula, the left term controls the estimated residual, and the right term controls the combination coefficient α according to the probability prior. α i is the mean of the probability distribution of α i , which is not zero when the average shape S P i is updated.
e method of alternating optimization [29] is used to reduce the energy function F. Given α, the minimum energy function F is target estimation (M c , R, z ″ , P + ). Given (M c , R, z ″ , P + ), the combination coefficient α is obtained by minimizing F.
After updating the position P i ′ and rotation matrix R of each face feature point P i under the camera coordinate, the deformation estimation technology of the N point mapping can be used to calculate the eigenvector combination coefficient α of the modified three-dimensional model S based on the mean of the probability α of the feature vector combination coefficients. After updating S, the mean of the probability α of the feature vector combination coefficients is updated. When determining the combination of the projection parameters (M c , R, z ″ , P + ) and the 3DMMs deformation model coefficient vector α, the paper establishes the solved 3D human face: and color reference relationship of faces in I: Given the standard posture projection parameter combination of S 0 as (M c , R, z ″ , P + ), then a standard pose face image I + can be generated by the color reference relationship of the three-dimensional model of the human relative to I: e normalized face must have symmetry, and the symmetrical point of S P in the three-dimensional face shape S is S P � R − 1 I −x RS P , where I −x is the diagonalized form of (−1, 1, 1). e quality q(S P ) of c(S P ) is inversely related to the number of references C #I (S P ) of S P in I. e number of references of S P is at least 1, then the following correspondence can be given: By the symmetry of the standard pose face image, the reference colors with higher quality can be given as follows: c + S P � q S P c S P + 1 − q S P c S P .

Dense Sampling and Preprocessing.
According to the idea of local dense sampling [30], the illumination and posture correction technology has very important application value to alleviate the dependence of deep convolutional neural network on the number of samples and improve the quality of perceived weights. After completing the lighting and projection analysis of the face samples, firstly, 4 random affine transformations of a small range are performed on the light-analyzed image, and then the three Euler angles, scale adjustment parameters, and origin coordinates of the threedimensional rotation matrix are randomly changed to generate 16 batches of dense sample images as a Mini-Batch for deep convolution model for weight training. is training method expands the sample size by 16 times, which fully alleviates the problem of the insufficient sample. Besides, due to the denseness between the same batch of samples, the probability that the perceptual model overfits the data is reduced. In the process of recognition, as face images with arbitrary lighting and attitude may appear, the paper still uses the technologies of lighting and projection analysis to analyze, correct, and then perceive the face pattern.

Attention Capsule Network Model.
e proposed attention capsule network model has five gated convolution Scientific Programming modules. Each gated convolution module consists of two layers of a gated convolutional network and maximum pooling. Each layer of the gated convolutional network includes linear function and sigmoid activation function. Compared with the traditional CNN, the gated convolutional network replaces the modified linear unit with a gated linear unit. e learnable gate can control the amount of information passed from the current layer to the next. Gated linear units can reduce the disappearance of the gradient. It is achieved by using a sigmoid activation function to preserve the nonlinear capability of the neural network while using a linear function to provide a linear path for the gradient. e maximum pooling operation can reduce the spatial dimension of features. e output features through the five gated convolution modules are sent to the primary capsule layer. e primary capsule layer consists of a convolution module, remodeling module, and squashing module. After the input features go through the convolutional layer, add the bias, and go through the ReLU nonlinear activation function, it is reshaped into a three-dimensional tensor with T × V × U and compressed with squashing function. T is the time dimension before remodeling, V is the dimension inferred from other variables, and U � 4 is the size of the capsule. e output of the primary capsule layer has T time slices. Each time slice has V capsules, and each capsule is a tensor with 1 × 1 × U.
e V capsules of each time slice are input into the advanced capsule layer. e calculation is performed between the primary and the advanced capsule layer using a dynamic routing algorithm. e dynamic routing algorithm matches V low-level capsules representing image frames with J high-level capsules representing expression categories. When multiple image frames predict the same event, the expression category of the image is determined. en, feedback is used to increase the weight between image frames related to the image expression category and reduce the weight of image frames not related to the image expression category to learn the weights between all image frames and image expression categories accurately. With each training, the weight of the routing algorithm is updated, and the final weight is saved at the end of the algorithm. Use the dynamic routing algorithm to calculate the output vector v j and then calculate the Euclidean length of the output vector v j . e Euclidean length composition vectors of J categories at each moment t are used as the output of the advanced capsule layer, denoted as o(t).
e V capsules of each time slice are input into the attention layer. e attention layer allows the network model to focus more on finding salient frames of the input image related to the image expression category. e sigmoid activation function of this layer can predict the importance of each frame. e output of the attention layer at each moment t is z(t) and the value of z(t) is between 0 and 1. e attention layer selects saliency frames while suppressing the irrelevant frame of image expression category. e time attention mechanism is realized through the output of the attention layer. Finally, the fusion layer combines the output o(t) of the advanced capsule layer with the output z(t) of the attention layer. e time attention mechanism is realized by selecting significant frames of time slices. Time slices with large attention factors correspond to class-related significant image frames and time slices with small attention factors correspond to class-irrelevant image frames. e final predicted output y j is obtained by calculating the weighted sum of the output o(t) and the attention factorz(t) of the advanced capsule layer. y j represents the predicted value of the J image event and the expression is as follows: where o j (t) represents the Euclidean length of the output vector v j of the j capsule at time t, and z j (t) represents the jattention factor at time t, j � 1, . . . , J, t � 1, . . . , T. z(t) controls the salient image frames of the o(t) transmitted information. Choose a probability threshold τ. When y j > τ, the output is the j image activity event. e overall framework of the attention capsule network model is shown in Figure 2.

Experimental Results and Analysis
To verify the effectiveness of the proposed facial expression recognition method using CNN and improved capsule network model, the experimental evaluation was performed on the BU-3DFE and the JAFFE dataset. e proposed algorithm is compared with that proposed in reference [23], reference [15], and reference [19] through experiments.

JAFFE.
e JAFFE dataset contains a total of 213 images. 10 Japanese female students were selected, and each person made 7 expressions. In the preprocessing stage, all images are uniformly normalized to 150 × 110 pixels, and then feature extraction is performed on the images. Figure 3 shows example images of the JAFFE dataset.

BU-3DFE.
e dataset covers 2D images modeled in 3D datasets of 7 typical expressions. e dataset includes 100 subjects, of which 56% are female and 44% are male. Images also vary in age, celebrity, and ethnic origin. Figure 4 shows example images of the BU-3DFE dataset.

Experimental Setup.
In the proposed facial expression recognition method using CNN and improved capsule network model, the feature bin is merged by multiple subsequent residual blocks without the single-layer convolution truncated in the residual block chain (RBC). e single-layer convolution consists of 32 convolution kernels with 256 × 7 × 7 and a step size of 2. In this way, the dimensions of the 8 parallel convolutional layers in the feature bin are all 32 × 8 × 8. e reason for taking 8 parallel convolutional layers in the feature bin is to establish the middlelevel convolutional feature of each expression. erefore, the number of parallel convolutional layers in the feature bin should be greater than and close to the number of 7 expression classifications. However, using 7 convolutional layers directly is too harsh, so the restrictions are slightly relaxed, and the number of convolutional layers in the feature bin is set to 8. e length of each class vector in the class vector bin is 16. To reconstruct the activation class vector, the activation class vector is first converted into the feature block of 8 × 8 × 32 using a full link method, and then the image is restored by three distributed interleaved convolution modules. 128, 32, 32, 32 distributed interleaved convolution kernels with scales of 6 × 6 × 32, 9 × 9 × 128, 6 × 6 × 32, 10 × 10 × 32, and steps of 2, 1, 2, 2 are adopted. Width and heigth are set to the uniform scale of the model input image 128 × 128.

Analysis of Parameter Performance.
To verify the value of parameters dynamic routing times, suppression illumination coefficient β , and combination coefficient α of the proposed facial expression recognition method using CNN and improved capsule network model, the experiments were performed on JAFFE and BU-3DFE datasets. e change range of dynamic routing times in the experiment is 1∼10. e value of illumination coefficient is 0.1∼1, and the value of the combination coefficient is 0.1∼1. After a lot of  It can be seen from Figures 5-7 that when the number of dynamic routing times is 3, the recognition rates on both the JAFFE and BU-3DFE datasets have peaked. e model works best when the light coefficient β and combination coefficient α is 0.2 and 0.4, respectively. erefore, in the following experiments, the number of dynamic routes is set to 3, the illumination coefficient is set to 0.2, and the combination coefficient is set to 0.4.

Results of Key Point Detection
To illustrate the key point detection of the image in this method, the proposed method is compared with existing face key point detection models. Figure 8 shows the comparison of key point detection results of reference [23] algorithm, reference [15] algorithm, reference [19] algorithm, and proposed method on JAFFE and BU-3DFE datasets.
It can be seen from Figure 8 that the proposed method is better than the three existing methods of reference [23] algorithm, reference [15] algorithm, and reference [19] algorithm significantly. It shows that when using the original face image for training in facial expression recognition, there will be a large error using rough geometric constraints as the real face key points. erefore, mapping the features to high dimension space by random partitioning is needed.

Effect Verification of Illumination and Posture
Preprocessing. To verify the effects of illumination and posture preprocessing, training was performed on the JAFFE and BU-3DEFE datasets, and validation was performed on the CK+ dataset and multi-PIE datasets. In each process of sampling, a single expression image is combined with illumination and posture processing technology to convert the camera projection perspective and illumination condition, and then, 32 training data of Mini-Batch are generated.
e experimental results are shown in Table 1. It can be seen from Table 1 that after the preprocessing of illumination and posture, the accuracy of cross-dataset recognition of several deep learning methods has been greatly improved. e results verify the effectiveness of the proposed illumination and posture normalization method.

Recognition Result.
To verify the effectiveness and superiority of the algorithm in the paper, Tables 2 and 3 show the recognition rates of various expressions of different pose angles on the BU-3DFE and multi-PIE datasets. Figures 9  and 10 show the confusion matrix with the best performance  It can be seen from Tables 2 and 3 that the recognition rate of each expression of different poses on the JAFFE dataset has reached more than 90%. In the BU-3DFE dataset, the highest recognition rate of the single expression also reached 86.13%, and under different posture angles, the recognition rate of each expression is higher than that at 0°b ecause the proposed method improves the smoothness of the image and reduces the distortion of the face texture to increase the recognition rate.
As can be seen from Figures 9 and 10, the accuracy rates on the JAFFE and BU-3DFE datasets have reached 96.66% and 80.64%, respectively. Hate is more difficult to identify on the JAFFE dataset and fear is more difficult to identify on the BU-3DFE dataset. e correct recognition rates are 95.23%   e reason is that the two expressions have similar texture changes around the eyes.
To verify the effectiveness and superiority of the algorithm in the paper, a comprehensive comparison is made with the existing methods on the JAFFE dataset and the BU-3DFE dataset. During the experiment, it is best to guarantee that the training object is performed under condition independent of the test object and SVM is used as   Tables 4 and 5.
It can be seen from Tables 4 and 5 that under the same classifier, the proposed method can obtain a higher recognition accuracy rate than several other expression recognition methods. e reason is that the proposed method extracts light-insensitive features fully, suppresses boundary mark at sudden changes of light, and reduces noise points of image. Meanwhile, mapping features to high-dimensional space through random partitioning helps to distinguish similar-looking expressions. erefore, the proposed method can improve the recognition accuracy of CNN models effectively.

Conclusion
A new facial expression recognition method improved the capsule network model is proposed, which reduces the noise of the image by adaptive preprocessing of the image illumination, reduces the complexity of the model, and improves the accuracy of model by using random partitioning. e improved model adds an attention layer after the primary capsule layer in the capsule network. It can increase the attention to the salient parts by weighting. at is, it can automatically select the most relevant important frames of the audio event class and ignore the irrelevant frames (such as background noise). Our attention layer realizes the attention mechanism by selecting the saliency of time slices.
us, the overfitting of the model is reduced. Experimental results show that the improved capsule model can effectively classify facial expressions under unconstrained conditions, which makes up for the deficiency of pure deep convolution network in acquiring sparse features hidden in discriminative texture and improves the generalization ability of existing expression classification models for illumination and pose differences.
In the future task of facial expression recognition, it is planned to integrate the attention matrix into the attention capsule network. e attention capsule network is used for weakly labeled semisupervised expression image detection. Also, the attention capsule network is applied to other largescale data problems with low discrimination.

Data Availability
e data included in this paper are available from the corresponding author upon request.