CNN-LSTM Facial Expression Recognition Method Fused with Two-Layer Attention Mechanism

When exploring facial expression recognition methods, it is found that existing algorithms make insufficient use of information about the key parts that express emotion. For this problem, on the basis of a convolutional neural network and long short-term memory (CNN-LSTM), we propose a facial expression recognition method that incorporates an attention mechanism (CNN-ALSTM). Compared with the general CNN-LSTM algorithm, it can mine the information of important regions more effectively. Furthermore, a CNN-LSTM facial expression recognition method incorporating a two-layer attention mechanism (ACNN-ALSTM) is proposed. We conducted comparative experiments on Fer2013 and processed CK  + datasets with CNN-ALSTM, ACNN-ALSTM, patch based ACNN (pACNN), Facial expression recognition with attention net (FERAtt), and other networks. The results show that the proposed ACNN-ALSTM hybrid neural network model is superior to related work in expression recognition.


Introduction
In the field of facial expression recognition based on deep learning networks, the convolutional neural network (CNN) has been proven to be an important means of extracting local spatial features of images. However, CNN is a typical feedforward deep network. Its network structure is monotonously connected and the information flows in only one direction, which makes it lack the recognition of contextual timing information and restricts the accuracy of the algorithm. e recurrent neural network (RNN) has loops because neurons in them not only receive information from other neurons but also receive information from themselves. e input of RNN is sequence data, and a loop unit is executed at each time t. e calculation of the hidden layer at the next time t + 1 depends on the current input data and the hidden state at time t [1]. Since each hidden state saves the node information that is before time t, RNN can establish the global dependency of sequence data. RNN is used in tasks such as speech recognition and language generation.
Computer vision problems involving sequence input can be satisfying solved by RNN. As for multilabel image classification tasks [2], Zhang et al. found that the traditional multilabel image classification method cannot explicitly use the label dependency in the image. It can be seen that on the basis of RNN, CNN-RNN shows a relationship between semantics and image tags. Guo et al. used CNN to gain the discriminative features and used RNN to optimize the classification of coarse and fine labels [3]. en, the classification performance of traditional hierarchical models is improved by fusing hierarchical information.
Parameters of RNN are adjusted through the back propagation algorithm. However, the network likely experience gradient disappearance as the length of the sequence data input to the network increases. Long short-term memory network (LSTM) introduces gating units to control the accumulation speed of information [1]. is not only endows it with a stronger memory capacity for storing sequence information but also effectively deals with the disappearance of gradients, which is an improvement to RNN.
LSTM has been widely used in image annotation, machine translation, semantic recognition, and other directions [4].
When using neurons as information storage units in neural networks, the more information is stored, the more neurons are needed and the corresponding networks structure are more complex [5]. A large amount of information in the networks often leads to information overload, so the attention mechanism is introduced for this problem. e attention mechanism draws on the basic principles of human brain resource allocation [6]. To obtain richer information from limited resources, more attention is paid to important areas for human attention is limited. Initially, the attention mechanism was proposed in the field of computer vision and later also used in the field of natural language processing [7][8][9]. Its main purpose is to rationally use computing resources. e method is to allocate more attention to the key parts. is effectively solves the problem of information redundancy and prevents the loss of key information.
Based on the ability of the attention mechanism to focus on key parts and the excellent memory capability of the cyclic neural network, this paper proposes a CNN-LSTM hybrid neural network that integrates a two-layer attention mechanism (ACNN-ALSTM) [10,11]. e CNN-LSTM network abstracts the local feature information from the facial expression image and establishes the global spatial dependence of the features through the LSTM. At the same time, the two-layer attention mechanism is introduced to mine the information of essential areas to obtain more discriminative expression features, thereby improving the recognition accuracy.
Our contributions are summarized as follows: (1) A two-layer attention mechanism is introduced to propose a novel ACNN-ALSTM model for facial expression recognition. e two-layer attention mechanism can enhance attention to key areas.
(2) e LSTM algorithm is introduced to establish the global spatial dependency of facial information. Afterwards, the effect of different network layers and the number of nodes in the hidden layer of the LSTM model on expression recognition are investigated. (3) Comparative experiments are made to verify that the introduction of the two-layer attention mechanism in the CNN-LSTM model improves the classification performance on Fer2013 and CK + data sets. Comparison with other neural networks that incorporate attention mechanism shows that our work has certain advantages.

Related Works
A convolution operator across the spatial domain is added to the network by Yang et al. to extract spatial features to automatically learn spectral spatial features from hyperspectral images [12]. en, a bidirectional convolutional long and short-term memory network was built using a bidirectional LSTM to fully capture the spectral information. Nagaraju et al. proposed a fully learning-based method for pixel-level segmentation and classification of scene images-2D LSTM network, which achieves good performance on SIFT stream datasets [13], considering the complex spatial dependencies of natural scene image labels. e model proposed by Lin et al. is composed of a deep LSTM network [14], eight encoders, and eight decoder layers.
e attention mechanism is used to connect the bottom layer of the decoder to the top layer of the encoder, which improves parallelism, reduces the training time, and provides a new solution for neural machine translation. Fu et al. found that not all joints can provide information for action analysis in their study of 3D human action recognition [15]. Instead, irrelevant joints usually produce some noise. Attention mechanisms can be employed to focus on the joints that can provide information.
us, a global context-aware attention LSTM network is proposed to achieve the aim of focusing on the most informative joints in each frame of the skeleton sequence, which shows excellent performance on 3D action recognition datasets.
During recent years, researchers have been trying to optimize the network model using the attention mechanism. For facial expression recognition tasks, Fernandez et al. present a new model, that is end-to-end network architecture with an attention [16]. Li et al. proposed a CNN with attention mechanism (ACNN) [17]. e ACNN can identify the occlusion regions of the facial expression image and focus on the most discriminative regions that are unoccluded.

Methodology
In this chapter, the specific composition of the two network architectures we designed is described. For the first one, a layer of attention mechanism is added to the CNN-LSTM network. For the other, a two-layer attention mechanism is introduced.

CNN-LSTM Facial Expression Recognition Network
Model Incorporating an Attention Mechanism. When the RNN is employed for facial expression classification tasks, the results of average sampling according to time steps are used as classification features. However, the average sampling method blurs the focus and cannot expand the advantages of valuable information. e attention mechanism can make the network focus on effective information, so it is employed to selectively process the information input to the classifier to optimize the classification performance of the neural network. e CNN-LSTM facial expression recognition network model incorporating an attention mechanism (CNN-ALSTM) proposed in this section mainly includes four parts: CNN local feature extraction layer, LSTM feature learning layer, global feature attention layer, and classification layer. e CNN-ALSTM model structure diagram is shown in Figure 1.

CNN Local Feature Extraction Layer.
is layer is mainly responsible for extracting the local abstract features of the facial expression image and serializing the generated feature vector as the input feature of the LSTM network. e CNN local feature extraction layer is composed of a 7-layer convolutional neural network, including 4 convolutional layers and 3 down-sampling layers. e parameters of each layer are shown in Figure 1.
Different from the employing the large convolution kernel in the original CNN model directly, the convolution layer C1 executes point-by-point convolution of the input data using a 1 × 1 convolution kernel. is is beneficial to increase the nonlinear representation of the input and improve the feature representation ability. In addition, the 1 × 1 point convolution kernel has few parameters, which can effectively reduce the network calculation complexity. e pooling layer in the CNN local feature extraction layer uses the maximum pooling method to extract the strongest features.
is can reduce the resolution of the feature map and reduce the computational complexity.

LSTM Global Feature Learning Layer.
e input sequence is x � x 1 , x 2 , . . . , x N , and the input data at moment n is x n . e main work of this layer is roughly divided into four parts. First, the facial expression abstract feature vectors obtained by the CNN feature extraction layer are serialized as the input of the LSTM network. en, according to the order of the time series, they are sequentially input into the main structure of the recurrent network. At the same time, the LSTM memory unit generates state information combining the current input data and historical information at each moment. Finally, the state information is passed to the next layer feature, which contains historical information at all previous moments.

Global Feature Attention Layer.
After the image is extracted by the LSTM network, the hidden state sequence of the LSTM network is defined as h � h 1 , h 2 , . . . , h N , h n is the hidden state at moment n. To select features that are more relevant to the classification task, a query vector q related to the facial expression recognition task is introduced. A scoring function is used to calculate the correlation between the hidden state and the query vector, which means that the higher the correlation between the hidden state and the classification task, the higher the score. For each hidden state, the correlation score can be obtained by the following equation: s(h n , q) is the attention scoring function.
After scoring the relevance of all hidden state tasks, the score is normalized to obtain the attention distribution a n of the hidden state: e attention distribution a n is the probability representation of the importance of the hidden state h n in the facial expression recognition task. e weighted average method is employed to fuse the information of all hidden sequences of the LSTM model, focusing on the most relevant features of the task. e following formula is the weighted feature representation of the sequence of hidden states:  Figure 1: CNN-LSTM model the structure of global feature fusing attention mechanism. In CNN local feature extraction layer, C means the convolutional layer and S means the pooling layer. e elements in {} represent the size and the number of the convolution kernels, respectively. e convolutional layers and the pooling layers have step size of 1 and 2, respectively.

Classification Layer.
Finally, there is the classification layer. e classification layer is responsible for classifying the weighted features fused by the global feature attention layer.

CNN-LSTM Facial Expression Recognition Network Model Incorporating a Two-Layer Attention Mechanism.
An ACNN-ALSTM network model is proposed in this section. e network mines the local features through CNN [18,19] and then uses the long-term memory capability of RNN to establish the spatial global dependence of the local features. LSTM is employed to compensate for the network instability when RNN processes long sequences. For the recognition of facial expression features, the proposed model further incorporates the attention mechanism based on the previous section, the process of which is similar to the CNN-ALSTM. e attention distribution is calculated for the hidden state of the LSTM, and then a weighted summation is performed on the hidden state. Afterwards, it is sent to the classifier as an optimized facial expression feature. e proposed model mainly includes five parts: CNN local feature extraction layer, local feature attention layer, LSTM feature learning layer, global feature attention layer, and classification layer.
e ACNN-ALSTM model structure diagram is shown in Figure 2.

CNN Local Feature Extraction Layer.
e structure of this layer is exactly the same as the CNN local feature extraction layer in the previous section, so it is not repeated.

Local Feature Attention Layer.
e role of this layer is to strengthen the expression ability of essential local features. e query vector q related to facial expression recognition tasks is introduced. us, similar to the previous section, the correlation between each local feature local n and the query vector can be calculated through the scoring function. Local features with high scores are more essential: S(local n , q) is the attention scoring function. After completing the task relevance scoring, the score is normalized to obtain the attention distribution La n of the local features as shown in the following formula: La n � soft max s local n , q .
e attention distribution La n is the probability representation of the essential of the local feature local n . en, weight each feature as the input of LSTM: x n � La n •local n .
x n is the weighted local feature.

LSTM Global Feature Learning
Layer. e process at this layer is similar to that described in the previous section. e LSTM algorithm employs gating units and strong memory ability to establish the global spatial dependence of facial information.

Global Feature Attention Layer.
e hidden state output by LSTM is employed as the input of the global feature attention layer. e query vector q of the attention layer is defined as a learnable weight vector whose dimension is consistent with the number of hidden layer nodes of the LSTM network. ..
x 1 x 2 x  is layer implements the weighted fusion of the output features of the LSTM using the attention mechanism in two steps. First, the attention distribution a � a 1 , a 2 , . . . , a N for the hidden state h � h 1 , h 2 , . . . , h N of the LSTM network is calculated. Second, the weighted average result att of the hidden state is calculated according to the attention distribution.
e attention mechanism can increase the weight of the features that are most relevant to facial expression recognition tasks and achieve its dominant role in the feature fusion process, thereby improving the classification performance.

Classification Layer.
e classification layer is responsible for classifying the weighted features fused by the global feature attention layer. First, to reduce the dimensionality of the feature to the number of categories, the full connection is used for classification, and a 7-dimensional vector e V is obtained. e symbol V corresponds to the attention scoring function s(h n , q) above. en, a normalized formula (7) is formed combining with a softmax function. Using the softmax function, the facial expression features that are weighted and fused through the attention mechanism are classified. S i represents the probability that the input sample belongs to the i th facial expression, i ∈ 1, 2, . . . , 7 { }.
Finally, the loss of the entire model is calculated by cross entropy, and the calculation method is shown as follows: L(y, S) � −ylog(S). (8) y refers to the probability distribution in the real category, and S represents the probability distribution of the categories predicted by the model in this section.

Experiments
e data sets, simulations implementation, training details, and expression recognition results are described carefully in this part. As for the analysis of the experimental results, the effect of LSTM parameters on the facial expression features extracted by the model is experimentally studied. Besides, the performance of the methods proposed is confirmed by the expression recognition result.

Datasets.
Two data sets are employed, namely, CK+ and Fer2013 data sets. Both data sets contain facial images of 7 emotions and they will be described in detail. [20].

Fer2013 Data Set
e Fer2013 data set is mainly from the well-known data science competition platform Kaggle. e data is obtained by searching for image keywords through the Google search engine, including 35,887 gray-scale images with a resolution of 48 * 48 pixels. Table 1 shows the details of various expressions in the Fer2013 data set. e data set is rich and diverse since the images are crawled from the network, but the data contains a lot of noise, which is a challenge for facial expression recognition classification algorithms.

CK + Data Set.
e CK+ (Extended Cohn Kanade) data set was established in 2010 by the Patrick Lucey team and the Zara Ambadar team [21]. e database is a close-up of faces taken by 123 subjects according to certain expression requirements. ere are 593 facial image data in the database, 327 of which include seven kinds of facial emotions. Since there is less data with expression tags in this data set, data enhancement processing is performed on 327 images with emotions [22]. First, face recognition is performed on the image to trim the invalid background. en, all images are converted to 48 × 48 resolution. Furthermore, the images are flipped and rotated to increase the amount of data to about 3 times the original. At the same time, the brightness and saturation of the images are adjusted. e data distribution of the processed CK + data set is shown in Table 2.

Implementation and Training
Details. Two kinds of image classification experiments are designed to verify the effects of the two deep learning network models proposed in the paper on facial expression recognition. ese experiments are implemented on the Linux16.04 64 bit operating system, using python3.6 and the Google framework ten-sorflow1.12.0-gpu for programming and implementation, and calling Open CV for data enhancement. NVIDIA Ge Force GTX1080ti graphics card with 16G video memory is employed. e FER2013 and CK + data sets are divided into  Tag  -0  1  2  3  4  5  6  Quantity  35887  4953  547  5121  8989  6077 4002 6198 Computational Intelligence and Neuroscience training set, validation set, and test set according to the ratio of 8 : 2. e iteration is 5000 and print the training loss and error every 100 rounds. e loss and error of the test are printed every 400 rounds. Face images of the input model are rescaled to 48 × 48 pixels. e learning rate of the model is 0.001, and Adam optimizer and softmax regression are employed for calculation and classification [23,24]. Cross entropy is used to calculate the loss function, and the batch size is 30. It is found that the feature abstraction degree of the LSTM model of the image is related to the number of nodes in the hidden layer and the depth of the hidden layer. erefore, the influence of the number of nodes in the hidden layer and the depth of the hidden layer on the effect of facial expression recognition is shown in Table 3.

Expression Recognition Results.
We conducted two types of experiments. First of all, the experiment studied the influence of LSTM parameters on facial expression features extracted by the model. en, a comparison about various variants of ACNN-ALSTM and related work is executed.

e Influence of LSTM Model Parameters on Facial Expression Recognition.
e paper considers two essential factors that affect the feature extraction of LSTM: the number of network layers and the number of hidden layer nodes. e classification results of differently constructed LSTM networks on the Fer2013 and the processed CK + data sets are shown in Figure 3. According to the experimental results, the classification accuracy rate continues to increase as the number of nodes increases. At the same time, the abstract features learned from the image data of the LSTM network are more discriminative. However, the increase in the number of hidden layer nodes has an upper limit on the effect of increasing the recognition rate. From Figure 3, it can be found that the effect is the same when the hidden layer nodes are set to 512 and 1024. Besides, increasing the number of network layers can improve the facial expression recognition effect of the model. When the LSTM model is a two-layer network and the number of hidden layer nodes is 512-512, the classification accuracy of the model on the Fer2013 and CK + datasets is relatively high.  CNN-LSTM model and its various variants. ese models are trained on the Fer2013 and the processed CK + training sets. Table 4 shows the average test results obtained through multiple experiments on the test set. In the table, ACNN-LSTM represents that the local feature is fused with the attention mechanism. e CNN-ALSTM represents that the global feature is fused with the attention mechanism. e ACNN-ALSTM model represents that the local feature and the global feature are both fused with the attention mechanism.

Comparison of Facial Expression Recognition
Compared with that only the local feature or that only global feature is fused with the attention mechanism, the effect of both local feature and global feature are fused with the attention mechanism is better. At the same time, it can be found that the effect of fusing the attention mechanism in the global feature is better than that in the local feature. e above-given research results show that adding an attention mechanism to the CNN-LSTM model can highlight essential features and improve the classification effect.
ACNN-ALSTM and related work. Although some results cannot be directly compared due to different experimental setups and different preprocessing methods, it is demonstrated in Table 5 that the proposed method can yield a feasible and promising recognition rate for static facial images.

Discussion
It is obvious that the efficiency and accuracy of recognition can be promoted by making sufficient use of key parts of facial expression. However, the utilization of the key parts of the existing algorithm can be further improved. us, the CNN-ALSTM and the ACNN-ALSTM incorporating an attention mechanism are proposed in this paper. Simulation results indicate that the proposed algorithm can further improve the recognition rate compared with other researchers' work, although both their algorithms and ours incorporating attention mechanisms. At the same time, this work demonstrates the effectiveness of attentional mechanisms for exploiting key information about facial expressions.

Conclusion
In this work, a novel ACNN-ALSTM model for facial expression recognition that incorporates a two-layer attention   Computational Intelligence and Neuroscience 7 mechanism is proposed. Comparative experimental results indicate that the introduction of the two-layer attention mechanism improves the classification performance of the system more significantly than that of the CNN-LSTM model and its variants. We also investigate the effect of the different network layers and the number of nodes in the hidden layers of the LSTM model on expression recognition. e results indicate that appropriately increasing the network depth and the number of hidden layer nodes is beneficial to improve the recognition accuracy of the LSTM model. In future work, we want to train the network for extreme conditions such as dark, light, and occlusion [25,26].

Conflicts of Interest
e authors declare that they have no conflicts of interest.