Facial Expression Recognition Method Combined with Attention Mechanism

Aiming at the slow speed and low accuracy of traditional facial expression recognition, a new method combining the attention mechanism is proposed. Firstly, group convolution is used to reduce network parameters. The channels of traditional convolution are grouped to cut oﬀ redundant connections so that the number of parameters decreases signiﬁcantly. Secondly, the ERFNet network model was improved by combining the asymmetric residual module and the weak bottleneck module to improve the running speed and reduce the loss of accuracy. Finally, the attention mechanism was added into the feature extraction network to improve the recognition precision. The experiment shows that compared with traditional face recognition methods, the proposed method can improve the recognition precision and recall signiﬁcantly; in CK+, Jaﬀe, and Fer2013 datasets, the recognition precision can reach 88.81%, 82.16%, and 79.33%, respectively.


Introduction
Facial expressions can convey clues to a person's emotional state, which together with voice, language, and hands and body postures compose the basic communication system between human beings in the social environment [1]. In different environments, facial expressions have many functions for communication. We can adjust dialogue, convey biological information, express mental labor intensity, and express emotions by receiving signals in turn [2,3]. In the process of information exchange between people, facial expressions play an indispensable role. Building a system that can automatically analyze facial expressions has an important practical significance in the fields of medical care, education, and driverless cars [4]. For example, judging the patient's pain degree through facial expressions can effectively assist doctors in diagnosis [5][6][7]. Detecting facial expressions to judge the degree of distraction can help teachers improve students' learning efficiency [8][9][10]. Monitoring the facial expressions of the drivers can perceive the degree of fatigue in advance [11][12][13]. erefore, the academic circle has explored various facial expression recognition systems to encode and recognize facial expression information. ere are two main methods of facial expression measurement: information judgment and indication judgment [14]. Information judgment is to study the conveyed meaning by facial expressions (such as happy, angry, or sad). Indication judgment is to study physical signals, which are used to convey information (such as raised cheeks or sunken lips) [15][16][17]. e main disadvantage of the information judgment method is that it cannot explain the full range of facial expressions. It usually assumes that facial expressions and target behaviors (such as expression categories) have a clear many-to-one corresponding relationship. But according to psychological research, this is not the case in reality. Generally speaking, the relationship between information and its related categories is not universal. e facial display and its interpretation vary from person to person and even from situation to situation [18][19][20].
In recent years, deep learning had developed rapidly. Many network architectures, such as VGG [21], AlexNet [22], and ResNet [23], had also been widely used in facial expression recognition. e VGG network was different from AlexNet, which deepened the network layers and used 3 × 3 small convolution kernels. ereby, VGG reduced the amount of parameters. He et al. [23] proposed ResNet, which used the principle of identity path to further deepen the number of network layers without causing gradient explosion. Cheng and Zhou [24] used transfer learning to overcome the lack of training samples. On the basis of VGG19, the network structure and parameters were optimized to improve the precision. Zhong et al. [25] introduced the dropout layer on the basis of ResNet and modified the fully connected layer (FC) to reduce parameters. At the same time, SE block [26] was added to the network to achieve a higher accuracy. Chen and Hu [27] proposed a new learning method of the interclass relationship, which extracted features from two different expression images, merged the extracted features at a random ratio to obtain a mixed feature, then redistributed the weight to each pixel, and last improved feature distribution through the relationship between expressions. So, the algorithm's discriminative ability was increased. Wang and Shen [28] combined global features with regional features in the deep convolutional network. In addition, the facial action unit and Bayesian network model were established to analyze the probability of the action unit. Finally, the learned features were integrated for expression classification. Xu and Zhao [29] used two branches to extract features. One branch is local binary pattern (LBP). Other branch is convolutional neural networks. en, the two features were merged. Finally, principal components analysis (PCA) was used to reduce the feature dimensionality. So, it can effectively improve the accuracy of facial expression recognition.
Many existing studies only focus on the entire face image. However, the texture distortion caused by the expression is obvious at the key points. e facial eyes, mouth, and left and right cheek are the areas where facial movements are most obvious when expressions occur. Focusing only on the overall features usually leads to unnecessary calculations. So, based on the main problems and research status of facial expression analysis, the main work of this paper is as follows: (1) Firstly, group convolution is used to reduce network parameters. e traditional convolutional channels are grouped, and redundant connections in the network are cut. So, the number of parameters is greatly reduced. (2) Secondly, the ERFNet model is improved by combining the asymmetric residual module and the weak bottleneck module, which improve the running speed and reduce the loss of accuracy. (3) Finally, an attention mechanism is added to the feature extraction network to improve the segmentation accuracy of the network.
is paper first gives the general technical steps of face recognition and introduces each step and then analyzes the convolutional neural network most commonly used in face recognition technology. Secondly, the network structure design of this paper is given. e network parameters are reduced by group convolution, the running speed of the proposed method is improved by the improved ERFNet network model, and the attention mechanism is introduced to improve the recognition accuracy of the network. Finally, experimental validations are carried out on the proposed expression recognition method. And, the experimental results are visually analyzed from training process, precision, and recall.

Technical
Steps. As shown in Figure 1, face recognition technology specifically includes four steps: (1) face detection, (2) face alignment, (3) face characterization, and (4) face matching. Face detection refers to find the position of the face in the picture. If there is a face, a rectangular frame containing the face will be returned. Face alignment refers to automatically locate the key feature points of the face. Face characterization refers to convert the face into a feature vector. Face matching refers to use the features from the previous step to compare with the features in the database. Finally, the distance between the features of different faces is compared. If the distance is less than a certain threshold, it is regarded as the same face, otherwise it is regarded as a different face.

Face Detection and Alignment.
In this part, it is first to determine whether there is a human face in the picture. e picture is preprocessed to eliminate the influence due to factors such as illumination and shooting angle. Here, the active shape model (ASM) algorithm and active appearance model (AAM) algorithm are often used for face key point detection. eir advantages are that the structures are clear and easy to understand and apply. However, the computational efficiency is lower. ey are not suitable for massive face images.

Face Characterization.
e faces detected in the above steps are processed to obtain the unique feature vectors. Such feature vectors often contain position information of the eyebrows, eyes, noses, and mouths. In addition, information including contour, size, and shape of the face may also be added into feature vectors. Navneet's Histogram of Oriented Gradient (HOG) algorithm, eigenface algorithm, and Haar wavelets' algorithm are often classic methods.

Face Matching.
e extracted feature vector will be compared with the face vectors in the database. If the similarity between the feature vectors is higher, the identity information corresponding to the face is output. If there is no match for all faces in the database, the output is unable to be recognized.

Convolutional Neural Network.
At the beginning of the 21st century, several studies in the facial expression recognition literatures found that CNN's performance is better than multilayer perceptions, and it can solve the problems of translation, rotation, and scale invariance in facial expression recognition. e convolutional layer browses the entire input image with a learnable filter and then generates various specific activation feature maps. e convolutional layer is one of the most important modules of convolutional neural networks, which can quickly increase the calculation rate.
e parameters of the convolutional layer include a series of learnable receptive fields. Although each receptive field is very small, they can extend to the full depth of the input. In the process of forward calculation, we convolve all inputs and perform a dot product operation with the receptive field at each position of the input. When the receptive field of the network receives a specific form of visual features, such as directions, spots, or colored edges, it will be activated, and the generated feature map will have a high response. In some deeper networks, even the entire honeycomb or wheel pattern can be seen. In order to obtain deeper feature maps, multilayer convolutional layers are usually used. When processing some high-dimensional and more complex input images, due to the large number of neurons, it is very difficult to connect all neurons between adjacent layers. So, local connections are used. e connections are limited to height and width, and full connection is still used in the depth of the input images. e convolution operation has three main characteristics: (1) local connection, which uses the correlation between adjacent pixels to share weights and greatly reduces the learning parameters. (2) Translation invariance, which has a good ability to recognize facial expressions in different positions. e convolutional layer is followed by the pooling layer, which greatly reduces the space size of the feature map and the computational cost of the network. (3) Fully connected layer. e neurons of adjacent layers are connected in pairs. FC is as a "classifier" in the convolutional neural network. e activation function layer, pooling layer, convolutional layer, and other operations are to turn data into abstract feature representations. en, the FC layer is to classify the learned features. As the network increases, the parameters of the FC layer will increase exponentially, which can even account for 80% of the entire network.
is will result in the slower of training speed. All neurons in the FC layer are connected to the upper layer, which can convert the two-dimensional feature map into a one-dimensional feature map and then perform feature representation and classification.
In CNN, it is very common to insert a pooling layer between two adjacent convolutional layers because overfitting can be controlled by downsampling. After the pooling operation, the pooling layer can reduce the depth of each layer. e most common pooling layer is that the receptive field is 2 × 2, and the step is 2. rough the dimensionality reduction of each layer input, 75% of the activation value can finally be discarded.

Design of Network Structure
3.1. Group Convolution. Traditional neural networks have mixed spatial and channel features after convolution operation. Group convolution can strip off this mixture and make the network pay attention to the feature relationship between different channels. As shown in Figure 2, compared with traditional networks, group convolution effectively reduces the complexity of parameters and greatly improves the segmentation speed, which can help to realize real-time segmentation.

Improved ERFNet.
At present, several existing semantic segmentation networks, such as SegNet [30] and FCN [31], have the problems of large amount of parameters and the long processing time. Although the lightweight real-time semantic segmentation ENet [32] can solve this problem, its implementation accuracy is unsatisfactory. ERFNet [33] effectively solves the problem of segmentation accuracy, but the residual module has a large amount of parameters, which will reduce the calculation speed of the model. erefore, based on the ERFNet, this paper uses asymmetric convolution modules to improve running efficiency. e asymmetric convolution module decomposes the d × d convolution kernel into d × 1 and 1 × d convolution kernels. During the convolution operation, the parameter amount is reduced from d2 to 2d, which can reduce a large number of parameters.
e network model adopts an end-to-end encoder-decoder structure, as shown in Figure 3. e network model mainly is composed by downsampling block, asymmetric residual block (ARB), weak bottleneck block (nonbottleneck-1D), and upsampling block.
In this model, modules 1 to 16 are the encoder part, and modules 17 to 23 are the decoder part. e encoder outputs small-resolution feature maps of several channels, and the decoder upsamples the small-resolution feature maps to recover the initial resolution output. e special network setting of ERFNet refers to [34], as shown in Table 1.
Modules 1, 2, and 8 are downsampling modules. Since the initial input image is large and contains more redundant information, downsampling can significantly reduce the size of feature maps and reduce the calculation complexity. However, frequent downsampling can also cause the precision loss of image edge segmentation. Meanwhile, it will increase upsampling calculation cost. To consider the balance between running speed and segmentation accuracy, this paper only performs three down-sampling processing. e residual module and the bottleneck module are two basic structures proposed by Nestor et al. [17]. e network of this paper is stacked on its basis after effective improvement. Aiming at the low efficiency of the residual module, this paper uses asymmetric convolution technology Mobile Information Systems to redesign it, which obtains the asymmetric residual module and the weak bottleneck module. When the feature map that needs to be calculated is large, using the asymmetric residual module and the weak bottleneck module can effectively improve the running speed of the model. And, using the combination of 1D filters to decompose the convolutional layer can effectively save the calculation cost.
Modules 17 and 20 are upsampling modules, which are used to alleviate the loss of spatial information and reduce the precision loss of image. In this paper, deconvolution is used for upsampling. Compared with traditional interpolation algorithms, the weights of deconvolution can be learned.

Attention Mechanism.
To improve the segmentation speed of the network, this paper uses an improved ERFNet as a feature extraction network. But ERFNet will lead a certain precision loss. erefore, the attention mechanism needs to be added to the network. e attention module considers the interdependence between the feature channels. And, through the self-learning of the network, the weights of useless features are effectively suppressed and the weights of useful features are enhanced, which can improve the understanding of the model. e operation of the attention module is mainly divided into two steps, squeeze and excitation, as shown in Figure 4.
In the attention module, global average pooling is used to shrink the input data from the previous layer. e feature map shrinks from M × N × K to 1 × 1 × C. e specific formula is as follows: where F GAP represents the global average pooling function. is paper uses the Atrous Spatial Pyramid Pooling (ASPP) [35] algorithm as the global average pooling function. ASPP provides an effective mechanism to control the size of the receptive field. Meanwhile, ASPP finds the best balance between precise positioning (small field of view) and the information restoration (large field of view).
In the attention module, the global features on each channel are obtained by formula (1). en, the module performs an excitation operation to obtain the correlation between different channels. is step is mainly completed by two fully connected layers. To reduce the number of parameters, the first fully connected layer compresses C channels into C/r channels, where r is the compression rate, as shown in formula (2). en, the activation function ReLU increases the nonlinearity of features. e second fully connected layer is restored to C channels, and the activation function Sigmoid is used to generate different weights for each feature channel. After this, the network has more distinguishing capabilities for the features of each channel. e purpose of gaining useful features and suppressing useless features is achieved by multiplicative weighting: where σ and δ, respectively, represent Sigmoid and ReLU activation functions.

Network Parameter Setting.
e experimental model is based on the TensorFlow1.9 framework and uses the cuDNN7.5 kernel for calculation. e workstation is configured with GTX 2080Ti graphics card and memory 512 GB. e optimization method used stochastic gradient descent. Its momentum parameter was set to 0.99, and the weight decay rate was 1 × 10 −4 . e initial learning rate was 8 × 10 −3 . In the experiment, the image size was set to 384 × 384 pixels,

Evaluation Index.
e loss value was a usually used indicator to evaluate a model, which was used to estimate the inconsistency degree between the true value and the predicted value. e lower the loss value was, the better the robustness of the model was. e cross-entropy loss function was used to compute the loss value, which was showed as formula (3), where y represented the true classification value, a represented the predicted value, and c represented the loss value: e precision P could be expressed as follows: e recall R was expressed as follows: where TP represents the number of actual positive samples in all predicted positive samples, FP represents the number of actual negative samples in all predicted positive samples, and FN represents the number of actual positive samples, which were mistaken as negative samples. Figure 5, when the number of iterations was 20, the accuracy of the training set tended to be stable, and the precision of training set was around to 90%. It was found from Figure 6 that the loss value of the validation set and training set decreased rapidly in the first 20 iterations. After the 25th iteration, the loss value of this model dropped to a very low level and stabilized, and there was no significant change anymore.

CK + Dataset Experiment.
is paper used CK + face expression dataset for experimental verification. e CK + dataset collected 123 subjects, 593 expression sequences, and 951 image samples. Among the 593 image sequences, there were 327 sequences with emotion labels. e emotion labels are happy, sad, angry, fear, surprised, disgust, and neutral. Part of images and the real category annotations of the dataset are shown in Figure 7. In order to prevent the overfitting phenomenon caused by the small number of samples, this paper used data augmentation to expand the experimental dataset. By performing operations such as rotation, horizontal flipping, and vertical flipping on the facial expression images in the dataset, the dataset was expanded to 12,000 images.
To test the effect of the model, 9000 images were randomly selected as the training set, 1500 images were as the test set, and 1500 images were as the validation set in the expanded image dataset. It could be seen from Figure 8 that the method in this paper had a higher precision. e results are 94.44%, 87.5%, 93.7%, and 93.08% for the four expressions of happy, angry, surprised, and disgust. e three expressions of sad, fear, and neutral had relatively lower precision values, which are 78.82%, 71.01%, and 66.67%, respectively. In experiment, the improved method is more likely to misclassify sad expressions as fear expressions. is may be due to the similarity between lips. e probability of misclassifying the fear expression as surprised is higher, mainly because the fear expression and the surprised expression have a higher similarity degree at the level of organ features such as eyes and nose. Neutral expressions are more likely to be misclassified as happy, disgust, and sad. It may be some expression changes were not obvious.

JAFFE Dataset Experiment.
Michael Lyons team created a database of Japanese women in the Department of Psychology at Kyushu University. e database contained a total of 10 participants. Each of these participants collected seven expressions of happy, sad, angry, disgust, surprised, fear, and neutral.
e JAFFE dataset was constituted of 213 facial images in the database. In the experimental environment, participants make corresponding facial expressions according to the requirements of collectors. en, collectors used the camera to capture the face expression. e illumination angles of images collected in this experiment were all the same, and they were all frontal illumination. e collectors re-cuted and adjusted the initially collected images. So the size of images was basically the same. e size was 256 × 256 pixels. e position of the eyes in images was also roughly the same. In the experimental environment, participants make corresponding facial expressions according to the requirements of collectors. en, collectors used the camera to capture the face expression. e illumination angles of images collected in this experiment were all the same, and they were all frontal illumination. In the experimental environment, participants make corresponding facial expressions according to the requirements of collectors. en collectors used the camera to capture the face expression. e illumination angles of images collected in this experiment were all the same, and they were all frontal illumination. e collectors re-cuted and adjusted the initially collected images. So the size of the images was basically the same. e size was 256 × 256 pixels. e position of the eyes in images was also roughly the same.

Fer2013 Dataset Experiment.
e Fer2013 dataset consisted of 35886 facial expression images, which included 28708 test pictures, 3589 public validation images, and 3589 private validation images. And, Fer2013 dataset also had 7 expressions. e dataset did not directly contained pictures, but saved data into csv files. After 10,000 iterations of training, the precision was 79.33% in this paper on Fer2013 dataset. It could be seen from Table 3 that the model had the highest precision in sad expression. e precision was 87.90%. And, the recognition precision of disgust expression was lower, which was 65.08%. According to the model, it was conjectured that the extraction parameters of feature maps were too few, and some expression features were ignored, which reduced the classification effect. e comparison results with the traditional deep learning model on the Fer2013 dataset were shown in Table 4. Among them, ResNet [17] used the principle of    Angry  711  12  58  32  76  16  53  Disgust  17  82  4  2  3  1  2  Fear  35  22  744  48  54  63  58  Happy  67  2  58  62  942  18  98  Sad  29  3  33  1555  67  29  58  Surprised  14  3  33  26  12  725  18  Neutral  41  2  35  44  156 20 935

Conclusion
is paper proposes a facial expression recognition method combined with the attention mechanism. Group convolution is used to reduce network parameters. e improved ERFNet model is used to improve the running speed of the algorithm. e attention module is used in the feature extraction network to improve the recognition precision. Experiments show that this method improves the recognition precision. However, the model still has room for improvement in the recognition of fear and sad expressions. It is necessary to subdivide and extract facial features to improve the recognition accuracy.

Data Availability
e data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.