Facial Expression Recognition Algorithm Based on Fusion of Transformed Multilevel Features and Improved Weighted Voting SVM

. In allusion to the shortcomings of traditional facial expression recognition (FER) that only uses a single feature and the recognition rate is not high, a FER method based on fusion of transformed multilevel features and improved weighted voting SVM (FTMS) is proposed. The algorithm combines the transformed traditional shallow features and convolutional neural network (CNN) deep semantic features and uses an improved weighted voting method to make a comprehensive decision on the results of the four trained SVM classiﬁers to obtain the ﬁnal recognition result. The shallow features include local Gabor features, LBP features, and joint geometric features designed in this study, which are composed of distance and deformation characteristics. The deep feature of CNN is the multilayer feature fusion of CNN proposed in this study. This study also proposes to use a better performance SVM classiﬁer with CNN to replace Softmax since the poor distinction between facial expressions. Experiments on the FERPlus database show that the recognition rate of this method is 17.2% higher than that of the traditional CNN, which proves the eﬀectiveness of the fusion of the multilayer convolutional layer features and SVM. FTMS-based facial expression recognition experiments are carried out on the JAFFE and CK+ datasets. Experimental results show that, compared with the single feature, the proposed algorithm has higher recognition rate and robustness and makes full use of the advantages and characteristics of diﬀerent features.


Introduction
FER refers to the use of computers to analyze human facial expressions and judge human psychology and emotions through pattern recognition and machine learning algorithms, thereby achieving intelligent human-computer interaction [1]. Traditional FER methods generally include three steps: face detection, feature extraction, and expression recognition [2,3]. e most important part is feature extraction, which directly affects the final recognition result.
Texture features commonly used in FER include Gabor and LBP. e Gabor filter has the same characteristics as the receptive field of visual cells and has the ability to analyze subtle changes in images from multiple scales and directions [4]. detailed expression features, but they are not robust. e relationship between geometric features and expression changes is more direct, easier to understand and analyze, and more robust under certain lighting conditions. However, the local description ability of expression information is weak, and the error is large. e shallow features of traditional hand-designed can no longer adapt well to various interference factors that have nothing to do with expression in the real world. Deep CNN has the ability to mine the deep potential distributed expression characteristics of data, and it is very effective when using deeper layers to learn features with high-level abstractions [15,16].
In recent years, CNN [17][18][19] has been widely used in FER. CNN maps the image layer by layer, and the mapping to the end is the result of feature extraction. Traditional CNN usually only uses the last layer of convolutional layer features for image classification. However, the features extracted from the intermediate convolutional layer also contain some information and have certain expressive power in the image [20][21][22]. Rashid M [23] proposed a sustainable deep learning architecture for accurate object classification, which utilizes the fusion and selection of multilevel deep features. Ren [24] proposed a CNN-based cosaliency detection model, which consists of two key parts including the integration of multilayer convolutional features extracted from a set of images and the interimage saliency propagation. ese indicate that the use of the features of the intermediate convolutional layer can improve the feature representation ability of the image, thereby improving the accuracy of the CNN. In addition, CNN usually uses Softmax for classification, but experiments have shown that Softmax is not suitable in the field of FER due to the low distinction between expressions [25,26]. Currently, many researchers combine the features extracted by CNN with traditional classifiers to have better performance and achieve good results [27][28][29][30]. Liu [31] proposed a multilevel structured hybrid forest (MSHF) for joint head detection and pose estimation, which extends the hybrid framework of classification and regression forest. Touil [32] used convolutional features and an online training SVM classifier to detect targets and improve accuracy. e classification accuracy and robustness of the SVM classifier in traditional classifier are better. Pham [33] evaluated the performance of these methods using ROC curves and methods based on statistical indicators by applying five machine learning methods. e experimental results show that the SVM model has the best performance.
Whether the feature is reasonable and effective, it will directly affect the final recognition rate. Single feature often has more or less deficiencies and defects, which cannot meet the conditions of good real-time, high precision, and robustness. In this study, these features are fused, and then, the decision-making level fusion is carried out, learning from each other's strengths, and a FER algorithm based on FTMS is proposed. In the shallow features, in addition to the simple processing of Gabor and LBP, this study proposes a joint geometric feature design method for facial expressions. In the aspect of deep CNN features, this study proposes to use the multilayer features fusion of CNN. Moreover, the Softmax classification of the traditional CNN is abandoned, and the SVM classifier is used to classify facial expressions. Finally, with the weighted voting method proposed in this study, the four classifiers trained based on four features are fused at the decision level to obtain the final recognition result, and the superiority of the new method proposed in this study is verified through experiments. e rest of the study is organized as follows. Section 2 is about some basic works related to the follow-up. Section 3 summarizes our new algorithm and describes feature fusion and the improved weighted voting method. Section 4 provides the experimental results. Section 5 concludes the study.

Related Basic Work
e images in the expression database are the subjects to interference from various aspects, such as light intensity, noise, and size. At the same time, the original expression image also contains certain nonface parts, such as background, hair, and other redundant information. erefore, it is necessary to reduce these interferences and eliminate redundant information through some preprocessing methods. Face detection is to extract the face parts of the image, remove the nonface parts, and ensure the effectiveness of subsequent feature extraction [34].
In order to facilitate the unified processing, we use the gray scale formula (1) to process the expression dataset images Among them, RGB is the color representing the three channels of red, green, and blue. Graying reduces the image channel, that is, the data dimension, so that the storage space occupied is smaller, and the calculation speed of image data processing is accelerated. We use the Viola-Jones model [35] to detect the face of gray-scaled image and save it, as shown in Figure 1. Finally, the size of the image after face detection is normalized, and bilinear interpolation is used to scale the image to the uniform resolution of each expression dataset, as shown in Figure 2.
After detecting a human face, there are still a few nonface regions, and the redundant information will reduce our final recognition rate. en, the normalized face expression image is used to label feature points by using ensemble of regression trees [36]. After calibrating the 68 key feature points of the face, three feature regions of the eyes, nose, and mouth are obtained by clipping, as shown in Figure 3. Among them, in order to get the eye part, we find points 17, 19, 24, 26, and 28 near the eyes. Take the abscissa of point 17 as the vertex abscissa x 17 , the maximum values of y 19 and y 24 as the vertex ordinate y e , (x e , y e ) as the vertex, |x 26 − x 17 | as the width, and |y 28 − y e | as the height to draw a rectangle and crop it to get the eye part and 40 * 20 size for sampling. In the same way, the nose and mouth parts are obtained and sampled with the size of 20 * 10 and 40 * 20, respectively, so that three characteristic areas of the eyes, nose, and mouth are obtained.

Traditional Shallow Expression Features.
Gabor features are obtained from certain feature maps. ese feature maps take important facial feature regions (such as the nose and mouth) as input images. By selecting 24 Gabor filters which are closest to the parameters of the receptive field filter of the visual cells, the results are obtained.
LBP histogram features are obtained by connecting 64 small histogram features in sequence. Using the circular LBP operator with a radius of 2, there are 8 points in the field. e LBP feature map is obtained by selecting the uniform LBP mode, and it is evenly divided into 64 small blocks, and each block histogram feature is extracted. e joint geometric feature proposed in this study is based on the location of feature points. First, the distance feature between feature points is extracted to represent the overall information, the deformation feature is extracted to represent the local information, and the distance feature is connected with the deformation feature finally. e distance feature represents the overall shape of the face and the distribution information of the eyes, nose, and mouth. We directly calculate the geometric distance between all feature points.
(2) Figure 4 is a calibrated expression image with 68 feature points. Calculate the 67 distance features d 1 2 , d 1 3 , . . . , d 1 68 between the feature point 1 and the other 67 feature points. Calculate the 66 distance features d 2 3 , d 2 4 , . . . , d 2 68 between the feature point 2 and the remaining 66 feature points (no longer seeking distance from the feature point 1 to avoid repetition) and so on. Finally, the relative distance d 67 68 between the 67th feature point and the 68th feature point is calculated. If there are n feature points on the image, the number of all distance features can be calculated as e distance feature vector can be expressed as e changes in the distance and position of the feature points mostly come from the eyebrows, eyes, mouth, and facial contours, especially when the mouth is open, and the changes in facial contours driven by it will cause significant changes in distance characteristics. Although the distance feature dimension we extracted is not high, there are still some feature redundancies, so we perform principal components analysis (PCA) [37] dimensionality reduction operation. e idea is to map high-dimensional data to lowdimensional space through projection transformation and use the principle of least mean square to obtain the most representative data.
We use indirect deformation features to characterize the deformation information of the local details of the eyes, nose, and mouth regions. Obviously, the local deformation of facial features caused by expressions will cause the changes in the position of feature points in these areas. According to the characteristics of facial muscle movement and facial features deformation, we use a linear combination of distance features between feature points on a part of the facial features area to define nine deformation features. e specific definitions of the nine deformation features are given in Table 1.
After obtaining these nine deformation features, they are correlated with the direct distance features processed by the PCA dimensionality reduction process to obtain joint geometric features. e combined geometric features represent facial expressions from two aspects. At the overall level, the distance features are used to describe the relative positional relationship of important feature points. At the local level, the deformation features are used to describe the facial features caused by changes in expressions.

Deep CNN Expression Features.
In CNN, different convolution kernels have different sizes and receptive fields. CNN can be regarded as a combination of feature extraction and a classifier. From the perspective of mapping of its various layers, it is similar to the feature extraction process, which extracts features of different levels, through continuous interactive mapping and finally mapped to several tags, with classification function, as shown in Figure 5.
is research uses VGG-16 for feature extraction. After visualizing the convolutional layer through feature map [38], the feature map of each channel can be obtained, and each channel is fused according to 1 : 1 to obtain the fused feature map. Figure 6 shows the convolutional layer map after channel fusion. rough the visualization of the feature map, it can be seen that the shallow features are more inclined to detect the edge of the image and the detected content is more comprehensive. With the deepening of the hierarchy, the feature map becomes more abstract, and the resolution of the image becomes smaller and smaller. In contrast, the deeper the layers, the more representative the extracted features. e traditional VGG model trained on ImageNet only uses the output features of the last convolutional layer, that is, the output vector of the last fully connected layer FC3 before Softmax classification. But the intermediate feature information also has a certain expressive ability for images.
is study proposes to use the features of the subdeep convolutional layer conv5_2 of the CNN to fuse the features of the deepest convolutional layer con5_3. e selection of subdeep features can ensure that deeper features can be obtained when the original features are relatively complete. e deeper the number of layers, the higher the level of semantic information for extracting features and the more sufficient the semantic information. In addition, Softmax is not very suitable for FER because of the low discrimination of facial expression. In this study, a better-performing SVM classifier is selected to improve the accuracy of recognition and the generalization ability of the model. Rely on the powerful learning ability of CNN to learn deep feature representation and then use SVM for expression recognition.
is study established a multilayer CNN structure as shown in Table 2. e output feature vectors of the VGG subdeep convolutional layer conv5_2 and the deepest conv5_3 are fused and sent to the network for training; the feature vector of conv8 is extracted and sent to the SVM classifier for classification training. e multilayer CNN designed in this study does not use the traditional pooling method for downsampling, but the use of convolutional layers for downsampling can strengthen the learning ability of the network. e loss function of SVM is given as formula Table 1: e specific definitions of the nine deformation features.

Location
Calculation method Curvature of the left eyebrow (dis (18, line(17, 21)) + dis (19, line(17, 21)) + dis(20, line(17, 21))/(3 × dis(17, 21))) Curvature of the right eyebrow   (5). e better the model, the score of the correct category should be higher than the scores of other error categories, as for how high the threshold (Δ) is determined by us. If it is above the threshold, we believe that the correct category is well distinguished from the specific category. We give zero loss to distinguish between these two categories. Conversely, if a wrong category has a higher score than the correct category, it means that the model distinguishes the two categories badly.
Among them, y i is the label corresponding to sample x i ， j corresponds to a number of a certain category, ω j x i is the score of misclassification, and ω y i x i is the score of correct classification. e design network structure flow chart of this study is shown in Figure 7. e facial expression image is input to the VGG network for feature extraction. e subdeep feature vector and the deepest convolutional layer feature vector are extracted and then merged. e fused feature vectors are used as the input of the multilayered CNN established in this study (Table 1). e feature vectors of the conv8 layer are extracted and sent to the SVM classifier.

Feature Fusion.
e so-called feature fusion refers to independently proposing various single expression features, analyzing their advantages and disadvantages and applicable environment, and then making comprehensive decisions to formulate the most reasonable recognition plan. According to the theory of information fusion, information fusion can be realized at four levels: pixel level, feature level, matching level, and decision level [39], which requires an effective fusion strategy.
Comprehensively, consider the extracted Gabor features, LBP features, joint geometric features, and deep CNN features.
(1) From the perspective of feature categories, Gabor and LBP features are used as texture features, joint geometric features as geometric features, and CNN features as deep abstract features; they are relatively independent and have almost no correlation feature categories.

Mathematical Problems in Engineering
(2) From the perspective of feature synthesis, although both Gabor and LBP features are texture features, the calculation methods are quite different. Gabor features exist in the form of directly expanded feature maps, while LBP features extract histogram features from LBP feature maps. ere is no strong correlation between them, so their feature-level fusion is easy to lose a lot of information. (3) From the characteristics of the features, although we added local deformation features to the joint geometric features, its local description ability for expression information is still weak and the error is large, while the Gabor features and LBP features are highly descriptive and highly accurate but not robust, so merging them can achieve the complementary effect. (4) From the representativeness of features, shallow features are more inclined to detect the edge of the image, and the detected content is comprehensive and key information will also be extracted. As the layers deepen, the feature map becomes more and more abstract, the resolution of the image is getting smaller and smaller, and much information is also ignored. Relatively speaking, the deeper the layer, the more representative the extracted features. e extraction of depth features adds semantic information to the image based on Gabor, LBP, and joint geometric features.
In summary, we choose a higher level of fusion, that is, a decision-level fusion to address our four-feature fusion problem.

Multiclassifier Voting Mechanism.
After extracting the features, the facial expressions are classified. is study uses SVM to complete the classification task. Decision-level fusion is actually training the SVM classifier with four features and then multiclassifier combination of the four classifiers.
is study proposes to use an improved weighted voting method to make a comprehensive decision on the four SVM classifiers and finally determine the recognition effect. e voting method is a relatively simple and specific method to realize parallel combination. Its implementation principle is the "one person, one vote" mechanism. But such an overly simple voting rule does not take into account the characteristics of the classifier itself, which will make the classification result worse. From the above analysis, we can see that the feature composition and characteristics we use to train each classifier are different and the recognition capabilities are different; in many cases, we will not use the same classifier, that is, the principles and methods of each classifier are different; even in each classifier, we use different datasets for training. So the recognition ability of each classifier is bound to be different. Obviously, the "one person, one vote" mechanism is not reasonable enough. We adopt the "one person, multiple votes" mechanism, that is, each classifier should be given different weights.
e experimental results show that using the recognition accuracy of a single classifier as the prerequisite and calculation basis for weight setting can further improve the classification effect. e specific process of the expression recognition algorithm proposed in this study using mixed features for weighted voting SVM classification is shown in Figure 8.
For the same expression, calculate the proportion of different feature recognition rates: Angry: Disgust: Fear: Happy: Neutral: Sad: Surprise: Where m � Gabor, LBP, UG, MC. Take the proportion of the recognition rate of a certain feature in the same expression database and a certain expression in the sum of the four feature recognition rates as the weight of this feature. In the end, the fusion strategy of the improved multiclassifier voting method is Where N is the number of expression categories, L is the number of classifiers, and W mi is the weight of the i th expression of the current m th classifier. e value of vote mi is 0 or 1, which indicates whether the recognition result of the current m th classifier is an i-type expression.

Experiments
Our experiment results including two parts are shown in this section. Section 4.1, the experiment of the CNN deep features proposed in this study with the FERPlus [40] dataset. After proving the effectiveness of our proposed fusion of multilayer convolutional layer features as CNN deep features, in Section 4.2, the FTMS-based expression recognition experiment was carried out with JAFFE [41] and CK+ [42] databases; the results were compared and analyzed.  e FERPlus dataset was directly sent to the VGG network for migration learning and only the last classification layer was changed; the original 1000 categories were changed to 8 categories. Figure 9 is a graph of accuracy and loss function during training. e vectors after the conv5_3 and conv5_3 layer features are fused as the input vectors into the multilayer CNN established in this study, and Softmax is used for classification training. Figure 10  e orange represents training and the blue represents validation.
It can be seen from Figure 9 that the test accuracy rate on the final test set is 51.7% and the final average value of the loss function is 1.3. e accuracy rate is relatively low, the loss function value is relatively large, and the curve oscillation is relatively large and unstable. It can be seen from Figure 10 that the test accuracy rate on the final test set is 59.4% and the final average value of the loss function is 1.1. Compared with using only the deepest layer features, the accuracy is improved by 14.9%, the loss function value is reduced by 15.4%, and the oscillation amplitude becomes smaller and the curve is smoother. It proves the effectiveness of our proposed multilayer convolutional layer feature fusion. Figure 11 shows the confusion matrix directly using VGG for expression classification. e recognition rate of fear, happy, and surprise is relatively high. e recognition rate of angry, disgust, neutral, and sad is all very low, which is already lower than 50%. Among them, the recognition rate of disgust and neutral is lower than 45%, and it is extremely low. According to the classification results, it is found that some facial expression classification results are quite extreme, indicating that the network is not stable enough. Figure 12 shows the confusion matrix for expression classification using our improved CNN. When a CNN trained with multilayer convolution features is used to classify each type of expression, the average accuracy will be improved by 14.9%. Except that the recognition rate of disgust has just reached 50%, the recognition rate of each type of expression is higher than 55% and the classification result is more stable than the original neural network. Figure 13 shows the confusion matrix using our improved deep CNN features and SVM classification. When the features processed by the multilayer CNN are fed into the SVM to classify each type of expression, the average accuracy rate will be increased by 2% compared with that in Figure 12 and 17.2% in Figure 11. Except that the recognition rate of disgust is slightly lower, the recognition rate of other expression categories is significantly higher than that of the original CNN and the classification result will be more stable.
e experimental results show that the use of fusion multilayer convolutional layer features can enrich expression features, and our proposed multilayer CNN can reduce the loss of features, thereby improving the accuracy of expression classification. e SVM classifier is more suitable for facial expression classification than the Softmax classifier, which can improve the robustness of the network. It proves the effectiveness of our proposed improved deep CNN feature network.

Fusion of Transformed Multilevel Features and Improved
Weighted Voting SVM 4.2.1. Database. In the JAFFE dataset, each image of each person's expression is selected as the training set with a total of 70 images and the remaining 143 images as the test set to ensure that the number of samples in the test set is sufficient. Select a total of 3 times to obtain the average recognition rate.
In the CK+ dataset, considering that the number of each type of expression in the CK+ database is not balanced, we select 1-4 images with the peak expression (or close to the peak) from each tagged sequence when processing CK+ as the experimental images, a total of 736 images. remaining 368 images are used as test images. Repeat the experiment 3 times to get the average value.

Experimental Steps
(1) First, the preprocessing of the facial expression image is needed, including graying, face detection, size normalization, and key feature point positioning of the face (2) Based on the localization of feature points, three areas of the eyes, nose, and mouth are extracted. Gabor features are extracted from the divided areas, and expression classification is performed using SVM. is study selects the frequency bandwidth b � 1.4, 1.6, 2.0 to calculate the wavelength λ � 2.4σ, 2.7σ, 3.2σ. Select 8 directions and 24 filters in total. e specific experimental process is shown in Figure 14.
(3) Extract the LBP histogram features from the normalized expression images and use SVM to complete the expression classification. e specific experimental process is shown in Figure 15. (4) Based on the location of feature points, the joint geometric features are extracted and the expression recognition is completed by the SVM. e specific experimental process is shown in Figure 16. (5) e expression images with normalized size are sent to VGG to extract deep CNN features and then are sent to the SVM for expression recognition. e specific experimental network structure is shown in Figure 7. e SVM classifier has been trained in the experiment as given in Section 4.1. (6) According to the improved weighted voting method proposed in this study, a decision-level fusion of the four feature training classifiers is carried out to obtain the final classification result.

Experimental Results.
A well-trained SVM classifier based on four features was tested on the test set. e final recognition rate results of the four features in the two databases are shown in Figures 17 and 18. From the results of the recognition rate, we can see (1) e average recognition rate of the facial expression recognition algorithm based on Gabor features on JAFFE and CK+ reached 88.49% and 92.86%, respectively. e expression recognition algorithm based on LBP features reached 89.27% and 92.35% on JAFFE and CK+, respectively. e expression recognition algorithm based on joint geometric features reached 80.49% and 92.49% on JAFFE and CK+, respectively. e expression recognition algorithm based on deep CNN features has an average recognition rate of 92.7% and 95.61% on JAFFE and CK+, respectively. e feasibility of the recognition algorithm based on four independent feature expressions is verified. e recognition rate of the CK+ is relatively high. e reason is that CK+ has more pictures, a large sample size, and sufficient training, although the CK+ database contains samples of different genders and skin colors.
(2) Compared with Gabor features, LBP features have better performance and more balanced recognition ability in JAFFE; however, the recognition rate of CK+ is lower because CK+ is more complex, including samples with different skin color brightness and poor clarity. It can be concluded that LBP features have higher requirements for image quality and poor noise immunity. Compared with Gabor and LBP features, the recognition rate of joint geometric features and deep CNN features on JAFFE decreases, while the recognition rate on CK+ is still high. is shows that the joint geometric features and the deep CNN features have a poorer recognition effect when the sample size is small; when the sample size is large, it will perform better, especially the deep features of convolutional nerves, which may be overfitting. It is also confirmed that the deep CNN features are robust to a small amount of brightness changes.
According to the method of Section 3.4.2, through calculation, the optimal weight is finally selected as given in Tables 3 and 4.
According to formula (13), decision-level classification is performed. After testing, the final recognition rate results of these two databases are shown in Tables 5-6. Compare it with the results of using these four features separately as shown in Figure 19.
It can be seen from the recognition rate in Tables 5-6 that the average recognition rate of the facial expression recognition algorithm based on FTMS proposed in this study on JAFFE and CK+ databases reached 94.95% and 96.68%, respectively, which verified the feasibility of the algorithm. From the comparison of different databases, CK+ still has the highest recognition rate due to its large sample size and sufficient training. Compared with the experimental results of the four features from Figure 19, on JAFFE, the proposed algorithm increases the recognition rate of Gabor features by 7.3%, the recognition rate of LBP features by 6.4%, the recognition rate of joint geometric features by 18.0%, and the recognition rate of deep CNN features by 2.4% and on CK+, the proposed algorithm increases the recognition rate of Gabor features by 4.1%, the recognition rate of LBP features by 4.7%, the recognition rate of joint geometric features by 4.5%, and the recognition rate of deep CNN features by 1.4%. It can be found that whether it is Gabor, LBP, joint geometric, or deep CNN features, the recognition rate of mixed features in all databases has been significantly improved. And the ability to recognize different expressions is more balanced.
is is because the weighted voting system fusion strategy takes advantage of each feature and significantly improves the recognition ability. Tables 7 and 8 are the confusion matrices after three repeated experiments on the two expression databases. It can be seen that with the use of hybrid features, as the recognition rate increases, the degree of expression confusion decreases. In JAFFE, sadness is easily misjudged as fear and surprise are easily misjudged as happy. In CK+, except for neutral expressions, it is easy to misjudge anger and sadness, and the degree of misjudgment of other expressions is not high. Take JAFFE as an example and print out some misclassified expressions, as shown in Figure 20. e first row below each image represents the predicted expression category and the second row represents the correct label category. It can be seen that some expressions are very complicated and difficult to distinguish. For example, people can be angry, happy, sad, disgust, and fear with a blank face. ese will be classified as normal expressions. In addition, people cry happily, cry in fear, or cry in anger. ese are all classified as sad. Moreover, surprise and fear are often inseparable, surprise and happiness are often inseparable, and exaggerated expression of disgust can easily be classified as sad. Overall, the use of mixed features improves the recognition rate and reduces the degree of misjudgment, which proves the effectiveness of the proposed fusion features in facial expression recognition. Figure 21 compares our proposed fusion feature with other combinations of fusion features. When the features are fused, the weights are calculated according to the method in Section 3.4.2, so that the respective trained SVM classifiers can be determined comprehensively according to the weights to obtain the result. It can be seen that among these expression recognition methods, the fusion feature performance we proposed is the best. On the JAFFE database, compared with the fusion of Gabor and LBP features, the fusion of Gabor, LBP, and joint geometric features increases the expression recognition rate by 1.5%, while on CK+, it increases by 1.6%, which proves the effectiveness of fusion of joint geometric features. Furthermore, coupled with the CNN deep features, the expression recognition rate is further improved by 1.9% on JAFFE compared to the fusion of Gabor, LBP, and joint geometric features and by 1.3% on CK+, which proves the   effectiveness of the fusion of CNN deep features. In general, the FTMS algorithm we proposed has a certain improvement in the recognition rate of facial expressions and has practical engineering significance.

Summary and Discussion
In     e features are significantly improved, the recognition effect is excellent, and the ability to recognize different expressions is more balanced. e experimental results show that the algorithm has higher recognition rate and robustness than the single feature and fully utilizes the advantages and characteristics of different features.
Although the algorithm proposed in this study has achieved good results in experiments, it still has certain shortcomings. e work that needs further improvement in the future research process includes (1) e expression recognition algorithm in this study is mainly for static images; but in fact, the change of facial expression is a complex dynamic process. When our recognition object is an image sequence or a dynamic video, we must consider not only the static features but also how to extract effective features from the dynamic sequence, and the algorithm complexity will also increase greatly. erefore, how to design a dynamic and static expression recognition system is also a problem worthy of exploring.
(2) e facial expression database used in this study is a commonly used database. e expression images are taken in a specific experimental environment and may not get the most real and natural   (3) is study only studies the seven basic expressions of the human face. ese seven expressions have obvious characteristics; even then, it is easy to cause confusion between expressions. However, in our real life, we will encounter various expressions and painful and happy mixed expressions and microexpressions which are not easy to distinguish. e study of these expressions will become a new research direction in the field of expression recognition in the future.