Multiview Feature Fusion Attention Convolutional Recurrent Neural Networks for EEG-Based Emotion Recognition

,


Introduction
Emotion plays a significant role in our life, affecting human cognition and decision-making [1]. At the same time, it is also a relatively complex psychological state [2]. How to recognize emotions has become one of the issues in the industry [3]. At present, the mainstream methods of emotion recognition include two-dimensional valence and arousal coordinate system [4] and discrete assessment method [5]. In the two-dimensional valence and arousal coordinate system method, valence represents the positive or negative direction of the emotion, and arousal represents the intensity of emotion [6]. In discrete assessment, emotions are divided into multiple discrete categories. For example, Zheng and Lu classify emotions into positive, neutral, and negative categories [7], and Shanmugam and Padmanaban classify emotions into eight types: joy, trust, fear, surprise, sadness, disgust, anger, and expectation [8]. Emotion recognition is of great significance. It can help humans understand their own emotions, and it can also help computers better understand human emotions [9] so that computers can better serve humans.
With the development of computer science and information technology, human-machine interaction technology has attracted more attention [10]. As the cornerstone of humanmachine interaction technology, emotion recognition has inevitably attracted the attention of the academic community [11]. Generally speaking, emotion recognition methods can be divided into two categories: one is based on external signals of the human body [12], such as expression, posture, and voice; the other is based on the internal motions of the body [13], such as EEG, ECG, and EMG. Compared with external signals such as facial expression, posture, and voice, emotion recognition results based on internal cues such as EEG are more reliable because humans cannot control them intentionally [14].
In the traditional EEG emotion recognition method, first, screen out the hand-made features that are more relevant to the emotion recognition task [15], and then, input these emotional features into the machine learning model for classification. However, because deep learning does not require manual feature making and has better learning effect [16], researchers of EEG emotion recognition mostly use deep learning methods for research in recent years [17,18]. Based on the characteristics of EEG signals, it can be extracted from the time domain, frequency domain, timefrequency domain, and nonlinear dynamical system [19]. Differential entropy (DE) is a representative nonlinear dynamic feature commonly used in EEG emotion recognition tasks [20]. The research of Garcia-Martinez et al. confirmed the effectiveness and robustness of the DE feature in EEG emotion recognition tasks [21]. Zhu and Zhong [22] classified DE features by using the 2DCNN-BiGRU network and achieved 87.89% and 88.69%, respectively, in the arousal and valence classification results of the DEAP dataset. However, a single convolution scale makes this method limited in spatial feature extraction, resulting in feature loss. In Yin et al.'s study [23], by using the ERDL model and combining the characteristics of the frequency domain and time domain, the classification accuracy on the DEAP dataset reached 90.45% and 90.6%, respectively. Shen et al. [24] proposed a four-dimensional convolutional recurrent neural network (4D-CRNN) to integrate the frequency domain, spatial domain, and time domain information of multichannel EEG signals to improve the accuracy of emotion recognition based on EEG. The accuracy of arousal and valence classification in the DEAP dataset reached 94.22% and 94.58%, respectively. However, these two methods ignore the differences between features and the impact of different features on classification results. In the process of feature fusion, it is easy to cause feature redundancy by not distinguishing different features. Cui et al. [25] proposed a DE-CNN-bi-LSTM network to remove DE features on different time slices in different frequency bands. After that, CNN and bi-LSTM were used to learn spatial and temporal information, so the classification accuracy on the DEAP dataset reached 94.86% and 94.02%, respectively. However, this method could not effectively deal with tag noise, which affected the classification results.
Aiming at the limitations of these methods, we propose a recursive network model based on multiview feature fusion. According to different spatial features extracted from convolutions of different scales, different spatial features of different periods are weighted and fused through the multihead attention mechanism to magnify the actual features and reduce the impact of invalid features, and label smoothing is used to reduce the impact of label noise. In conclusion, the main contributions of this study are as follows: (1) To solve the noise in prediction labels of a single subject in EEG sentiment analysis, the method of label smoothing has dramatically reduced the influence of label noise on the classification accuracy of models to achieve a better effect of sentiment classification (2) Multiscale convolution makes the extracted spatial features more comprehensive, and convolution is closely combined with bidirectional GRU (bi-GRU) so that the model can learn more comprehensive time-frequency features (3) This paper proposes a multiview feature fusion attention convolutional recurrent neural network, which integrates the weight of frequency, space, and time-domain features. It effectively improves the classification accuracy of emotion recognition In this paper, Section 1 is the introduction part, which introduces some basic concepts in the field of EEG emotion recognition, and briefly summarizes the previous research work. This paper made some breakthroughs based on prior studies. Section 2 is the method part, which mainly introduces key concepts in the data set and model. Section 3 is the experimental results and analysis, and the experimental process and results are introduced and analyzed in detail. Section 4 discusses, compares, and analyzes the existing paper results, reflecting the research significance and value of this paper. Section 5 is the conclusion, reviewing and summarizing this paper. [26] is a multichannel dataset collected by Koelstra et al., as shown in Table 1, who invited 32 subjects (including 16 males and 16 females) to watch 40 music videos used to study human emotion. The subjects invited for the experiment are in good physical condition and mental health and can generally respond to the stimulation of the video material. Each music video lasts for 1 minute, regarded as an experiment. For each experiment, the first 3 seconds (3 s) is the video conversion time, and the last 60 s is the music video play time, so the duration of each sample is 63 s, and the video conversion time of the first 3 s is the baseline time of the experiment. After playing the video, each subject had to score the video in valence, arousal, and other dimensions, ranging from 1 to 9. We selected five as the threshold and regarded the  Journal of Sensors emotion recognition task of the DEAP dataset as two binary classification problems. EEG signals were sampled using a 32-channel electrode. The sampling frequency is 512 Hz, and electrodes were placed by the international standard lead 10-20. In addition, the experiment collected not only 32-channel EEG signals but also 8-channel ECG and EOG, a total of 40 physiological signal channels. Only the first 32 EEG signals were used in this paper. The website provides the preprocessed data by downsampling the frequency from 512 Hz to 128 Hz and removing noise such as ocular artifacts.

Dataset and Preprocessing. DEAP dataset
The 63 s EEG signal was collected for each trial. First, we cut the 63 s signal to 0.5 s. Then, each segmented EEG signal block was filtered to obtain four frequency bands of θ, α, β, and γ, and then, DE features were extracted from each frequency band. Since Yang et al. [27,28] have proved that considering baseline signals can improve the classification effect of the model, we will carry out baseline correction according to the method in the paper. The DE features of each frequency band of the baseline signal were averaged. Then, the DE features of the stimulus signal were differentiated from the average of the corresponding frequency band to obtain the baseline-corrected DE features. The processed DE features were mapped into a twodimensional map for each frequency band according to the electrode distribution. For each block, the 2D maps of the four frequency bands are concatenated to form a new feature matrix. Finally, the feature matrix of the block is combined according to the segment (1segment = 6 blocks) and sent into the model as a sample. The data preprocessing process is shown in Figure 1.
Electrodes can be converted into two kinds of 2D maps: one type is 8 × 9, as shown in Figure 2(a), and the other is a 9 × 9, as shown in Figure 2(b). Likewise, 2D maps of the four frequency bands can be jointed in two ways. One is stacked splicing, that is, to form a three-dimensional matrix. In this paper, 8 × 9 maps are spliced stacked, as shown in Figure 2(c). Or it can be assembled into a large picture. The picture jointed in this way is still a two-dimensional matrix. In this paper, 9 × 9 graphs are jointed in a large picture, as shown in Figure 2 For a single subject, 60 seconds of stimulus signal data collected from 40 music videos watched is 40 * 60 * 2, divided into 800 samples for 4800 stimulus signals; each sample contains 6 time period information, and each period information contains 4 frequency bands. Taking the superposition result as an example, the characteristic of each frequency band is an 8 * 9 mapping matrix. There are a total of 32 subjects, and in the arousal classification, there are a total of 25600 samples, including 10860 low arousal samples and 14740 high arousal samples. Each sample contains 6 time period information. Each period information is a 4 * 8 * 9 feature matrix. In the valence classification, there are a total of 25600 samples, including 11,40 low-valence samples and 14160 high-valence samples, and each sample contains 6 time period information. Each period information is a 4 * 8 * 9 feature matrix. Because subjects respond differently to emotional stimuli, the number of positive and negative samples for (low/high) arousal and (low/high) valence will not be the same.

Spatial Feature Extraction
Based on 2D-CNN. Convolutional neural networks are often used to process 2D data and usually consist of three parts: convolutional layers, pooling layers, and activation function layers [29]. The convolution layer performs the inner product operation on the input data through the convolution kernel. By setting the size and number of the convolution kernel, the model can extract different types of data features. At the same time, in the convolutional layer, the number of parameters that the model needs to be trained is reduced through "sparse connection" [30] and "weight sharing," thereby reducing the difficulty of training.

Journal of Sensors
In addition, the pooling layer can further reduce the data provided by the model to the next layer of the network, reducing the difficulty of model training [31]. The activation function layer transforms the data to reduce the training difficulty and enhance the correlation between data. Part of the convolutional neural network model used in this paper is shown in Figure 3.

Time Series
Feature Extraction Based on GRU. RNN (recurrent neural network) has certain advantages when dealing with time series data. When RNN processes the information at each moment, it can effectively preserve the original timing of the data, and the training parameters will not increase due to the increase in the sequence length. This paper uses an improved cyclic structure GRU (gated recurrent unit) model, as shown in Figure 4.
The GRU has a reset gate and an update gate. The reset gate determines the degree to which the input information will be combined with the previously memorized information. The update gate determines how much of   Journal of Sensors the previously memorized information can be retained to the current time step. The specific formula is as follows: where h t is the hidden state at time t, x t is the input at time t, r t and z t are the reset gate and update gate, respectively,h t is the candidate hidden state, σ is the sigmoid function, and * is the Hadamard product.
In this paper, GRU is used to obtain the time series characteristics of data, and a comparative test is carried out for GRU and bi-GRU in Section 3.4.

Feature Fusion
Based on Multihead Attention. Usually, scaled dot-product attention consists of three parts: Q (query), K (key), and V (value). The structure is shown in Figure 5(a). Assume that the dimensions of the input Q and K are d k , and the dimension of V is d v . Then, calculate the transposed multiplication of Q and K, divide by ffiffiffiffi ffi d k p , pass the result through the Softmax function to get the weight, and multiply the weight by V to get the output matrix. The specific formula is as follows: This paper uses multihead attention [32], and its structure is shown in Figure 5(b). Multihead attention can combine the information learned by different heads, which can be regarded as parallel processing of multiple scaled dotproduct attention. Q, K, and V are first subjected to a linear transformation and then input to the scaled dot-product attention. Here is the scaled dot-product attention for head times and stacking the obtained results. The spliced result is then subjected to a linear transformation to obtain the value as the output of multihead attention.

AdamW and Label Smoothing
2.5.1. AdamW Optimization Algorithm. Adam optimization algorithm has been widely used in various deep learning models since its appearance, but experiments found that Adam has specific problems. Such as slow model convergence, nonconvergence, and other problems, various improved versions of Adam appeared. Different parameters in the Adam optimization algorithm adaptively learn at different learning rates. The formula is as follows: where g t represents the gradient, the subscript t represents time, m t is the first-order moment variable of the gradient, v t is the second-order moment variable of the gradient, and β 1 and β 2 are the exponential decay rates (decay factors) of the moment estimation. When the values of m t and v t approach the 0 vectors, the result will be biased. This problem is solved by performing bias correction on m t and v t . The formulas for the bias correction value m t and v t of m t ′ and v t ′ are as follows: AdamW [33] adds a regular term to Adam's loss function and adds the result of the gradient of the regular term when calculating the gradient so that the gradient of the overall loss function is calculated when updating the model parameters, thereby updating the parameters. AdamW's loss function is Then, the formula for AdamW parameter update is where θ is a parameter in the model, η is the learning rate, α is 0.001, ξ is 10 −8 , and ω is an actual number.

Label Smoothing.
There are usually some noisy labels in machine learning samples, and these labels will have a certain impact on the prediction results. Label smoothing prevents the model from believing too much in the labels of the training samples by assuming that the labels may be wrong during training [34]. The formula looks like this: , otherwise: Among them, ε is a defined hyperparameter, which generally takes a value of 0.1, K is the number of categories of Last status1 h 2 (t)  Journal of Sensors

Multiview Feature Fusion Attention Convolutional
Recurrent Neural Network Model. In this paper, after preprocessing the EEG time series data, the original sequence is divided into six data segments according to the period, thus effectively preserving the time series of the data. At the same time, to make the extracted features more comprehensive, multiscale convolution is used to extract spatial-domain features, and the extracted features are highly abstracted through convolution blocks. For the abstracted data, the extracted frequency-domain and spatial-domain features are weighted from the time series perspective through the attention mechanism, and the weighted data is classified through the bidirectional GRU model. The specific process is shown in Figure 6.
Perform spatial feature extraction to obtain feature matrix Concat the two spatial features to obtain matrix A 1 ′ , B 1 ′ , C 1 ′ , D 1 ′ , E 1 ′ , F 1 ′ , and transfer the matrix to the Conv block. Abstract the spatial features through three-layer convolution in the Conv block to obtain feature matrix The abstract matrix is subjected to maximum pooling, flatten and linear network layers to obtain feature matrix A 3 ′ , B 3 ′ , C 3 ′ , D 3 ′ , E 3 ′ , F 3 ′, and six matrices are Concat to obtain matrix G 1 . At the same time, the random initialization matrix m is used as the initial weight matrix W of attention after passing through the embedding layer. After G 1 is input, the final weight matrix W ′ is obtained through multihead attention   7 Journal of Sensors training, and the feature matrix G * 1 is obtained after weighting. The weighted matrix is passed into the bidirectional GRU model to extract the time series features of the data. Concat the states h 1 ðtÞ and h 2 ðtÞ at the last moment of forward and reverse to obtain the final output state matrix hðtÞ, and pass hðtÞ through the linear network layer to achieve classification.

Experiment and Analysis
The batch size for training multi-aCRNN is 128, the dropout is 0.5, the maximum number of epochs is 500, the learning rate is 5 * 10 −5 , and the number of heads is 8. PyTorch implements the model, NVIDIA-SMI 460.67, CUDA Version: 11.2, python version 3.7.0, PyTorch version 1.11.0.
Five cross-validations were used for each experiment in this paper. They were performed for the average classification accuracy (ACC) and the number of subjects (Num) with average classification accuracy below 90% for arousal (a) and valence (v) analysis. The overall experimental process is shown in Figure 7. The idea is to compare the number and scale of convolution layers, the use of attention, the   Journal of Sensors comparison between GRU and bi-GRU, and the use of label smoothing.

Comparison of the Number of Convolutional Layers.
Usually, in a convolutional neural network, the number of convolutional layers determines the degree of abstraction of features. Thereby, a more accurate prediction result can be obtained. Therefore, this section compares the two-layer and four-layer convolution results. The results are shown in Figures 8 and 9: It can be seen from Figures 8 and 9 that the deep convolution is better than the two-layer convolution to a certain extent for the experimental results, especially in the valence result-the average accuracy of using deep convolution increases by 0.83%. At the same time, judging from the clas-sification results of a single subject, when using two-layer convolution for classification, the arousal classification results of 7 subjects were lower than 90%. The valence classification results of 6 subjects were lower than 90%. In contrast, in the four-layer convolution, there are three subjects whose arousal classification results are lower than 90% and   9 Journal of Sensors four subjects whose valence classification results are lower than 90%, greatly reducing the degree of classification error.

Attention-Based Deep Convolution Classification Results.
For the features extracted by the convolution layer for different periods, the traditional Concat method cannot sufficiently distinguish the effectiveness of these features. Aiming at this problem, we use the attention mechanism for feature fusion so that the frequency-domain and spatial-domain features of different periods can be distinguished by increasing the weight to achieve a better classification prediction effect. The results are shown in Figure 10.
Comparing Figures 10 and 9, after adding attention, the overall classification accuracy has been improved to a certain extent. It can be seen from Table 2 that the average accuracy of arousal classification increased by 0.71%, and the average accuracy of valence classification increased by 0.32%. For a single subject, using weighted feature fusion resulted in greater progress in the more difficult second subject to classify. At the same time, only two subjects had an accuracy rate below 90% for the arousal classification accuracy rate. Still, six subjects had an accuracy rate below 90% in the valence classification results. Since the weighted fusion is  Journal of Sensors carried out on the results of the convolution of six time periods, it has much to do with the convolution process. It is guessed that the locality of the initial convolution causes the attention of the model to be limited during learning, so the overall accuracy of valence classification has improved, and some single-subject results have declined.

A Comparative Study of Convolution at Different Scales.
Since the convolutional layer is limited by the size of the convolution kernel when performing feature extraction, the size of the convolution kernel of the model is experimentally explored in this section. It can be seen from Figures 11 and  12 that different convolution kernel sizes have a significant impact on classification accuracy and classification stability.
It can be seen from Table 3 that the first layer of convolution is the feature extraction of the original data, so by comparing the convolution kernels with the convolution kernel size of 5 and 7, it can be seen that the convolution kernel of 7 scales is used in the classification results better than five-scale convolution, and the accuracy of valence classification is improved by 0.42%. This is because large-scale convolution has a more extensive perception range when extracting features from the original data, which can significantly reduce the limitations of spatial feature extraction. However, this is only limited to the extraction of initial features. The effect of using a larger-scale convolution in the middle layer convolution is significantly reduced. This is because the middle layer convolution is a reabstraction of features. A larger-scale 11 Journal of Sensors convolution will affect the initial feature information and cover the part of the effective information in the initial features, resulting in a decrease in the classification results.
3.4. GRU and Bi-GRU Comparative Experiment. For time series, the time series features are extracted through a recurrent neural network to optimize the model. In this section, a comparative experiment is carried out on the GRU and bi-GRU networks, and the experimental results are shown in Figures 13 and 14. It can be seen from Table 4 that the classification accuracy results of GRU and bi-GRU are almost the same, and the classification results for a single subject are also relatively close. For this phenomenon, the use of GRU and bi-GRU will be further explored in Section 3.5.

Multiscale Fusion Model Based on Label Smoothing.
In Section 3.4, GRU and bi-GRU are explored, but the experimental results are relatively close and cannot clearly show the pros and cons of the model. At the same time, we explore convolutional networks of different scales in Section 3.3 and find that larger initial convolution kernels are more effective for extracting spatial features but whether retaining the features of smaller-scale convolutions at the same time will promote emotion recognition to a certain extent. The multiscale fusion model based on label smoothing will be explored in this section. The results are shown in Figures 15-17 and Table 5.
After label smoothing, the model's accuracy has been significantly improved, and the results for a single subject are also more stable. In the single-scale Bi-GRU fusion model (7+4+4+1+Bi-GRU+lab), the accuracy of arousal and valence reached 96.09% and 96.02%, respectively, and there is only one subject with a classification accuracy below 90%. Since some subjects have certain errors in the experiment, it is difficult to improve the classification accuracy of some subjects. These noises will also affect the classification results of other subjects and even lead to the model's accuracy. The training results are getting worse and worse. It can be seen that AdamW and label smoothing can significantly promote the fitting of the model and the calibration of the network, which can dramatically reduce the impact of label noise.
It can be seen from Table 5 that the training results of bi-GRU are significantly better than GRU. In the multiscale network, the results of bi-GRU (5&7+4+4+1+bi-GRU+lab) reach 96.43% and 96.30%, respectively, compared with 96.02% and 95.74% of the GRU network (5&7+4+4+1 +GRU+lab), the results of using bi-GRU are improved by 0.42% and 0.56%, respectively, and the overall optimization was achieved. It shows that in the EEG time series data,   the reverse time series information also has a certain effect, promoting the overall experimental results. At the same time, by comparing the experimental results of single-scale and multiscale, it can be seen that the results of using 5scale and 7-scale convolution kernels at the same time to extract features from the original data are significantly better than the results of using 7-scale convolution kernels alone.
Indicating the fusion of different scale features is more helpful for the model to learn more comprehensive and effective information to achieve better classification results.

Contrast of Splicing and Stacking
Preprocessing. This section experimentally explores two different preprocessing methods, splicing and stacking. It can be seen from Table 6 that the stacking preprocessing method improves the arousal and valence classification results by 1.47% and 1.7%, respectively, compared with the splicing preprocessing method. And, for a single subject, the stacking results are more stable. At the same time, the splicing method has two subjects with an accuracy of less than 90% in arousal classification and four subjects with an accuracy of less than 90% in valence classification. So in this experiment, the stacking preprocessing method is used for the experiment.

Discussion
As shown in Table 7, when the model learns spatiotemporal information based on extracting frequency information, it tends to get a better experimental result. However, when the model does not distinguish the obtained information and trains all the information, it will affect the training effect of the model. We perform weight training on the extracted features through the attention mechanism, amplify effective information, reduce invalid information, and use label smoothing to reduce the impact of noise in the label on the final classification result. It can be seen by comparing the papers using the same feature information that multi-aCRNN (ours) outperforms the 4D-CRNN model by 2.21% and 1.72% on the arousal task and the valence task, respectively, and surpasses the DE-CNN-BiLSTM model by 1.57% and 2.28%. We can conclude that selectively training on frequency, spatial, and temporal information is more conducive to emotion recognition, and reducing label noise positively affects emotion recognition.

Conclusions
This paper proposes a multiview feature fusion attention convolutional recurrent neural network model for EEG sen-timent analysis. This method extracts more comprehensive spatial feature information through multiscale convolution and combines the frequency-domain features and spatialdomain features of EEG data. The weight fusion is carried out from the time series perspective so that the model learns more accurate information for classification prediction. Through the comparison experiment between GRU and bi-GRU networks, the bi-GRU network is determined as the network layer for temporal feature extraction. At the same time, the noise in the label is smoothed, which effectively reduces the impact of label noise on the classification results, realizes emotion recognition in complex practical situations, and is verified on the DEAP dataset. The multi-aCRNN model achieved 96.43% and 96.30% on the arousal task and valence task, respectively. At the same time, this paper conducts an experimental comparison of stacking and splicing of motor conversion methods to understand the impact of different feature combination methods on sentiment analysis. These experiments can be the basis for further research on EEG characteristics and better experimental research on emotion analysis.
Although the experiments in this paper effectively fuse multiview features and obtain high classification accuracy, the model still has shortcomings in classification tasks. In future work, we will extend multi-aCRNN to a multiclass classifier to obtain more accurate emotional state localization and achieve more accurate and good classification results.

Data Availability
The experimental data in this paper comes from the public data set DEAP, and the data set source link is http://www .eecs.qmul.ac.uk/mmv/datasets/deap/.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Authors' Contributions
XF and RH conceived the project, designed the experiments, and drafted the manuscript. RH, FB, PC, and ZF collected the data and conducted the experiments. RH, FB, and XF proofed and polished the manuscript and organized this project.