Semisupervised Deep Features of Time-Frequency Maps for Multimodal Emotion Recognition

Traditional approaches for emotion recognition utilize unimodal physiological signals. Te efectiveness of such systems is afected by some limitations. To overcome them, this paper proposes a new method based on time-frequency maps that extract the features from multimodal biological signals. At frst, the fusion of electroencephalogram (EEG) and peripheral physiological signal (PPS) is performed, and then, the two-dimensional discrete orthonormal Stockwell transform (2D-DOST) of the multimodal signal matrix is calculated to obtain time-frequency maps. A convolutional neural network (CNN) is then utilized to extract the local deep features from the absolute output of the 2D-DOST. Since there are uninformative deep features, the semisupervised dimension reduction scheme reduces them by balancing the generalization and discrimination. Finally, the classifer recognizes the emotion. Te Bayesian optimizer fnds the proper SSDR and classifer parameter values to maximize the recognition accuracy. Te performance of the proposed method is evaluated on the DEAP dataset considering the two-and four-class scenarios through extensive simulations. Tis dataset consists of electroencephalograph (EEG) signals in 32 channels and peripheral physiological signals (PPSs) in eight channels from 32 subjects. Te proposed method reaches the accuracy of 0.953 and 0.928 for two-and four-class scenarios, respectively. Te results indicate the efciency of the multimodal signals for detecting emotions compared to that of unimodal signals. Also, the results indicate that the proposed method outperforms the recently introduced ones.


Introduction
Emotion recognition is widely used in healthcare, teaching, human-computer interaction, and other felds.Since the physiological signals can refect the real emotional state of an individual, they are widely used for emotion recognition.Single modality approaches extract the series of features from some channels.Tis approach cannot make full use of the relevant information among channels.Multimodal emotion recognition is an emerging interdisciplinary feld of research in afective computing and sentiment analysis.It aims at exploiting the information carried by signals of diferent natures to make emotion recognition systems more accurate.Tis is achieved by employing a powerful multimodal fusion method [1].

Tis paper proposes an emotion recognition scheme based on multimodal signals consisting of electroencephalograph (EEG) and peripheral physiological signals (PPSs).
Te proposed method utilizes the two-dimensional discrete orthonormal Stockwell transform (2D-DOST) to consider the intramodal and cross-modal correlation among the multimodal signals, including EEG and PPS signals, and the relations between the samples of each signal.Ten, a convolutional neural network (CNN) is considered to extract the local deep features among the output of the 2D-DOST.Since there are several redundant features in the set of deep features, semisupervised dimension reduction (SSDR) is used and a classifer recognizes the emotion.Te feature reduction and classifcation performance depend on some parameters obtained by the Bayesian optimization approach to maximize accuracy.We considered the binary and fourclass scenarios on the database for emotion analysis using the physiological signals (DEAP) dataset to assess the performance of the proposed method.Te results demonstrate that the proposed method outperforms the recently introduced methods.Hence, the contributions of this paper are as follows: (i) Proposing a new method for multimodal emotion recognition using EEG and PPS (ii) Using the 2D-DOST to analyze the intramodal and cross-modal correlations (iii) Extracting deep features by the CNN and then reducing the number of deep features by a semisupervised method (iv) Joint optimization of the parameters of SSDR and classifer (v) Performing extensive simulations to indicate the performance of the proposed method.
Following this introduction, Section 2 presents the related works on multimodal emotion recognition.Section 3 describes the dataset and a detailed description of the proposed method.Section 4 contains the results and discussion, and Section 5 concludes the paper.

Related Works
Te EEG is the most used physiological signal in singlemodal emotion recognition systems [2][3][4][5][6].EEG and other physiological signals, such as PPS, are usually used for emotion recognition in multimodal systems.Te hierarchical fusion based on the CNN was proposed in [7] to extract the potential information multimodal signals, including the EEG and the PPS, and feature-level fusion was performed to merge the deep and statistical features.Te binary classifcation scenarios based on valence and arousal dimensions were considered in the DEAP and MAH-NOB-HCI datasets.Te method presented in [8] combines the EEG and PPS with eye movement signals, and the joint oscillation structure of multichannel signals was analyzed by the multivariate synchrosqueezing transform (MSST).After that, a deep CNN extracts the local features from the MSST.Binary scenarios were evaluated based on the dimensions of arousal and valence on DEAP and MAHNOB-HCI datasets.An ensemble CNN was utilized in [9] to analyze the correlation between EEG and PPS signals from the DEAP dataset to develop multimodal emotion recognition.Te multistage multimodal dynamical fusion network was proposed in [10] to analyze the unimodal, bimodal, and trimodal intercorrelations.It was shown that multistage fusion performs better than single-stage fusion on the DEAP dataset.Te multiple-fusion-layer-based ensemble classifer of stacked autoencoder was proposed in [11] to recognize the emotions from the DEAP dataset.PPSs such as galvanic skin response (GSR), respiration patterns, and blood volume pressure were utilized in [12].Tis method combines some continuous wavelet transforms (CWTs) and classifes them using a CNN.Te four-class scenario on the DEAP dataset was considered for performance evaluation.Te EEG, pulse, skin temperature, and blood pressure are recorded by the wearable sensor nodes in [13], and the fuzzy support vector machine (SVM) performs the emotion recognition.
Audio-and video-based signals are used separately or combined with the physiological signals for multimodal emotion recognition [14][15][16].Te EEG and facial expressions were used in [17] for multimodal emotion recognition.Te combination of the CNN and the attention mechanism extracts the essential features from facial expressions, and a CNN extracts the spatial features from EEG signals.Te features of diferent modalities are merged at the feature level.Binary scenarios on DEAP and MAHNOB-HCI datasets were considered for performance evaluation.Another method based on EEG signals and facial expressions was presented in [18].Te authors in [19] used facial expressions, GSR, and EEG signals with a hybrid fusion strategy.Tey considered the three emotions on the LUMED-2 dataset and four classes on the DEAP dataset.In [20], the 3D-CNN extracts the spatiotemporal features from the EEG signals and the video.A hybrid multimodal data fusion method was presented in [21] to fuse the audio and video signals from the DEAP dataset using a latent space linear map.Te principal component analysis (PCA) and CNN were used in fusion and feature extraction from EEG and audio signals in [22] and then the grey wolf optimization algorithm was employed for selecting combined features.Te heart rate can be detected from the photoplethysmography (PPG) signal; hence, some research used PPG.A method based on PPG and GSR signals was proposed in [23], which uses the 1D-CNN autoencoder model and lightweight model obtained using knowledge distillation.Te performance of the model is evaluated on DEAP and MERTI-Apps datasets.Te heart rate was extracted from PPG signals in [24], and then the combination of the 1D-CNN and long short-term memory (LSTM) was adopted for classifcation on MAHNOB-HCI.Te features in time and frequency domains were extracted from PPG and GSR signals in [25] for emotion recognition.It was shown that feature selection with random forest recursive feature elimination and classifcation by the SVM yields the highest accuracy.Table 1 summarizes the recently introduced research on multimodal emotion recognition from biological signals.It is observed that DEAP is the most used dataset.Also, most works focusing on time-domain and timefrequency analyses were adopted in [8,12].Te feature concatenation was considered after feature extraction from each modal, and cross-modal correlation was not considered in the feature extraction process.

Proposed Method
3.1.Dataset.To evaluate the performance of the proposed method, we consider the DEAP dataset [26]

Proposed Method.
Here, the proposed method for multimodal emotion recognition from EEG and PPS signals is explained in detail.Te general framework of the proposed method is shown in Figure 2, which consists of the following four main steps: data fusion, feature extraction, feature reduction, and classifcation.

Fusion.
Previous works based on multimodal signals usually extract the features from diferent modalities separately and then merge the extracted features.In this manner, the cross-modal correlation is not considered.Also, there are many redundant features.To overcome this drawback, we propose to merge the multimodal signals before any feature extraction process.Let X EEG and X PPS denote the matrices with the size of 32 × 384 and 8 × 384, respectively.After fusion, there is the matrix X m with the size of 40 × 384, which is considered for further processing.

Feature Extraction.
Te Stockwell transform was introduced to overcome the drawbacks of short-time Fourier transform (STFT) and wavelet transform while benefting their advantages and characteristics [27]; however, there are some diferences.STFT uses a fxed window size for signal analysis, resulting in a tradeof between time and frequency resolution.In contrast, the Stockwell transform uses the variable-length window; hence, diferent frequency components can be analyzed with diferent time resolutions, which is necessary for transient and stationary signals.Since the Stockwell transform uses the Gaussian window, it provides a localized time-frequency map (TFM).In contrast, the STFT spreads the spectral energy over multiple time-frequency bins due to the use of rectangular windows.Tis Stockwell transform characteristic accurately identifes signal components' time and frequency characteristics.Te STFT sufers from smearing due to the rectangular analysis window.Te Stockwell transform mitigates this issue using a window that smoothly tapers of.Te Stockwell transform retains phase information, while STFT distorts the phase due to the windowing process [28,29].
For a continuous-time signal x(t), the continuous Stockwell transform, S(τ, f), is computed as follows [30]: where , t, and τ are the time variables, f denotes the frequency, and σ � 1⁄ | f | is the scale factor.Also, A(τ, f) and e jθ(τ,f) are the magnitude and phase of the Stockwell transform, respectively.Te output of the Stockwell transform is a complex-valued matrix whose rows and columns are concerned with time and frequency, respectively.
For the discrete signal x[k], k � 0, 1, . . ., N − 1, obtained from x(t), by sampling, with the discrete Fourier transform (DFT) of X[n], n � 0, 1, . . ., N − 1, the discrete Stockwell transform for x[k], S[k, n], for n ≠ 0, can be calculated by replacing τ ⟶ k and f ⟶ n/N as follows [30]: ( For n � 0, the Stockwell transform equals the DC value of DFT as . Te 2D Stockwell transform for the 2D image f(x, y) is computed as follows [30]: Te shift parameters u and v control the centre position of Gaussian windows on diferent axes.Also, f u and f v (f u ≠ 0 and f v ≠ 0) denote the frequencies.Tere is considerable redundancy in the time-frequency matrix provided by the Stockwell transform.Te DOST is proposed in [31,32] to overcome this drawback.Te DOST provides spatial frequency representation similar to the wavelet transform [32].Te 2D-DOST of an N × N image f(m, n), with 2D Fourier transform F(m, n), is defned as follows: International Journal of Intelligent Systems Here, v x � 2 p x −1 + 2 p x − 2 and v y � 2 p y −1 + 2 p y − 2 are the horizontal and vertical frequencies, respectively, and p x and p y � 0, 1, log(N−1).For this image, there are N 2 DOST points.Te 2D-DOST gives information about the frequencies S(u, v) in the bandwidth of 2 p x −1 × 2 p y −1 frequencies [30].
As mentioned, the input of the 2D-DOST is an N × N image, and usually, N is a power of two for computational efciency.Hence, each X m with the size of 40 × 384 is partitioned into six partitions, resulting in X (i)  m , i � 1, . . ., 6, each with the size of 40 × 64.Finally, each X (i)  m was resized to the size of 64 × 64.After that, the 2D-DOST is computed for each X (i)  m to obtain S (i) m .Finally, the time-frequency matrix, S m , of trial X m is computed as follows: CNNs provide several benefts for analyzing TFMs.CNNs are particularly efective at capturing local patterns and features.TFMs contain localized structures; hence, CNNs can automatically learn and extract relevant local features from these maps.Tis enables the model to capture time-varying patterns and frequency-specifc information.TFMs often exhibit hierarchical structures, where low-level features correspond to basic signal components and higher-level features capture more complex relationships and patterns.CNNs can learn these hierarchical representations by stacking multiple convolutional layers.Tis allows them to capture both low-level details, such as individual frequency components, and high-level features that represent more abstract signal characteristics.TFMs are susceptible to noise and variations introduced during signal acquisition or processing.CNNs have demonstrated robustness to noise and variations.By leveraging local receptive felds and pooling operations, CNNs can efectively suppress noise and capture invariant features in TFMs.Tis robustness enhances the model's ability to analyze the TFM in the presence of noise or variations [33][34][35].
Te CNN extracts the multiscale localized spatial features from the input image using diferent layers, including image input, convolutional, batch normalization, rectifed linear unit (ReLU), pooling, fully connected, and softmax.Te convolutional layers generate high-level features by detecting local patterns such as lines and edges.Te smallsized flters, or kernels, are employed for this purpose.Te minibatch process normalizes the output of convolution layers to reduce the sensitivity to the initialization and increase the training speed.Tere is a nonlinear activation flter, called ReLU, after this layer, with the input-output relation function as r out � max 0, r in  .Tere are many high- International Journal of Intelligent Systems level features at the output of the ReLU layer with high correlation, and training such features requires more computational resources.Terefore, the pooling layer is employed to reduce the number of high-level features at the output of the ReLU layer.Tis layer generally performs the downsampling with functions such as average pooling, global maximum pooling, maximum pooling, and global average pooling, in which the max-pooling is the most frequently used.Tis function selects the maximum value in the pooling window.Te output of the last pooling layer is given to the fatten layer that converts the feature maps from the matrix form to the vector one.Te elements of this vector are the input of fully connected and softmax layers that act as the traditional multilayer perceptron.
Designing the new structures for the CNN and training them is time consuming and requires a huge number of labelled training samples.Transfer learning is utilized to solve this challenge.Generally, transfer learning is using the pretrained CNN for a new problem.To this end, only the number of neurons in the last dense layer is modifed according to the number of classes of the new problem and the whole or some weights of the pretrained network are refned considering the training data of the new scenario.Also, the training samples are resized considering the size of the input image layer.After training, the features at the fatten layer's output are considered deep features and used for further processing.

Feature Reduction
. Some high-level deep features obtained from the fatten layer may be highly correlated, increasing the redundancy in the feature vector given to the classifer.Te redundant features increase the training complexity and probability of overftting.Hence, they should be removed from the feature vector.Te semisupervised methods combine the efciencies of both supervised and unsupervised methods and balance the discrimination and generalization.Tis paper uses the semisupervised dimensionality reduction (SSDR) proposed in [33] for feature reduction.
Let n t and n 0 , respectively, denote the number of training samples and the number of deep extracted features.Accordingly, s 1 , . . ., s n t ∈ R n 0 are training feature vectors and S 1 � [s 1 , . . ., s n t ].In this method, the n M pairs of training samples belonging to the same class and n C samples from diferent classes, respectively, construct the must-link constraints, M, and the cannot-link constraints, C. SSDR obtains the new feature vectors set G � ?T ?, where W 1 � [w 1 , . . ., w n r ], ?T W � 1, is the projection matrix, and the new features should preserve the structure of the original features.To this end, the objective function (?) is defned as follows: Te parameters α and β balance the cannot-and mustlink constraints.Te concise form of the objective function can be expressed as follows: (7) where L � D − S denotes the Laplacian matrix, and D is the diagonal matrix obtained as D ii �  j Y i,j .Te elements of matrix Y are obtained as follows: It is observed that the performance of SSDR depends on parameters α and β.Hence, Bayesian optimization is utilized to fnd their optimum value that maximizes the accuracy.

Classifcation.
Here, several classifers, including SVM, kNN, ANN, decision tree, and random forest, are considered separately to obtain the performance of the proposed method.Te performance of these classifers depends on their parameters.For the SVM, the kernel type and box constraint; for kNN, the number of neighbours, distance metric, and weighting scheme; for the decision tree, the maximum number of splits; and for the random forest, the minimum number of leaf sizes and number of predictors to sample should be optimized.A joint optimization based on Bayesian fnds its optimum value, as shown in Figure 3.It should be mentioned that the structure of the ANN is chosen according to the dense layers in the corresponding CNN.

Results and Discussion
Tis section explains the simulations performed to assess the performance of the proposed method and the obtained results.Te confusion matrix, accuracy (Acc), sensitivity (Sens), precision (Prec), kappa, and F 1 scores are calculated and reported.Tese metrics are calculated as follows: where the number of correctly classifed and rejected multimodal signals is, respectively, denoted by true positive (TP) and true negative (TN).Conversely, the number of incorrectly identifed and incorrectly rejected multimodal signals is given by false positive (FP) and false negative (FN), respectively.Also, A r � 1/N c is the random accuracy, where N c is the number of classes.
4.1.Simulation Setup.We adopt the cross-subject validation protocol to determine the train and test data.Hence, the proposed method is subject independent and considers the data of one subject for testing and the data of the remaining subjects train the model.Tis validation scheme repeats this procedure for all subjects as test data, and fnally, the results are averaged.Tis paper considers some frequently used pretrained CNNs for deep feature extraction from the 2D-DOST content, including AlexNet, VGG19, ResNet18, Inception-v3, and EfcientNet-B0.  3 depict that the EEG-PPS fusion yields the highest accuracy equal to 0.953 and 0.928 for two-and four-class scenarios, respectively.It is observed that random and PPS-EEG fusions have close accuracy, where the accuracy of the EEG-PPS scheme is slightly higher.Tis fusion scheme preserves the intramodal correlations among different channels and also considers the cross-modal correlations among the signals of diferent modalities.In contrast, a random manner cannot preserve the intramodal correlations among channels due to the random location of signals.
Also, comparing the results of only EEG and only PPS signals indicates that EEG signals are more informative than the PPSs; hence, their fusion reaches a higher accuracy than using only one.It should be noted that the maximum accuracy of both scenarios is obtained considering the deep features extracted by Inception-V3 CNN and SVM classifer.
Te structure of Inception-v3 [36] is given in Table 4.It should be noted that the output size of each module is the input size of the next one.Te structure of inception modules is also given in Figure 4.
Tables 5 and 6 present the confusion matrix of the proposed method for two-and four-class scenarios, respectively.It is observed that the accuracy of the detection of negative emotions is slightly higher than positive ones in the two-class scenario.Notably, the minimum sensitivity is 94.7%, higher than the recently introduced works.Te angry, happy, calm, and sad emotions are most accurate in the four-class scenario.Also, the values of kappa and F 1 scores indicate the efciency of the proposed method.

Accuracy for Diferent Pairs of Classifers and the CNN.
Tables 7 and 8 present the accuracy and kappa score of the proposed method for diferent pairs of CNN and classifer to fnd the set of CNN and classifer that reaches the highest accuracy.Notably, each pair's reported accuracy is the maximum obtained by the optimization of SSDR and classifer parameters in the EEG-PPS fusion scheme.It is observed that in both scenarios, the combination of Inception-v3 and the SVM yields the highest accuracy.Te ResNet18 and EfcientNet-B0 have a close performance that is lower than Inception-v3 and higher than AlexNet and VGG19.Also, the performance of VGG19 is better than AlexNet.For all CNNs, the SVM with Gaussian kernel reaches the highest accuracy, and after that, ANN has the highest accuracy in most cases.
Table 9 discusses the efect of feature reduction on the performance of the proposed method.We considered the proposed method without feature reduction, with unsupervised PCA, with supervised LDA, with the combination of PCA and LDA, with static SSDR, in which parameters are not optimized, and with optimized SSDR.It is observed that generally, using feature reduction increases the accuracy.Since LDA is supervised, it has higher accuracy than unsupervised PCA.However, the generalization of LDA is lower than PCA.To overcome this issue, a combination of them, PCA + LDA, can be used that reaches a higher accuracy than when used alone.Te parameters of static SSR are set randomly, and it is observed that its performance is slightly lower than the hybrid PCA + LDA scheme.International Journal of Intelligent Systems

Conclusion
Tis paper proposed a new method for emotion recognition from multimodal signals, including EEG in 32 channels and PPS in channels.Te proposed method employs the 2D-DOST to analyze the relations between the multimodal signals.Ten, a CNN was used to extract the deep local features from the absolute of the 2D-DOST.After feature reduction by SSDR, a classifer determines the emotion by solving an optimization problem.Te results showed that the extracted deep features by the Inception-v3 network and their classifcation by the Gaussian SVM reached the highest accuracy equal to 0.953 and 0.928, respectively, for two-and four-class scenarios on the DEAP dataset.Several fusion schemes to combine the EEG and PPS signals were examined, and it was observed that the scheme [X EEG ; X PPS ] has the maximum accuracy.Also, it was shown that optimized SSDR has higher accuracy than the frequently used feature reduction schemes such as PCA and LDA.Te results indicate the efciency of multimodal emotion recognition compared to the unimodal approach.Also, the proposed method outperforms the recently introduced methods.
feature extraction from each modal/CNN Audio International Journal of Intelligent Systems EEG and PPS signals.EEG signals were recorded using 48 electrodes.PPSs are horizontal electrooculography (hEOG), vertical EOG (vEOG), zygomaticus major electromyography (zEMG), trapezius EMG (tEMG), galvanic skin response (GSR), respiration belt, plethysmograph, and temperature.All signals were downsampled to 128 Hz.EEG and PPS signals were passed through bandpass and lowpass flters, respectively.Te middle 30 seconds of the 63 seconds of recorded data were considered for further processing.Since it was generally adopted that each subject reaches a stable in the middle of the video, the selected part of the signals was partitioned into segments with a duration of three seconds so that consecutive segments have a 50% overlap with each other.Terefore, there are 40 trials for each subject, each trial with 19 segments with 384 samples.Tis paper considers two scenarios based on valence and arousal for rating the emotional signals.Te binary scenario classifes the multimodal signals based on the valence rating into positive and negative emotions, as shown in Figure1(a).Conversely, the four-class scenario considers the 2D valence-arousal model for classifying emotions into one of the following categories: sad, calm, happy, and angry, as shown in Figure1(b).

Figure 2 :Figure 1 :
Figure 2: General framework of the proposed multimodal emotion recognition.

Figure 3 :
Figure 3: Te procedure used to optimize the parameters of SSDR and classifer.

Table 1 :
Summary of multimodal emotion recognition from biological signals.

Table 2
Tere are several ways to combine EEG and PPS signals to construct the matrix X m such as EEG-PPS, X m � [X EEG ; X PPS ], and PPS-EEG, X m � [X PPS ; X EEG ].Te other way is that channels of EEG and PPS signals are randomly located at the rows of the matrix X m .Tere are several placements for this purpose.We examined several placements, and the highest accuracy was reported.Also, the results of using only EEG and PPS signals are obtained.Te results given in Table Table 10compares the performance of the recently introduced multimodal emotion recognition approaches.As observed, the EEG is the frequently used modality in multimodal emotion recognition systems.Most multimodal schemes considered the EEG and other biological signals such as EOG, PPS, GSR, and facial expressions.Also, the EEG and PPS signals are the most used.Generally, the EEG + PPS scheme reaches a higher accuracy than the other combinations of biological signals.It is observed that the proposed method has more accuracy than the recently introduced works.

Table 2 :
Parameters used for tuning the deep feature extractors.

Table 3 :
Accuracy of diferent fusion models.
Te bold values represent the highest accuracies.

Table 9 :
Te efect of feature reduction on the accuracy.

Table 5 :
Confusion matrix of the two-class scenario.

Table 6 :
Confusion matrix of the four-class scenario.