Multimodal and Multitask Learning with Additive Angular Penalty Focus Loss for Speech Emotion Recognition

,


Introduction
Speech not only explicitly expresses linguistic content but also implicitly contains the speaker's emotional states such as sadness, happiness, and fear.Te speech emotion recognition (SER) aims to automatically identify the speaker's emotional states [1][2][3], having a large number of applications such as human-computer interaction, information recommendation, and health detection.Consequently, methods for SER are deeply investigated.In addition to methods based on handcrafted features [4,5], some methods are mainly based on deep learning [6], which convert speech signals into spectrograms and then various deep learning methods are used to deal with them.Among them, deep convolutional neural networks (DCNNs) and recurrent neural networks are most widely used [7,8].Te temporal convolutional networks are also popular for solving SER problem [9,10].Tey focus on the innovative design of the neural network structures [11,12], such as by adding the attention mechanism [13] and the transfer learning to solve problems of SER [14].
Te frst problem is from the small training data [15], as the model trained on the small data will easily lead to the over ftting and in turn result in the weaker generalization ability.Data augmentation is an efective method to solve this problem [16].For example, generative adaptive networks (GANs) and its variants are often applied to generate new samples [17][18][19].Alternatively, a larger data can be directly constructed from existing data with hand-crafted features [20].Because there is a large amount of unlabeled data, transfer learning [15] and semisupervised methods [21] can be also applied to expand the training data.
Another problem comes from insufcient features for SER.In such case, multimodal learning can be applied to learn enough features from diferent angles.Tese features are complementary so that they can describe the internal semantics of speech more completely and accurately.For example, speech and text can be integrated to extract features for SER [22][23][24].In addition to speech and text, another method also considers facial expression and motion through transformer encoder and then combines these features to achieve the classifcation [25].Tese methods have achieved good results.
To solve the overftting problem, multitask learning can be applied for SER.For example, it takes the language classifcation as an auxiliary task and speech emotion recognition as the main task [26].It selects speech emotional features for any two classes independent of the speaker for the classifcation as multiple tasks and then ensembles their results [27].Te hierarchical multitask learning is proposed that uses the coarse classifcation and fne classifcation as two tasks [28].Tere methods use unsupervised reconstruction as an auxiliary task [29].Te more complicated method obtains the multiscale unifed metric [30], where the phone recognition and gender recognition are the auxiliary tasks.Tese methods are based on the single modal of speech.
Te emotional labels of speech may be uncertain [1] for those methods based on frames for SER.When they segment each speech sentence into frames, the label of each frame is assigned by that of the sentence, easily leading to the noise labels [2].Some new methods are proposed to solve this problem, such as the iterative self-learning framework that designed four specifc label change rules [31] and the selflabeling method for each speech frame [2].Simultaneously, the multiclassifer mutual learning is also proposed [1], where all classifers classify each sample and then combine their classifcation results to construct its new label.Some special issues are emphasized, such as the uneven length of input speech [32], which can be solved by DCNN and LSTM (long short-term memory).Besides, hand-crafted features of multivariable time series, bidirectional echo state network, and sampling methods are used to solve the unbalanced problem [33].Te individual standardization network aims to reduce the emotional confusion caused by individual diferences [34].To extract and select optimal features, the cryptographic structure [35], sparse coding [36], and the hybrid network of capsule network and transfer learning-based mixed task net are proposed [3].In addition, ensemble deep learning [37] and supervised contrast learning [38] are also proposed.
DCNN needs loss function to guide its learning for SER.Te most of loss functions are not specifcally designed for speech emotion recognition [39,40].Although there are multimodal learning and multitask learning independently used for speech emotion recognition, they have not been combined.Tis paper proposes a new method that combines multimodal and multitask learning with new additive angular penalty focus loss (MTAP) to recognize the speech emotion.Te main contributions are as follows: (1) To solve the problem of the fuzzy decision boundary and the imbalance between difcult and easy speech samples, a new additive penalty focal loss function (APFL) is proposed for SER (2) A new method is proposed for SER that combines multimodal and multitask learning with APFL, where the gender recognition is taken as an auxiliary task, and spectrogram, text, and audio are the different modalities of speech samples Section 2 provides the related work, while the proposed method is introduced in Section 3. Experiment results and analysis are presented in Section 4. Section 5 presents conclusions.

Related Work
As our contributions are related to the combination of our new loss function with multimodal and multitask learning for speech emotion recognition, they are compared and analyzed.

Loss Function.
Te loss function widely used in speech emotion recognition is the cross-entropy loss (CEL) [2].Te center loss function [41] is also used for SER to pull features in the same emotional category to its center [40].However, it only improves the intraclass compactness without enlarging the distance between classes.Te triplet loss function [42] is also used for SER [39], which aims to reduce the distance of samples in the same class and enlarge the distance of heterogeneous samples.Another class-specifc angular Softmax loss is designed to train the time-frequency convolution neural network [43].In other felds, some new loss functions are also proposed, such as face recognition loss ArcFace [44] that transforms Euclidean space into the angle space and introduces additional angle penalties to target categories for strictly controlling the boundary of each category.Tis loss can achieve the efect of reducing the intraclass distance and increasing the interclass distance.Focal loss (FL) function [45] is proposed to solve the extreme imbalance between the foreground and background of data.It adjusts the contribution of hard samples to the total loss by introducing modulation parameters.Tese loss have not been used for SER.Particularly, diferent from these methods, our method combines ArcFace and FL in innovative way to solve the problem of fuzzy decision boundary and imbalance between difcult and easy samples.Generally, the deep neural network determines the gradient through the loss function and then uses it to modify the weights of the network.Our method works in the same way.But GHM (gradient harmonizing mechanism) [46] is diferent.Inspired by the gradient norm distribution, it frst calculates the gradient density and then adds a harmonic parameter to the gradient of each sample according to the density.In practical applications, the modifcation of gradient can be realized equivalently by reconstructing the loss function.GHM changes with the density that may change in the training process.However, our method is a static loss function.It does not adapt to the change of data distribution.It also does 2 International Journal of Intelligent Systems not change in the training process.However, GHM is currently used for the target detection, not for SER.Its principle can also be introduced into our method to further improve the performance.

Multimodal Learning.
Multimodal learning can learn features from diferent modalities of samples.Tese features can be complementary so that they can describe the semantics of emotional speech more completely and accurately.For example, the method of integrating speech and text modality is proposed that emphasizes the temporal relationship between diferent modalities [22].Another method also uses speech and text but introduces the attention network to promote the interaction and information fusion between them [23].Alternatively, two diferent neural networks are applied to extract features of speech and text, respectively, and then concatenate their features [24].Tese methods do not consider the relationship between diferent modalities.In addition to speech and text, other modalities such as facial expression and motion are considered, where the similarity between speech and text, and speech and other modal features are learned through transformer encoder and then their features are combined [25].Diferent from these methods, our method combines the spectrogram features extracted by deep neural network, text features extracted by the pretrained language model BERT, and audio features extracted by the pretrained VGGish sound model.

Multitask Learning.
Language and gender can afect the performance of speech emotion recognition [26], which can be used within the framework of multitask learning.For example, emotion classifcation is the main task and language classifcation is taken as an auxiliary task [26].Taking gender recognition as an auxiliary task is conducive to extracting distinctive features and increasing the distinguishability between emotional categories [47][48][49].Speakers are also used in multitask learning framework for speech emotion recognition, where each task selects features for any two classes independent of the speaker for the classifcation and then ensembles the classifcation results [27].Te hierarchical multitask learning framework is also proposed that takes the coarse classifcation and fne classifcation as two tasks [28].Te augmentation of data and unsupervised reconstruction can be taken as an auxiliary task to avoid the difculties caused by the data annotation [29].Another method is more complicated that obtains the multiscale unifed metric [30] by the multitask learning, where the classifcation of both Emission States Category and Emission Intensity Scale is the main task and the classifcation of phone recognition and gender recognition is the auxiliary task.
Tere is one method that combines multimodel learning and multitask learning [50].However, it aims to perform the speech recognition and identity recognition, instead of the speech emotion recognition, resulting in learned features that may deviate from the emotion recognition.Furthermore, it uses video, text, and audio.However, it is difcult to prepare the video data, as the speech sentences in video are not easy to be determined.

Additive Angle Penalty Focal Loss
Tis section proposes a new additive penalty focal loss function.Although ArcFace loss [44] and Focal loss [45] have been proposed in computer vision felds, they have not been considered in SER.Furthermore, both ArcFace and focal loss only consider one aspect of optimization such as fuzzy decision boundary or class imbalance, resulting in the limited improvements.Tus, we combine them in the innovative way to extract better discriminative features.It not only enhances intraclass compactness and interclass distance but also assigns more appropriate weights to the hard examples, so that it is obviously stronger in learning discriminative features than both ArcFace and focal loss.As it does not use the domain knowledge of SER, generally it can be also applied to other domains.In our case, we apply it to improve SER.

Additive Angle Penalty Focal
Loss.Fuzzy decision boundary and class imbalance are the two challenges faced by speech emotion recognition.To tackle these issues and improve the recognition accuracy, APFL is devised as follows: where and cos θ j � W j T x i .W j denotes the j-th column of the weight matrix W after L2 normalization, x i is the L2 normalized feature vector of the i-th sample corresponding to the ground-truth class y i , θ j is the angle distance between W j and x i .p x i ,y i denotes the posterior probability of x i being classifed to the class y i .N is the number of training samples and n is the number of classes.s is a hyperparameter that should be adjusted carefully to obtain the optimal performance of the model.m is the additive penalty to the angle between x i and the weight W y i of its corresponding label y i so as to provide additional guidance to synchronously enhance the intraclass similarity and interclass diference.In this way, the issue of fuzzy decision boundary can be alleviated.c is used to guide the model to pay more attention to the hard examples by multiplying the modulating factor Te idea of APFL is quite useful and easy to implement in any deep-learning framework.Whenever Softmax loss or similar loss function is used, we can replace it with APFL to achieve the better performance.
International Journal of Intelligent Systems

Comparison with Diferent Loss Functions.
In this section, we compare APFL with some relevant loss functions, i.e., Softmax loss (cross-entropy loss), focal loss, and Arc-Face.For simplicity of analysis, we consider the binary classifcation case with classes C 1 and C 2 .

Geometric Diference.
As illustrated in Figure 1, we compare the decision boundaries.Evidently, APFL has stricter decision conditions (for C 1 it requires θ 1 ≤ θ 2 − m rather than θ 1 ≤ θ 2 , and it is similar for C 2 ), resulting in a clearer boundary with a margin of � 2 √ m between diferent classes in the angular space.

Impact from Examples.
Either Softmax loss or ArcFace does not take the infuence of data distribution into account.Multiplying the modulation factor (1 − p x i ,y i ) c alleviates this issue to an extent.Specifcally, for examples that are easily misclassifed, the factor is close to 1 as p x i ,y i is small; hence, the loss is nearly unafected.But for those that are well classifed, the factor goes to 0 since p x i ,y i tends to 1; thus, the loss is down-weighted.It can prevent the model from overwhelmed cases by easy examples.It can be easily found that these loss functions are in fact special cases of the proposed APFL.When c � 0, APFL is equivalent to ArcFace.When m � 0, it becomes the focal loss with L2 normalized features and weights.

Multimodal and Multitask Learning
Framework with APFL Tis section introduces our proposed multimodal and multitask learning framework with our new loss (MTAP) for speech emotion recognition, shown in Figure 2, which uses spectrogram, text, and audio for the input speech sample while the gender recognition is taken as the auxiliary task.
4.1.Loss.Due to its efectiveness in speech emotion recognition [14], DCNN is used to extract features for the classifer to recognize the speech emotion.Te input speech signal is frst converted into the spectrogram and then feed it into DCNN to extract features.As illustrated in Figure 3, the dot product between extracted features and the last fully connected layer is equivalent to the cosine distance when both of them are normalized, where W is the weight matrix of the full connection layer and updated by the backpropagation of errors method.We use the arccos function to acquire the angle between them.Afterwards, we add an additive penalty to the angle and obtain the target logit back by using the cosine function.Subsequently, we rescale all logits by the fxed feature norm, and then following steps are exactly the same as in the Softmax loss.Finally, we multiply the cross-entropy by the modulating factor to adjust the less loss to the well classifed examples.
Tus, the total loss for our framework is defned by L total � L apfl + λ × L gender , where L gender for auxiliary task is defned by Softmax loss, L apfl is for speech emotion recognition, and λ controls the infuence of auxiliary task on the model.[51] is used to extract features of texts, which is a pretrained model.By combining tasks of both Masked Language Model (MLM) and Next Sentence Prediction (NSP), the embedded feature representation of language is learned by the self-supervised learning on a large corpus and then obtained features can be directly used as the input of downstream tasks.BERT has three parts as the input: Token embedding, Segment embedding, and Position embedding.Token embedding is the feature representation of the input text where Token can be understood as a word in Chinese.Segment embedding proceeds the paired sentences as the input, which has only two values: 0 and 1.For the input sentence pair, all Tokens of the previous sentence are given 0, and all Tokens of the next sentence are given 1.As text is sequential, the order of words will afect the meaning of sentences.However, Transformer structure cannot capture this information.Position embedding is designed to make up for this defect, which is learnable.Due to the complexity, the simpler version of BERT denoted as BERT BASE is used to extract features of the text corresponding to the speech.

VGGish.
VGGish is a small VGG network [52], which was trained on a larger dataset AudioSet [53], which contains about 2.1 million videos with 527 sound categories.VGGish is the pretrained model that can be used as an audio feature extractor.It presents 128-dimensional feature vector for the input audio with one second, which can be used as the initial input of another model.Although there are differences between general audio and emotional speech, speech also contains some features of the general audio.Tese features can be also applied for SER [2].As DCNN cannot learn enough features from the small emotional speech data sets, VGGish can be used to extract audio features as the complementary features.For each speech

Datasets and Evaluation Indicators.
Tree benchmark databases are selected to evaluate our method, including Interactive Emotional Dyadic Motion Capture (IEMOCAP) [54], Surrey Audio-Visual Expressed Emotion (SAVEE) [55], and Berlin Emotional Speech Database (EMODB) [56].IEMOCAP [54] consists of 5 sessions and each session is displayed by a pair of speakers (male and female) in scripted and improvised scenarios.We choose 4 emotion types (i.e., angry, happy, neutral, and sad) for our experiments only from improvised data, and thus, 2280 utterances are used.
We adopt 5-fold cross-validation using Leave-One-Session-Out (LOSO) strategy, that is, 4 sessions are used for training, while the rest one is divided into two equal parts for validation and testing.
SAVEE [55] is composed of records performed by four native English male actors in seven emotions.It includes 480 utterances in total, i.e., 60 anger, 60 disgust, 60 fear, 60 happiness, 120 neutral, 60 sadness, and 60 surprise.On these data, we conduct fourfold cross-validation Speaker Independent (SI) and fvefold cross-validation speaker dependent (SD) experiments.
EMODB [56] contains 535 emotional utterances performed by 10 actors with seven diferent emotions: anger, boredom, disgust, fear, happiness, neutral, and sadness.On these data, we adopt 10-fold cross-validation strategy for both SI and SD experiments.
Generally, the performance of SER can be evaluated by two widely used metrics [57][58][59].One is the Weighted Accuracy (WA) that is the classifcation accuracy of all test   All logits are then multiplied by the scaling parameter s and go through the Softmax to obtain the prediction probability p x i ,y i for computing the total loss.
International Journal of Intelligent Systems samples, also known as Overall Accuracy.Te other is Unweighted Accuracy (UA) that is the average accuracy of each individual class, also known as Class Accuracy.

Experimental Results of Our Proposed New Loss Function
5.2.1.Data Preprocessing.We trim the long duration audio utterances to shorter duration ones which covers 75 percentiles of all audio samples in IEMOCAP [57].Tus, the maximum duration is restricted to six seconds.For those longer than six seconds, their head and tail are cut.Each trimmed sample is assigned the same emotion label as that of its utterance.Subsequently, we use the feature extraction method [58] to obtain spectrograms, where the sequence of overlapping hamming windows is applied with the frame length of 40 msec and frame interval of 10 msec.For each frame, we calculate its discrete Fourier transform and then perform the aggregation of the short-time spectra to obtain a matrix of size T × F, where T ≤ 600 and F � 400.Te last step uses the zero padding to obtain the fxed 600 time points.Tus, the spectrogram size is 600 × 400 for IEMO-CAP, 500 × 400 for SAVEE, and 400 × 400 for EMODB.

Experimental Settings.
DenseNet169 [60] pretrained on ImageNet [61] is used to extract features of speech spectrograms.Te parameter c ∈ 0.10, 0.20, 0.50, { 1.0, 2.0, 5.0, }, the penalty m ∈ [0.2, 0.5], and the feature scale s is an empirical parameter that should be appropriately large where it equals to 10.All experiments use the crossvalidation strategy.Besides, we run fve times per fold and then take the average as the fnal result of the fold to ensure the reliability.

Speaker Independent Experiments.
Under this strategy, we conduct experiments on both EMODB and SAVEE using 10-fold and 4-fold cross-validation, respectively.Te results are reported using the average value and standard deviation of WA and UA.It can be found from Table 1 that our new APFL outperforms all compared methods in WA and UA.In addition, its standard deviation is also minimum in most cases, indicating that it can make the model more stable.

Speaker Dependent Experiments.
Te relevant experimental results are reported in Table 2.It can be seen that our new APFL still obviously outperforms all compared methods in WA and UA.It also achieves the more signifcant improvements than that in Speaker Independent cases.Especially, compared with ArcFace loss on EMODB, APFL has the larger improvements of nearly 2% in terms of both WA and UA.Similarly, it can be observed that the model using APFL is more stable on the whole in terms of standard deviation.

Leave-One-Session-Out Experiments.
As described earlier, we choose the improvised speech part from IEMOCAP as it comes from real cases.Experiments are conducted by LOSO with fvefold cross-validation.Te optimal parameters and results are reported in Table 3 in the format of means and the standard deviations of WA and UA.It can be found that the model with APFL performs best among all models with compared loss functions.

Visual Analysis of Loss Functions.
In order to illustrate advantages of our new loss function, we use t-SNE (tdistribution stochastic neighbor embedding) method [62] to visualize the distribution of features extracted from Den-seNet169 on test samples in IEMOCAP under the guidance of each compared loss function.Te results are shown in Figure 4.
It can be seen in European space that the category decision boundary of CE Loss is very vague, the categories are basically mixed together, and the overall distribution is very loose.Although focal loss performs slightly better, it is still messy.ArcFace has a signifcant improvement, having three distinct clusters.However, happy category in yellow is basically confused with the other three categories.In contrast, clusters formed by our APFL are banded clearly in four directions, corresponding to four categories.In particular, the boundary between the happy category in yellow and the angry category in blue becomes clearer, which indicates that some samples ever wrongly classifed by the other loss functions have now been correctly identifed.Te similar results can be observed in the angle space.It can be seen that the category boundary of CE loss is much vague, basically mixed.In such a case, it is hard to perform the nice classifcation.Although focal loss obtains the better results, its boundaries among classes are still overlapped.By comparison, ArcFace seems form three separated clusters.However, four clusters are still not clearly formed.In contrast, clusters formed by our APFL are separated in four segments, corresponding to four categories.Te interval between categories is also obviously larger than that of the other loss functions, while the arc length in the same category also becomes smaller.Tis means that the intraclass compactness and interclass diference by our APFL have been improved.

Experimental Results of Our MTAP.
As our MTAP consists of several components, ablation experiments are conducted to illustrate the necessity and superiority of each component.

Multimodal.
In order to more clearly understand the efectiveness of each modal, ablation experiments are conducted on the improvised part of IEMOCAP.Te results are shown in Table 4.It can be seen that all modalities are necessary to improve the performance.MTAP achieves an obvious improvement of 2% on UA when the text modal is added to spectrogram modal.However, the further improvement is limited when the audio modal is further added, indicating that there may be redundancy between three modalities.In order to more intuitively refect the contribution of each modal to features learned by MTAP, 6 International Journal of Intelligent Systems we use t-SNE to visualize the distribution of test samples whose features are extracted by our model in European space.Te results are shown in Figure 5.When only the spectrogram features is used, it can be seen that there are roughly three clusters, but they are basically mixed together without clear boundaries.However, by comparison, our results in Figure 5(d) have the better gap among three clusters denoting anger, neutral, and sad, indicating that multimodal can improve the recognition performance of each emotion category, proving that our method is efective.

Gender Identifcation as Auxiliary Task.
In the proposed MTAP, the gender recognition is taken as the auxiliary task.Experiments are conducted to illustrate its efectiveness.Te results are shown in Table 5.It can be seen that the auxiliary task has improved the performance of the model in any case, illustrating that text modal is efective for SER.In more details, MTL-B performs best in three cases, even surpassing MTL-C, where their diference is only whether the text modal is used.Tis means that text modal is not efective for the gender recognition task.It is reasonable because features of the text do not vary with the gender diference but only related to the content of the text itself.International Journal of Intelligent Systems  International Journal of Intelligent Systems  International Journal of Intelligent Systems situations.Tese compared methods are basically developed from CNN structure, but they use only one or two modalities of speech.It seems that there is no method at present using three modalities of spectrogram, text, and audio.It can be seen from Table 7 that our method is optimal in both WA and UA, illustrating the superiority of our method.

Conclusions
Tis paper proposes a new multimodal and multitask learning method for speech emotion recognition, where a new additive angle penalty focus loss function is also proposed to guide the network learning.One of its advantages is that spectrogram features, text features, and audio features are extracted from diferent angles and then combined to enrich features for speech emotion recognition.Another advantage is that the auxiliary task of the gender identifcation is applied to improve the generalization ability and transfer the knowledge to SER for the complementary.Tis is because there are diferent properties of voice signals that male and female express the same emotion.When the neural network model is used to learn the emotional features of the input voice signal, if the gender is not specifed, the model must learn emotional features composed of both male and female simultaneously, so that the feature space is not only large but also sparse.In such a case, a lot of training samples are required; otherwise it is easy to overft.When the gender recognition task is introduced, the model can learn the emotional feature space of male and female, respectively, equal to learning two smaller subspaces.Generally, the dimension of the subspace is smaller than that of the whole space, so that it needs smaller training samples.In the case of the same training samples, the model with the gender   Models WA (%) UA (%) CNN-BLSTM [7] 68.80 59.40 CNN-BLSTM with a two-step predictor [7] 67.30 62.00 Parallel CNN [57] 71.20 61.90 CNN-GRU-SeqCaps [63] 72.73 59.71 Variable-length CNN-GRU [64] 71.45 64.22 CNN-TF-GAP [43] 72.43 64.80 End-to-end ASR and SER [65] 69.70 63.10 CNN-MHSA [59] 72 10 International Journal of Intelligent Systems recognition task is more capable of learning ability, leading to the higher accuracy of speech emotion recognition.Furthermore, the proposed APFL has advantages of improving the compactness within the class, enlarging the diference between classes and focusing on difcult samples, so as to guide the network to learn more efective emotional features.Although our method has achieved the good performance, there is still room for the further improvement.For example, more modalities can be considered such as the speaker's facial expression and body movements, while the more complicated backbone network can be also considered such as with attention mechanism.Tey will be investigated in the future work, including their applications such as recognition of Parkinsons's disease through speech signals.

Figure 1 :
Figure 1: Te comparison of decision margins of diferent loss functions under binary classifcation case.Te dashed line represents the decision boundary, and the gray areas are decision margins.

Figure 2 :
Figure 2: Framework of our MTAP for speech emotion recognition, where it fuses multimodal features and uses our new additive angle penalty focus loss function.It also considers the gender recognition as the auxiliary task in three diferent ways.MTL-A only uses spectrogram features for the auxiliary task, MTL-B uses both spectrogram and audio features, and MTL-C uses all features.(a) MTL-A.(b) MTL-B.(c) MTL-C.

Figure 3 :
Figure3: Framework of DCNN with new APFL loss for speech emotion recognition, where feature x i and weight W j are L2 normalized, and logit cos θ j is computed for each class.For the ground θ y j , an extra angular penalty m is added to calculate cos(θ y j + m) as the new logit.All logits are then multiplied by the scaling parameter s and go through the Softmax to obtain the prediction probability p x i ,y i for computing the total loss.

Figure 4 :
Figure 4: Visualization of samples distribution in IEMOCAP whose features are extracted by diferent loss functions, where the frst and second columns, respectively, represent distributions in European space and angle space.

Figure 5 :
Figure 5: Visualization of test samples distribution in IEMOCAP whose features are extracted with the combination of diferent modalities, where (a) spectrogram, (b) spectrogram with text, (c) spectrogram with audio, and (d) spectrogram with text and audio.

Table 1 :
Function.To validate that APFL is the necessary component of MTAP, some experiments are conducted where both CEL and APFL as loss functions are used and all others remain unchanged for MTAP.Te results are shown in Table6.It can be seen that MTAP with Comparison with Recent Methods.In order to verify the superiority of our MTAP, several advanced methods in recent years are applied to make comparison with MTAP, as these methods use the backbone network similar to ours and experimental settings are the same.Te experimental data are the improvised part of IEMOCAP, as it is closer to real Experimental results by SI, where DenseNet169 is used to extract features for the spectrogram.

Table 2 :
Experimental results by SD, where DenseNet169 is used to extract features for the spectrogram.

Table 3 :
Experimental results by LOSO, where DenseNet169 is used to extract features for the spectrogram.

Table 4 :
Performance comparison of our MTAP in the case of diferent combinations of multiple modalities.
Bold fonts indicate the best performance.

Table 5 :
Performance comparison of our MTAP where gender identifcation is the auxiliary task.

Table 6 :
Performance comparison of MTAP in the case of diferent loss functions and backbone networks, where m � 0.3.
Bold fonts indicate the best performance.

Table 7 :
Performance comparison of MTAP with recent methods on the improvised part of IEMOCAP.