Multimodal SentimentAnalysis Based on Interactive Transformer and Soft Mapping

Multimodal sentiment analysis aims to harvest people’s opinions or attitudes from multimedia data through fusion techniques. However, existing fusion methods cannot take advantage of the correlation between multimodal data but introduce interference factors. In this paper, we propose an Interactive Transformer and Soft Mapping based method for multimodal sentiment analysis. In the Interactive Transformer layer, an Interactive Multihead Guided-Attention structure composed of a pair of Multihead Attention modules is first utilized to find the mapping relationship between multimodalities. /en, the obtained results are fed into a Feedforward Neural Network. /e Soft Mapping layer consisting of stacking Soft Attention module is finally used to map the results to a higher dimension to realize the fusion of multimodal information. /e proposed model can fully consider the relationship between multiple modal pieces of information and provides a new solution to the problem of data interaction in multimodal sentiment analysis. Our model was evaluated on benchmark datasets CMU-MOSEI and MELD, and the accuracy is improved by 5.57% compared with the baseline standard.


Introduction
Sentiment analysis aims to detect affective states or subjective information from data. It is often used to understand or judge people's attitudes, opinions, and sentiment. Traditional sentiment analysis mainly focuses on text data, using statistical knowledge combined with natural language processing and machine learning techniques to study and analyze the sentiment polarity of sentences or documents [1]. In reality, human sentiment is expressed not only through language, but also through acoustic information (e.g., speakers' tone of voice) and visual information (e.g., speakers' facial expressions and body movements). For example, social media users are no longer satisfied with sharing feelings and emotions in the form of text but tend to use multimedia forms such as pictures or videos when sending blog posts. In this way, they can express their attitudes more abundantly.
e multimodal information at different granularity levels is spread out by people, and traditional sentiment analysis methods cannot handle this problem well. On the basis of text information, multimodal sentiment analysis can use multimodal representation learning, multimodal alignment, and multimodal fusion technologies to combine acoustic and visual information to eliminate ambiguity caused by a single modality. At present, the research of multimodal sentiment analysis can be divided into two categories according to the number of talkers: multimodal narrative sentiment analysis and multimodal conversational sentiment analysis [2]. Multimodal narrative sentiment analysis usually transmits the author's personal attitude with narrative information. e expression of information is relatively independent and does not involve the interaction between multiple speakers. For example, in the analysis of public opinion, multimodal narrative sentiment analysis is used to analyze the information in social media platforms such as microblog and twitter to obtain users' attitude towards a hot event. In contrast, there is more than one speaker in multimodal conversational sentiment analysis. e sentiment or attitude of each talker is transmitted in the form of dialogue or communication. In the process of interaction, the sentiment state of speakers is mutually influencing, jumping, and unstable. For example, in customer sentiment analysis, multimodal conversational sentiment analysis is used to obtain interaction clues among multiple customers and to predict sentiment evolution trend in the interaction process. e research of multimodal sentiment analysis is not mature, and there are still some unsolved problems. Among them, the establishment of multimodal information fusion mechanism has become the main bottleneck restricting the development of this field. Multimodal information fusion aims to fuse the representations of different modalities and retain the key information in each modality during the fusion process. To solve this problem, there are several ideas: feature-level fusion, decision-level fusion, and hybrid fusion [3,4], as shown in Figure 1. e yellow part in Figure 1 shows the feature-level fusion mechanism, which extracts feature vectors from multimodal data, respectively, and fuses them into a multimodal feature vector. e fused vector is used to judge sentiment [5,6].
is fusion method can obtain the association among various modalities, but the features need to be mapped to a shared space for fusion because the semantic spaces of different modalities are different from each other. e blue section in Figure 1 shows the decision-level fusion mechanism, which first conducts single modality sentiment analysis independently and then fuses the results to obtain the final decision [7,8]. is fusion method can design a feature extraction method for the semantic space of each modality to obtain the optimal single modality decision, but independent learning will cause the overall time cost to be too large. e green area in Figure 1 shows the hybrid fusion mechanism, which comprehensively uses the feature-level fusion and decision-level fusion methods to reduce the time cost on the basis of fully learning the associated information of each modality.
In addition, multimodal interaction is also a hot topic. Multimodal interaction aims to supplement the information of different modalities. When the information of one modality information is missing, the data of another modality is used to make up the missing part. is kind of interaction is concealed, complex, and dynamic. It makes multiple modalities related to each other and affects the final sentiment judgment. How to accurately and comprehensively model the complex interaction in multimodal data is still troublesome in this field. Multimodal interaction mainly includes two situations: the interaction between features and the interaction between decisions. For the interaction between features, multimodal features in a shared space need to be aligned through semantic space barriers across different domains to achieve semantic fusion. Several methods such as feature concatenation [9,10] and attention mechanisms [11] have been developed to solve this problem. For the interaction between decisions, it is necessary to understand the correlation between multimodal data, and this is a comprehensive cognitive decision-making process. e voting method and linear weighting method are proposed to solve this problem.
In view of the above two problems, it can be seen that, in the process of multimodal information fusion, making full use of the correlation among different modalities to make each modality learn from each other is the key to multimodal sentiment analysis. Based on the Transformer-Encoder framework [12], we propose a model to learn the information of different modalities. In this process, we utilize the Guided-Attention mechanism [13] to introduce the information of other modalities. en, the unimodal results are mapped to higher dimensions for fusion, and the final decision is made according to the fusion results. e main contributions of this work can be summarized as follows: (i) We propose a multimodal sentiment analysis model based on Interactive Transformer and Soft Mapping. is model can achieve the optimal decision of each modality and fully consider the correlation information between different modalities. (ii) We propose the Interactive Transformer (IT) structure, which can mine the interactive information between modalities. (iii) We propose the Soft Mapping (SM) structure, which projects each modality representations to a new space for fusion. (iv) e experimental results on two benchmark datasets show that the proposed method can achieve better accuracy than those existing methods only using linguistic and acoustic information.
e remaining sections of this paper are arranged as follows. In Section 2, we will review the related work on CMU-MOSEI dataset and MELD dataset. In Section 3, we will describe our method and the core content of the model structure in detail. In Section 4, we will describe the datasets and the method of data preprocessing. In Section 5, we will provide experimental results and analysis. In Section 6, we will summarize the paper and discuss the potential future work.

Related Work
In the early studies, researchers focused on the problem of multimodal fusion, and they tried to solve the problem of multimodal sentiment analysis by improving the fusion method. For example, some models fuse each modality representations according to different granularities.
Amir Zadeh et al. proposed the Memory Fusion Network (MFN) [14] to solve the problem of multimodal sequence modeling. e interaction problem is divided into viewspecific interactions and cross-view interactions, which is implemented in three steps. Firstly, they use LSTM to learn view-specific interactions individually. en, they learn the is improvement enables the method to dynamically select the appropriate fusion graph according to its importance in the process of multimodal information fusion. e above methods provide ideas for the research in the field of multimodal sentiment analysis, but they all ignore the interaction between modality representations, such as supplementing by one modality when the information of another modality is scarce. is defect makes these methods unstable. With the development of deep learning, researchers begin to pay attention to the interaction between modalities. Some researchers try to use LSTM model to obtain the contextual information around each discourse from the perspective of time. Some researchers use gating mechanism or attention mechanism to couple features. ese ideas promote the development of multimodal sentiment analysis research.
is mechanism considers that each modality is not of equal importance and needs to be dynamically adjusted according to the linguistic information, tone of the speaker, and facial expressions of utterance. is method explains how multiple modalities contribute to sentient, selectively learning cross fusion vectors to solve the noise problem in the process of multimodal information fusion. Sahay et al. proposed Relational Tensor Network architecture [17]. Tensor fusion is applied to the modality features of each video clip, and LSTM network is used to model the sequence between clips. In this method, the interaction between modalities is refined to a single time segment from the time level, and the operation can be performed frame by frame. Shenoy and Sardana proposed an end-to-end RNN architecture [18], named Multilogue-Net. ey assume that the sentiment or emotion governing a particular utterance predominantly depends on 4 factors: interlocutor state, interlocutor intent, the preceding and future emotions, and the context of the conversation. Firstly, the model represents the speaker's state by learning multiple state vectors of a given utterance and then uses pairwise attention mechanism to obtain the relationship among all modalities. is method starts with the emotional states of both sides of the conversation and fully obtains the relevance and context information between modalities to assist multimodal emotion recognition.
Transformer draws the global dependency between input and output completely based on the attention mechanism without recurrence and convolutions [12]. It utilizes the Multihead Attention to replace the recurrent layers and achieves excellent results in translation tasks. At present, researchers have tried to apply it to multimodal sentiment analysis, providing new research directions for solving problems in this field. Delbrouck et al. describe a Transformer-based joint-encoding (TBJE) architecture for the task of Emotion Recognition and Sentiment Analysis [19]. e model uses Transformer-Encoder framework for emotion recognition, and the proposed joint-coding framework can fuse any kind of modality information.
In summary, we believe that it is necessary to grasp the implicit correlation between modalities. For example, in the issue of sentiment ambiguity, because of the difference in intonation and context, the true meaning of the language can be very different. Implicit correlation is used to analyze the language environment. At the same time, acoustic information and visual information are used to analyze intonation and body language. After comprehensive analysis, we can get the real sentiment and emotion contained in the information. Finally, we choose to use the Transformer-Encoder framework to draw the internal dependence of a single modality, which is doped with residual transformation Wireless Communications and Mobile Computing and layer normalization to enhance the adaptability of the model. e Multihead Attention module in the coding framework is improved by the idea of Guided-Attention mechanism, so that the information of a certain position can notice the representation information in other modalities.

The Proposed Method
is section mainly introduces the framework structure of the model. Interactive Transformer layer is based on the Transformer model, which uses the coding framework to learn the representation information of different modalities. It only needs to rely on the attention mechanism combined with feedforward neural network to achieve the effects of other network models. In fact, the model can obtain the global dependency between input and output without involving the recursive structure of sequence coding. In this process, the Interactive Multihead Guided-Attention (IMHGA) structure proposed by us can introduce the information of other modalities to complete interaction. IMHGA structure is a combination of two improved Multihead Attention (MHA) modules; we will elaborate its principle in Section 3.2. Finally, Soft Mapping is used to map the local results of each modality to higher dimensions for fusion, and the final decision is based on the fusion results. e Soft Mapping layer consists of stacking Soft Attention (SA) module and the output of SA module. After preprocessing, the data is transferred to Interactive Transformer layer, in which the representation information of each modality is learned. Because the information of other modalities is introduced, it can superimpose the information from different representation subspaces in the learning of a single modality representation information. en, the result is passed into the FNN layer which is composed of full connection layer and nonlinear activation function. In each block, the two sublayers are finally subjected to a residual transformation and layer normalization (A & L), as shown in the following formula: (1) rough our experiments, it is found that the best result can be obtained when N � 4 and all blocks' parameters are independent. e output of the previous block is used as the input of the next block. Finally, the output of coding module is input into Soft Mapping layer. ey will be mapped to a higher dimensional space for fusion in order to obtain the final result.

Interactive Multihead-Guided Attention.
e Interactive Multihead Guided-Attention (IMHGA) structure is composed of a pair of improved Multihead Attention (MHA) modules, which we call Guided-Attention (GA). e core idea of it is to use attention mechanism to determine the corresponding relationship between two languages, and there is no dependency during forward propagation. erefore, attention mechanism can execute operations in parallel and speed up the training of the model to reduce the time cost. We apply this idea to multimodal problems, hoping to find the mapping relationship between two-modality information. On this basis, with the help of Multihead Attention mechanism, we use the other modality information as a guide when learning one modality information.
e query (Q), key (K), and value (V) in IMHGA structure come from multiple modality data. It is not like traditional MHA module, which comes from the same modality data. As shown in Figure 3, GA-x and GA-y learn modality-x and modality-y, respectively, where the vectors K and V come from the currently learning modality and the vector Q comes from another modality. Taking GA-x's learning of modality-x as an example, all vectors are subjected to a linear transformation. en, the query (Q y ) from the modality-y and the key (K x ) from the modality-x are used to calculate the similarity weight by the dot product function, and results are normalized by Softmax function. Finally, the weight is used to perform a weighted summation of the value (V x ) from modality-x. e calculation method in this step is the result of Guided-Attention module. e specific process is shown in the following formula.
e above operations are performed for a total of h times, and each time is regarded as a head module. e result of IMHGA structure can be obtained by splicing and linear changing the results of h times. Note that, in order to make the dot product not too large, the calculated similarity weight is usually divided by the dimension of K and the parameter W of linear transformation in each head is different.
e specific process is shown in the following formulas: rough a large number of experiments, we found that the model can play the best effect when using the language and acoustic data. In this paper, we mainly use the fusion of two-modality data. Generally speaking, the language modality is used as main information, and the acoustic modality is used as auxiliary information. However, some important information may be ignored, such as mood and intonation in acoustic modality. ey can help us identify such special situations as polysemy and irony in the language modality. erefore, we regard the status of the two as equal and make them modulate each other. is is the meaning of Interactive in IMHGA structure. After IMHGA structure, the two generated matrices are output to the next Soft Mapping layer in parallel through FNN for fusion. 4 Wireless Communications and Mobile Computing

Soft Mapping.
So far, the model has learned the interaction information between the modalities. Before sentiment classification, the learning results of each modality need to be projected into a new performance space in SM for fusion, as shown in Figure 4. Specifically, we map vectors f text , f audio from each FNN to a higher dimension, as shown in the following formula: where w i is a 2k × 1 transformation matrix and the vector f i is embedded into a higher dimension 2k × k. en, we use the set v j of vector size 1 × 2k to do Soft Attention for matrix in the high dimensional space. After the results are weighted and summed, they are integrated into the vector m i of size k. e calculation process is shown in the following formulas: where m i is the calculation result on a single node on the sequence. erefore, we need to stack the results of all nodes in the whole sequence to get the Soft Mapping feature. As is shown in the following formula, we have Note that a residual transformation and layer normalization are carried out to ensure that the input of the next round contains the result of the previous round, at the end of this process. It is shown in the following formula: e vector s is the result of each modality. On this basis, the vectors obtained from the two modalities are summed according to the element order, and the sum results are classified and predicted according to the following formula:  Figure 5.

CMU-MOSEI Dataset and MELD
In addition, we use MELD dataset [20] to evaluate the model. It selects the multiperson dialogue scene in the TV series Friends as its material, which also contains linguistic, acoustic, and visual information. Its label is also multicategory, which divides emotions into anger, trouble, fear, joy, neutral, sadness, and surprise. In order to adapt to different needs, the above classification is divided into rougher negative, positive, and neutral sentiments. e label distribution of MELD dataset is shown in Figure 6.

Feature Extraction.
Next, the preprocessing process of modality data is introduced, and different methods are selected to extract features according to their respective characteristics.

Linguistic
Feature. Sentiment analysis has a long history of development in linguistics. It has its own characteristics from semantic-based sentiment dictionary methods to machine learning-based sentiment classification methods. In this study, in order to enable the fusion of multimodal data, the above methods will no longer be applicable and should be processed from a more abstract point of view. For linguistic data, we need to transform it into a vector containing semantic and grammatical information to represent it. Firstly, the original linguistic data is analyzed to construct a cooccurrence matrix for words. en, based on the distributed representation of the matrix, the cooccurrence matrix is decomposed by using the association between words to obtain the representation vector of words. Specifically, we process the text data to obtain valid words and then count the frequency of word occurrences and record them in the cooccurrence matrix X. e element of X is x (i,j) , which indicates the number of times word-i and word-j appear in the same window. Because there are 14176 independent words, we create a cooccurrence matrix X with dimension 14176 × 14176. We use GloVe [21] to embed the matrix X, and each word is embedded into a vector of 300 dimensions.

Acoustic Feature.
Audio is a way for human beings to express their emotions. Intuitively speaking, acoustic information is another form of linguistic information, and the emotion at this time is consistent with the expression of linguistic information. However, acoustic data is more complex than linguistic data and also contains a large part of unique acoustic information, such as laughter, sighs, high intonation, and low intonation. ese are the key tasks of acoustic data in multimodal emotion recognition. e most abundant emotions in acoustic data are contained in human voice. When extracting speech features, it is necessary to remove irrelevant noise and focus on the human voice. Mel scale can be used to divide the sensitivity of human ear to frequency, so in this study Mel-Frequency Cepstral Coefficients (MFCC) [22] are used to extract acoustic features. Specifically, the 40 ms time scale is used to synthesize multiple sampling points of continuous audio signal within the time scale, which is called "frame." en, the signal is preenhanced through a high-pass filter to compensate for the high frequency part of the speech signal, and Fourier transform is used to transform it from the time domain to the frequency domain to observe the state of energy part. Next, the frequency spectrum obtained by each frame is filtered by Mel filter to remove the frequency information that cannot be distinguished by human ear. After extracting the logarithmic energy on each Mel scale, the inverse discrete Fourier transform is performed. Finally, we can get acoustic features; the vector for it contains 80 dimensions. entire video picture. ey can play a certain role in emotion recognition. In this study, we use pretrained CNN to extract visual features [23], which uses a two-dimensional convolution kernel to extract spatial information and a one-dimensional convolution kernel to extract temporal information. Because the introduction of visual information brings more noise, we will not show the content of visual information in the follow-up experiments.

Graph-MFN.
is method uses a new fusion model called the Dynamic Fusion Graph (DFG) to build the n-modal interactions and replace the original fusion component in MFN [15].

B2 + B4 W/Multimodal Fusion.
It utilizes self-attention to capture long term context and gating mechanism to selectively learn cross attended features [16].

Multilogue-Net.
e model focuses on effectively capturing the context of a conversation and treats each modality independently, taking into account the information a particular modality is capable of holding [18].

TBJE.
e approach relies on a modular coattention and a glimpse layer to jointly encode one or more modalities [19].

Text-CNN.
e method achieves excellent results by a simple CNN with little hyperparameter tuning and static vectors [24].

BcLSTM.
e model designed based on LSTM can capture the contextual information in the conversation [8].

DialogueRNN.
e method is based on recurrent neural networks that keeps track of the individual party states throughout the conversation and uses this information for emotion classification [25].

Experiments
In this section, we will report the result on CMU-MOSEI dataset and MELD dataset. It is worth mentioning that the model can achieve good results in the case of only using linguistic feature and acoustic feature.

Implementation Details.
In order to ensure the training speed and training results at the same time, the SWATS optimization method proposed by Keskar and Socher [26] was used in the experiment. Using Adam optimizer in the early stage can bring the advantage of fast convergence. Using SGD optimizer in the later stage can help the model find the optimal solution in a small range. Specifically, when using Adam optimizer, calculate the learning rate of SGD optimizer after each iteration. If it is found that the learning rate basically remains unchanged, it means that the bottleneck has been reached and you can switch at this time. At this time, the orthogonal projection of SGD optimizer on the descending direction of Adam optimizer should be exactly equal to the descending direction of Adam optimizer. It is shown in the following formula: erefore, the initial learning rate of SGD optimizer is shown in the following formula: en, we found that the number of blocks in interactive transformer directly affects the final result. Considering that the change of structure will bring some influence, we find the best setting of value by comparing the change of 2-class sentient accuracy under different N values.
e specific results are shown in Figure 7.
e results use five times average value and finally use n � 4 to ensure the best performance of the model, and the hidden layer size of each coding block is 512. In addition, we set the Interactive Multihead Guided-Attention structure to 4-head modules and the hidden layer size of Feedforward Neural Networks to 1024. In order to prevent the overfitting phenomenon, we set dropout of 0.1 on the output of each FNN and of 0.5 on the input of classification.

e Result of CMU-MOSEI Dataset.
We compare the evaluation results of the model on CMU-MOSEI dataset with Graph-MFN [14], B2 + B4 w/multimodal fusion [16], Multilogue-Net [18], and TBJE [19]. e results of 2-classsentiment are shown in Table 1. It should be noted that our model does not use visual information and integrating it into the noise makes the results unsatisfactory, which is also the direction of our next efforts. In general, only using linguistic feature and acoustic feature has been able to make our model achieve good results, which has been improved compared to other methods. e results of 7-class-sentiment and 6-class-emotion are shown in Table 2. We show the average value of the classification results and only compare with the model using the same calculation method. It can be seen that the effect of the model on emotion multiclassification task is only slightly improved compared with other methods, which will be the direction of our future work. In addition, the performance of 7-class-sentiment classifications is not as good as that of 6class-emotion classifications, which is in line with the expectation. It is also a common fault of all methods, because 7-class-sentiment deals with more fine-grained classification, with only very subtle differences between each category.
We randomly selected 4662 samples from the test set to verify 2-class-sentiment results and calculated True Positive Rate (TPR) and False Positive Rate (FPR) by using the prediction results of the model and the real labels of the samples, so as to approximate the continuous Receiver Operating Characteristic (ROC) curve. As shown in Figure 8, we can see that our model has excellent performance.
Model 7-class-sentiment 6-class-emotion Accuracy (%) F1-score (%) Accuracy (%) F1-score (%) Graph-MFN (T + A + V) [14] 45.00 / / / TBJE (T + A) [19] 45.36 / 81.48 / TBJE (T + A + V) [19] 44 to the emotion of a single person, and there are no other interference factors. e MELD dataset is mainly composed of multiperson dialogue scenes. Taking acoustic modality data as an example, there may be multiple people talking at the same time, which will cause interference to sentiment analysis. And the emotional state of multiple people will affect each other. When analyzing a person's emotional state, it is necessary to consider the influence of other people. Our model has not been improved for the multiperson dialogue scene, which is the direction of our next work.

Ablation Study.
In this study, we propose two structures: Interactive Transformer (IT) and Soft Mapping (SM). e Interactive Transformer layer interacts with each other when learning a modality to improve the learning effect, while the Soft Mapping layer fuses the learning results of each modality to better classify emotions. In order to verify the effectiveness of these two structures, we have carried out the ablation experiment. e experiment was divided into four situations: the first is completely not using Interactive Transformer layer and Soft Mapping layer, only Interactive Transformer layer is removed, only Soft Mapping layer is removed, and Interactive Transformer layer and Soft Mapping layer are used at the same time.
When we do not use Interactive Transformer, we choose traditional Transformer to replace it, which means that we lose the ability of interaction between modalities. When SM is not used, we directly perform weighted average calculation on the learning results of each modality and then perform sentiment classification. e results are shown in Figure 9. It can be seen that the improvement is obvious when using IT or SM alone, but the improvement is very limited when using both at the same time. e reason may be that the roles of IT and SM are duplicated, and they are both designed for modal interaction. So, when they are used alone, the information between modalities can be complementary, and the results are similar. When using them at the same time, the supplementary information that they mined is repeated, so the improvement is not obvious. In the follow-up, we will try to dig out related information in different directions in a targeted manner to make the division of labor between IT and SM clearer.

Conclusions and Future Work
In this paper, we propose an Interactive Transformer and Soft Mapping based method for multimodal sentiment analysis. e proposed model can fully consider the relationship between multiple modality pieces of information, which is helpful for sentiment analysis after data fusion. Although our model has achieved competitive results on the CMU-MOSEI dataset, there are still some shortcomings. Our model does not make full use of the visual modality information, and it only uses data from linguistic modalities and acoustic modalities. Trying to add data from visual modalities makes the results unsatisfactory. In the next step, we will continue to look for the method of integrating visual data, because expression and body movements of characters in the visual data also contain rich and delicate emotions, which can be of great help to emotion recognition. In addition, the evaluation results of MELD dataset also reflect some problems. Our model ignores that people's emotions affect each other in multiperson dialogue scenarios. For example, when a person expresses negative emotions externally, the emotional state of other people will also shift negatively. In the future, we will focus on the emotional analysis of multiperson dialogue from four aspects: role of context, interspeaker influence, emotion shifts, and contextual distance.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.

Ethical Approval
is paper does not contain any studies with animals performed by any of the authors.

Conflicts of Interest
e author(s) declare that there are no conflicts of interest with respect to the research, authorship, and/or publication of this paper.