Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Multimodal sentiment analysis has been an active subfield in natural language processing. This makes multimodal sentiment tasks challenging due to the use of different sources for predicting a speaker's sentiment. Previous research has focused on extracting single contextual information within a modality and trying different modality fusion stages to improve prediction accuracy. However, a factor that may lead to poor model performance is that this does not consider the variability between modalities. Furthermore, existing fusion methods tend to extract the representational information of individual modalities before fusion. This ignores the critical role of intermodal interaction information for model prediction. This paper proposes a multimodal sentiment analysis method based on cross-modal attention and gated cyclic hierarchical fusion network MGHF. MGHF is based on the idea of distribution matching, which enables modalities to obtain representational information with a synergistic effect on the overall sentiment orientation in the temporal interaction phase. After that, we designed a gated cyclic hierarchical fusion network that takes text-based acoustic representation, text-based visual representation, and text representation as inputs and eliminates redundant information through a gating mechanism to achieve effective multimodal representation interaction fusion. Our extensive experiments on two publicly available and popular multimodal datasets show that MGHF has significant advantages over previous complex and robust baselines.


Introduction
Every day, a large and meaningful amount of information is generated around us. Most of this information is generated on the web, and social media is a centralized area of information on the web. It covers many topics, opinions, sentiments, and emotions closely related to our lives. Multimodal sentiment analysis (MSA) has been an active subfield in natural language processing [1,2]. is is mainly due to its wide range of applications, such as government elections [3], intelligent healthcare [4], and chatbot recommendation systems for human-computer interaction [5]. Compared to traditional sentiment analysis, MSA uses multiple sources (excerpted raw text, acoustic, and visual information) to make predictions about the sentiment expressed by a specific object in a specific period. One of the multimodal sentiment analysis challenges is to model the interactions between different modalities because they contain supplementary and complementary information [6]. Another factor that limits the performance of multimodal sentiment analysis tasks is data fusion. is is because there are multiple recurring problems, such as missing values and misalignment in visual and auditory modalities [7].
In recent years, researchers have designed sophisticated fusion models. Zadeh et al. [8] designed the tensor fusion network, which uses a Cartesian product to fuse the feature vectors of three modalities; this provided a new idea for multimodal data processing. Tsai et al. [9] designed a multimodal transformer that processed all modalities together to obtain the predicted sentiment scores. Although these methods have achieved good results, a problem that may affect the final prediction effect is that these models ignore the differences between different modalities, which may lead to the loss of crucial prediction information during the modal representation acquisition stage. Hazarika et al. [10] designed a modality-specific and modality-invariant feature space, combining two types of representations with similarity loss, reconstruction loss, and dissimilarity loss to evaluate the model effect. Yu et al. [11] used a multitask format and introduced an automatic modal label generation module in the training phase to assist the main task channel, saving manual labelling time, and thus improving efficiency. Although these studies also achieved encouraging results, they lacked intermodal information interaction during the modal fusion phase. Doing so may result in the redundant information present in the upper stage being retained in the final prediction stage, making the model performance poor. As shown in Figure 1, there are two opposite prediction results after the same text interacts with different modalities. For example, an ordinary language with ordinary acoustic features is predicted as a negative sentiment. In contrast, the same type of language with positive visual features is predicted as a positive sentiment. is indicates that different modal combinations have a fundamental impact on sentiment prediction. It should be noted that, in Figure 1, "?" indicates that sentiment cannot be accurately identified, "−" represents negative sentiment, and "+" represents positive sentiment. e number of these symbols signifies the intensity of the sentiment.
To address the mentioned issues, inspired by crossmodal matching and interaction modelling, we propose a novel multimodal sentiment analysis framework, MGHF. It includes mid-term interactions performed in the modal representation phase and post-term interactions in the modal fusion phase. is approach allows the model to fully perceive various modalities' potential representational sentiment information, which helps us improve the fusion and prediction results. Although previous studies have shown that text modality is the most critical [9,12], we still believe that the information implied by any modality should be considered in the MSA task. Specifically, MGHF employs a flexible strategy for modality variability by using appropriate neural networks for different modalities. In the medium-term interaction learning phase, MGHF performs cross-modal attention interactions for acoustic modality, visual modality, and text modality, respectively, to obtain text-based acoustic representation and text-based visual representation. Several past studies [13] have pointed out that task-related information is not evenly distributed across modalities, with the text modality contributing much more than other modalities. ere are also studies [8,9] that would fuse the text-video and audio modalities as a ternary symmetric structure, which does not take into account the variability of the various modalities and thus fails to fuse them correctly. According to previous experience, in order to make the text modality occupy a higher weight than other modalities in the later fusion stage. We combined the textbased acoustic representation with the text representation, the text-based visual representation with the text representation, and the text-based acoustic representation with the text-based visual representation in a two-by-two combination. We also design gated recurrent hierarchical fusion networks that dynamically interact with learning information representations between modal combinations to complement the information between combinations. Our extensive experiments on the publicly available and popular datasets CMU-MOSI [14] and CMU-MOSEI [15] show that MGHF shows strong competitiveness over previous complex interaction and fusion baselines. e contributions of this paper are summarized as follows: (i) A gated cyclic hierarchical fusion network for multimodal sentiment analysis is proposed. It dynamically interacts with information representations between 3 different modal pairs. e gated cyclic hierarchical fusion network enables sufficient interaction between each modal pair, eliminates redundant information between modal pairs, and maximizes the retention of valid representations for modal prediction. (ii) Inspired by distribution matching, we consider the interactions within different modalities. In the modal representation acquisition stage, we make the nonverbal sequences to cross-modal attention with text sequences, which can capture potential representations within different modalities while making the modal representations closer to the real sentiment expressions. (iii) Experiments conducted on two publicly available multimodal datasets show that our model has significant advantages over previous advanced complex baselines.

Related Work
is section introduces multimodal sentiment analysis, as well as related work on multimodal representation learning and data fusion.

Multimodal Sentiment Analysis.
Unlike traditional sentiment analysis, multimodal sentiment analysis often uses multiple sources (excerpted text, audio, video, and other information) to fully and accurately predict the speaker's sentiment orientation. Researchers have various ways to deal with MSA tasks, one of which is representative of the extraction of intramodal temporal information and the other is the extraction of intermodal interaction information. e former mainly uses neural networks such as the Long Short-Term Memory (LSTM) Network [16] for the extraction of modal contextual information [10,17]. e latter can be further divided into early, late, and hybrid, depending on the fusion stage. Early fusion is the fusion approach used in the pre-extraction phase of the data. Rozgic et al. [18] used early fusion to connect multimodal representations as input to an inference model, which provides a novel idea for modal fusion. Zadeh et al. [19] designed a memory fusion network (MFN) using multiview sequential learning, which explicitly illustrates two interactions in the neural architecture. e post-fusion approach performs a series of necessary processing within the modality and intermodal data fusion in the final stage. Liu et al. [20] proposed a low-rank multimodal fusion approach to reduce the computational complexity by using low-rank tensor fusion to improve efficiency. Other researchers have used hybrid fusion to improve the performance of MSA tasks. Dai et al. [21] used a simple but very effective hybrid modal fusion approach using weakly supervised multitask learning to improve the generalization performance of the dataset.
We differ fundamentally from previous work in that. First, there is a modal divide between different modalities, and using only the same neural network does not seem to yield useful information. Instead of considering a piece of single contextual information, we use the most appropriate strategy based on the modal sequence characteristics. After obtaining the initial representations, unlike in previous work, our interaction fusion does not only occur in the final stage. Useful potential information can be induced from the companion representations through the intermediate interaction stage. Similarly, the post-interaction stage of the modality is used to better retain information useful for prediction and eliminate redundant information. It is worth noting that instead of the traditional approach of treating text, audio, and video equally, we flexibly utilize the information useful to the task for each modality based on the contribution of the modality.

Representation Learning and Data Fusion.
Representation learning methods can also be applied to multimodal sentiment analysis and have achieved significant results. Wang et al. [22] proposed a recursive attentional change embedding network to generate multimodal shifts. Hazarika et al. [10] proposed a way to learn multimodal invariant and specific representations while combining four different losses to evaluate the performance of the model. Yu et al. [11] proposed self-supervised multitask learning to learn modality-specific representations and introduced a single-peak annotation generation module to assist the main task channel. In the context of sentiment analysis, multimodal fusion is essential because sentiment cues are usually distributed over different modalities [23]. Xiangbo et al. [24] proposed an extended-squeezed-excitation fusion network (ESE-FN) that fuses multimodal features in the modal and channel directions. e network learns extended-squeezedexcitation (ESE) caveats in the modal and channel directions to effectively solve the elderly activity recognition problem. Shu et al. [25] proposed a new weakly shared deep transport network (DTN) for converting cross-domain information from text to images. is provides ideas for interconversion across modalities. Based on this, Tang et al. [26] proposed a new generalized deep transmission network (DTN) for the transmission of information across heterogeneous, textual, and visual domains by establishing parameter sharing and representation sharing layers.
In view of this, our model is based on the late fusion of representation learning. Unlike previous studies, we learn representations across intramodal interactions while employing different combinations of modal interactions to obtain intermodal representations.

Materials and Methods
In this section, we will detail the main components of our model and their specific roles.

Task Setup.
Multimodal data sequences in sentiment analysis consist of three main modalities which are the text modality (t), acoustic modality (a), and visual modality (v), respectively. e goal of multimodal sentiment analysis (MSA) is to predict the speaker's emotional polarity from a segment of discourse, which is also the input to the model in this paper. First, given the input discourse U s∈ t,a,v\ { } , this paper uses U v to denote visual modal information, U a to  Computational Intelligence and Neuroscience denote acoustic modal information, and U t to denote textual modal information. Here, 3.2. Overall Architecture. In this paper, our multimodal sentiment analysis architecture consists of three primary and flexible modules as shown in Figure 2. ey are the feature extraction module for each modality, the (acoustic-text/visual-text) cross-attention module, and the gated recurrent hierarchical fusion network module. For the text channel, we use pretrained BERT for its high-dimensional semantic extraction. For the acoustic and visual channels, we first feed the initial sequence into a 1D temporal convolution to obtain enough perceptual and temporal information. e obtained (acoustic/visual) representations are then learned cross-modally with textual representations, which can induce potential representational information for both acoustic and visual modalities, synergistic to the overall effective orientation. Notably, this cross-modal matching has been prominent in recent cross-modal learning approaches [27,28]. Afterward, we feed the output of the two cross-modal attention (text-based acoustic representation and text-based visual representation) and the extracted textual modal representation into a gated recurrent hierarchical fusion network, which eliminates redundant modal information to obtain the final information for prediction. Of course, some of the modules in our model are flexible and can be reconfigured with any suitable baseline to accomplish different types of tasks.

Modality Representation.
e acquisition of representation for our model is divided into three channels, namely, text channel, video channel, and audio channel. In the following, we describe the essential details of the model acquisition of representations.

Text Channel.
For the text channel, we fine-tuned the pretrained model BERT [29] used as an extractor of text features, consisting of a 12-layer stacked transformer. e input text is preprocessed and fed to BERT for embedding by adding two special tags CLS and SEP. Consistent with recent work, the first word vector of the last layer is chosen in this paper as the average representation of the representation in the final 768-dimensional implicit state [30].
Here, t represents the initial sequence of text and θ bert t represents the hyperparameters of the BERT pretrained model.

Audio and Video Channels.
For the audio and video channels, we designed two independent modal characterization modules for the nonverbal sequences, and they function before fusion. We followed previous work [11] and processed the raw data using a pretrained toolkit to obtain the initial vector features.
Temporal Convolutions. First, to make our modalities sufficiently perceptible, we pass the input sequence through a one-dimensional temporal convolution layer.
where Conv1D(•) is the one-dimensional temporal convolution function, k m is the size of the convolution kernel used by the modality m, U m is the input sequence of modality m, d is the common dimension, and T m denotes the discourse length of modality m; here, m ∈ \ a, v\ { }. Positional Embedding. To equip the sequences with temporal information, following Vaswani et al. [31], the position embedding (PE) is bracketed to U * m as follows: where PE(T m , d) ∈ R T m ×d , the purpose is to compute the embedding for each position index. PE(•) represents the position embedding function, m ∈ a, v { }. Cross-Attention Transformers. We then perform crossmodal cross-attention on the resulting sequences, which induces potential representational information for both acoustic and visual modalities that are synergistic to the overall practical orientation. It is worth noting that our cross-modal attention occurs only between text and acoustic modalities and between text and visual modalities, which allows the text modality that contributes most to the task to be weighted higher than the other modalities and ensures the relative independence of the visual and acoustic channels. We justify this approach in Section 5.2.1.
where Q t represents the query vector for the text modality and K a , V a , K v , and V v denote the key vectors and value vectors of the acoustic and visual modalities. softmax(•) represents the softmax function, d h represents the dimensionality of the modality, and T represents transpose.
Transformer computes multiple parallel attentions, and the output of each attention is called a head. e i th head is computed as 4 Computational Intelligence and Neuroscience where W Q t i ∈ R d t ×d q is the weight matrix of Q t when computing the head of the i th text modality; W where W O m is the weight matrix multiplied after the splicing the head of m modalities and n denotes the number of selfattention heads we use. Here, we have n � 10, Concat(•) is the splicing operation, m ∈ \ a, v\ { }. us, the text-based acoustic representation f a t and the text-based visual representation f v t can be obtained.

Gated Cyclic Hierarchical Fusion Networks.
In previous studies [10,11], after obtaining valid representations, most of the modal representations are simply spliced directly for final prediction. is can inadvertently add redundant information to them. To allow the redundant information in the representations to be effectively removed, we designed a gated recurrent fusion network (see Figure 3). is module is flexible and can be paired with other benchmarks to enhance the effect. Of course, we also verified the effectiveness of the hierarchical fusion network.
We used the text-based acoustic representation f a t and text-based visual representation f v t as well as text representation f t as inputs to the gated recurrent hierarchical network. Previous experience [9,12] has shown that the text modality contributes much more to the task than the other modalities. Given this, we combined the text-based visual representation, the text-based acoustic representation, and the text representation in two combinations to ensure that the text modality accounts for a high weight, which would result in three combinations of representations.
where Concat(•) denotes the combination operation, f a t ⊕t denotes the combination of text-based acoustic representation with text, f v t ⊕t denotes the combination of text-based visual representation with text, and f a t ⊕v t denotes the combination of text-based acoustic representation with textbased visual representation.
After obtaining the specified three combinations, we fed them into a bi-directional gated recurrent network (Bi-GRU). e purpose of doing so is to allow the information between different modalities to be fully perceived and to effectively remove redundant and irrelevant information from the representations through the gating mechanism. We also employ a bi-directional long and short memory Cross-modal Attention Computational Intelligence and Neuroscience (Bi-LSTM) network. By comparison, we found that the former has more straightforward parameters and faster training speed, and its results are comparable.
where Bi − GRU(•) represents the bi-directional gated recurrent cell network and θ gru represents the hyperparameters of the gated recurrent cell network. After that, we combine the outputs of the gated cyclic hierarchical fusion networks and feed them into the fully connected layer for the final prediction.
where W s l1 ∈ R (d t +d a +d v )×d s and ReLU are the relu activation functions and ⊗ represents the elemental product.
Finally,f * s is used as the final representation and for the prediction task.
where W s l2 ∈ R d s ×1 .

Experiment
In this section, we will detail the specifics of our experiments. [14]. e Multimodal Sentiment Intensity Corpus dataset is a collection of 2199 viewpoint video clips. is dataset is a popular benchmark for multimodal sentiment analysis. Each opinion video is annotated with sentiment in the range of [−3, 3]. e dataset is strictly labelled using tags for subjectivity, emotional intensity, per-frame, per-viewpoint annotated visual features, and per-millisecond annotated audio features. CMU-MOSEI [15]. e multimodal Opinion Sentiment and Sentiment Intensity dataset is the largest multimodal sentiment analysis and recognition dataset. e dataset is an improved version of the CMU-MOSEI dataset. MOSEI contains more than 23,500 sentence expression videos from more than 1,000 online YouTube speakers. e dataset is gender-balanced. All sentences were randomly selected from different videos of topics and monologues. Videos were transcribed and correctly punctuated. We give the detailed dataset settings in the experiments (see Table 1).

Modality Processing.
To ensure fair competition with other baselines, we follow previous work [11] and treat the three modalities as a typical tensor described as follows: Text Modality. Most previous studies have used glove [32] as a source of word embedding and achieved good results. Considering the strong performance of pretrained models, we prefer to use the pretrained language model BERT [29]. For a fair and objective comparison, we adopted the latter as the processing tool for our text modality. Audio Modality. For audio data, the acoustic analysis framework COVAREP [33] was used to extract up to 12 Mel-frequency cepstral coefficients, pitch, turbid/apparent segmentation features, and so on. All features are related to mood and intonation. It is worth noting that acoustic features are processed to align with the text features.

Bi-GRU
Bi-GRU  [34]. e process is repeated for each sampled frame within the vocalized video sequence.
Eventually, we align the initial modalities with the text for the alignment operation. is will allow our experiments to proceed appropriately and ensure fair experimental comparison results.

Evaluation Metrics.
Again, to be fair, we split the MSA task into a regression task and a classification task. is paper will have five valuation metrics, which are: secondary precision (ACC-2) and F1-score. Mean Absolute Error (MAE): it directly calculates the error between the prediction and the authentic number labels. Level 7 Precision (ACC-7) and Pearson Correlation (Corr) measure the standard deviation from the human-annotated actual value. It is worth noting that the secondary precision and F1 scores were divided into two groups: negative and non-negative feelings (including neutral feelings), and negative and positive feelings, respectively. In addition to the value of MAE, higher scores imply better results.

Baseline.
We compared the performance of MGHF with several multimodal fusion frameworks, including state-ofthe-art models, as follows.

Previous Models
(i) TFN. Tensor fusion network [8] is based on Cartesian product to calculate the tensor of each modality for capturing the interaction information of unimodal, bimodal, and three modalities. (ii) LMF. Low-order multimodal fusion [20] is an improvement of the tensor fusion network (TFN) to reduce the computational complexity and improve the efficiency by using low-order tensor fusion. (iii) MFM. Multimodal Factorization Model [35] demonstrates flexible generation capability by adjusting independent factors and reconstructs missing modes. (iv) MULT. Multimodal Transformer (MULT) [9] extends the multimodal converter architecture using directed pairwise cross-attention, which converts one modality to another using directed pairwise cross-attention. (v) ICCN. Interaction Canonical Correlation Network (ICCN) [13] learns correlations between text, audio, and video through Deep Typical Correlation Analysis (DCCA). (vi) MISA. Learning Modality-Invariant and Modality-Specific Representations (MISA) [10] combines a combination of distribution similarity, orthogonal loss, reconstruction loss, and task prediction loss for learning the representation of different modalities and the representation of fused modalities. (vii) MAG-BERT [36]. A multimodal adaptation gate was designed for the BERT alignment gate and inserted into the general BERT model to optimize the fusion process.

State-of-the-Art.
For sentiment analysis tasks, the results of Self-MM [11], a self-supervised multitask learning framework, on both MOSI and MOSEI datasets represent state-of-the-art (SOTA) models. Self-MM assigns a singlepeaked training task with automatically generated labels to each modality, allowing multimodal sentiment analysis tasks to be performed in a multitask context.

Results and Discussion
In this section, the experimental results of the model are analysed and discussed in detail.

Quantitative Results.
We compared the MGHF with currently popular benchmarks, including the state-of-theart (SOTA) model (see Tables 2 and 3). For a fair comparison, we divided the models into two categories depending on the data setup, aligned and unaligned. In our experiments, first, compared with the aligned advanced models, our models all achieved similar or even surpassed results. In addition, our models achieve significant gains on all indicators of the regression as well as on some of the categorical indicators compared to the unaligned models. In addition, we reproduce two strong baselines, MISA and selfmm, under the same conditions. We find that MGHF outperforms them on most indicators. On the MOSI dataset, MGHF achieves competitive scores on both classification tasks. On the regression task, MGHF also improves the SOTA model by various degrees. Our model also outperforms some complex fusion mechanisms, such as TFN and LFN. e above results show that our model can be applied to different data scenarios and achieve significant improvements. We visualized some of the metrics, which can help us visualize how the model is performing (see Figure 4).

Ablation Study.
We set up ablation experiments to verify the performance of our model, which is divided into the following main parts.

Representational
Interaction. First, for cross-modal attention interactions, we conducted the following experiments. e first group was performed for the interaction between two modalities, and we did not consider acousticbased text features and visual-based text features because this would make the text modality so heavily dominated that Table 2: Results on MOSI. Note: (B) Means the language features are based on BERT; model with * represents the best results for recurrence under the same conditions. ○ is from [10], and ◇ is from [11]. In indicators Acc-2 and F1-score, the left side of "/" is calculated for negative and non-negative sentiment, while the right side of "/" is calculated for negative and positive sentiment.  Table 3: Results on MOSEI. Note: (B) Means the language features are based on BERT; model with * represents the best results for recurrence under the same conditions. ○ is from [10], and ◇ is from [11]. In indicators Acc-2 and F1-score, the left side of "/" is calculated for negative and non-negative sentiment, while the right side of "/" is calculated for negative and positive sentiment.   Computational Intelligence and Neuroscience   Computational Intelligence and Neuroscience modal independence would be reduced or even disappear. e cross-modal attention between nonverbal sequences is hardly satisfactory, probably due to the characteristics of nonverbal sequence data. Acoustic, visual, and textual crossmodal attention seems to play an important role, which is consistent with previous studies [9,12]. e second set of experiments was conducted after combining the crossmodal interaction representations obtained in the first set, which could help us elucidate whose combination of crossmodal interactions is more beneficial for the MSA task. In Table 4, it seems apparent that the combination of text-based acoustic and text-based visual representations performs the best. We believe this is partly because the text modality enhances the complementary acoustic and visual information, providing additional cues for semantic and affective disambiguation [37], and partly because it preserves the independence of the acoustic and visual modalities. We visualized this part of the experimental index scores for ten randomly selected samples from the MOSI test set (see Figure 5), and similar results were observed on MOSEI.

Gated Recurrent Hierarchical Fusion Network
Effectiveness. To verify the reliability of our proposed gated cyclic hierarchical fusion network, we will perform the multimodal sentiment analysis task under the same conditions without this fusion strategy. For visual comparison, two representative metrics from the classification and regression tasks are selected for evaluation, while the evaluation results are visualized. It is worth noting that among these metrics, higher scores imply better performance, except for the MAE metric. e results are shown in (Figure 6). Specifically, the gating mechanism effectively removes the redundant information contained in the previous stage. is not only implies that the representations obtained by the model in the prediction stage are inclusive of the potential  representations of each modality but also helps us clarify the need for representation interaction learning at a later stage.
In addition, we also conduct ablation experiments of the fusion strategy (as shown in Table 5). In this experiment, we do not combine the resulting text-based visual modality, text-based acoustic modality, and text modality. e settings are marked as "a" in Figure 7 and MGHF_w/o(pc) in Table 5. At the same time, we replace Bi-GRU in the fusion network with Bi-LSTM neural network. is setting is marked as "b" in Figure 7 and MGHF_LSTM in Table 5. As mentioned before (Section 3.4), only in this experiment does Bi-GRU achieves comparable or even better performance on some metrics.
As shown in the previous section (Section5.2.1) the combined contribution of text-based sound representations f a t and text-based visual representationsf v t is the highest. We used these two representations combined with the initial representations f a , f v , and f t to evaluate which set performs best for the hierarchical fusion network (see Table 6). It is easy to see that the combination of f a t and f v t with the textual representation f t , which is the input to our gated recurrent hierarchical fusion network, performs best.

Conclusions
In this paper, we propose a complete solution for multimodal sentiment analysis, MGHF, which differs in two main parts: modal representation and modal fusion. By using distribution matching in the representation learning phase, the neighbouring modalities are made to contain potential representations of the companion modalities to achieve modal information interaction in time series. Meanwhile, we design a gated recurrent hierarchical fusion network in the fusion phase through the intermodal representation interactions performed in the later fusion phase. It eliminates redundant modal representations and retains those valid for prediction in the final stage, making the prediction results closer to the actual scores. We show that our model is intensely competitive with previous complex baselines through extensive experiments on two publicly available datasets.

Data Availability
e data used to support the findings of this study are available on the following address. Dataset Address: https://immortal. multicomp.cs.cmu.edu/raw_datasets/processed_data/.

Conflicts of Interest
e author(s) declare that there are no conflicts of interest regarding the publication of this paper.