Multimodal Fusion Method Based on Self-Attention Mechanism

Jiangsu Province Key Lab on Image Processing and Image Communication, Nanjing University of Posts and Telecommunications, Nanjing 210003, China R&D Center, China Academy of Launch Vehicle Technology, Beijing 100176, China Bell Honors School, Nanjing University of Posts and Telecommunications, Nanjing 210003, China Department of Computer Science, Norwegian University of Science and Technology, Gjovik 2815, Norway National Engineering Research Center of Communication and Network Technology, Nanjing University of Posts and Telecommunications, Nanjing 210003, China


Introduction
Multimodal integration has become a popular research direction in the field of artificial intelligence by virtue of its outstanding performance in various applications. Multimodal research has performed well in speech recognition [1], emotion recognition [2,3], emotion analysis [4], speaker feature analysis [5], and media description [6].
Multimodal fusion is an extremely important research direction and core technology in multimodal field research. Multimodal fusion is aimed at utilizing the complementary information present in multimodal data by combining multiple modalities. It is one of the challenges of multimodal fusion to extend fusion to multimodal while keeping the model and calculation complexity reasonable.
Previous research methods used feature concatenation to fuse different data. These methods [7,8] take the feature of the input concatenated as input, and some methods [9] even remove the temporal correlation in the modalities. Although these methods have been integrated at the beginning, it is precisely because of this that the interaction within the modal is suppressed at the beginning, causing the modalities to lose its overall correlation or even temporal dependencies.
Some fusion methods [10,11] use methods such as weighted average or majority voting to fuse modalities together, and these modalities have their own models in later stages. Each of these methods has an inevitable shortcoming. Since each model is modeled separately, the interaction of the modes is lost.
At present, the latest methods [12,13] try to use tensor representation to model the interactions between modes to solve those shortcomings. The extremely highdimensional tensor representation caused by various forms of outer products puts a lot of pressure on the calculation speed and memory occupation. In [14], Liu et al. proposed to use the low-rank multipeak fusion method, which partially solves the problem of large calculation and complicated parameters due to tensor representation but lacks the consideration of the correlation between multiple unimodal inputs.
An attention mechanism has been applied to various fields and has achieved satisfactory results. In [15], Wang et al. proposed "Residual Attention Network," a convolutional neural network using an attention mechanism which can incorporate with the state-of-art feed forward network architecture in an end-to-end training fashion. Lin et al. proposed a novel structure-attention-based LSTM as a hierarchical structure model, which has an advantage in capturing the potential semantic structure. As for applications, Choi et al. [16] proposed a finegrained attention mechanism for neural machine translation while Ge et al. [17] proposed a leveraged attention mechanism in video action recognition. Hsiao and Chen [18] proposed to integrate the attention mechanism into deep recurrent neural network models for speech emotion recognition. However, none of these previous works aimed at applying an attention mechanism in multimodal fusion.
In this paper, we propose a novel low-rank multipeak fusion model based on a self-attention mechanism, which uses the low-rank weight tensor with an attention mechanism to make multipeak fusion more efficient and more globally relevant. The overall framework of our model is shown in Figure 1. We evaluate the performance of our method through experiments on three multimodal fusion tasks on public data sets and also compare our experiments with the latest models. While reducing the complexity and parameters of the model, we are studying how to improve the applicability and stability of our model. To our knowledge, this is the first time that the self-attention mechanism has been applied to the low-rank factor of multimodal fusion. Compared with other tensor-based models, our model performs very well both in terms of efficiency and performance.
The main contributions of our paper are as follows: (i) We propose low-rank multimodal fusion based on a self-attention mechanism, which can effectively improve the global correlation (ii) While maintaining low parameter complexity and high calculation speed, our model has high adaptability and can be applied to various tasks

Low-rank factors
Self-attention block Low-rank factors Low-rank factors Figure 1: Overview of our multimodal fusion model based on self-attention mechanism: the unimodal representations z v , z a , and z l as input to MF (multimodal fusion), which were obtained by passing the unimodal inputs x v , x a , and x l into three subnetworks f v , f a , and f l , respectively. In MF, z v , z a , and z l generate new unimodal representations z v ′, z a ′, and z l ′ through self-attention; then, z v ′, z a ′, and z l ′ produce an output representation by performing low-rank multimodal fusion with modality-specific factors. The output will be multimodal representation, which can be used for applying classification task. 2 Wireless Communications and Mobile Computing (iii) We provide the performance of our model on three multimodal tasks evaluated on public data sets compared to other latest models 2. Related Work 2.1. Tensor Representation Method. The tensor representation method is one of the most successful methods for multimodal fusion. The core of tensor representation is to convert the input representation into a high-dimensional tensor, and then map it to a lower-dimensional output vector space. Tensors are usually formed by multiplying the outer product by the input modality. The input tensor Z is calculated from the unimodal representation: where ⊗ N n=1 denotes the tensor outer product over a set of vectors indexed by n, and z n is the input representation.
The input tensor Z ∈ ℝ d 1 ×d 2 ×⋯×d N uses a linear layer f ð•Þ to generate a vector representation: where W denotes the weight of this layer and b represents the bias. Because Z is an N-order tensor, where N is the number of input modes, the weight W should be an N + 1-order tensor. The dimension of the weight W is W ∈ ℝ d 1 ×d 2 ×⋯×d N ×d h , where the N + 1th dimension is equal to the size of the output representation d h . Since W · Z is a dot product, the weight W can be regarded as d h Nth order tensor. Due to the high dimension of tensor Z, the computational difficulty and model complexity of tensor fusion method are greatly improved. The dimension of tensor Z increases exponentially with the number of modes. This makes the tensor fusion method fail to perform more tasks at the same time, which reduces the adaptability of the model.

Low-Rank Tensor Representation
Method. The low-rank multimodal fusion method is aimed at solving the shortcomings of the multimodal fusion model represented by a tensor with the method of decomposing the weight W into a set of low-rank factors.
The method of degrading the weights in the multimodal fusion method represented by the low-rank tensor is to decompose the weight W into N fixed modalities. Because W ∈ ℝ d 1 ×d 2 ×⋯×d N ×d h can be regarded as d h Nth order tensor, so our weight can be expressed as follows: A deep understanding of formula (3), so g W m has the following exact decomposition of the vector in Equation (5). Each g W m contributes to one dimension in the vector h, so we can simplify Equation (2): where R is the rank of the tensor, which makes the decompo- is the decomposition factor of the original weight tensor based on rank R.
In the above formula ffw ðiÞ n,m g N n=1 g R i=1 , we give the rank R a fixed value r, and the formula ffw ðiÞ n,m g N n=1 g r i=1 can be decomposed by the fixed rank, and the model is parameterized at the same time. We expand the vector w is its corresponding low-rank factors. Therefore, the weights of the multimodal fusion method represented by tensor can be transformed into low-rank weight tensor: Bring Equation (6) into Equation (2) to get the following a simplified low-rank tensor representation: We made a series of derivation changes in the above formula and finally turned the model calculated from an exponentially complex model into a linear model, where ⋀ N n=1 x n denotes the product of elements in the order of tensors: Compared with the original tensor representation method, the low-rank multimodal fusion method improves the calculation speed and reduces the complexity of the model. However, only a simple outer product operation is performed for each single mode, which largely ignores the correlation between each single mode and loses the global uniformity.

Attention Mechanism.
Neural networks equipped with attention have parallelizable computation, lightweight structure, and the ability to capture both long-range and local dependencies. The core of the attention mechanism method is to measure the correlation between z n and q. A compatibility function gðz n , qÞ generates score k, which can reflect the dependency between z n and q. The score is converted into a probability by function softmax, and finally, the probability is used as a weight.
3 Wireless Communications and Mobile Computing where k is represented as a vector of n correlation scores. By applying k to the function softmax, we get a probability distribution about attention pðyjz, qÞ. And s is the output vector for query q.
In the attention mechanism, choosing different compatibility functions gðz n , qÞ will have different experimental results. The different compatibility functions also directly lead to various categories of attention mechanisms. In this paper, the attention mechanism of our method uses the dot product attention compatibility function as follows: where w d 1 , w d 2 are learnable parameters, h•, •i denotes the inner product.

3.1.
Overview. The method proposed in this paper is an improvement to the low-rank multimodal fusion method and an effective improvement to the input modal based on the low-rank multimodal fusion method. We propose a novel self-attention mechanism and apply it between input modalities to improve the correlation and local dependence among various modalities. We pay more attention to the improvement of the self-mode, so we choose to use the selfattention mechanism instead of the traditional attention mechanism model. Since our model does not introduce redundant parameters, our model maintains a low complexity while improving accuracy. In addition, our self-attention module uses parallel computing, which makes the calculation speed greatly improved compared with the traditional attention mechanism model. Compared with the model using traditional attention mechanism, our model has lower complexity and faster running speed.

Network
Architecture. The overall framework of our network model is shown in Figure 1. Our model network is composed of three parts, namely, the extraction module, fusion module, and classification module. The fusion module is the core part of our model, which is what we will focus on next. The task of the extraction module is to transform the unimodal inputs x v , x a , and x l into unimodal representation s z v , z a , and z l through the subnetworks f v , f a , and f l . The unimodal representation obtained by the extraction module is expressed in the form of tensor, which is more convenient for the following calculation. And the fusion module contains a selfattention module for each unimodal representation. The unimodal representation enters the fusion module and generates a unimodal representation with new weights through a self-attention mechanism. Observing our network model, we do not need to directly calculate the input tensor Z, we first decompose z v , z a , and z 1 in low rank to get z v , z a , and z 1 ,then assign the corresponding weights W v , W a , W 1 to each factor, and finally sum them with the weights, which greatly reduces the complexity of our model and reduces the calculation pressure. Finally, the input tensor passing through the self-attention module generates the output tensor in the fusion module, which is the final output result that can be used for classification.

Self-Attention Module.
Since each unimodal has different information, the purpose of multimodal fusion is to make full use of the complementary information of multimodal data. We note that the self-attention module also has the ability to capture the global and local connections, so the most prominent part of our contribution in this article is to propose the introduction of the self-attention module into multimodal fusion. In the self-attention module, we use a different output vector calculation method than the traditional attention mechanism. This new method can perfectly meet the requirements of our simultaneous input of multiple tasks and realize parallel computing. The self-attention model we proposed is a weighted self-attention in proportion, which includes multitask self-attention. Our self-attention model formula is as follows: The three parameters q, k, and v of Equation (12) all conform to this equation q, k, v = φ q,k,v ðz n Þ, which means that all three input parameters come from the same source, where v ∈ ℝ d i×n , k ∈ ℝ d n×i , q ∈ ℝ d i×n . For the multitask attention mechanism, the input will be projected into multiple subspaces. This parameter uniformly scales the dot product attention to be embedded in each subspace.
Since s = ½s 1 , s 2 ,⋯,s n ∈ ℝ d i×n is a series of output vectors of q, k, and v, therefore, s is a series of output vectors of z n , we derive the following equation: where z n ′ is the new unimodal representation generated by the self-attention module. In this way, the self-attention between our single modes is completed. Bring Equation (13) into Equation (7) and simplify it as follows: It can be seen from formula (14) that the formula is consistent with the model we have shown. First, superimpose each weighting factor, then do element product between each single module. 4 Wireless Communications and Mobile Computing Since we are merging multiple tasks at the same time, we will show below that when n = 2, our formula will expand to formula (15): In this way, we can appropriately expand the formula according to the actual situation. It can be seen that the proposed method has high adaptability and can be flexibly applied in various tasks. In our self-attention module, we can see that our new input representation represents multiple tasks applied to multimodal fusion. And our selfattention module uses parallel computing to improve the accuracy of the model while maintaining a high speed of model calculation.

Training Loss.
Our model adopts the mean absolute error (MAE) as our loss function. MAE is the average value of absolute error, which can better reflect the actual situation of classification error and can also reflect the classification performance of our model.
where h i is the output tensor we got through our model and h P i is the classified value. n represents the total number of our training samples.

Experimental Environment.
Our tensor representation method is generally based on a tensor fusion network, but the biggest difference from this network is that our method uses a self-attention mechanism in the MF module. In the experiment, we compared our method with some of the latest multimodal fusion methods. Our experiment environment is 2080Ti * 2 Graphic Processing Unit (GPU), 32 G memory, 12 Intel(R) Xeon(R) W-2133 CPU @ 3.60 GHz. Our model training and testing are completed on CONDA 4.8.3, python3.7.7, and pytorch 1.5.0.

Data sets.
We conduct our experiments on multimodal data sets; they are CMU-MOSI [19], IEMOCAP [20], and POM [6]. These data sets provide data for sentiment analysis, speaker feature recognition, and emotion recognition. The goal of our experiment is to identify the speaker's emotions through these verbal or nonverbal behaviors.

IEMOCAP.
This IEMOCAP data set is a collection of 151 recorded dialogue videos; each dialogue video has two speakers, so the entire data set has a total of 302 videos. Each video is marked with 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed, and neutral).

POM.
The POM data set consists of 903 movie review videos, and the speakers of each video are marked with confidence, enthusiasm, and other characteristics.

CMU-MOSI
. 93 movie review videos on YouTube make up the data set CMU-MOSI. Each video contains multiple opinion segments, and each segment has tags about the opinion on sentiment.

Evaluation Metrics.
We report four evaluation metrics as used by our multiple task: F1-emotion, accuracy Acc-k where k is the number of classes, mean absolute error (MAE), and Pearson's correlation (Corr). Among those metrics, F1emotion is the score of the model under different emotions; as a statistical measure of the accuracy of a binary classification model, it can be viewed as a weighted average of the model accuracy and recall with a maximum of 1and a minimum of 0. Accuracy (Acc) is defined as the percentage of the total sample that classifies the correct result. Mean absolute error (MAE) reflects the classification performance as we reported before. Pearson's correlation (Corr) considers the degree of correlation among variables.  4.6. Experimental Data Analysis. We compare our method's performance on the three tasks of sentiment analysis, speaker feature recognition, and sentiment recognition with the previous models with excellent performance. The results are shown in Table 1. In all data sets, our approach can produce competitive and consistent results across metrics such as F1, Corr, Acc, and MAE. On the emotion recognition task, our model got the highest score on three emotions scored by F1. The results verify that our method outperforms other traditional method and is close to the state-of-the-art approaches.
On the multimodal personality trait recognition task, our model also achieved competitive results. Although LMF achieved a high score on the ACC indicator, our score is only 0.1 less than the LMF score.
On the multimodal sentiment analysis task, our model performs very well on performance indicators Corr and Acc-2. Nonetheless, our method scored only 0.057 less than the highest on MAE. All in all, our method perfectly completes the multimodal sentiment regression task. 5 Wireless Communications and Mobile Computing 4.7. Influence of Rank Setting. In the experiment, the parameters changed by the actual situation often have a great influence on the experimental results. The different rank settings in our model will indeed affect the experimental results. In order to prove that our model can stand out in various tasks and has a high adaptability, we propose a new experiment, in which, we constantly set different values for rank to observe the effect of changes in rank on the experimental results.
In the experiment of the influence of rank on the experimental results, our other parameters are set as follows: the dropout of audio and video are both set to 0.2, and the text dropout is set to 0.5. For some other parameters, the learning rate is set to 0.001, batch size is set to 32, and the weight decay is set to 0.01.
To evaluate the impact of different level settings on our model, we measured the performance change of MAE in the CMU-MOSI data set while changing the number of levels. The results are shown in Figure 3. We have observed that although the rank value is constantly increasing, our training results have remained stable. Therefore, it can be seen that our model is not sensitive to rank, no matter what the rank is, the performance of our model can always remain stable. In some cases where the rank value is high, our model can still be adapted and used.

Conclusion
In this paper, we propose a multipeak fusion method based on a self-attention mechanism. This method uses a lowrank tensor representation, and the attention mechanism is used in tensor representation to improve the correlation between multiple representations. Our method achieves competitive results in different multimodal fusion tasks in different data sets. Our method reduces the complexity of the parameters while also reducing the measurement complexity. It is a novel attempt to apply the attention mechanism to multimodal fusion, and it shows higher efficiency and better performance on different downstream tasks. The application of the attention mechanism makes our model have higher classified ability under the premise of few parameters and high efficiency. In the experiment, our method performs better than the multimodal fusion method represented only by low-rank tensor.

Data Availability
No data were used to support this study.

Conflicts of Interest
The authors declare that they have no conflicts of interest.