Adding Visual Information to Improve Multimodal Machine Translation for Low-Resource Language

Machine translation makes it easy for people to communicate across languages. Multimodal machine translation is also one of the important directions of research in machine translation, which uses feature information such as images and audio to assist translation models in obtaining higher quality target languages. However, in the vast majority of current research work has been conducted on the basis of commonly used corpora such as English, French, German, less research has been done on low-resource languages, and this has left the translation of low-resource languages relatively behind. This paper selects the English-Hindi and English-Hausa corpus, researched on low-resource language translation. The different models we use for image feature information extraction are fusion of image features with text information in the text encoding process of translation, using image features to provide additional information, and assisting the translation model for translation. Compared with text-only machine translation, the experimental results show that our method improves 3 BLEU in the English-Hindi dataset and improves 0.47 BLEU in the English-Hausa dataset. In addition, we also analyze the effect of image feature information extracted by different feature extraction models on the translation results. Different models pay different attention to each region of the image, and ResNet model is able to extract more feature information compared to VGG model, which is more effective for translation.


Introduction
Currently, machine translation is an important research direction.
e main goal of information-driven machine translation is to decipher the source language information into the expected target language. e framework requires learning from sentence-aligned bilingual readiness information. Early statistical machine translation (SMT) is an information-driven model, which uses a probabilistic model to capture the interpretation process. e model relies on accepting a word as a basic element and a discriminant model based on maximum entropy and using the information obtained from the sentences [1].
Neural machine translation (NMT) models are usually two-part encoders and decoders; the work of the encoder is to encode a xed-length language into a vector form according to certain rules, called a word encoding vector. e encoding vector hides the basic information of the source sentence. e text encoded by the encoder is sent to the decoder side, where the vocabulary of the target language is gradually deduced from the encoding vector, and nally the target discourse is obtained. Rely on a perfect decoder to produce complete sentences for the source language. Since the source and target languages usually have di erent lengths, the problem of gradient explosion between phrases is triggered by long short-term memory (LSTM) [2] and gated recursive units (GRU) [3]. Gated recursive unit (GRU) is a regular recurrent neural network. Sutskever et al. [4] proposed an encoder-decoder architecture for recurrent neural networks (RNNs). e attention mechanism was originally proposed by [5] in their paper "Neural Machine Translation by Jointly Learning to Align and Translate" as an extension of work on the sequence to sequence model. e encoder-decoder model only encodes a fixed-length sequence of sentences and decodes the output vector at each time step, and the attention was proposed to address the length limitation of the architecture. When the sequence of decoders is too long, the decoding results become worse. Attention was presented as a single approach for word alignment and translation. Word alignment is a prominent issue in machine translation, which analyzes the relationship between the input sequence and the output target sequence, and the translation model selects the most appropriate result for output based on exact relationship. Multimodal machine translation (MMT) assists the system to translate higher quality target sequences by other information besides the text, such as images, audio, and video, which are incorporated into the translation model by different ways to assist the system. It has been shown that [6], by matching image information with text and performing training, produced better results than text-only translation.
For now, plain text-based machine translation with an encoder-decoder architecture is a widely used technique.
is model accepts a sequence as input and encodes the information in the sequence into an intermediate representation. e decoder then decodes the intermediate representation into the target language. LSTM has been shown to be a good solution to the long-time sequence dependence problem. Sutskever et al. [4] proposed to encode the input sequence with an LSTM and decode that vector with another LSTM and obtain the output sequence. However, for long sentences, this encoder-decoder structure is unable to encode all the information. Bahdanau et al. [5] proposed attention mechanism and the problem was solved. In lowresource language translation, the model architecture based on attention mechanism has good performance. Bahdanau et al. [5] have good performance in WAT2021 with bidirectional RNN (BRNN) encoder and dual attention RNN decoder for multimodal machine translation.
In this paper, we investigate improving on the text-only machine translation model by adding image information as an auxiliary modality to improve the quality of the translated target sentences. In our experiments, we extract image features by different models and then add image features information to the encoder side of the baseline, fuse it with the initial word encoding information and position embedding information, and train an encoder containing image features, followed by a standard decoder module at the same time, to form a complete encoder and decoder model, thus improving the quality of English to Hindi translation. Comparing the experimental results, our approach improves 3 BLEU on the test set compared to the [7] method, and we also use different models to extract the image feature information and compare the impact of the extracted features on the translation results under different models.

Related Work
For English-Hindi translation tasks, the literature study found that the translation task still works on the basis of textonly [8][9][10]. Reference [8] used synthetic data, followed by the multimodal NMT setting [11], and obtained a BLEU score of 24.2 in Hindi-to-English translation. However, in the WAT2019 multimodal translation task English to Hindi, [12] obtained a higher BLEU score of 20.37 on the challenge test set. is score was later improved again in the WAT2020 [12] task, where a BLEU score of 33.57 was obtained on the challenge test set. In [7], a bidirectional RNN of encoder type (BRNN) and a double-attention RNN of decoder type with default settings are used, and pretrained words are used to embed the data of monolingual corpus and additional parallel corpus IITB. In the WAT2021 multimodal translation task, [13] attempted to employ phrase pairs to improve English to Hindi translation effectiveness and performance. Reference [14] proposed a reinforcement learning (RL) method to introduce sequence-level supervised signals as a systematic reward applied to the model when building the NMT system. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers.
is is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP) [15].
Reference [16] proposes to research the effect of inducing visual information in translation in datasets and suggests how to use data information for inducing text translation. Reference [17] proposed a Sequence Adaptive Memory (SAM) based cognitive translation model based on an improved version of the Cortical Learning Algorithm (CLA) with translation for English to Hindi. e model is capable of learning to create word pairs, dictionaries, and translation rules on very small data sets. Comparisons with traditional phrase-based approaches and current state-of-the-art methods are made and present comparable results. ere are scholars [18] who used pretrained fine-tuning for translation tasks between Hindi and English, and they used the pretrained mBART50 model [19] in a multitask setting, with translation as the primary task and self-supervised language modeling as a secondary task. e performance of the model is compared with the conventional fine-tuned mBART50 trained for the translation task. According to the current research results, we integrate image modal information with text information based on machine translation of text-only and input them into the neural network at the same time to increase the available information in the encoding stage. e encoder adopts the encoder module in transformer, which maps the input information into three matrices, which are query matrix Q, key matrix K, and value matrix V. e degree of influence of node i on node j is represented by the dot product of the query matrix Q i of node i and the key matrix K j of node j, and after adjustment, it can become the weight of information of node j aggregating node i.

Visual Feature Extraction.
Matching images with text: In translation, the extracted image features and text information are fed into the translation model for fusion coding at the same time. For image features, we choose VGG11, VGG19, and ResNet50, three models for feature information extraction to analyze whether the features extracted by di erent models contain rich visual information and their contribution to the translation task. We ne-tune the pretrained VGG and ResNet models to the image feature information we need that is obtained.

Neural Machine Translation.
Currently, most of the Neural Machine Translation (NMT) model is based on transformer [20], whose core is the attention mechanism. And it is used with multihead mechanism. Each attention header operates on an input sequence x (x 1 , . . . , x n ) containing n elements, where x i ∈ R d , and computes a new sequence of the same length z (z 1 , . . . , z n ), where z ∈ R d : where c is the weight coe cient, which is calculated by the softmax function: where W Q , W K , W V ∈ R d×d is the trainable parameter matrix for a particular layer. In traditional machine translation (MT) tasks, the data are in the form of text-only, which is encoded and fed into the neural network, and the source utterance contains only textual information. As in Figure 1, we can divide the overall transformer process into several nodes at di erent times, such as word vector and location coding combination nodes, multiheaded attention nodes, and feedforward nodes.
If we want to integrate information from other modalities, we need to add node information from other modalities to the overall process, as an additional information to assisted system translation. erefore, we extract feature information from the images and add image feature information as pseudo-words to the multimodal self-attentive layer.

Multimodal Self-Attention.
[?] proposed a model with attention, which can better demonstrate the dependencies between each word in a sentence and called the self-attention mechanism, and the structural design is shown in Figure 1. In their model, the hidden state was calculated by a selfattentive and feedforward system, while positional encoding is used to show the regional relationships of each word in the sentence, adding a multiheaded attention mechanism to achieve parallelized processing. e processing of MT is accelerated due to the possibility of parallel processing, leading to better results than the model based on the benchmark recurrent system.
As mentioned earlier, in MMT, the extracted image feature information is only used as additional information to tutor the system for translation. erefore, compared to text, visual feature information is not equally important. If the information is directly fused with the position embedding and word vector encoding at the initial location, unnecessary noise is generated. According to the multimodal self-attention mechanism proposed by [21], we add image feature information to the encoder part of the overall architecture, as shown in Figure 2.
Our work is an improvement of the encoding stage based on the transformer. e transformer is composed of several submodules; each one contains multiheaded attention, a feedforward neural network, and a residual structure in the submodule. Each submodule handles each token and its previous input relationship. In the beginning, the self-attentive mechanism is used to construct multiheaded attention modules. e approach is to apply self-attention to the same input several times, using separate normalization parameters. With the attention parameters set once, they can be reused several times to solve parallel processing problems.
In addition, the model easily learns the related information between di erent heads by computing the multihead attention method. During data processing, the self-attention module calculates and updates the weights according to the importance of token j in the sequence. As mentioned earlier, the input sequence is mapped into three matrices (Q, K, V).
e dimensionality of all three correlation matrices is R d×h , with d denoting the embedding dimension and h denoting the number of attention heads. e nal output of each head will be integrated by a linear connection.
In the multimodal self-attention mechanism, with the attention parameters set once, they can be reused several Mathematical Problems in Engineering times to solve parallel processing problems, which provides a potential adaptation from text to image. So we take text and image as two parts of information input as x txt ∈ R n×d and x img ∈ R n×d ; we can have the result of multimodal attention as where c ij is the weight matrix obtained from the softmax function. So we can get where z ∈ R (n+p)×d is the text and image hidden representation. e decoder side receives the z generated by the encoder side and then generates the target sequence. With the attention parameters set once, they can be reused several times to solve parallel processing problems. e features of the extracted spatial images are not directly encoded in the model. Instead, they adjust attention to compute a hidden representation of each word of the image. In each encoder layer, we also use a residual between each layer as well as a normalization layer. Finally, the decoder of the transformer standard is connected.

Dataset Analysis.
is experiment uses the Hindi Visual Genome 1.1 dataset [22] (https://lindat.m .cuni.cz/repository/ xmlui/handle/11234/1-3267) and Hausa Visual Genome (https://github.com/hausanlp/HausaVisualGenome), which contains the training set (containing 28930 text sentences and 28928 images), the validation set (988 text sentences and 998 images), the test set (1595 sentences and 1595 images), and the challenge set (1400 sentences and 1400 images). In the study, we found that some of the pictures in dataset do not have corresponding descriptive statements. ere are also some data with duplicate statements. erefore, we performed additional processing on the experimental data to remove the duplicate sentences in the text and the images without descriptive statements. At the same time, in order to guarantee the consistency of subsequent image feature extraction, we also eliminate the grayscale image and the corresponding descriptive statements from the dataset and nally collate the datasets for our experiments, 28,891 texts and images for each training set, 995 texts and images for each validation set, 1595 texts and images for test set, and 1595 texts and images for challenge test set, as shown in Tables 1 and 2.

Image Feature Extraction.
In this experiment, we used VGG11 and VGG19 [12] model and ResNet50 [23] model to extract image features information, respectively. e VGG19 model consists of 16 convolutional networks connected, followed immediately by 3 fully connected layers, each with a size of 3 × 3, followed by a 2 × 2 maximum pooling operation. And nally the data output from the fully connected layers is classi ed after softmax normalization. In our experiments, we modify the last two layers of the model as shown in Figure 3. e nal softmax layer is discarded and the output of the fully connected layer is modi ed to match the data dimension of the experiment. e VGG11 model uses the same approach to extract image features. Another feature extraction model used is ResNet50, through 4 blocks, with 3, 4, 6, and 3 bottlenecks in each block. In the latter blocks, normalization is achieved with softmax after average pooling and full connectivity. In the modi cations similar to the VGG19 model, the image feature information is output by modifying the nal output layer of the model. All extracted image feature data are saved in NPY data format, as shown in Figure 4    implementation. Our experiments are divided into two parts, which are text-only and multimodal comparison experiments for extracting image features by VGG11, VGG19, and ResNet50 model. In our experiments we set the multihead attention head as 6 and the learning rate lr as 0.007; for the optimizer, we use Adam optimizer [25], where Adam β 1 and β 2 were 0.9 and 0.98, the training batch size is 64, and warmup updates are set to 8000 and trained on RTX3090 device with 24G memory.
Meanwhile, in the experiment, we set the parameter patience to 20 and continued training to 20 epochs when the current result of the model reached the best, and the whole training process was terminated when the best result of the model no longer improved. To avoid the error of the experiment, we evaluate the translation results for each model trained and select the best model result.

Result Analysis
We extracted image features by di erent models and performed multimodal machine translation experiments separately, and the experimental results were evaluated using BLEU [26]. We can nd that the translation results under multimodality have a larger improvement compared to the text-only, with a 3 BLEU (https://github.com/ moses-smt/mosesdecoder) improvement compared to [27] in Hindi-English and improved 0.47 BLEU. It can be found that the multimodal translation used in our experiments can translate the target language well, which indicates that extracting the image feature information corresponding to the text can e ectively assist the system to translate and obtain more accurate target sentences, as shown in Table 3.
At the same time, we found that the image features extracted under di erent models also have di erent degrees of in uence on the translation results. e image feature information extracted under the ResNet50 model is 0.6 BLEU higher than the image feature information extracted under the VGG19 model experimentally. Analyzing the model structure of both, VGG11, VGG19, and ResNet50 all use Convolution Neural Network (CNN), but the depth of the network used is di erent between them. VGG11 has 11 parametric network layers and 5 VGG blocks, where the rst 2 blocks use a single convolutional layer and the next 3  Mathematical Problems in Engineering blocks use a double convolutional layer. e rst module contains 1 input channel and 64 output channels, and the output channels of each network immediately following are doubled, resulting in a nal output of 512 dimensions. At the end of the network are three fully connected networks and a softmax structure, with 3136, 512, and 512 neural units in its fully connected layers. VGG19 also consists of 5 VGG modules, but each module contains di erent convolutional layers. e convolutional layers of each VGG19 module are 2, 2, 4, 4, 3 plus the nal fully connected and softmax layers, which constitute the entire VGG19 network structure. e VGG11 network only learns the contour features of the image and cannot clearly express the relationship built in each region. Compared with VGG11, the extra convolutional layer of VGG19 can learn the relationship between the parts of the image more accurately. For the ResNet50 network, there are 4 large blocks, each with 3, 4, 6, and 3 submodules, the network is initialized with individual convolution, and the feature information is output by full connectivity, resulting in a total of 50 layers. Although ResNet is deeper than VGG11 and VGG19, the weight parameter is smaller because the ResNet network uses global average pooling, which reduces the number of parameters and reduces over tting and computation to some extent. erefore, it is able to obtain image feature information better. It can be inferred that di erent network architectures pay di erent attention to the features of each part of the image and therefore extract di erent feature results, resulting in di erences in the experimental results of the two models. Figure 5 shows that di erent models pay di erent attention to each region of the image when extracting image features, and Figure 5(a) shows the feature extraction heatmap under the VGG19 model. We can nd that, when mentioning the characteristics of the gure, more attention is paid to the gure of the character area "benches"; other areas are paid less attention. In contrast, the ResNet50 model is able to focus on some of the "benches," which may account for the di erence in translation results between the two auxiliary systems. Figure 6 shows the heatmap of the relationship between source language sentences and target language sentence words for di erent feature extraction models, when the image features are extracted and the auxiliary model is trained on the dataset translation under the VGG19 model and the ResNet50 model. It can be found through the gure that there is no major di erence between the two pictures; only in individual words, the degree of association is not the same, which also thus shows that the fusion of the image feature information we extracted makes it possible to assist the model in translation. As mentioned earlier, when extracting features under di erent models, the focus on each region of the image is di erent, so the extracted feature information is not exactly the same, and therefore, the auxiliary information provided to the model is also di erent, resulting in di erent degrees of word-to-word association.
As Tables 4 and 5 show, with di erent models, for the probability of being selected for each word in the target language (results are taken in log), we can nd that both models with added image features have better probability of selection for each word than the text-only model, because the image feature information, which provides additional auxiliary information to the model, increases the weight of the query matrix Q (mentioned in Chapter 4) in the model, allowing each target word to be found more accurately, which is the aspect where multimodal machine translation outperforms text-only.
Comparing with the sample translation result, as shown in Table 6, the translation result in the plain text is "a woman sitting on a metal bench," which only expresses the basic meaning of the original sentence but does not accurately translate the color of the bench "blue." e multimodal   translation result is "a woman sitting on a blue bench," which accurately translates the color "blue" of the bench in the source language due to the additional information provided by the image features. From the analysis of the above results, it is clear that the multimodal machine translation method we use can improve the quality of English-Hindi translation to some extent.

Conclusion
is experiment is based on the improved transformer architecture. Transformer has good realizations in machine translation of plain text. In recent years, in the field of multimodal machine translation research, transformer has also been adopted to improve the translation effect. Compared to models that use text-only information, multimodal machine translation is able to input image feature information matched with text as additional information into the encoder or decoder (this paper uses adding image feature information at the encoder side) on limited resources to assist the system in translation to help the system obtain a better target language. After the experiments, it can be seen that the model is designed to effectively use the information provided by images for translation, and the image feature information extracted by using different network models has different degrees of influence on the experimental results. In future work, we will also try adding image feature information to different models to improve the translation effect of low-resource languages.
Data Availability e code and data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Table 6: Sentence and matching image translation example results; the example image is shown in Figure 5 (transformer is text-only model, h represents the target language of the translation, e is the target language corresponding to English language). Source a woman sitting on a metal blue bench