Improving Neural Machine Translation with AMR Semantic Graphs

)e Seq2Seq model and its variants (ConvSeq2Seq and Transformer) emerge as a promising novel solution to the machine translation problem. However, these models only focus on exploiting knowledge from bilingual sentences without paying much attention to utilizing external linguistic knowledge sources such as semantic representations. Not only do semantic representations can help preserve meaning but they also minimize the data sparsity problem. However, to date, semantic information remains rarely integrated into machine translation models. In this study, we examine the effect of abstract meaning representation (AMR) semantic graphs in different machine translation models. Experimental results on the IWSLT15 English-Vietnamese dataset have proven the efficiency of the proposed model, expanding the use of external language knowledge sources to significantly improve the performance of machine translation models, especially in the application of low-resource language pairs.


Introduction
Neural machine translation (NMT) [1][2][3][4] has proven its effectiveness and thus has gained researchers' attention in recent years. In practical applications, the typical inputs to NMT systems are sentences in which words are represented as individual vectors in a word embedding space. is word embedding space does not show any connection among words within a sentence such as dependency or semantic role relationships. Recent studies [5][6][7][8] found that semantic information is essential to generate concise and appropriate translations in machine translation. Although these models have made a significant progress, their design and functions are limited to statistical machine translation systems only. Consequently, the tasks of surveying, analyzing, and applying additional semantic information to NMT systems have not received comprehensive attention.
In this study, we present the method of integrating abstract meaning representation (AMR) graphs (https:// amr.isi.edu) as additional semantic information into the current popular NMT systems such as Seq2Seq, ConvSeq2Seq, and Transformer. AMR graphs are rooted, labeled, directed, and acyclical graphs representing the entire content of a sentence. ey are also abstracted from related syntactic representations in the sense that sentences with similar meanings will have the same AMR graph, even if the words used in these sentences are different. Figure 1 illustrates an AMR graph in which the nodes (e.g., want-01, girl) symbolize concepts, while the edges (e.g., ARG0 and ARG1) represent the relationship between the concepts that they connect. Compared to semantic role graphs, AMR graphs contain more relationships (e.g., between boy and girl). Besides, AMR graphs directly hold entity relations while excluding the alternating variables (i.e., using lemma) and the function words. erefore, AMR graphs can be combined with the input text to generate better contextual representations. Moreover, the structured information from AMR graphs can help minimize the problem of data sparsity in resource-poor settings. First, the AMR graph representations are combined with the word embedding to create a better context representation for a sentence. en, multihead attention can focus on all positions of contextual features with the outputs of the AMR graph representations.
Integrating AMR graphs into NMT yields several benefits. First, this addresses the problems of data sparsity and semantic ambiguity. Second, structured semantic information constructed from AMR graphs could help complement the input text by providing high-level abstract information, thereby improving the encoding of the input word embedding. Last, multihead attention can also take advantage of semantic information to improve the dependency among words within a sentence.
Recent studies have applied semantic representation to NMT models. For instance, Marcheggiani et al. [9] exploited the semantic role labeling (SRL) information for NMT, indicating that the predicate-argument structure from SRL can help increase the quality of an attention-based sequenceto-sequence model. Meanwhile, Song et al. [10] proved that semantic information structured from AMR graphs can complement input text by incorporating high-level abstract information. In this approach, the graph recurrent network (GRN) was utilized to encode AMR graphs without breaking the original graph structure, and a sequential long shortterm memory (LSTM) was used to encode the source input. e decoder was a doubly attentive LSTM, taking the encoding results of both the graph encoder and the sequential encoder as attention memories. Song et al. had also argued that the results of an AMR integration is significantly greater than those of a sole SRL integration because AMR graphs include both SRL and the relationships between the nodes (i.e., words). However, Song's approach has encountered some drawbacks such as failed to address the problem of the correlation between nodes in AMR graphs and investigated only on the machine translation system using the recurrent neural network (RNN). e contributions of our work are as follows: (i) First, instead of adding a node to represent an edge in the graph and assigning properties of the edge as those of the documents, we extend the node embedding algorithm [11] to use direct edge information (ii) Second, instead of using the graph recurrent network in [10], we propose an architecture that binds an inductive graph encoder (iii) Finally, we examined and analyzed the results on the English-Vietnamese bilingual set, which is considered a low-resource language pair. rough experiments, we demonstrate the effectiveness of integrating AMR into neural network machine translation and draw insightful conclusions for future studies.
e organization for the remaining of the article is as follows. Section 2 introduces current popular machine translation architectures such as Seq2Seq, ConvSeq2Seq, and Transformer. Next, Section 3 presents the method of representing AMR graphs in the vector form as well as proposing a method to integrate AMR graphs into different NMT models. en, Sections 4 and 5 discuss the corpus used in the experiment and the experimental configuration for the model, respectively. Afterward, Section 6 presents the experimental results of the machine translation model with an integrated AMR and analyzes the effect of an AMR on the model along with some translation errors generated by the model. Section 7 summarizes our work.

Neural Machine Translation
In this section, we provide a brief introduction about the Seq2Seq model and its variants such as ConvSeq2Seq and Transformer.
2.1. Seq2Seq. We take the attention-based sequence-to-sequence model of [1] as the baseline model, but we use LSTM [12] in both encoder and decoder. :ARG1 :ARG1 The boy desires the girl to believe him.
The boy wants the girl to believe him.
The boy has a desire to be believed by the girl.
The boy is desirous of the girl believing him.  (2)

Decoder.
e decoder predicts the next word y t , given the context vector c and all previously predicted words (y 0 , y 1 , . . . , y t − 1). We used an attention-based LSTM decoder [1], with attention memory as the concatenation of the attention vectors among all source tokens.
For each decoding step t, the decoder feeds the concatenation of the embedding of current input e y t and previous context vector c t− 1 into LSTM to update the hidden state: en, the new context vector is computed as where a is the alignment model which is a feed forward network, scores how well the inputs surround position i, and the input at position t match. e output probability over target vocabulary is calculated: where W o and b o are the model parameters.

ConvSeq2Seq
is architecture is proposed by Gehring et al. [2] to completely replace the RNN with the CNN with the following components: e ConvS2S model followed the encoder-decoder architecture. Both encoder and decoder blocks share an identical structure that computes hidden states based on a fixed number of input elements. To enlarge the context size, we stack several blocks over each other. Each block comprises a one-dimensional convolution and a nonlinearity. In each convolution kernel, parameters are W ∈ R 2 d×k d and b w ∈ R 2 d . e input is represented as ∈∈R k d , which is a concatenation of k input elements with dimension of d and maps them to get the single output Y ∈ R 2 d with dimension twice of that of the input. en, the k output elements will be fed into subsequent layers. We leverage the gated linear unit (GLU) as nonlinearity which applied on the output of the convolution where A, B ∈ R d are the inputs to the nonlinearity, ⊗ denotes the element-wise multiplication, the output Y � [AB] ∈ R 2 d has a half of size compared to Y, and σB is the gate that control which inputs A of the current contexts are relevant. In order to enable deep convolutional blocks, we adopt the residual connections which connect the input of each convolutional layer with the output: where h l is the hidden state of l th layer.

LightConvSeq2Seq
. As a variant of a CNN called lightweight convolution [13] which allows computation with linear complexity, O(n), with n being the length of the input string. e structure of LightConvSeq2Seq consists of the elements similar to Conv2Seq but using lightweight convolution operation rather than convolution operation.
Depthwise Convolution (DConv). Perform a convolution operation independently over every channel; thereby, the number of parameters reduce significantly from d 2 k to dk, where k is the kernel width. In general, at position i and direction c, the output O i,c is calculated as follows:

Transformer.
Transformer [4] also includes an encoder and a decoder. e encoder generates a vector representation of the input sentence. Assuming an input of the form x � (x 1 , x 2 , . . . , x n ) and a representation of x of the form z � (z 1 , z 2 , . . . , z n ), the decoder produces sequentially for a translation of y � (y 1 , y 2 , . . . , y m ) based on z and the previous outputs.

e Encoder.
ere are N stacked similar blocks. Each of these blocks consists of 2 subblocks: a self-attention mechanism and a feed forward network. A residual connection surrounds each subblock, followed by layer normalization. e general representation formula for the encoder is as follows:

e Decoder.
ere are also N blocks. However, each block consists of 3 subblocks: a self-attention block, a feed forward block, and an encoder-decoder attention block inserted between them. e residual connection and layer normalization are used similarly to the encoder. e encoder generates outputs step by step. e self-attention block only pays attention to the positions generated in the previous steps by using a mask. e mask prevents the decoder from paying attention to locations that have not been generated, so outputs can only be predicted based on the result z of the encoder and previous outputs.

Self-Attention.
ere are 3 components as follows: query (Q), key (K), and value (V), defined as follows: where Q, K, and V are the parameters with the number of dimensions d k , d k , and d v respectively.

The Proposed Method
In this section, we present the graph embedding algorithm and propose our method to integrate the AMR graph embedding representation to various well-known NMT systems such as Seq2Seq, ConvSeq2Seq, and Transformer.

Graph-Level Information
Representation. Figure 2 depicts the graph encoder architecture based on the model of Xu et al. [11], with some enhancements to integrate more information about the edge of the graph. e directional graph G � (V, E) with the label on the edge e u,v ∈ E presents the relationship between the nodes u and v to which it connects. e process of learning to represent the node v ∈ V is as follows: (1) We first transform the text attribute of node v into a feature vector a v by looking up the embedding matrix W E (2) Next, we categorize the neighbors of v into two subsets: forward neighbors, N ⟶ (v) and backward neighbors, N ← (v). Particularly, N ⟶ (v) returns the nodes that v directs to and vice versa. e information about the edge e u,v associated between the node v and the adjacent node u is combined as follows: . e result is passed to a feed forward layer, followed by a nonlinearity activation function σ, which updates the forward representation of v, to be used in the next iteration. (5) Update the backward representation of v, h k v← , using similar procedure in steps (3) and (4), but this time, we utilize backward representations rather than the forward representations and use AGG ← to aggregate neighbor information.
As mentioned in steps (3) and (5), the representation association operation of node v is performed with one of the following aggregation functions: (i) Mean aggregator: performs the average calculation on each element of aggregator, except that the result is fed into a fully connected layer and a nonlinear activation function [14].
with MEAN as the function returning the average value, and σ as the nonlinear activation function. (iii) Pooling aggregator: each node embedding vector is passed through a feed forward layer followed by the pooling operation (which can be max, min, and average): with max as the maximum operation, and σ as the nonlinear activation function.

Graph Embedding.
Graph embedding Z contains all the information on the graph and is calculated by one of the following two methods: (i) Pooling based: the node embeddings z v , v ∈ V are passed through a linear transform network and performs pooling.
(ii) Adding a super node: node v s is pointed by all nodes in the graph. Using the algorithm in Section 3.1, the representation of v s is z v s . e representation of v s contains all information of the nodes that should be considered as representations of the graph or graph embedding.

Dual Attention Mechanism.
e architecture of an integrated AMR machine translation model is illustrated in Figure 3 with an English sentence input and a corresponding AMR graph. e proposed architecture consists of an encoder for the input sentence and a decoder with the input value resulting from the encoder. e main difference from the traditional decoder-encoder model is that there is an additional graph encoder to process information on graphs and to represent this information in a vector format. is vector is then combined with the hidden states of the encoder and fed into the decoder to find the corresponding representation in Vietnamese.
We propose a specific integration method for the Seq2Seq model with sequential processing in Section 3.2.1 and focus on models with parallel processing such as ConvSeq2Seq, LightConvSeq2Seq, and Transformer in Section 3.2.2.

Seq2Seq Model with the Sequential Processing
Mechanism.
e model (Figure 4(a)) consists of two attention mechanisms operating independently: the original attention (left) learns the alignment between the result y i− 1 and the hidden states h j , j ∈ [1, n] of the encoder and the graph attention learns to align between the output and the nodes in the AMR graph, yielding a context vector c i− 1 . In particular, the computation of c i− 1 is as follows: e ij � a s i− 1 , z j , where a is a feed forward network, evaluating the matching between the nodes surrounding the position j and the input i. ese two context vectors are then combined with the decoder's state s i and the embedding vector of y i− 1 to calculate a probability distribution that determines y i .  . .
x 0 . .   the model cannot effectively learn the connection between the input sentence, the output sentence, and AMR graph, with a small increase of about 0.2 (experiments with LightConvSeq2Seq and Transformer). erefore, the use of the graph embedding of Z should help the model obtain more information about the graph before the attention calculation. is has been proven with experimental results, which show an increase of the BLUE score by 0.6. Figures 4(b)-4(d) describe the proposed model that integrates AMR with a dual attention mechanism. Regarding the LightConvSeq2Seq-AMR and Transformer-AMR models, the self-attention mechanism for the graph is similar to the description of the self-attention mechanism in Section 2.3 with the input being representations of nodes z v , ∀v ∈ V instead of the state h i , ∀i ∈ n. Regarding the ConvSeq2Seq-AMR model, experimental results show that utilizing Luong's attention mechanism to learn the alignment between the graph and the output produced better results than the multistep attention.

The Corpus
e corpus used to evaluate the model is IWSLT15 [15], which includes approximately 130, 000 English-Vietnamese bilingual sentences taken from TED Talks presentations for the training set. For fine-tuning, we use the set called tst2012, which includes 1553 parallel pairs language. Besides, the test sets consist of tst2013 and tst2015, which include 1268 and 1080 English-Vietnamese bilingual pairs, respectively. e statistical information is given in Table 1.
For the preprocessing phase, byte-pair encoding (BPE) (https://github.com/rsennrich/subword-nmt) [16] with 8000 operations is utilized to deal with rare words and compound words for both English and Vietnamese, thereby significantly reducing the vocabulary size in English from 54111 to 5208 and in Vietnamese from 25335 to 3336.
For AMR parsing, we use NeuralAmr toolkit (https:// github.com/sinantie/NeuralAmr) [17] which implements the sequence-to-sequence models to the tasks of AMR parsing and AMR generation. eir model achieves competitive results of 62.1 SMATCH [18], the current best score (at the time doing this work, Jan 2020) reported without the significant use of external semantic resources. is tool produces AMR graphs represented in the PENMAN notation (https://www.isi.edu/natural-language/penman/ penman.html) and in a linear form, as demonstrated in the AMR preprocessing example.

Experimental Configuration
e models are implemented in Python 3 and use the library Fairseq (https://fairseq.readthedocs.io/en/latest/#) [19]. e configuration of the base models is as follows: (i) Seq2Seq: we investigate the MT model with two types of LSTM which are uni-LSTM (one-directional) and bi-LSTM (two-directional). ere are 512-word embedding dimensions, which utilize 512 LSTM hidden units in both the encoder and the decoder.
(ii) ConvSeq2Seq: it comprises 4 convolutional blocks and 512 hidden units for both the encoder and the decoder. e kernel size is 3. (iii) LightConvSeq2Seq: it consists of 4 convolutional blocks with the kernel size of 3, 7, 15, and 31 for each block and applies to both the encoder and the decoder. Self-attention is adopted with H � 8 heads. (iv) Transformer: it has N � 6 blocks for both the encoder and the decoder. e word embedding dim is set to 512 and 2048 for the feed forward network. Self-attention used with the number of heads was 8.
e proposed models have the same configuration as the base model. Besides, the graph encoder used 128-dimensional embedding for the representation of both edge and node. We stacked 2 layers of the graph encoder and aggregating information from neighboring nodes with the mean aggregator for LSTM and max pooling with the rest of the models.
During training, Adam optimizer [20] is used with a fixed learning rate of 0.001 for LSTM and ConvSeq2Seq, 0.0002 for LightConvSeq2Seq, and 0.0005 for Transformer.
Besides the basic models presented above, the results of the proposed model are also compared with the method of Song et al. [10]. To make a fair comparison, we have retrained Song's model with the same preprocessed dataset and tuned hyperparameters.
After the models are trained, the BLEU score [21] was used to evaluate the translation quality. We also apply the bootstrap resampling method [22] to measure the statistical significance (p < 0.05) of BLEU score differences between translation outputs of proposed models compared to the baseline.

Results and Discussion
In this section, we present our experimental results and our analyzes on the results.

Results.
Once the models have been trained, a beam search with the size of 5 is utilized to find a translation that maximizes the conditional probabilities.
With both the test sets tst2013 and tst2015, the proposed models are proven to be superior to the corresponding base model. In particular, as given in Table 2, with uni-LSTM-AMR-F and bi-LSTM-AMR, the BLEU scores are 27.21 and 29.29, respectively, which are 1.09 and 3.17 higher than Song's method [10]. Similarly, with the set tst2015, bi-LSTM-AMR improved BLEU by 2.83, compared to Song's method.
is shows that despite using the double attention mechanism, bi-LSTM-AMR and uni-LSTM-AMR can integrate the information from AMR more effectively, thereby producing better translation results.
Meanwhile, when LightConvSeq2Seq is run on tst2013 and tst 2015, the BLEU scores are only 27.47 and 25.09, respectively. However, when integrating the AMR into the system, the BLEU score increased significantly by 1.0 and 0.58 on tst2013 and tst2015, respectively. Besides, Light-ConvSeq2Seq-AMR-F and LightConvSeq2Seq-AMR-B, which were integrated graph information from one direction, also outperform LightConvSeq2Seq, as given in Table 3.
As given in Table 4, ConvSeq2Seq also shows an improvement in machine translation quality with an increase in the BLEU score to about 0.3 for ConvSeq2Seq-AMR with tst2013. However, there is a BLUE decrease of 0.08 with tst2015. However, the ConvSeq2Seq-AMR-F model achieves the best results when integrating information from the forward neighbors. An increase of 0.1 in BLEU is observed with tst2013 and 0.5 with tst2015. Similar to Transformer, integrating information from the forward and backward neighbors in Transformer-AMR is not effective, with only an increase of 0.09 over the base model with tst2013. Only combining information from the forward neighbors in Transformer-AMR-F achieves a noticeable BLEU score of 28.88 and 26.28 with tst2013 and tst2015, respectively, which signal an increase of 0.28 and 0.52 compared to Transformer.

e Effect of AMR on the NMT Model.
According to the results presented in Section 6.1, the bi-LSTM-AMR and LightConvSeq2Seq-AMR models improve BLEU more than the other two models, ConvSeq2Seq and Transformer. erefore, to analyze the impact of AMR on the machine translation system, bi-LSTM-AMR and LightConvSeq2Seq-AMR models are selected for further training to examine graph elements such as information integration directions, graph encoding layers, and aggregators.

Bi-LSTM-AMR
(i). Direction and Depth. Figure 5 depicts the change in performance when adjusting the number of graph encoding layers. e mean aggregator is used to combine information from neighbors. In general, bi-LSTM-AMR and uni-LSTM-AMR-B show the highest translation quality throughout the 30 examined layers. However, an increase in the number of layers does not always help the model achieve a higher BLEU. A decrease in BLEU scores is also observed. e more stacked layers there are, the greater the amount of information the model could learn, which ultimately leads to the overfitting problem due to saturated information. All models obtain the best results with only 2 or 3 graph coding layers. As the number of layers increases, the BLEU scores decrease. Nevertheless, the results seem more consistent and less fluctuating with bi-LSTM than with uni-LSTM.
ere are three aggregators used for aggregating information from neighboring nodes: mean aggregator (MA), max-pooling (MP) aggregator, and GCN aggregator (GCN-A). e strategy of using information from one direction (forward or backward) is also considered to make more accurate statements about the effect of the aggregator on the effectiveness of the model. e results in Table 5 show that Bi-LSTM-AMR-MA achieved the highest result on the two test sets with the BLEU scores of 29.29 and 26.41, respectively. Meanwhile, uni-LSTM-AMR-MA, which uses information from both sides, achieved lower BLEU scores than the variants uni-LSTM-AMR-F and uni-LSTM-AMR-B, which only combines information from the forward and the backward neighbors, respectively. Moreover, bi-LSTM-AMR-MA outperforms bi-LSTM-AMR-F and bi-LSTM-AMR-B due to its ability to capture information from two directions during the node embedding learning and combine with information  e bold values are the highest results when evaluating each model for the "tst2013" and "tst2015" testsets.  Mathematical Problems in Engineering from the bi-LSTM encoder. erefore, the LSTM decoder can leverage information from the graph more efficiently to improve the machine translation quality. is shows that bidirectional aggregation is more useful when combined with a bidirectional LSTM encoder. Accordingly, uni-LSTM-AMR-F-MP and uni-LSTM-AMR-B-MP, which only combine information from one direction, achieve good results when used with a unidirectional LSTM encoder.

LightConvSeq2Seq-AMR.
Similar to bi-LSTM-AMR, the LightConvSeq2Seq-AMR model is also affected by different aggregators. In particular, as given in Table 6, the mean aggregator (MA) yields better results on average values than the rest. e results with tst2015 show that all the three modes with MA both achieve much higher results than the rest of models.
On the contrary, the GCN-A results are the lowest, similar to Seq2Seq.
is proves that the information combination of GCN-A is not as efficient as those of MA and MP. Figure 6 shows the change of BLEU when stacking convolutional blocks in the encoder and the decoder and the effect of heads H in self-attention. On both sides, the BLEU scores increase when the number of heads increases. In particular, the figure on the left shows the LightConvSeq2Seq-AMR model with the configuration (4, 4), which stacked 4 convolutional blocks at the encoder and 4 convolutional blocks at the decoder, and (6, 6) yields the best results. e BLEU scores are approximately 28 and 27.6 with just 1 head and then increases to 28.46 and 28.2 when H � 8. However, with an additional graph encoding layer, the (4, 4) configuration is inferior to the (6, 6) configuration. is configuration yields the highest    Figure 6. e remaining (4, 3) configuration yields the lowest

Conclusions
We proposed a method to integrate the AMR graphs into popular machine translation architectures such as Seq2-Seq, ConvSeq2Seq, and Transformer. Structured semantic information from AMR graphs can supplement the context information in the translation model for a better representation of abstract information. Experimental results show that AMR graphs yield better results than other representations such as dependency trees or semantic roles.
For future studies, we plan to examine other methods to integrate more complex semantic graphs, such as Prague Semantic Dependencies, Elementary Dependency Structures, and Universal Conceptual Cognitive Annotation, and investigate different encoding methods suitable for a range of semantic graphs.

A. Error Analysis
is section presents some translation errors of the proposed model.
In the first example in Table 7, with bi-LSTM-AMR, the model incorrectly predicts the phrase "and in V Magazine" to be "và V là Magazine." Although the translation is incorrect, the model still recognizes "V Magazine" as a proper noun and that V is a magazine ("V là Magazine"). Meanwhile, both ConvSeq2Seq-AMR and Transformer-AMR cannot recognize this pattern and omit the word "Magazine" when translating. Light-ConvSeq2Seq-AMR is the only model that provides a relatively complete translation.
Example 2 in Table 8 illustrates the case in which the model still understands the meaning but selects the wrong representation. e English word "internal" is meant to complement the phrase "combustion engine," which already entailed the meaning of "Cộng cơ Cốt trong." In this case, ConvSeq2Seq-AMR and bi-LSTM -AMR has taken "internal" to mean "inside" as an adjective that modifies the location information of the engine and ignores the word "combustion" when translated into Vietnamese. Meanwhile, Light-ConvSeq2Seq-AMR and Transformer-AMR prove a better performance in capturing information, as they produce accurate translations. Table 9 describes the case in which the model retains the meaning correctly, but the reference data are incorrect. e word "it" is translated to "những thông tin Có" in the data. is is an inaccurate translation because the word "it" refers to a singular entity, while the translation is in the plural form. Besides, there is only one sentence and no information about the surrounding context, so the results obtained from the proposed models are similar to one another. e Vietnamese word "nó" can be used to refer to previously mentioned things or events. It is thus highly ambiguous, causing difficulty in interpreting even for humans. Table 10 illustrates some sample translations of the models: Song's method, bi-LSTM (base model), and bi-LSTM-AMR (proposed model).

Data Availability
e datasets used to support the findings of this study are from https://wit3.fbk.eu/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Long H. B. Nguyen and Viet H. Pham contributed equally to this work.