An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, the computations of each head attention are conducted in the same subspace, without considering the different subspaces of all the tokens. On the other hand, the low-rank bottleneck may occur, when the number of heads surpasses a threshold. To address the low-rank bottleneck, the two mainstream methods make the head size equal to the sequence length and complicate the distribution of self-attention heads. However, these methods are challenged by the variable sequence length in the corpus and the sheer number of parameters to be learned. Therefore, this paper proposes the interacting-head attention mechanism, which induces deeper and wider interactions across the attention heads by low-dimension computations in different subspaces of all the tokens, and chooses the appropriate number of heads to avoid low-rank bottleneck. The proposed model was tested on machine translation tasks of IWSLT2016 DE-EN, WMT17 EN-DE, and WMT17 EN-CS. Compared to the original multihead attention, our model improved the performance by 2.78 BLEU/0.85 WER/2.90 METEOR/2.65 ROUGE_L/0.29 CIDEr/2.97 YiSi and 2.43 BLEU/1.38 WER/3.05 METEOR/2.70 ROUGE_L/0.30 CIDEr/3.59 YiSi on the evaluation set and the test set, respectively, for IWSLT2016 DE-EN, 2.31 BLEU/5.94 WER/1.46 METEOR/1.35 ROUGE_L/0.07 CIDEr/0.33 YiSi and 1.62 BLEU/6.04 WER/1.39 METEOR/0.11 CIDEr/0.87 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-DE, and 3.87 BLEU/3.05 WER/9.22 METEOR/3.81 ROUGE_L/0.36 CIDEr/4.14 YiSi and 4.62 BLEU/2.41 WER/9.82 METEOR/4.82 ROUGE_L/0.44 CIDEr/5.25 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-CS.


Introduction
were the first to introduce the attention mechanism to neural machine translation (NMT) along with recurrent neural networks (RNNs): the mechanism weighs the importance of each source token to produce the target token. By contrast, the traditional way predicts each target token in each time step, using the fixed-length context vector [1]. Kalchbrenner et al. [2] and Gehring et al. [3,4] combined the attention mechanism with models based on the convolutional neural network (CNN) for NMT. Recently, the transformer-based models became fashionable solutions to sequence-to-sequence (seq2seq) problems like NMT [5][6][7], for they outperform RNN-based models and CNN-based projection size for each head is commonly referred to as the head size [10].
However, this multihead attention mechanism has two problems. On the one hand, in theory, more heads make a model more expressive in natural language preprocessing (NLP). However, some scholars demonstrated that more heads do not necessarily lead to better performance. e low-rank bottleneck may arise, once the number of heads surpasses a certain threshold [10]. Namely, more heads generate redundant head information, increase the computational complexity of the model, cause feature redundancy, and reduce performance. Voita et al. [11] and Michel et al. [12] proved that only a small part of the heads is truly important for NMT, especially those in the encoder block. Important heads such as morphology, syntax, and lowfrequency words serve multiple functions, while other heads only convey repeated and incomplete information. On the other hand, each head is independent without considering the mutual relationship of all heads. e calculation of each head attention is performed only in the same subspace but not in different subspaces. e multihead self-attention mechanism only concatenates all the results at the end.
To avoid the low-rank bottleneck brought by more heads, Bhojanapalli et al. [10] brought the parameters of the low-dimensional space close to the attention matrix by increasing the key size to the sequence length for a subhead. Shazeer et al. [13] argued that, when the dimension of the subhead reaches the extreme level, the dot product between the query and key does not fit the informational matching function. To address this issue, the talking-head attention emerged. Under this mechanism, the attention can attend to any query and key, regardless of the number and dimensions of the subheads, by learning the linear projection matrices before and after the softmax function. However, both attention mechanisms are also conducted in the same subspace. Besides, the former approach may not improve the machine translation performance, resulting from the varied ranges of sequence lengths. For talking-head attention, more parameters have to be learned, as the attention head distribution becomes more complex. erefore, it is necessary to resolve the maximum number of heads for avoiding the low-rank bottleneck and make full use of the interactive information of all heads. To attend to all subqueries and subkeys and prevent the low-rank bottleneck, this paper proposes the interacting-head attention mechanism, based on the following intuitions: (1) when there are relatively few heads, the attention relationship between the head sizes among different subspaces increases with the head size; (2) when there are relatively many heads, the attention relationship between the head sizes among different subspaces decreases with the head size and may be ignored in the most extreme case; (3) the right number of heads must be selected, because it is computationally intensive to calculate the head attention of all tokens in all spaces. e proposed interacting-head attention mechanism enables the head size to talk in the same subspaces and interact with each other in different subspaces. Furthermore, a suitable threshold was defined for the number of heads to control the training time and decoding time, while avoiding low-rank bottleneck and ensuring the head size.
Our model was compared to three baseline multihead attention models on three evaluation datasets. e comparison proves that the interacting-head attention mechanism improves the translation performance and enhances the expressive power. On

Preliminaries
is section recaps the transformer architecture, which outshines RNNs and CNN in seq2seq tasks, reviews the background of various forms of attention, especially multihead attention used in transformer [5], analyzes the low-rank bottleneck induced by multihead attention in the standard transformer, and introduces the two mainstream solutions to low-rank bottleneck, as well as their problems in NMT.

Transformer.
e transformer architecture resolves NMT solely by relying on the attention algorithm [5]. It has been proved that the transformer-based models are superior to the models using RNNs and CNN [1-4, 8, 9]. Like RNNs 2 Computational Intelligence and Neuroscience and CNN, the standard transformer-based model employs the encoder-to-decoder structure for NMT [14]. is structure maps the source sequence to a hidden state matrix as a natural language understanding (NLU) task and views the matrix elements as the context vectors or conditions for producing the target sequence. Encoder and decoder blocks are stacked in the encoder-to-decoder structure.
Each encoder block usually comprises a multihead selfattention layer and a feedforward layer with residual connection [15], followed by a normalization layer [16]. As the core component of the encoder, the multihead selfattention layer captures the hidden representations of all the tokens within the source sequence.
is operation mainly depends on the SAN, which learns the mutual attention score of any two tokens in the source sequence. It should be noted that the learned attention scores constitute an asymmetric square matrix, because of the learned parameters. For example, a ij , the attention score from the i-th token to the j-th token, is not equal to a ji , the attention score from the j-th token to the i-th token. Specifically, the SAN computes the attention scores by the scaled dot product attention algorithm. Since each token is visible to the others, the encoder can capture the feature of each token in two directions. ere are two primary functions of the encoder: (1) learning the hidden representations of the input sequence as a condition for natural language generation (NLG) tasks, for example, NMT; (2) completing downstream NLP tasks, such as sentiment classification or labeling by transfer learning, after being trained independently as a masked language model (MLM) [17] and connected to specific networks. e decoder blocks have a similar structure as encoder blocks. e only difference lies in an additional sublayer, which computes the attention scores between the representations of the source sequence given by the encoder and the current target token representation given by the multihead SAN of the decoder. is sublayer, known as the encoder-decoder attention layer, is followed by a multihead attention layer. In the decoder, two attention mechanisms, namely, multihead self-attention and encoder-decoder attention, are arranged to capture the hidden state of the target token in each block. Since the token is only visible to its leftward tokens, the self-attention scores form a lowerdimensional triangular matrix. In other words, the multihead self-attention layer aims to focus the current target token only on the leftward tokens and mask the future tokens in the target sequence. In addition, the decoder learns the leftward token representations to generate the token probability distribution in each time step. During training, the probability distribution of the target token is computed based on the ground-truth leftward target tokens or their representations. All the representations are given by the encoder as a context vector for generating the target sequence. During inference, the current token probability distribution is computed based on the previous target token distribution. All the token representations are given by the encoder. e decoder works in a teacher-forcing way during training, while in an auto-regressive way during inference. e difference between the two stages is that the last token feature comes from the last ground-truth token and the last generated token given by the trained model, respectively.
Because the attention mechanism is not order-aware, the transformer-based models add the positional information into the tokens, for example, absolute positional embedding.

Attention.
For NMT, the translation performance hinges on the attention mechanism, in addition to the encoder-to-decoder structure. Bahdanau et al. [1] pioneered the use of the attention mechanism for NMT along with RNN. Sutskever et al. [8] and Luong et al. [9] further advanced the implementation of the attention mechanism in NMT. After the introduction of the attention mechanism, a target token no longer depends only on the same context vector. e different roles of the source token in target token generation are reflected. Along with the appearance of the transformer, complicated attention algorithms have been developed for specific NLP tasks, such as single head attention and multihead attention. Apart from linking up the encoder with the decoder, these algorithms learn the relationships in an end-to-end way.

Dot Product Attention.
Luong et al. explored the computing methods of an attention score, examined their effectiveness, divided attention mechanisms into global attention and local attention [9] (the former targets all the source tokens, while the latter considers the subset of all the source tokens), and designed three computing methods for the weight scores between two tensors or vectors along with RNNs for NMT. Here, some symbols used in Shazeer et al. [13] are adopted. ree computing methods can be expressed as where m ∈ R d and x ∈ R d are the matching and matched column vectors, respectively; W ∈ R d×d is a learned parameter matrix; score(·) is real. e larger the score, the more important x is to the generation of m. Dot-product attention is widely used for model implementation, by virtue of its fast speed and space efficiency [5]. In line with the notations given by Shazeer et al. [13], the attention between two sequences X ∈ R n×d and M ∈ R m×d is computed through a dot product operation.
where n and m are the length of X ∈ R n×d and M ∈ R m×d with the same dimension d, respectively. To keep the shape constant between the input and the output, O ∈ R n×d is Computational Intelligence and Neuroscience regarded as the final output or mapped further to a lower or higher dimension with a linear projection matrix W o ∈ R d×d o to get the final output.

Scaled Dot Product Attention.
Scaled dot product attention is referred to as single head attention in this research. is attention mechanism projects the input X into d k -dimensional queries Q and projects the other inputs M into d k -dimensional keys K and d v -dimensional values V. e increase of d k pushes up the dot products, which in turn make the softmax function converge into regions where it has extremely small gradients [5]. erefore, the attention score is scaled with 1/ �� d k . Firstly, it is necessary to explain the calculation of attention scores by a single head attention between two tensors X ∈ R n×d X and M ∈ R m×d M , where the next projection operation is needed to deal with the dimensional difference. e matrices of queries Q ∈ R n×d k , keys K ∈ R m×d k , and values V ∈ R m×d v can be, respectively, obtained with the linear projection matrices W q ∈ R d X ×d k , W k ∈ R d M ×d k , and W v ∈ R d M ×d v on X, M, and M. e global computing can be defined as where O ∈ R n×d v is the output. e O value is obtained following the last linear projection. If the self-attention scores are computed within a sequence, the linear projection matrices W q , W k , W v , and W o must function on the same tensor; namely, X ≡ M. If X is different from M, the encoder-to-decoder attention scores should be calculated by the formula (3). Scaled dot production self-attention is applied in the SAN of the encoder and the decoder, as well as the encoder-decoder attention layer. In fact, Vaswani et al. [5] used a transformer to capture the token dependencies, relying on multihead scaled dot production attention.

Multihead Attention.
In the standard transformer, it is beneficial to split the representations into multiple heads and concatenate the subresults of heads in the end. is is because more heads elevate the expressive power and improve model performance. Both tensors are employed on X ∈ R n×d X and M ∈ R m×d M , where X represents the matching tensor and M represents the matched objective. e dimensions of queries, keys, and values are then split into h parts, which is equal to the number of heads. erefore, the two tensors can be projected into three low-dimensional matrices (subqueries, subkeys, and subvalues) with the corresponding low-dimensional parameter matrices W i q ∈ R d X ×d k i , W i k ∈ R d M ×d k i , and W i v ∈ R d M ×d v i for the i − th head. Under most circumstances, d k i is equal to d v i , and both are set to d/h, with d being the model dimension [5].
In the end, all the suboutputs O i ∈ R n×d v i of subhead h i are concatenated as the final result O ∈ R n×d v . e final result can be further mapped into a lower or higher dimension with a linear projection matrix In the standard transformer, the multihead attention mechanism is utilized in three sublayers: encoder SAN, decoder SAN, and encoder-decoder attention. During model implementation, all three sublayers adopt multihead dot product attention.

Low-Rank Bottleneck.
More heads theoretically enhance the expressive power, and fewer heads mean weaker expressive ability. Nevertheless, Bhojanapalli et al. [10] found when the number of the heads is greater than d/n (d and n are the model dimension and the sequence length, respectively), a low-rank bottleneck appears, making the model unable to represent an arbitrary context vector. To remove the bottleneck, the dimension d of the model can be increased while increasing the head number. is approach is obviously expensive because more memory resources are required for the intense computations for model training.

Increasing Key Size and Head Size.
e Q, K, and V are always set in the same dimensions (d). After determining the model dimension d and the number of heads h, a subhead projects Q, K, andV into some subspaces of Q i ∈ R n×d k i , K i ∈ R n×d k i , andV i ∈ R n×d v i , using a series of projection matrices W i ∈ R d×d/h , where n represents the length of the sequence, and d k i � d v i � d/h is the subdimension. en, the i − th head attention computes with Atten i � soft max(Q i (K i ) T / �� � d k i ) to produce a self-attention square matrix Atten i ∈ R n×n . Finally, the suboutputs O i ∈ R n×d v i of the dot product between Atten i and V i are concatenated.
Nonetheless, projecting into a low-dimension subspace is equivalent to mapping a n 2 -dimension attention score matrix with 2n · d/h variables. With the increase of h, 2n•d/h ≪ n 2 results in a low-rank bottleneck. It is not ideal to reduce h or increase d. Either of them reduces the expressive power or adds to the computing load. Bhojanapalli et al. [10] presented a solution that breaks the constraint of d k i � d v i � d/h: 2nd k i ⟶ n 2 is realized by increasing the key size d k i . is approach, without changing the shape of the attention head or the computing process, satisfies the following relationships:

Talking-Head Attention.
According to Vaswani et al., adequately increasing the size of heads could improve the expressive power. But this is not supported by any empirical evidence [5]. Specially, the translation is rather poor, when the token embedding is reduced to just one scalar. Under this circumstance, the dot product of the queries (one scalar) and keys (one scalar) cannot represent their subspace features. Shazeer et al. put forward a variant of multihead attention called talking-head attention, which adds two linear transformation matrices before and after the softmax function to compute the attention weights of the i − th head [13]. e addition enables each attention head to talk with each other.
In talking-head attention, the attention score of the i − th head J i ∈ R n×m is calculated the same as multi-head attention. Before normalization with the softmax function, the first talking between all heads is established with the projection matrix W t1 ∈ R h×h .
en, normalization is performed to get the attention weight, using the softmax function. After that, the second talking is established with another projection matrix W t2 ∈ R h×h .
At last, the final output representations for X are computed by the same method as multihead attention.

Defects of the Two Solutions.
e first solution aims to make 2n · dk ⟶ n 2 or dk ⟶ n.
e designers of the solution set the head size of a head attention unit to the input sequence length and defined it as independent of the number of heads. For NMT, however, the sequence length varies greatly. e second solution employs linear transformation to change the distribution of different subattention matrices, which significantly increases the number of trainable parameters. In addition, the increase of h reduces the value of d k and weakens the features generated by the subspace. As a result, the second solution cannot improve the final translation performance. Overall, the low-rank bottleneck cannot be effectively solved, unless more complex high-dimensional spatial transformations are called for help.

eoretical Hypothesis.
In the original multihead attention, a subhead computes the dot product among the subembeddings (head size) of the tokens in the same subspace. e head size in different subspaces is expected to have a strong correlation. e correlation should be strong when the head size is large or the number of heads is small and weak when the head size is small or the number of heads is large. e subembedding of different subspaces can be ignored because the subembedding of the same subspace is very small. Obviously, when the head size limitation reaches 1 and the number of heads equals model dimension, the dot product of subembedding in the same subspace is equal to the product of two scalars. is certainly cannot express the feature information of the same subspace. To calculate the correlation of the head size in different subspaces, this paper proposes a novel attention mechanism called interactinghead attention. It is assumed that the head size is no greater than the sequence length, aiming to prevent the low-rank bottleneck.
e effectiveness of our model was verified experimentally based on this hypothesis.
To clarify the composition, the associations between two adjacent tokens with different head sizes in different subspaces are displayed in Figure 1, where the red line indicates the association of the head size in the same subspace, and the blue, black, green, and brown lines specify the association of the head size in subspaces 1, 2, . . ., (h − 1) and h, respectively.

Computational Intelligence and Neuroscience
In fact, there is an association between any two head sizes of the tokens in different subspaces. Figure 2, the traditional multihead attention adopts the method of dividing before combining. Each subhead represents the matching between subembeddings in the same subspace. However, not all subheads are associated with each other. If the number of heads grows, the omission of the dependency among some heads will result in low performance. What is worse, only the partial attention among the corresponding queries and keys is considered, although the traditional mechanism covers the main matching information. In contrast, our mechanism considers the dependencies of all the attention among the queries and keys. In addition, it is assumed that the different dimensions of the head size of a token indicate morphology, syntax, and semantic information, respectively. e morphology must also have a close association with the morphology (more important attention score) of other tokens. Needless to say, it is also related to the syntax and semantic information of other tokens.

Graphical Representation. As shown in
token2 token (n-1) token (n) Figure 1: Associations between the head size of different subspaces.
Atten  Figure 2: Comparison between multihead attention (a) and our mechanism (b) with 4 heads. 6 Computational Intelligence and Neuroscience Our mechanism has the following advantages: (1) Compared with talking-head attention, our mechanism does not need to learn extra parameters, and only adds some inner product computations. (2) Our mechanism learns subordinate information by interacting-head attention, in addition to the attention computation of all the tokens with talkinghead attention in the same subspace. In this way, all parts can fully communicate with each other.

Sufficient Interactions between Heads.
To ensure that any attention head attends to all subqueries and subkeys, this paper further examines the relationship between any subquery from the matching tensor X ∈ R n×d X and all the subkeys from the matched tensor M ∈ R m×d M , where X and M are the feature matrices of the source and target sequences, n and m are their lengths, and d X and d M are their dimensions, respectively. It is assumed that the number of heads is set to h. Like the original multihead attention, for the ith subspace, both tensors are mapped to other tensors For Q i , the attention scores between it and all the subkeys are computed and then normalized by the softmax function.
where O ij ∈ R n×d v i is the attention output between the special subquery Q i and the dynamic subkey K j . Assuredly, the calculation of the interacting-head attention on one sequence only needs to replace M with X. Next, the final output O can be obtained through similar concatenations of sub-sub-output and sub-output, respectively: A minimal python implementation is shown in Algorithm 1. In practice, the deep learning framework keras is used for all our experiments.

Choosing the Suitable Number of Heads.
In Section 2.2.3, the dimensions of the Q i , K i , andV i matrices of the ith head are subject to d q i � d k i � d v i � d/h, which can be written as h � d/d k i . According to the definition of head size in [10], it can be expressed as h � d/d k i � d/head size. As mentioned before, increasing model dimensionality and the number of heads can enhance the expressive ability. But a heavy computing load and a large memory demand will ensue, which lead to a low-rank bottleneck. Our model initially adopts a fixed dimensionality d. Inspired by [10], to prevent the low-rank bottleneck, the sequence length is regarded as the minimum head size. erefore, the mean sequence length of the training set should be computed to obtain the maximum number of heads. In our model, the maximum possible number of heads is computed by where d is the model dimensionality and n is the mean sequence length of the training set.

Experiments
is section tests our model on three datasets, namely, IWSLT16 DE-EN, WMT17 EN-DE, and WMT17 EN-CS. All of them are widely used as NMT benchmarks. Before the experiments, the three datasets were preprocessed, and the hyperparameters were configured. ree classic and efficient models were selected as baselines to demonstrate the superiority of our model in translation quality. e experimental results were analyzed to verify our hypothesis and reveal the merits and defects of our model.

Datasets.
For the IWSLT16 DE-EN corpus, the experimental data were extracted from the evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) [18]. e extracted data consist of 181 k/12 k sentence pairs as training/evaluation sets. e concatenation of tst2010/2011/2012/2013/2014 was taken as the test set, including around 12 k sentence pairs.
For the WMT17 machine translation task, EN-DE and EN-CS MT tasks were chosen as our problems because of the limited memory resources [19,20]. For WMT17 EN-DE and EN-CS corpora, the training set consists of 5.85 million and 1 million sentence pairs, respectively. For the two corpora, newstest2013 of 3 k sequence pairs was treated as our evaluation set and newstest2014/2015/2016/2017 as the test set.
Both datasets were preprocessed through data normalization and subword segmentation, using Moses, a de-facto standard toolkit for statistical machine translation (SMT) [21]. Firstly, the sentence pairs of all datasets were tokenized, and those longer than 80/80/100 on the training sets of IWSLT16 DE-EN, WMT17 EN-CS, and WMT17 EN-DE, respectively, were discarded. After that, a truecase model was trained on the cleaned train set and applied to each subset. Secondly, all sequence pairs were encrypted by bytes pair Computational Intelligence and Neuroscience encoding (BPE) [22], using a sentence piece tool (https:// github.com/google/sentencepiece) [23]. is step mitigates the influence of unknown (UNK), padding (PAD), and rare tokens. In IWSLT16 DE-EN and WMT17 EN-DE translation tasks, the source and target languages (EN and DE) have similar alphabets. erefore, a shared vocabulary with 40,000/ 80000 tokens was learned on IWSLT16 and WMT17, respectively. In the WMT17 EN-CS translation task, a vocabulary was learned for English (EN) and Czech (CS) separately, because the two languages are distant from each other.
As can be seen from Table 1 and Figure 3, the sequence lengths of the languages in different datasets obeyed similar distributions and remained consistent with the mean sequence lengths. e length of sequence ranged from 3 to 120, from 1 to 332, and from 1 to 316 for IWSLT16 DE-EN, WMT17 EN-DE, and WMT17 EN-CS, respectively. e mean sequence lengths of the three datasets were set as 20, 25, and 26, respectively.

Parameter Settings.
e settings of our experimental parameters refer to those in [5] which first proposed the transformer architecture for NMT. Our experiments were arranged based on an appropriate setup on the optimizer, learning rate, and hyperparameters. e optimizer was designed by Adam with β 1 � 0.9, β 2 � 0.997, and ε � 10 − 9 as our optimizer [24]. e learning rate was configured by the warm-up strategy [5] with warm up − steps � 8000. During training, the label smoothing rate [25] was set to 0.1, and the dropout was fixed at 0.1. Moreover, because of the limitation of GPU memory, the dimension of the hidden state for linear transformation was set to 1024, and the model dimension was set to 512. To avoid the low-rank bottleneck, the maximum number of heads was obtained by formula (13). if mask is not None then: (11) mmask � (−1e + 9) * (1 − mask) (12) a_ij ⟵ K.Add([a_ij,mmask]) (13) end if (14) a_ij ← K.expand dims(a_ij, axis � 1) (15) attn.append(a_ij) (16) end for (17) 8 Computational Intelligence and Neuroscience model cpt files can also be converted into PyTorch bin files with transformers [26]. All the experiments were completed by two NVIDIA Tesla V100 GPUs of 32 GB memory. During the inference, a beam search algorithm was used with beam size 4 and batch size 8 to decode all test sets. e length penalty was set to 1 and 0.6 for IWSLT16 and WMT17 test sets, respectively.
(1) BLEU [27]. BLEU, one of the most manifold evaluation methods for machine translation, uses N-gram token matching to evaluate the similarity between the reference and the candidate. e quality is positively correlated with the proximity between the translations and the references. (2) WER [28]. Similar to translation edit rate (TER) [33], WER computes the word error rate between the reference and hypothetical translation. e word errors include the number of substitutions, insertions, and deletions from the translation to the reference. e rate is the ratio of word errors to the length of the reference.
(3) METEOR [29]. Based on explicit word-to-word matches, METEOR includes identical words in the surface forms, morphological variants in stemmed forms, and synonyms in meanings between the reference and the candidate. (4) ROUGE [30]. ROUGE was introduced by Chin-Yew Lin for text summarization. It contains four different measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. Here, ROUGE-L is selected as the metric to evaluate machine translation. Note that L is the abbreviation of the longest common subsequence (LCS) between the reference and the candidate. (5) CIDEr [31]. CIDEr is originally used to evaluate the generated image descriptions. It measures the similarity of a generated sequence against a set of ground truth sentences written by humans. is similarity reflects how well the generated descriptions capture the information of grammaticality, saliency, importance, and accuracy. (6) YiSi [32]. YiSi is a family of quality evaluation and estimation metrics for semantic machine translation. In this paper, YiSi-1 is selected for its high average correlation with human assessment, thanks to the use of multilingual bidirectional encoder representations from transformers (BERT).
BLEU, WER, METEOR, ROUGE_L, CIDEr, YiSi were computed using multi-bleu.perl (https://github.com/mosessmt/mosesdecoder), pyter (https://pypi.org/project/pyter/), Note. Train, Eval, and Test represent the number of sequence pairs of different data subsets, respectively; length refers to the number of tokens in a sentence; total length of the train set is the total number of tokens in the training set; mean length of the train set is the ratio of the total length to the total sequence pairs in the training set.  Computational Intelligence and Neuroscience nlg-eval (METEOR, ROUGE_L, CIDEr using https://github. com/Maluuba/nlg-eval) [34], and YiSi (https://github.com/ chikiulo/yisi).

Baselines
(1) Original multihead attention by Vaswani et al. [5]: the original transformer-based model is implemented based on multihead attention, which brings more expressive power than single head attention. e model linearly projects the queries, keys, and values with different, learned projection matrices to d k , d k , and d v dimensions, respectively. Each head yields d v -dimensional output values. All the attention heads are concatenated into the final values.
(2) Multihead attention (head size equaling sequence length) by Bhojanapalli et al. [10]: in the original multihead attention, the scaling between the number of heads and head size leads to a low-rank bottleneck.
To overcome the problem, Bhojanapalli et al. set the head size to input sequence length and keep it independent of the number of heads. In this way, each head acquires more expressive power. e effectiveness of their approach was verified through experiments on the two tasks of Stanford Question Answering Dataset (SQuAD) and Multigenre Natural Language Inference (MNLI).
(3) Talking-head attention by Shazeer et al. [13]: with the increase in the number of heads, the dimensionality of query vectors and key vectors becomes so low that the dot product between the two types of vectors no longer includes useful information. is is what is commonly called a low-rank bottleneck. To address the problem, talking-head attention inserts two linear learned projection matrices across the attention-head dimension of the attention-logits tensor, allowing each head attention to target any subquery vector and subkey vector. e feasibility of this attention mechanism was tested on several seq2seq NLP tasks. But Shazeer et al. did not test the mechanism on any NMT task. erefore, this paper implements the mechanism on both the evaluation benchmarks and compares it with our model.

Results.
For the IWSLT2016 DE-EN translation task, all models almost reached the peak performance at 16 heads. As shown in Table 3 Analyses. Horizontally, a low-rank bottleneck occurs inevitably, when the number of heads reached a certain level. To some extent, the previous models address this problem at the cost of performance degradation. Machine translation is a generation task between different languages. Compared with the results of previous studies, our model brings significant performance improvement and reveals strong correlations between the subembeddings in different subspaces. Longitudinally, the expressive ability of the model increases with the number of heads, until the latter reaches the bottleneck point d/n. e superiority of interacting-head attention over the original multihead attention is the result of the function among subembeddings in different subspaces. Tables 3 and  4, multihead attention with fixed head size and talking-head attention sacrifice performance for solving the low-rank bottleneck. e final performance is primarily affected by four factors: the dimensions of queries, keys, values, and the number of heads. e leading impactors of the attention matrix are the dimensions of queries and keys. In multihead attention with fixed head size, the attention matrix is realized by the dimensions of queries and keys, both of which are equal to the mean sequence length. e model performance hinges on such factors as the dimension of values, the number of heads, as well as the mean sequence before/after      the low-rank bottleneck point. In talking-head attention, the linear transformation has a greater impact on the attention matrix before softmax normalization than after that operation.

Influencing Factor Analysis. As shown in
In our experiments, linear transformations were applied with both functions. e poor performance may be attributable to the use of masked multihead attention in the decoder.
Interacting-head attention is more effective than the original multihead attention, revealing a strong relationship between the head size of different subspaces. Specifically, when the number of heads is small, there is a strong relationship between different subembeddings in different subspaces. With the growth of the number of heads, the said relationship is gradually weakened. In particular, interacting-head attention degenerates into multihead attention, after the number of heads surpasses d/n.           inner product calculation of tensors in different subspaces. e tensor calculation of different tokens in different subspaces slows down the training process. e slowdown is no big deal, given the huge translation improvement of our model. Besides, this problem can be solved by setting the maximum number of heads as a fixed scalar.

Maximum Number of Heads.
To verify the suitability of the number of heads, our model was subjected to an ablation test, with the number of heads changing from 32 to 64. As shown in Table 9, a low-rank bottleneck occurred, once the number of heads exceeded a threshold, although the performance of our model was better than that of the original multihead attention. According to the performance variation, the threshold should be d/n as the maximum number of heads. e test was only carried out on the IWSLT16 DE-EN dataset because the training time of our model grows exponentially after the number of heads surpasses the threshold. 4.7. Discussion. In original multihead attention, the translation performance is positively correlated with the number of heads when the heads are between 2 and 16 and negatively correlated with the number of heads when the heads surpass 16. Within a certain range, many heads enhance the expressive power. Once the number exceeds a threshold, a lowrank bottleneck will take place, due to the ultrasmall dimensionality in the subspace. In original multihead attention, when there are many heads, the dimensions of each subquery, subkey, and subvalue meet the condition: d q � d k � d v � d model /h. In this case, d q and d k are small, and the sequence length is usually greater than d model /h.
In multihead attention with fixed head size, the low-rank bottleneck can be avoided by setting d q � d k � n in the subspace, when there are many heads.
In talking-head attention, the independence of the subattention matrices is improved through linear transformation between the subhead attention matrices, which is actually performed between the i − thd q and the j − thd k . However, the attention matrices in the sequences are actually sparse. e sparsity can be inferred through the syntactic dependency tree and Bayesian network, and even be observed through visualization tools like Bertviz [35] (https://github.com/jessevig/bertviz). e irregular sparsity makes it difficult to learn the optimal weight coefficients from the overall perspective of the attention matrix.
In our model, two thorny problems are resolved. Firstly, the original multihead attention only calculates the hidden features of different tokens in the same space and concatenates all subfeatures into the final output. However, our experimental results show certain connections of different tokens in different subspaces. Secondly, our model adopts the solution of multihead attention with fixed head size and proposes a method for optimizing the maximum number of heads, thereby preventing the low-rank bottleneck induced by the low dimensionality of the subspaces. e disadvantage of our model is the requirement of many tensor calculations, which prolongs the training time. Future research will try to reduce the tensor calculations by capturing the key attention and ignoring the minor attention between the tokens.

Conclusion
Currently, the transformer-based models employ the multihead attention mechanism for NMT, which computes the attention scores between the tokens themselves and among the tokens in the same subspaces. However, language is complex, which contains multidimensional information such as lexical, syntactic, and semantic information, and there are relationships between different dimensions of information.
erefore, this paper proposes the interacting-head attention model, which boasts two advantages. On the one hand, our model confirms the attention relationship between different tokens in different subspaces and uses this relationship to improve translation performance. On the other hand, the model provides a new method for optimizing the maximum number of heads, which helps to prevent the low-rank bottleneck. Besides, a threshold was defined for the number of heads, aiming to avoid the exponential growth of training time. Under this premise, using our model can greatly improve the translation performance. In conclusion, experimental research in this paper argues that the interacting-head attention mechanism is significantly effective for NMT. Simultaneously, the experimental results show that there is a strong interaction in the different dimensions of information of all the tokens within a sequence. However, this model has 2 disadvantages. On the one hand, the attention scores between sequence tokens are different, and some even tend to be 0. erefore, the attention relationship between tokens should not be a fully connected network, but a sparse network which also can reduce the time complexity of computing the attention matrix. On the other hand, considering the attention relationship between different tokens in different subspaces, it is necessary to perform lots of tensor inner product calculations, especially with more heads. As a result, the training and decoding times are extended to a certain extent. ese defects of the proposed model will be addressed in future work.
Data Availability e data that support the findings of this study are publicly available from https://wit3.fbk.eu and https://www.statmt. org/wmt17/translation-task.html. If the IWSLT16 DE-EN corpus is used in your work, reference [18] should be cited. If Note. e unit of the performance is BLEU.
the WMT17 EN-DE and EN-CS corpora are used in your work, references [19,20] should be cited.

Conflicts of Interest
e authors declare that they have no conflicts of interest.