Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network

Collecting parallel sentences from nonparallel data is a long-standing natural language processing research problem. In particular, parallel training sentences are very important for the quality of machine translation systems. While many existing methods have shown encouraging results, they cannot learn various alignment weights in parallel sentences. To address this issue, we propose a novel parallel hierarchical attention neural network which encodes monolingual sentences versus bilingual sentences and construct a classifier to extract parallel sentences. In particular, our attention mechanism structure can learn different alignment weights of words in parallel sentences. Experimental results show that our model can obtain state-of-the-art performance on the English-French, English-German, and English-Chinese dataset of BUCC 2017 shared task about parallel sentences' extraction.


Introduction
Parallel sentences are a very important linguistic resource which comprises much text in the parallel translation of different languages. A large parallel corpus is crucial to train machine translation systems which can produce good quality translations. As is well known, the major bottleneck of statistical machine translation (SMT) and neural machine translation (NMT) is the scarceness of parallel sentences in many language pairs [1][2][3]. With an increasing amount of comparable corpora on the World Wide Web, a potential solution that alleviates the parallel data sparsity is to extract parallel sentences from comparable corpora. Previous research has shown that this bottleneck can be relieved by extracting parallel sentences from comparable corpora [4][5][6][7][8][9][10][11].
As collecting parallel sentences is important for improving the quality of machine translation systems, many works try to mine parallel sentences from comparable corpora in the last two decades. eir success has a great contribution to the development of this research. Traditional systems developed to extract parallel sentences from comparable corpora typically rely on multiple features or metadata from comparable corpora structure. Bouamor and Sajjad [12] proposed to use a hybrid approach pairing multilingual sentence-level embedding and supervised classifier to identify parallel sentence pairs. ey used features such as source-target punctuation marks features and morphosyntactic features to build a support vector machine binary classifier. Although feature engineering is an effective strategy to filter parallel sentences, it usually suffers from the language diversity issue. For example, the named entity is an important feature to measure source-target candidate parallel sentences. However, the named entity has various processes in different languages. For English, CoreNLP (https:// stanfordnlp.github.io/CoreNLP/) can be implemented to extract English persons, locations, and organizations, while there are no open-source tools to deal with other lingual named entities such as Uyghur. To address those issues, many methods extracted parallel sentences without feature engineering. More recent approaches used deep learning, such as convolutional neural networks [13] and recurrent neural networks based on long short-term memory (LSTM) [1,14,15] to learn an end-to-end network classifier to filter parallel sentences.
Although mining parallel sentences using neural-network-based approaches has been quite effective, we use the better representations that can be obtained by incorporating knowledge of context information in the model of sentence architecture in this paper. As we all know, not all parts of a sentence are equally relevant for representing parallel sentences (as an example in Figure 1, unmarked words do not affect detecting parallel sentences). at is, different words have various important weights for detecting parallel sentences.
To address those issues, this paper proposes a parallel hierarchical attention network (PHAN) that learns parallel sentence representations. e PHAN first avoids employing a lot of manual operation to carry out feature engineering. At the same time, compared with current neural networks, the PHAN can effectively learn language differences and the various weights of alignments. As illustrated in Figure 2, the process can be as follows: (1) It first uses one-hot word representations as inputs without feature engineering. (2) Since parallel sentence pairs have different hierarchical components (words form sentences, two monolingual sentences form a parallel sentence pair), the model first encodes monolingual contexts to learn language differences. (3) en, it inputs those monolingual encodings into a top network to encode a parallel sentence representation. e reason for using this network is that different words in a sentence are different. Moreover, the importance of words is highly context-dependent; that is, the same word may be differentially important in different contexts [2,16,17]. (4) Finally, we aggregate the outputs of the neural network into the classification layer to identify parallel sentences. e classification layer adopts the softmax function to implement a binary classification.
Our experimental results show that our method achieves significant and consistent performance compared with all baseline methods in filtering parallel sentences task. In our work, we remove feature engineering and additional computing resources. In particular, we extract parallel sentences from Wikipedia articles. en, we use the parallel sentences to test the machine translation system and show that the extracting parallel sentences can improve machine translation.
is paper first introduces the main research content. Section 2 presents a detailed description of the model. Section 3 presents experiments and settings. Section 4 gives the detailed results of our experiment. Finally, it is the conclusion of this paper.

Parallel Hierarchical Attention Network
In this section, we propose a parallel hierarchical attention network (PHAN) to identify parallel sentence pairs. Figure 1 shows the structure of the PHAN. We consider a training parallel dataset D � (S s i , S t i : l i ), i � 1, . . . , N made of N pairs of sentences (S s i , S t i ) with labels l i ∈ 0, 1 { }. If a pair of sentences is parallel, the label is marked as 1 { }, otherwise as 0 { }. For example, we set the label of two sentences ″ I love the motherland ″ , ″ wo ai zuguo ″ as 1 { }.
e network takes a pair of sentences (S s i , S t i ) as input and output is a label of a pair of sentences l i . It has two levels, monolingual sentences versus bilingual sentences. e level of monolingual sentences is made of source language encoder and target language encoder. e monolingual encoder is made of two bidirectional GRU (Gated Recurrent Unit) networks with parameters H w and an attention model with parameters a w , while the bilingual encoder level similarly includes a network and an attention model. e monolingual level mainly encodes monolingual sentence context and dependency. e bilingual level mainly encodes parallel sentence pair interactive context and dependency. e classification layer uses the output p(s|t) to determine a label l i .

Word Layers.
In natural language processing, continuous word embeddings [18] are often used as the input of the neural network. However, in this task, we use the one-hot vectors instead of continuous embeddings. e reason for using one-hot vectors is that one-hot vectors can help to encode the context of a sentence. In the first step, to compare source and target sentences in the mathematical sense, we need to project them into one-hot n-dimensional space. Each word is converted into a one-hot representation. Although words are often converted into continuous word embeddings, the one-hot representation is more suitable to capture context information.
In order to get this one-hot vector, we define a lexicon V � w 1 , w 2 , . . . , w m , where m is the number of words of source or target sentences. A one-hot of the word w i is an array as [0, 0, . . . , 1, . . . , 0], and we set the number of the word in the lexicon as 1. For example, for a sentence "she is the king," the lexicon is . en, the one-hot of "the" is [0, 0, 1, 0]. e one-hot representation of j th word in the i th sentence is defined as where w s i,j is j th word in the i th sentence. E T is a pretrained embedding matrix, where Embedding( ) is a linear transformational function to embed a word to a one-hot vector.
e source language has the same definition.

Encoder Layers.
In the above section, we convert words into one-hot word vectors that can be calculated in the neural network. Next, we use a stream-dependent word encoder to encode each word representation to learn the near context information in a sentence. 6 7-8 Each year , millions of students sit the exam on june 7-8 in China.°F igure 1: Not all parts of a sentence are equally relevant for representing parallel sentences. 2 Computational Intelligence and Neuroscience e traditional recurrent neural network (RNN) is affected by short-term memory. If a sequence is too long, it will be difficult to transfer information into a long step. erefore, it will miss some important information when we process a long text. For example, when we watch a movie, we may only remember the words such as "amazing" and "excellent" and do not care about the words such as "this," "is," and "a" in the next day. e GRU can effectively achieve the above process. It can only keep some relevant information and forget useless data when we obtain parallel sentences. At the monolingual level, in order to learn the information from both directions of words, this paper uses bidirectional GRU to learn the context in a sentence. e GRU used a gating mechanism to track the state of sequences without using separate memory cells. ere are two types of gates: the reset gate r t and the update gate z t . ey together control how information is updated to the state. At the time t, the GRU computes the new state as follows: which is the linear interpolation between the previous state h t−1 and the state h t computed with new sequence information. We use the two states to learn the context information in monolingual sentences. e gate z t decides how much past context information is kept and how much new context information is added. is operation can effectively learn longer context information. z t is updated as follows: where x t is the input state sequence vector with time t. e other state h t is computed in a similar way. h t is a corresponding weight that maintains a constant state.
In fact, r t is the reset gate which controls how much the past state information contributes to the sentences. If r t is zero, then it forgets the previous state. We use the following equation to update the reset gate: In the process, we use w s i,j to represent a word in a source sentence, tϵ[0, T]. In order to encode the context information of a sentence, we use the following formula to calculate the hidden representation state for the t th time in the source language:

) Classification layers
Attention mechanism (3) Bilingual encoder layers  ] T , which summarizes information of the whole sentence. Target sentences are encoded like source sentences with an additional neural network layer, which helps the encoder to recognize the most relevant features by emphasizing critical points of the target sentence given by each source sentence.
From the example of Figure 2, we can observe that not all words contribute equally to the representation of the sentence meaning, especially when distinguishing whether two sentences are parallel. erefore, we introduce an attention mechanism to learn this information that different words have various weights in distinguishing parallel sentences.
In the attention process, we first use a one-full-layer perception to learn u s i,t as a hidden representation of h s i,t . en, in order to learn the importance of a word in a sentence, we calculate the similarity of h s i,t with a level context vector u w . Next, we use a softmax function to get a normalized importance weight. Note that u w is a model parameter in the attention mechanism. e context vector u w can be seen as a high-level representation that selects which word is more important for a sentence. After that, we get a state u s by a weighted sum of the word annotations based on the weights. We can get a target vector u t by the same method.
At the bilingual level, after combining the intermediate vectors u s and u t , the function networks encode sequence vectors. We concatenate the forward GRU and the backward GRU to obtain the hidden states for each input vector.

Classification Parallel Sentence.
In this section, we should detect whether a sentence pair is parallel or not from the top neural network. In order to achieve this goal, we employ a softmax layer to classify parallel sentences. e basic process is that it maps the multiple outputs of the encode layer into an interval (0, 1). In this paper, we treat the classifying parallel sentence as a binary classification problem. We input the source and target sentences into the encode layer. e encoder layer outputs a state vector u into the classification layer. For the classification layer, we use the following formula that maps the input into the interval (0, 1). It is obvious that the output of the classification layer is a probability.
where W c is a value matrix and b c is the bias term for the classification layer. For the classification problem, we usually use the cross-entropy as a loss.
We use ϕ to stand for the binary cross-entropy. en, we use the gold label l i and predicted label l i ′ for a pair of a sentence i to optimize the loss. e final objective can be minimized with stochastic gradient descent (SGD) or variants such as Adam to maximize classification.

Experiments and Setup
In this section, we assess the effectiveness of our model. We compare our method with multiple settings. As we want to improve the performance of our model, we artificially construct negative samples. [19] showed that a training model only using parallel sentences is not enough.

Negative Examples. Hangya and Fraser
ere are many sentence pairs where the overall meaning is similar, but they are not parallel sentences. So, we need to generate negative examples with similar words but different meanings. erefore, we generate synthetic noisy data from good parallel sentences. We follow [20] to generate our negative examples that have similar words but different meanings.
Gregoire and Langlais [14] showed that obtaining parallel sentences from nonparallel corpora in practice is an unbalanced classification task in which nonparallel sentences represent the majority class. Although an unbalanced training set is not desired since a classifier trained on such data typically tends to predict the majority class and has a poor precision, the overall impact on the performance of our model is not clear. So, we train a total of 10 models with kϵ 0, 1, . . . , 9 { }, such that with k � 0 and k � 9, a model is respectively trained on the dataset with a positive to negative sentence pairs ratio of 100% and 10%.

Data.
To implement experiments, we use the BUCC'17 English-French, English-Chinese, and English-German parallel datasets (https://comparable.limsi.fr/bucc2017/cgibin/download-data.cgi) to train our model. For test sets, we use the BUCC'17 English-French, English-Chinese, and English-German datasets (https://comparable.limsi.fr/ bucc2017/cgi-bin/download-test-data.cgi). Each testing dataset contains two monolingual corpora. e monolingual corpora contain about 100 k-550 k sentences and 2,000-14,000 sentences are parallel. For the convenience of researchers, BUCC 2017 provided us with an evaluation script and a gold standard data to calculate the precision, recall, and F-score. For Chinese, we use OpenCC (https:// github.com/BYVoid/OpenCC) to normalize characters to be simplified and then perform Chinese word segmentation and POS tagging with THULAC (http://thulac.thunlp.org). e preprocessing of English, French, and German involves tokenization, POS tagging, lemmatization, and lower casing which we carry out with the NLTK (http://www.nltk.org) toolkit. e statistics of the preprocessed corpora are given in Table 1.

Training Settings.
We use 256-dimensional GRUs for all RNNs in our model. To prevent the neural network from overfitting, we give the drop-out as 0.5 for the last layer in each module. In order to enhance our model, we add some new negative parallel sentences into training data by sampling {0, 1, . . ., 9} negative sentence pairs for each parallel sentence pair. For the system, we use TensorFlow to realize our models. All those parameters introduced earlier are based on manual analysis of the data and nonexhaustive tuning on the development set.

Baselines
. We compare our model to four baselines (the parameters of the baselines follow their authors): (1) Maximum entropy classifier (ME) [3] (2) Multilingual sentence embeddings (MSE) [12] (3) Dual conditional cross-entropy (DCCE) [21] (4) An LSTM recurrent neural network (LSTM) [14] e first baseline (ME) is the traditional statisticsbased approach that is conventionally considered as alignment features between two sentences. e alignment features mainly conclude the number of connected words, the top three largest fertilities, and the length of the longest connected substring. We use those features to construct a maximum entropy classifier according to Munteanu et al. is method mainly relied on feature engineering. Feature engineering usually suffers from the language diversity issue. e second baseline (MSE) is an important contribution of this type to approach that mentioned in [22]. First, they used a continuous vector representation of each sourcetarget sentence pair which is learned using a bilingual distributed representation model to reduce the size and noise of the candidate sentence pairs. en, they filtered source-target sentence pairs by feature engineering and built a support vector machine (SVM) binary classifier to identify parallel sentences.
is method also relied on feature engineering. e third baseline (DCCE): this work proposed dual conditional cross-entropy to extract parallel sentences.
is work used the computed cross-entropy scores based on training two inverse translation models on parallel sentences.
is method requires additional computational resources to train the translation model. e final baseline (LSTM) is based on bidirectional recurrent neural networks that can learn sentence representations in a shared vector space by explicitly maximizing the similarity between parallel sentences. is method does not distinguish the various weights of words in detecting parallel sentences. ese end-to-end network models do not add attention to encode and do not learn complex mappings and alignments to quantify parallel information.
Compared to the baselines, the PHAN first is independent of feature engineering. It makes the PHAN universal and is easy to apply the PHAN into multiple languages. Moreover, the PHAN uses a parallel hierarchical attention mechanism to capture the deep representation of monolingual and parallel bilingual sentences.

Model Evaluation.
In this section, we first give the overall performance of different models. Table 2 shows precision, recall, and F 1 scores of three language pairs. From Table 2, we can observe that the two methods of ME and MSE get very poor performance compared with ours. e performance is stable no matter in English-French, English-Chinese, and English-German. As the two methods of ME and MSE rely on feature engineering, alignment and bilingual words need a lot of manual annotation. However, manual annotation only covers limited language information and the high cost of manual annotation makes it difficult to obtain large-scale annotation corpus in many languages or domains. e work of [21] for the WMT18 task performed sentence pairs' extraction, was not feature-based, and gave very good results. We also verify the performance of our method by contrasting [21]. Junczys-Dowmunt [21] trained a multilingual translation model to enforce the agreement of cross-entropy scores. However, they need to train a good machine translation system to improve performance. e trained machine translation system heavily affects the performance of required parallel sentences. From Table 2, we can observe that the results of English-Chinese are not as good as English-French and English-German. As we all know, English-Chinese machine translation is not good as English-French and English-German on the same scale corpus and translation method. e reason is that English-French and English-German are similar languages, but English-Chinese is distant languages. In addition to LSTM, which does not use a parallel attention mechanism, we show a significant increase in our proposed method. Our PHAN outperforms LSTM in three language pairs. We analyze the performance of ours and LSTM; the main difference is that we treat the same words that may be differentially important in different sentences. So, we use two parallel networks and attention mechanism to learn different context information. However, LSTM does not learn this context information as it does not add an effective attention mechanism. Our model uses a parallel attention mechanism Computational Intelligence and Neuroscience 5 to mine more context information to improve performance. In the next section, we will carry out two experiments to further analyze our model.

Qualitative Analysis.
We further analyze the performance of PHAN to observe which model can make it perform better than that without the attention mechanism. Alignment is an important factor in identifying parallel sentences. If the weights of alignment are not important, the neural network without attention mechanism may also effectively detect parallel sentences since all alignments have the same contribution. However, the alignment deeply depends on linguistics and context [23][24][25]. For example, the English word "bearing" means multiple Chinese words such as "chengzhou," "baochi," and "zhoucheng" in a different context. We can visualize alignments for some sample sentences and observed translation quality as an indication of an attention model. In order to test that our model is able to mine various informative alignments in parallel sentences, we use this method to make the analysis. To test whether our model can better capture alignments than LSTM without a parallel attention mechanism, we plot the distribution of the attention weights of the words in three language bilingual sentences. e results are shown in Figures 3 and 4. e two figures show that our attention model can obtain a bettervisualized alignment. From the two figures, we can find that our model can obtain various alignment weights in three language pairs. For example, our model can distinguish oneto-many alignment in English-Chinese. We can find that LSTM forces the alignment to one-to-one; if a word does not capture alignment, it will not align any words. However, we can observe the alignments of three language pairs; we find that one-to-many occurs more in English-Chinese than English-French and English-German. is may be the main factor that our model gets a bigger improvement in English-Chinese than English-French and English-German. In order to verify this hypothesis, we count the proportion of the number of words in three language sentence pairs. e    Figure 5. We can observe that English sentences are often longer than Chinese sentences, and the other language pairs have not this situation. is makes oneto-many often occur in English-Chinese. It makes semantic confusion and affects the classification of parallel sentences. is is also an important reason why different language pairs have various accuracies in the classification of parallel sentences.
We further explore the language differences and their impact on detecting parallel sentences. We manually extract English-Chinese and English-French parallel sentences to discuss language differences. Example 1 is extracted by the PHAN, but the other baselines miss it. From Figure 6, we can observe that the English phrase "caught my eye" and the Chinese phrase "ying ru wo de yan lian" are not a suitable translation regardless of context information. According to the bilingual lexicon, "Zhua zhu wo de yan jing" is the right translation of the English phrase. However, if we use the translation "Zhua zhu wo de yan jing" to replace the phrase "ying ru wo de yan lian" in the Chinese sentence, the new sentence is wrong. Although the translation is right, it is a wrong collocation in Chinese. e ME, MSE, and DCCE need the lexicon to learn the bilingual signal, which leads to the fact that the word pairs that are not in bilingual lexicon affect detecting parallel sentences. As LSTM has no parallel attention mechanism to effectively encode monolingual information, LSTM cannot encode a monolingual context to distinguish alignments. In fact, language differences and their impact are very important in machine translation. In building machine translation systems, many works add attention to improve machine translation [26]. Example 2 is obtained by all systems. e English phrase "caught my eye" and the French phrase "attiré mon attention" are very right translations in English-French lexicon. From the above, we can conclude that our method can consider language differences by encoding the monolingual context. It can lead to a better result in detecting parallel sentences.

Performance in Machine Translation.
In this paper, we hope to obtain parallel sentences and improve the performance of the machine translation system. In the training machine translation system, we use the BUCC'17 English-French, English-Chinese, and English-German parallel datasets as baselines. We use our model to extract parallel sentences from Wikipedia (https://linguatools.org/tools/ corpora/wikipedia-comparable-corpora/) corpus. en, we add the obtained parallel sentences into the three original training data as the new training set for machine translation. To evaluate the translation performance of machine translation, we use the well-known BLEU score. We use phrasebased systems that are trained with Moses for the SMT system. To train the NMT systems, we use OpenNMT (https://github.com/OpenNMT/OpenNMT-py) system.
We trained 48 machine translation systems for each SMT (http://www.statmt.org/moses/) and NMT (https:// opennmt.net/) approaches. e baseline systems are trained with BUCC'17 English-French, English-Chinese, and English-German parallel sentences. For the remaining compared systems, we sort the extracted parallel sentence pairs by an extraction system in descending order according to the threshold values and append the top of {20000, 50000, . . ., 500000} and append the extracted parallel sentence pairs to the original training dataset. We change different numbers of extracted parallel sentences to train the machine translation system to test the stable performance of our model. Table 3 shows BLEU scores in machine translation systems of SMT and NMT approaches. We can observe that adding the parallel sentences extracted by our model can lead to significant improvement compared to the baseline systems. erefore, we know that parallel training sentences heavily affect the performance of the machine translation system. is improvement can be observed in three language machine translation systems. e table shows different gains of BLEU scores compared Computational Intelligence and Neuroscience to the baseline systems. When we get Top20K, we add extracted parallel sentence pairs to improve the BLEU score of SMT and NMT systems by 1.13 and 3.1 in English-French, and we also find this improvement in other language pairs. en, we observe that when we get Top500K, the translation system trained on extracted parallel sentences has better BLEU than Top20K.
is means that our model can effectively extract parallel sentences so that it can improve BLEU. We know that adding parallel training sentences can improve the performance of machine translation. ese results confirm the quality of extracted sentence pairs and the effectiveness of our model. Hence, we can conclude that our approach could be applied to extract parallel sentences from comparable corpora and improve the performance of machine translation.

Conclusions
In this paper, we explore a new parallel hierarchical attention network to extract parallel sentences. Our system is able to obtain state-of-the-art performance in filtering parallel sentences while using less feature engineering and preprocessing. Additionally, our model can make full use of monolingual and bilingual sentences. Moreover, we propose a parallel attention mechanism to learn various alignment weights in parallel sentences. In the experiments, we show that our model obtains a state-of-the-art result on the BUCC2017 shared task. In particular, the effectiveness of our model in using the obtained parallel sentences to implement machine translation tasks is demonstrated.
In the future, we will explore the following directions: (1) BPE and similar methods can effectively help us solve the out-of-vocabulary issue. We will use BPE to improve its performance (2) Our model needs parallel sentences to be trained, which can be problematic in low-resource language pairs. In order to lessen the need for parallel sentences, identifying parallel sentences via minimum supervision is a promising avenue, especially in lowresource language pairs

Conflicts of Interest
e authors declare that they have no conflicts of interest.