Filtering Reordering Table Using a Novel Recursive Autoencoder Model for Statistical Machine Translation

In phrase-based machine translation (PBMT) systems, the reordering table and phrase table are very large and redundant. Unlike most previous works which aim to filter phrase table, this paper proposes a novel deep neural network model to prune reordering table. We cast the task as a deep learning problem where we jointly train two models: a generative model to implement rule embedding and a discriminative model to classify rules. The main contribution of this paper is that we optimize the reordering model in PBMT by filtering reordering table using a recursive autoencoder model. To evaluate the performance of the proposed model, we performed it on public corpus to measure its reordering ability.The experimental results show that our approach obtains high improvement in BLEU score with less scale of reordering table on two language pairs: English-Chinese (+0.28) and UyghurChinese (+0.33) MT.


Introduction
Recently, machine learning model based on deep neural network (DNN) has achieved great breakthrough in many application fields.Furthermore, it is currently becoming a dominant method in both image recognition and automatic speech recognition [1].Some DNN techniques such as autoencoder, long short-time memory (LSTM), and convolution neural network (CNN) have obtained satisfying results in the field of natural language processing (NLP) [2][3][4].However, to the best of our knowledge, the idea of DNN has not achieved comparable success in NLP.This is due to the fact that, unlike image or voice, structure of the language is more complex and feature extraction is more difficult.
As a part of NLP, application of deep learning on machine translation (MT) can be divided into two types: Neural Machine Translation (NMT) and deep learning applied on PBMT [5,6].PBMT which is also called traditional machine translation now is facing the impact of NMT, which is a new neural-network-based model of MT.With the good translation performance and simple structure, NMT draws most attentions on application of neural network on the MT.Despite these, NMT relies on huge size of corpus, and the argument about which has better performance between NMT and PBMT still continues.Furthermore, large size of corpus is impossible on some language pairs such as Uyghur-Chinese.In this study, we present a novel DNN model to optimize reordering model on PBMT.
PBMT generally extracts the phrase and reordering examples from the result of word alignment and then generates the phrase table and reordering table which would be used to the decoding process.The former can be called translation model and the latter can be called reordering model.Adding a language model [7], an integrated PBMT system can translate the input sentence by decoding.It is valuable to research language model, because it is not only used to machine translation but also applied to the other fields of NLP.Similarly, as a component of machine translation, the translation model is always a research focus.From the original word-based translation model to the latest NMT model, the performance of MT is getting better and better and MT is getting closer to automatic control [8,9].Unlike most works which aim to filter phrase table, this paper focuses on optimizing the reordering model in PBMT by filtering reordering table.The previous reordering models are useful, except in environments where memory and time constraints are imposed.The proposed model can get better performance with less space.
In this paper, we propose a DNN model consisting of three parts: a generative model, a discriminative model, and a filtering strategy.The generative model is a recursive autoencoder to implement reordering rule embedding (compact vector representations for reordering rule).The discriminative model is a multilayer perceptron (MLP) to classify these rule vectors.The filtering strategy based on minimum difference is designed to filter reordering rule.Our model is used to reconsider the reordering table and filter its wrong and noisy rules.After this filtering process, the modified reordering table can successfully accelerate the speed of decoding and eventually improve the quality of translation.Figure 1 is a reordering table in Moses system from English-Chinese MT.
This paper is organized as follows.Section 1 is introduction, mainly introducing the background of our research.Section 2 is illustrations of some representative workings on reordering model.The details of our filtering model based on DNN are elaborated in Section 3. The settings and results of experiments on this DNN model are given in Section 4. Section 5 is conclusions and future works.

Related Work
There are many various methods that have been proposed to filter the phrase translation table on Statistical Machine Translation systems.Yin et al. proposed a method to filter the phrase table based on virtual context [10].Di et al. proposed -value and phrase sticky degree in this field [11].Zens et al. use the basic principles of acoustics to filter phrase table [12].Zhang et al. take advantage of a bilingually constrained recursive autoencoder to learn semantic phrase embedding and prune phrase table through phrase semantic similarities [13].Compared with translation model, the reordering model is more independent, while few researchers presented related methods to filter the reordering table.
In Statistical Machine Translation systems, the reordering models are various from simple distance penalty model to complex machine learning models.The first type is following the simple principles; researchers believe that the language model and translation model are enough to accomplish the task of reordering.The representative work of this type is a simple distance penalty model proposed by Koehn et al. [14].This model is simple to implement but effective in English-French MT.The second type is the current mainstream method; it has complex definitions of reordering orientations and discrimination of reordering orientations.The methods of predicting reordering orientations are various from simple maximum likelihood to complex maximum entropy model [15,16].Li et al. proposed a DNN reordering model to discriminate reordering orientations [17].The third type of reordering model uses information on syntax or grammar among different language pairs.This type of methods is generally used in the process of decoding and takes advantage of grammar rules to limit the words order of translation results.It seems like the rule-based machine translation model.For example, both of Xiao et al. and Wang et al. used the syntactic information of Chinese language to direct reordering operations [18,19].
This paper presents a reordering table filtering model to improve the reordering ability of MT; our method optimizes reordering model belonging to the second type which is the most popular method for researchers [15][16][17].This type of reordering model includes two factors: reordering orientations and the score of reordering orientations.The reordering orientations refer to the discrimination of join orders in two consecutive bilingual phrase pairs.The common reordering orientations are monotone, swap, and discontinuous.The discontinuous reordering orientation also can be divided into discontinuous monotone and discontinuous swap.Figure 2 is an example of reordering orientations with respect to the adjacent phrases.For example, the word "minister" is monotonous to its prior word.Formula (1) is the definition of four orientations: monotone, swap, discontinuous monotone, and discontinuous swap: where  time consumption of decoding and the quality of translation in MT is improved.

Reordering Table Filtering Model Based on DNN
The work flow of our model is shown as Figure 3. Firstly, we preprocess the reordering table to get an adaptable dataset.
Secondly, we use generative model called recursive autoencoder to generate a continuous space representation which treats a rule as a dense real-valued vector.Thirdly, another discriminative model called multilayer perceptron (MLP) is used to score the orientation of each rule.Finally, according to the orientation score of each rule, we select the final reordering rule through the strategy of minimum difference.We use the stochastic gradient descent to adjust the parameters of whole model; stochastic gradient descent is a good way to adjust parameters [20][21][22].Here, we first introduce the text preprocessing of original reordering table and then describe the construction of autoencoder-based generative model and MLP-based discriminative model as well as the filtering strategy based on minimum difference.(1) Same rules account for about 30 percent of the total number.

Text Preprocessing
(2) There are many short rules, most of them can merge to their corresponding long rules, the rule whose length below five is about eighty percent of the total number (the maximum rule length is 7).
(3) There are many noise and useless data.
(4) More than 90 percent of rules with only one word are ambiguous of reordering orientations.According to the descriptions from 1 to 4, this paper deals with the reordering table as follows: (1) Adding an attribute to every reordering rule in order to record the number of this rule, especially for rules with one word (2) Deleting redundancy rules and only saving one as well as recording the total number of this reordering rule (3) In order to accelerate training, combining some short rules to long rule in situation as shown in Table 1, then accumulating their number For example, if rule 1 is ", , , " and rule 2 is ", , , ," which means the phrase pair ", " is monotonous with their prior and following phrases, and rule 2 reveals this situation, we combine them together.Table 1 shows the rules in reordering table which can be merged.In Table 1, " 1 " determine orientation with respect to prior phrase and " 2 " determine orientation with respect to following phrase."TIPs" denotes which special orientations should appear.This paper focuses on how to filter reordering rule with wrong reordering orientations.The details of DNN-based classifier which is trained to reconsider the reordering score table are described in the following sections.The aim of our model is to select high-quality rules to retrain the reordering model and improve the quality of translation.

The Classified Model of Reordering Rules Based on DNN.
The reordering table filtering algorithm consists of two components: a generative model and a discriminative model.In generative model, we use RAE (recursive autoencoder) to embed reordering rule.In discriminative model, MLP is used to score the orientation of each of the rules.RAE provides a reasonable composition mechanism to embed each rule and MLP is a simple but effective classifier based on deep learning [23,24].
It is a classification issue that scores orientations of rules in reordering table.For example, if a reordering model has two reordering orientations such as monotone and swap, there are four types in reordering table: "swap, monotone," "monotone, swap," "swap, swap," and "monotone, monotone."In addition, the length of rules in reordering table generally is less than ten, so filtering reordering table can be seen as a classification problem of short texts.The problem of short text classification is not a trivial one.Since the feature vectors of text are always high dimension and sparseness, the result of short text classification is far from satisfactory.
Autoencoder can simulate human brain and combine high dimension of features in a nonlinear way to obtain the low dimension of abstract features [25][26][27]; it is an advanced model of machine learning.MLP accepts input vectors and can easily enhance classifying performance by adding hide layers [28].

Word Embedding.
The reordering rules consists of source phrase, target phrases, and the reordering orientations.In rule embedding process, the word vector is the basis and serves as the input to the generative model.After learning word embedding, all vectors are stacked into an embedding matrix  ∈  ×|| , and each word in our vocabulary  corresponds to a vector  ∈   .
Given a reordering rule which is an ordered list of  words, each word has a column index  of the embedding matrix .The index  is used to retrieve the word's vector representation using a simple multiplication with a one-hot vector  which is zero in all positions except for the th index: According to previous researches [13],  is usually set empirically, such as n = 50, 100, 200.

Generative Model.
Generator is a semisupervised rule embedding model which can learn vector representation and can be well adapted to the given label.Assuming we are given a reordering rule, it is first projected into a list vectors ( 1 ,  2 ,  3 , . . .,   ) by using formulation (2).The RAE learns the rule vector representation by recursively combining two children word vectors in a bottom-up manner.As shown in Figure 4, the recursive autoencoder accepts input data and works.The details are as follows: (1) Putting a nonlinear change on the input vector, we choose an element-wise activation function such as  = tan ℎ( ) and obtain the encoding result  through it.This step is called encoding, as shown in formula (3).We should notice that  at here means [1; 2] ∈  2×1 and matrix  here means  ∈  ×2 , so that  is still a vector with the same dimension as input vector and so does : (2) The encoding result  is restructured by decoder and outputs its corresponding vector .  is transposition of ; both  and   are offset.This step is called decoding, as shown in the following formula: (5)  which extracted from above four steps is the rule feature vector, and then add some noise to  as input vector to encoding.The deep recursive autoencoder is obtained by using above four steps for many circulations.To avoid too much calculations, the number of iteration is set to 50-80.
The deep network has good characteristics to abstraction and feature extraction.The above RAE is completely unsupervised and can only induce general representation of the multiword.Here we add a softmax layer to extend the original RAEs to a semisupervised model.At the last layer, the objective function includes the reconstruction error and the prediction error, as shown in the following formula:  (, ; ) =  rec (, ; ) + (1 − )  pred (, ; ) .  is a multiple classification of logistic regression; formula (8) shows the definition of softmax function: Every component of the output is a score corresponding to the reordering orientation probability according to input rule.After adding softmax layer, we use the pretrained weights as initial weights and minimize supervised cost between the output probability and real reordering orientation probability to modify overall parameters of the network.Figure 5 is a flow diagram of the reordering classification model based on MLP.

Filtering Strategy.
After above steps, we can obtain a DNN-based classifier which accurately outputs each reordering orientation score of reordering rules.This paper defines a standard estimation to evaluate the quality of reordering rules; it is represented by the following formula: (9) where max(score  ) refers to the maximal distributed score of the DNN-based classifier and also the most reasonable reordering orientation of the reordering rule.score  (  ) refers to the probability of original reordering orientation in the DNN-based classifier.In other words, the accuracy of reordering orientation refers to the different value of reordering orientation probability between original reordering orientation and the most reasonable reordering orientation in classifier.When this value is equal to zero, it indicates that this reordering rule is positive because the reordering orientation in original reordering table is the same with the most reasonable reordering orientation in classifier.
For example, a reordering orientation in original reordering table is "monotone, monotone," and the most reasonable reordering orientation in classifier is also "monotone, monotone," so the accuracy of reordering orientation is zero and this rule is positive.

Experiments
We applied the proposed model to phrase-based machine translation systems to evaluate its performance.Our experiments include English-Chinese and Uyghur-Chinese translation.We firstly pretrain the word embedding with toolkit Gensim (https://is.muni.cz/publication/884893/en) on training set.For the dimensionality, we set it as 50.Then the reordering rule representation is learned by a RAE model shared by Socher (https://github.com/jeremysalwen/ParaphraseAutoencoder-octave) in GitHub.We empirically set the learning rate as 0.01.The discriminative model that is chosen by us is implemented by Theano (http://deeplearning.net/tutorial/ mlp.html#mlp).The MLP is a deep learning network including four layers; the number of neurons in each layer, respectively, is 50, 80, 80, and n.The number of neurons in input layer is decided by the dimension of word embedding; the number of neurons in output layer is decided by the types of reordering orientation.Besides, we set learning rate as 0.1 and weight penalty factor as 0.0002 in stochastic gradient descent algorithm.
The proposed method was executed on a computer with Moses 2.1 (http://www.statmt.org/moses/), 4 GB memory, and Ubuntu 12.04.The word alignment tool which we selected is open source GIZA++ and then we use the strategy of "grow-diag-final-and" to implement many-to-many word alignments.The maximal extracted phrase length is 7 and the reordering model is selected as the variate in various experiments.In process of tuning parameters, we use MERT method to optimize arguments.In addition, we use SRILM to training two 5-gram language models on each Chinese corpus and estimate parameters according to Kneser-Ney smoothing algorithm.The evaluation metric of machine translation is case-insensitive BLEU-4 scores [29].
In order to compare with various reordering methods, we take two experiment tasks of English-Chinese and Uyghur-Chinese into consideration.The settings of both experiments are the same and we also set five small groups in these two experiments; the details are as follows.
Baseline.We use default distance penalty model as our reordering model to train translation model; this test has no reordering table.
MSD.We use the option "phrase-msd-bidirectional-fe" as reordering model to train translation model; "phrase" denotes this is a phrase-based MT model; "msd" denotes that this model has three orientations: monotone, swap, and discontinuous; "bidirectional" denotes that this model determines orientation with respect to both following and prior phrase; "fe" denotes that this model conditions on both the source and target languages.
MSD F. We firstly use the option "phrase-msd-bidirectionalf " as reordering model to train translation model, and then utilize our proposed filtering model to select 80, 60 and 40 percent size of original reordering table respectively.Finally, the filtered reordering table are used to retrain the reordering model which would be used in decoding.
MSLR.We use the option "phrase-mslr-bidirectional-fe" as reordering model to train translation model.All parameters are the same means with above.Besides, "mslr" denotes that this reordering model has four orientations: monotone, swap, discontinuous-left, and discontinuous-right.MSLR F. We use the option "phrase-mslr-bidirectional-f " as reordering model to train translation model and then utilize our proposed filtering model to select 80, 60, and 40 percent size of original reordering table, respectively.Finally, the filtered reordering table is used to retrain the reordering model which would be used in decoding.3 and 4 demonstrated the experiment results of English-Chinese and Uyghur-Chinese machine translation system, respectively.Figures 6 and 7 showed the currency of average BLEU score on Uyghur-Chinese and English-Chinese MT, respectively.In Figures 6 and 7, the numbers 1-9 denote the 9 small groups in our experiments.According to the results of experiments, we can draw following conclusions.

Results. Tables
The performance of machine translation systems gets improvement by applying the reordering table filtering model based on DNN.Compared with the original reordering model, the pruned one needs less space but is more useful.The BLEU score gains 0.15 improvement on average while the size of reordering table is 80 percent of original reordering  table; the BLEU score can gain 0.23 on average improvement while the size of reordering table is just 60 percent of original reordering table.However, the BLEU score reduces 0.22 on average when the size of reordering table is 40 percent of original reordering table.In addition, the best performance of Uyghur-Chinese machine translation system obtains the improvement of 0.33 BLEU score, and English-Chinese MT is 0.28.
Influence of our model on machine translation varies on different language pair.In experiment, the improvement of English-Chinese machine translation is not so obvious compared with Uyghur-Chinese.As we can see from the result, the former gets less improvement than the latter.The reason is probably that grammatical differences between Uyghur and Chinese are more than English and Chinese, so the reordering problems in Uyghur-Chinese MT are more prominent than English-Chinese MT.Therefore, the performance of our model is better for MT with more grammatical differences.
Our model gets more improvements with more types of reordering orientations.We can see from All in all, this model is suitable for machine translation systems based on arbitrary language pair when the machine translation generates reordering table in the process of training.Our model can improve the quality of machine translation in the situation of reducing the scale of reordering table and speed up the decoding process.

Conclusion
This paper proposed a reordering table filtering model based on deep neural network to improve the problems of reordering in Statistical Machine Translation.The proposed model is evaluated on the field of Uyghur-Chinese and English-Chinese machine translation.The experiment results show that the quality of machine translation in Uyghur-Chinese and English-Chinese obtains obvious improvements when using the new filtered reordering table in decoding process and the reordering ability gets improved.
To enhance the speed and accuracy of decoding in SMT, we optimize the reordering model by pruning reordering table.Reordering table consists of reordering rule and its corresponding orientation.Our method firstly filters the original reordering table by DNN-based model and then uses the filtered reordering rule to retrain the reordering model.
The paper focuses on reordering table, so the method we proposed can be used in any machine translation systems generating reordering table.However, not all machine translation systems generate reordering table, such as the translation model based on syntax.Meanwhile, our model is independent of reordering model and the ability of reordering relies on the performance of reordering model.In future work, we plan to merge the reordering model based on DNN to PBMT as a feature function.

Figure 1 :
Figure 1: A part of reordering table in Moses system.

Figure 2 :
Figure 2: An example of reordering orientations with respect to the adjacent phrases.

Figure 3 :
Figure 3: The work flow of DNN-based reordering table filtering model.

Figure 4 :
Figure 4: An illustration of the generative model based on ARE.

Figure 5 :
Figure 5: A flow diagram of the reordering discriminative model based on MLP.

4. 1 .
Settings.The corpus come from the CWMT 2015 public evaluation datasets and we use English-Chinese and Uyghur-Chinese corpus in news domain as our research objects.Since our model is used to machine translation decoding, we divided the corpus into three parts: training set, test set, and development set.The details of corpus are shown in Table 2.The parallel English-Chinese training data from CWMT contains 77.8 M sentence pair.The parallel Uyghur-Chinese training data from CWMT contains about 0.14 M sentence pair.The development set of English-Chinese contains 1 K sentences.The development set of Uyghur-Chinese contains 1.1 K sentences.Both test sets of Uyghur-Chinese and English-Chinese contain 1 k sentences.

Figure 6 :
Figure 6: The currency of average BLEU score on Uyghur-Chinese MT.

Figure 7 :
Figure 7: The currency of average BLEU score on English-Chinese MT.

Table 2 :
The corpus of our experiments.Filtering strategy based on minimum difference refers to use DNN-based classifier to calculate the accuracy of every rule in reordering table and then rerank reordering rule by accuracy score in ascending order.Finally, we choose a scale of reordering table according to the quality of original training corpus.In general case, we select sixty percent of the original reordering table whose performance is comparative with original reordering table.

Table 3 :
Experiment results of Uyghur-Chinese MT.

Table 4 :
Experiment results of English-Chinese MT.
Table 3 that reordering model with four orientations gains 0.26 improvement on average, and the reordering model with three orientations gains 0.225; this situation happened again in Table 4.The reason may be that orientations discrimination by reordering model with four orientations is more dependent on the quality of the training set than the reordering model with three orientations.And filtering the reordering table by our DNN model helps enhance the accuracy of classifier which discriminate the orientation when decoding.Besides, our model is also influenced by the correlation between training set and test set.We found that the average BLEU score in development set is higher than test set in English-Chinese MT; this situation is opposite to Uyghur-Chinese MT, which means the training data has more correlations with test set than development set in English-Chinese MT.On the contrary, the BLEU score gains 0.19 improvement on development set and 0.16 improvement on test set on average in English-Chinese MT.The same situation happens in Uyghur-Chinese MT whose development set has more correlations with training data and gains less improvement than test set.As deep neural network is more powerful on filtering noise data than traditional machine learning methods, it means our model prefers dirty data than clean data.On the other words, our model is more suitable for MT whose test data has less correlation with training data.Finally, we found that our model can achieve best performance when the size of filtered reordering table is 60 percent of original reordering table.The reason is that the selected reordering rules can cover original reordering table and obtain more accurate probability of reordering orientations with this proportion.When the size of original reordering table was reduced to 40 percent, some reordering knowledge has been dropped.While the size of original reordering table retains 80 percent, the improvements are not so obvious because the difference between the original reordering table and filtered reordering table is too little.