A Chinese-Naxi Tree-to-Tree Machine Translation Method Based on Subtree Alignment

In allusion to the syntactic differences between Chinese and Naxi language, the thesis presents a tree-to-tree method of Chinese-Naxi machine translation based on subtree alignment. In this method, we define a subtree alignment model, providing its inference probability, and solve the alignment missing problem of Chinese-Naxi alignment by updating nodes (insert or delete). And then we train the subtree alignment model by EM algorithm, merging subtree alignment to the translation model. Finally we extract the template of Chinese-Naxi translation, adopting the extraction algorithm based on matrix, and implement the Chinese-Naxi machine translation. Result of the contrast experiment shows that, compared to the Chinese-Naxi machine translation method based on tree-to-tree translation, the translation accuracy increased after importing subtree alignment.


Introduction
Naxi characters are the only existing and active pictograph over the world.Work of Chinese-Naxi machine translation can contribute to the inheriting of Naxi characters effectively, as well as bilingual learning of Chinese-Naxi language.However, there are great differences between Naxi language and Chinese [1][2][3][4].There are unique features in Naxi language such as "verbs being in the end of sentences," "attribute being postpositional," and "no auxiliary word," which cause the alignment missing in Chinese-Naxi bilingual alignment.For example, the Chinese sentence "我 喜 欢 吃 黑 雪 山 的 松 果" (I like to eat pine cones from black snow mountain) is translated to the Naxi sentence " (I) (pine cones) (black snow mountain) (like) (to eat)."Due to the features of pictograph, Naxi language has no auxiliary words like "过" and "的" in Chinese, which causes the alignment missing when executing alignment procedure, as well as the confusion of the original sentence.Some solutions were presented to solve this problem.Xiao et al. presented a word realignment method for statistical machine translation [5], which contributes to the accuracy of machine translation by using the inconsistency of bidirectional word alignment.Xiao also presented a subtree alignment based on unsupervised learning [6,7], which can extract more rules compared to word alignment and works well in solving alignment missing problem.As to the Chinese-Naxi word alignment, Li and Yu et al. did some researches on dependency-tree-to-string translation model [8,9], which improved the translation accuracy in dependency-tree-to-string translation model.Li's Chinese-Naxi machine translation method based on dependency-tree-to-string translation model made full use of the Chinese syntactic information, but no consideration to Naxi syntactic features.In fact those great syntactic differences between Chinese and Naxi language make the syntax analysis an important role in bilingual machine translation.Therefore, on basis of Li's dependency-tree-to-string translation method, Gao et al. proposed a Chinese-Naxi machine translation method based on Naxi dependency language model, used dependency parsing syntax information of Naxi side to construct a Naxi dependency language model and fused it into decoding process; to a certain degree, this improved the accuracy of Chinese-Naxi machine translation [10]; However, about the alignment missing, caused by the other characteristic of Naxi language, no auxiliary word, this method did not refer to.
In allusion to aforementioned problem, if we adopt the statistical machine translation method based on dependency-tree-to-string model, there would be a large quantity of alignment missing due to the lack of consideration to Naxi syntactic features, which would cause the information loss of misaligned word in translation result.In the thesis, we proposed a tree-to-tree translation template of Chinese-Naxi translation based on subtree alignment, and in the procedure of extraction, we synthesize the syntactic features of each language, extend the translation template with misaligned word, and implement the translation of misaligned word in bilingual alignment finally.

A Chinese-Naxi Tree-to-Tree Translation Template Extraction Method Based on Subtree Alignment
2.1.Definition of the Subtree Alignment Model.Given a syntax tree pair (, ), we define the best inference  * to realize subtree alignment, and its formal definition is as follows: Here, (, ) is the inferential space.

Inference Probability of the Subtree Alignment Model.
The inference probability of the subtree alignment model is as follows: Here, there are four factor probabilities:  nt (⋅) is the mapping probability of nonterminal symbols between Chinese and Naxi. tree (⋅) is the generating probability of subtree  which is extended from a node in Naxi side. lex (⋅) is the mapping probability of terminal symbols between Chinese and Naxi. reorder (⋅) is the coding rule probability of reordering frontier nonterminal.The four conditional probabilities are expressed by bilingual aligned subtree instance, as shown in Figure 1, respectively.According to the aligning rules of syntax tree with Chinese-Naxi subtree alignment in Figure 1, the four conditional probabilities may be expressed as shown in Table 1.

Node Deletion and Insertion.
In the processing of Chinese-Naxi subtree alignment, if the Naxi language side is empty subtree, the inference probability is changed into Here,  is a special symbol, an expression of empty subtree.
(1) Function TrainModelWithEM({( If the Chinese subtree is empty, the inference probability is changed into Here,  is also a special symbol, an expression of empty subtree.[11] is used to train the model parameters and the four conditional probabilities; the training process to  nt (⋅) is shown in Algorithm 1.

Training Subtree Alignment Model. EM algorithm
Here,  nt and  nt indicate the nonterminal of source language side and the one of target language side, respectively.EC(  nt | nt ) is the Expected Count of   nt | nt . is the iteration times. and V express the tree node of source ending and the one of target ending, respectively.

A Chinese-Naxi Tree-to-Tree Translation Template Extracting Algorithm Based on Subtree Alignment
Before extracting translation template, we must carry on syntactic analysis at both source language side (Chinese) and target language side (Naxi) in bilingual training corpus to figure out the subtree alignment relation.And then, we can do the extraction by adopting the extracting algorithm based on matrix [12].According to the subtree alignment relation we can obtain the phrase tree as shown in Figure 1.
The extracting algorithm is shown in Algorithm 2. Here,  represents a subtree alignment matrix for the pair of trees (, ). min is an empirical threshold to control how often rules are pruned.In this work, it is 10 −7 by default.

Experiment and Result Analysis
4.1.Experimental Design.For the reason that research on machine translation of Naxi language is not as well-developed as that of English, public corpus is short of support.To our study, we collected 35,000 Chinese-Naxi parallel sentences from bilingual textbooks of elementary education and dialogues.After word segmentation, dependence syntax parsing, and bilingual word alignment annotation, we developed a Chinese-Naxi corpus for statistical machine translation, from which we selected 16,000 bilingual aligned parallel sentence pairs as development set and 7,000 sentence pairs as test set, and the experimental corpus is shown in Table 2.
In order to validate the accuracy of tree-to-tree Chinese-Naxi translation method based on subtree alignment, we design some contrast experiments to the translation methods, respectively, based on tree-to-string, string-to-tree, tree-totree, and tree-to-tree model with subtree alignment [13,14].

Experimental Results and Analysis.
The results of comparative experiments are shown in Table 3.
As we can see from Table 3, the Chinese-Naxi syntax translation system based on tree-to-tree model with subtree alignment is +1.5% of BLEU4 higher than that of the Chinese-Naxi syntax translation system based on tree-to-tree model with word alignment and +2.6% higher than that of the Chinese-Naxi syntax translation system based on tree-tostring model.The improvement is attributed to the fact that the Chinese-Naxi translation model based on tree-to-tree translation with subtree alignment solves alignment missing between Chinese and Naxi by deleting or inserting node.For example, when translating the Chinese sentence "我 喜 欢 吃 黑 雪 山 的 松 果" (I like to eat pine cones from black snow mountain), the proposed method can delete the null alignment word "的" and thus got better result " (I) (pine cones) (black snow mountain) (like) (to eat)."The experimental results show that the proposed translation model based on tree-to-tree translation with subtree alignment improves the accuracy of Chinese-Naxi machine translation.However, there are also some errors in our translation.After detailed analysis, we found that the main reasons are the characteristics of Naxi language, such as verbs being in the end of sentence, attribute being postpositional, polysemy for one word, and synonyms for multiple words, and the data sparseness that the small size of corpus brings about.

Conclusions
This thesis puts forward a Chinese-Naxi syntactic statistical machine translation method based on tree-to-tree model with subtree alignment.The experimental results show that when compared with the tree-to-tree translation model with word alignment, the tree-to-tree machine translation model with subtree alignment wins +1.5% of BLEU4.In the next step, we will expand the scale of corpus.At the same time, aiming at serious structural differences between Chinese syntax and Naxi syntax, we will continue to work on the Chinese-Naxi tree-to-tree machine translation model and attempt to integrate semantic information into the translation model to improve quality and performance of Chinese-Naxi syntactic statistical machine translation.

Algorithm 2 Table 3 :
The results of comparative experiments.syntax translation system based on tree-to-string model 22.41 22.53 Chinese-Naxi syntax translation system based on string-to-tree model 22.45 22.58 Chinese-Naxi syntax translation system based on tree-to-tree model 23.59 24.74 Chinese-Naxi syntax translation system based on tree-to-tree model with subtree alignment 25.05 26.16 generated by Berkeley Parser, a syntactic analyzer based on Penn Treebank [15], and from the sentence on target side (Naxi), the corresponding Naxi syntax trees are also generated by Naxi syntactic analyzer which is developed by us.A 3-gram language model was trained by SRILM toolkit.EM algorithm trained a subtree alignment model.Matrix algorithm based subtree alignment extracted tree-to-tree translation template and got 512 template rules.Minimum Error Rate Training (MERT) trained our logarithm linear model and turned the weight parameters.The BLEU4 [16] is selected as the experimental measure.

Table 2 :
The experimental corpus.