Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information

. The performance of a machine translation system (MTS) depends on the quality and size of the training data. How to extend the training dataset for the MTS in speciﬁc domains with eﬀective methods to enhance the performance of machine translation needs to be explored. A method for selecting in-domain bilingual sentence pairs based on the topic information is proposed. With the aid of the topic relevance of the bilingual sentence pairs to the target domain, subsets of sentence pairs related to the texts to be translated are selected from a large-scale bilingual corpus to train the translation system in speciﬁc domains to improve the translation quality for in-domain texts. Through the test, the bilingual sentence pairs are selected by using the proposed method, and further the MTS is trained. In this way, the translation performance is greatly enhanced.


Introduction
At present, the performance of a machine translation system (MTS) is determined by the quality and size of the training data. e larger the size and the higher the quality of training data are, the superior the translation performance is. However, the distribution of existing bilingual resources is quite imbalanced in different domains, and the bilingual resources in some aspects are extremely scarce [1]. When the training corpora and the test texts are subordinated to different domains, a translation system generally presents poor performance. e main reason is that many technical terms are included in those corpora; nevertheless, it fails to obtain effective translation knowledge on the technical terms from the training data. Research on the method for selecting bilingual sentence pairs aims to select subsets of sentence pairs related to in-domain texts from a large-scale bilingual parallel corpus from many different corpora. It is expected to extend the training dataset for an MTS in specific domains to enhance the performance of the machine translation. In addition, the bilingual parallel sentence pairs acquired by using existing methods for mining bilingual resources generally do not contain corresponding labels indicating domains.
us, determining how to automatically mine bilingual sentence pairs relevant to a specific domain from the bilingual resources becomes an effective approach to improve the performance of machine translation.

Related Work
Existing methods for selecting in-domain bilingual sentence pairs can be approximately divided into three types: separately based on information retrieval, a language model, and the combination of the translation model and language model. In terms of the first method, Lu et al. [2] and Huang et al. [3] proposed a selection method for bilingual sentence pairs based on information retrieval. In the method, the sentence pairs related to the corpora in the test set are retrieved from a universal corpus by querying sentences in the test set. e method fails to realize the rapid and real-time translation owing to the test sentences are used as the input of information retrieval to be queried.
As for the second method, Yasuda et al. [4] put forward a method for evaluating and selecting bilingual sentence pairs based on the language model perplexity in the target domain. In the method, the language models are trained with monolingual corpora within the target domain, and the relevance of the candidate bilingual sentence pairs is evaluated according to the cross entropy of a language model. Duh et al. [5] assessed the quality of the bilingual sentence pairs with the cross-entropy difference of neural network language models. Axelrod et al. [6] screened and explored the in-domain data by means of part-of-speech tagging and N-gram. Additionally, aiming at the selection of bilingual sentence pairs in the universal translation system, Yao et al. [7] proposed a method for selecting bilingual sentence pairs based on the quality and coverage of sentence pairs. e method is able to evaluate the quality of sentence pairs according to the scores of multiple features marked based on an artificially designed weighted fusion. e sentence pairs with a high score are selected, and the redundant sentence pairs are further filtered out based on the coverage of N-gram. Wang et al. [8] proposed a method for selecting parallel corpora based on classification: a classifier with a great difference is constructed based on the features of a small number of sentence pairs to distinguish the quality of bilingual sentence pairs. Shah and Specia [9] evaluated multiple translation methods for sentence pairs through tests. Although the above research methods have achieved a favourable effect, they still show some limitations. At first, these methods only take the domain relevance of sentence pairs into account while ignoring the mutual translation of sentence pairs in the target domain.
us, the bilingual sentence pairs selected according to these methods possibly present poor mutual translation quality, thus bringing noise to the subsequent training of the translation models. Afterwards, the weights of features for evaluating the qualities of different sentence pairs are adjusted by virtue of manual experience in existing research methods, failing to obtain the optimal performance [10].
As for the third method, the domain relevance and mutual translation of a sentence pair are both taken into account [11]. However, the method for the selection of sentence pairs based on a language model or translation model evaluates the relevance of sentence pairs to the target domain with the aid of the coexistent statistical information of words or N-gram [12]. Limited by the size of monolingual or bilingual resources in a target domain, the method is likely to result in data sparseness; moreover, the topic diversity of in-domain texts is ignored when training a translation model or language model with all dataset [13,14]. In addition, related work is also conducted from the perspective of the granularity of phrase pairs; phrase pairs of specific types are automatically screened from the phrase table, which can be formed by combining phrases in the phrase table for the target domain. Furthermore, the phrase pairs with high domain relevance are screened and added to the phrase table [15,16]. Farhath et al. [17] evaluated the impact of different types of data sources in developing a domain-specific SMT for the domain of official government letters.

Research Methods
e whole construction scheme is shown in Figure 1. e method for selecting in-domain bilingual sentence pairs based on topic information is used to train the translation systems in specific domains to improve the translation quality. By virtue of the topic relevance between the bilingual sentence pairs and the target domain, the method is capable of selecting subsets of sentence pairs related to the text to be translated from a large-scale bilingual corpus. At first, based on the contextual words of phrase pairs in bilingual corpora, the topic vector of phrase pairs is learnt; afterwards, the topic vectors of the development set in the target domain and candidate bilingual sentence pairs are acquired by using the set of the extracted phrase pairs. Finally, the topic relevance between the candidate bilingual sentence pairs and the texts in the development set in the target domain is calculated. e highly relevant sentence pairs will be preferentially selected as the training data in the target domain. With the aid of the topic relevance of texts, the bilingual sentence pairs relevant to the target domain are selected, which provides a new method for extending the training data for specific MTSs and solves the problem incurred by the lack of training data in specific fields.

e Training Module for a Topic Model Based on Phrase
Pairs. e module learns the topic distribution of bilingual phrase pairs, which shows the occurrence probability of bilingual phrase pairs under different topics.
e specific steps are displayed as follows: (I) Based on the phrase extraction algorithm, phrase pairs are extracted from a word-aligned bilingual parallel corpus and the IDs of the corresponding bilingual parallel sentence pairs containing the phrase pairs are recorded. (II) Some phrase pairs are stochastically sampled from the extracted phrase pairs, and the contexts of the sentence pairs containing the current phrase pairs are obtained according to the IDs recorded above. Except for the current phrase pairs, the words contained in the contexts of the sentence pairs are combined to form a new document as a pseudodocument for the distribution of phrase pairs. e words frequently occurring in the contexts of different phrase pairs can characterize the semantic meaning of the corresponding phrase pairs. us, a pseudo-document of a phrase pair is built with the aid of the contextual words of the phrase pair occurring in a bilingual parallel corpus. Furthermore, based on the topic distribution of the pseudodocument, the topic distribution of the phrase pair is obtained. (III) Special characters, stop words, and low-frequency words in the pseudo-document for the distribution of phrase pairs are removed, and the document is employed to train the LDA topic model to acquire the topic distribution of the document. Moreover, the topic distribution of the document is taken as that of the corresponding phrase pair.

Inference Module for Topic Vectors of the Bilingual Sentence Pairs and the Target Domain.
e module is used to obtain the topic representation of the bilingual sentence pairs and target domain. e topic vectors of bilingual sentence pairs and the development set in the target domain are calculated by using the set of phrase pairs extracted therefrom. By taking the calculation of the topic vector of the bilingual sentence pairs as an example, the specific steps are summarized as follows: (I) Extraction of phrase pairs: e phrase pairs satisfying the consistency of word alignment are extracted from the bilingual parallel sentence pairs by virtue of the phrase extraction algorithm. (II) Inference of the topic distribution of phrase pairs: All contexts of each phrase pair in a bilingual parallel corpus are obtained to construct a pseudodocument for the topic distribution.
(III) According to the occurrence frequency of a phrase pair in bilingual sentence pairs, the topic vector of the phrase pair is weighted to calculate the topic vector of the bilingual sentence pairs. e specific mathematical expressions are displayed as follows: where M and N separately refer to the numbers of the source and target language phrases in bilingual sentence pairs; c s (f i , e j ), t k (f i , e j ), K, and V s denote the occurrence times of the phrase pair (f i , e j ) in bilingual sentence pairs, the value in the kth dimension in the topic vector of the phrase pair Scientific Programming equal to 1, it is necessary to further normalize the topic vectors calculated above.
where p i (f, e) stands for the probability distribution of bilingual sentence pairs in the ith topic. e topic vector of the development set of the target domain is calculated in the same way.

Calculation Module for Topic
Relevance. e module is applied to calculate the topic relevance between bilingual sentence pairs and the target domain and finally screen the subset of sentence pairs related to the target domain from a large-scale bilingual parallel corpus. e specific steps are described as follows: (I) e similarities between the topic distributions of all candidate bilingual sentence pairs and the development set of the target domain are calculated by taking Jensen-Shannon divergence as the evaluation index. (II) According to the score of Jensen-Shannon divergence, all candidate bilingual sentence pairs are listed in an ascending order. A parallel sentence pair with a lower score shows a higher relevance to the target domain. As an index in statistics for calculating the similarity of two probability distributions, Jensen-Shannon divergence is mathematically defined as follows: where p i (f, e) refers to the probability distribution of bilingual sentence pairs in the ith topic and p i (dev) represents the probability distribution of the development set of the target domain in the ith topic. (III) e bilingual sentence pairs sorted as TopN � {100 K, 200 K, 400 K, 600 K, 800 K, 1,000 K} are successively selected as the training data for the target domain to build a specific MTS. e optimal value of N is determined according to the translation performance of the MTS in the development set.

Test Corpora.
Based on the English-Chinese translation task in the spoken language domain, the feasibility of the aforementioned methods for the selection of sentence pairs is separately validated through tests. e training corpora for a bilingual topic model involve two parts: (1) training corpora in the spoken language domain, which are taken from the official oral tourism parallel corpora (size: 50 K sentence pairs, 356 K English words, and 317 K Chinese words) offered by the Chinese Language Translation Task, China Workshop on Machine Translation (CWMT09) [18] and (2) training corpora in the universal domain, which correspond to bilingual sentence pairs (size: 16 M sentence pairs, 396 M English words, and 358 M Chinese words) automatically acquired from Web according to the method by Liu et al. [19].

Test Setting.
e test was conducted with the aid of an open source machine translation tool NiuTrans, which produced by Northeastern University of China [20]. e system environment is set as follows: GIZA++ is used to realize the word alignment of bilingual sentence pairs; by means of a trigram language model, the parameters of the translation system are optimized using the training method for the minimum error ratio [21]; moreover, the BLEU value is used as the evaluation criterion for the performance of the MTS [22]. Different translation system schemes are designed in the test to verify the translation effect under various schemes.
CWMT: the MTS [18] trained with bilingual corpora (with the size of 50 K) in the spoken language from the CWMT09 GE: the MTS [19] trained with the aid of large-scale universal training corpora (with the size of 16 M) Duh_2013: the MTS [5] trained with TopN sentence pairs selected from a large-scale universal bilingual corpus by employing the method for selecting sentence pairs based on the neural network language model perplexity proposed by Duh et al Yao_2016: the MTS [1] trained with TopN sentence pairs selected from a large-scale universal bilingual corpus by utilizing the method for the selection of sentence pairs, combining a translation model with a language model proposed by Liu et al. [19] TIM: the MTS trained with TopN sentence pairs selected from a universal bilingual corpus using the method for the selection of in-domain bilingual sentence pairs based on the topic information proposed in the study

Results and Analysis.
e performances of CWMT and GE systems in the study are shown in Table 1.
It can be seen from Table 1 that the translation performance of the translation system GE trained with largescale universal bilingual corpora is superior to that trained with the aid of bilingual corpora in the spoken language domain (CWMT). e BLEU value of the GE system based on the same test set is improved by 13.72%. e reason is that the size of the training corpus of the GE system is larger than that of the CWMT system. Large-scale corpora of the GE system cover more translation knowledge and language phenomena, while small-scale corpora of the CWMT system are likely to result in data sparseness, thus leading to poor translation performance.
However, large-scale universal bilingual corpora contain many bilingual sentence pairs having an insignificant correlation or no correlation with the target domain. e sentence pairs impose an adverse effect on the translation performance. By applying different methods for selecting sentence pairs, the sentence pairs sorted as TopN � {100 K, 200 K, 400 K, 800 K, 1000 K} are extracted from large-scale universal bilingual corpora. Moreover, the sentence pairs are used as the training corpus for a translation system to train a translation model. e specific test results are shown in Figure 2.
e test results show that, with the aid of the method for selecting in-domain bilingual sentence pairs based on topic information, the TopN sentence pairs are extracted from a large-scale bilingual corpus to train a translation model of the MTSs. In this way, the training effect of the translation model can be effectively improved while reducing the cost of model training. When extracting Top400 K sentence pairs (BLEU � 37.25%) related to the target domain from a universal bilingual corpus, the translation effect of the system has exceeded that with all corpora (BLEU � 35.62%). e reason is that the universal bilingual corpus contains training data from various domains, including many sentence pairs less related to the target domain. As a result, many noises are found in the translation rules extracted therefrom, thus influencing the final translation performance.
As shown in Figure 2, the TIM method for selecting sentence pairs relevant to the target domain based on the topic information proposed in the study is superior to the other test methods. When the extracted sentence pairs satisfy TopN � 800 K, the translation performance of the TIM shows the optimal BLEU value of 38.89%, separately increasing by 2.01% and 1.46% compared with those of Duh et al. [5] and Yao et al. [1]. On the condition of Top-N � 1000 K, the BLEU value for measuring the translation performance of the TIM is separately 1.8% and 1.22% higher than those of Duh et al. [5] and Yao et al. [1]. e other several models mainly consider some statistical information (such as text coexistence and language model) to estimate the domain relevance of sentence pairs while ignoring the information of latent semantic meaning in sentence pairs. By contrast, TIM selects bilingual sentence pairs based on the topic information, which generalizes the in-domain translation knowledge at the level of topics. us, the information of translation domains is more favourably matched, thus attaining a better translation effect.

Conclusions
From the perspective of the domain relevance of bilingual sentence pairs, the bilingual resources for specific translation tasks are selected and extended from a large-scale universal bilingual parallel corpus. On this basis, it is expected to improve the performance of the specific MTSs. e bilingual sentence pairs relevant to the target domain are selected  based on the topic relevance of texts; a domain is depicted from the perspective of topic and the bilingual sentence pairs, and in-domain texts are characterized as the probability distribution of topics; furthermore, the sentence pairs relevant to the target domain are selected with the aid of the topic relevance to train the specific translation system, thus improving the translation quality of in-domain texts. e study provides a new method for extending the training data for the MTS in specific domains and solves the problem incurred by the lack of training data in some specific domains. e test result reveals that the translation performance is greatly enhanced in the case of selecting the bilingual sentence pairs and training the translation system based on the proposed method. In the future work, it is supposed to employ more effective domain features to select the bilingual sentence pairs in specific domains and extend the size of bilingual corpora and the number of domains involved in the corpora.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.