A Dirichlet Process Mixture Based Name Origin Clustering and Alignment Model for Transliteration

Inmachine transliteration, it is common that the transliterated names in the target language come frommultiple language origins. A conventionalmaximum likelihood based singlemodel can not deal with this issue verywell and often suffers fromoverfitting. In this paper, we exploit a coupled Dirichlet process mixture model (cDPMM) to address overfitting and names multiorigin cluster issues simultaneously in the transliteration sequence alignment step over the name pairs. After the alignment step, the cDPMM clusters name pairs into many groups according to their origin information automatically. In the decoding step, in order to use the learned origin information sufficiently, we use a cluster combination method (CCM) to build clustering-specific transliteration models by combining small clusters into large ones based on the perplexities of name language and transliteration model, which makes sure each origin cluster has enough data for training a transliteration model. On the three differentWestern-Chinese multiorigin names corpora, the cDPMM outperforms two state-of-the-art baseline models in terms of both the top-1 accuracy and mean F-score, and furthermore the CCM significantly improves the cDPMM.


Introduction
Machine transliteration concentrates on translating a word or phrase from one writing system to another based on the pronunciation, which is a major way of importing foreign words such as proper nouns and technical terms into the target languages. It is also a major method for out-ofvocabulary words translation which commonly consists of people names, places names, company names, and products names. Machine transliteration is an essential task for many NLP applications such as automated question answering (QA), machine translation (MT), cross language information retrieval (CLIR), and information extraction (IE).
Machine transliteration has emerged around many years ago as a part of machine translation. The methods used for transliteration were often based on the phonetic origin of source and target languages at first, but now the spellingbased statistical approaches have nearly dominated this field. Because the spelling-based methods directly align characters on the training corpus based on the statistical results and do not require language specific phonetic knowledge, they are language independent and capable of achieving state-of-theart performance [1]. However, there are still several common challenges [2] in spelling-based machine transliteration, such as script specifications, language of origin, missing sounds, and overfitting. In this paper, we concentrate on the language of origin and overfitting problems.
The origin problem of the language, also known as multiorigin identification, can affect both the alignment step and decoding step of statistical transliteration. In the Western-Chinese machine transliteration, the bilingual training corpus usually contains name pairs originated from more than one language. Names coming from different languages have their own transliteration pronunciation rules. For example, 2 Advances in Artificial Intelligence The same Chinese character "金" should be aligned to different romanized character sequences: "金: kim," "金: kana," "金: king," and "金: jin." In this case, it is not easy for a transliteration model, which does not take the names origin into consideration, to learn the right alignment for "金" and it is the same difficulty for the character "丁." Overfitting is another problem which exists in many previous alignment models such as GIZA++ [3], M2Maligner [4], and HMM [5] in transliteration. These alignment models are optimized based on expectation maximization and try to fit the training data by maximum likelihood (ML) estimation. In practice the training data usually has some kind of deficiency, such as containing noise data and being not large enough. So the training data often cannot hold the real data distribution. When a ML model learns on this kind of training dataset, it often suffers from overfitting.
In this paper, we exploit a simple and fully unsupervised model to solve the two problems above, which is able to both cluster and align simultaneously. The coupled Dirichlet process mixture model (cDPMM) [6] integrates a Dirichlet mixture model for name pairs clustering and a set of Bayesian bilingual alignment models (BBAM) [7] for each cluster. The clustering and alignment models synergistically complement one another: the clustering model groups the training data into the right origin cluster so that self-consistent alignment models can be built on these data of the same type, and at the same time the alignment probabilities from the alignment models can direct the clustering process. Furthermore, based on the cluster and alignment results, we propose a cluster combination method (CCM) to build the cluster-specific transliteration model and language model for each cluster. In the decoding step, given a source name, we classify it into the most similar cluster based on the language model and transliterate it with the cluster-specific transliteration model. In this way, we use the origin information to direct the alignment and decoding step and obtain significant improvements.
Our major contributions in this paper are summarized as follows: (i) We exploit a Dirichlet process mixture model for clustering name pairs, which is fully unsupervised and does not require setting the cluster numbers aforehand. It is effectively capable of discovering an appropriate number of clusters from the data automatically. (ii) We apply the Bayesian segmentation alignment model in the alignment step over each cluster, which allows many-to-many monotonic alignment and can overcome overfitting. (iii) The clustering and alignment models are coupled in a unified model, and they work synergistically to support each other. (iv) By combining the small clusters into large ones based on the perplexities of name language models and transliteration models, we build the combined cluster-specific transliteration model and language model and can transliterate each source name based on its origin.
(v) We conduct experiments on the three Western to Chinese multiorigin datasets and the results show that our cDPMM and CCM methods are competitive in terms of accuracy and mean -score compared to other methods.
The rest of the paper is organized as follows: we start by introducing related works in Section 2. Then we describe the detail of our coupled Dirichlet process mixture model in Section 3. In Section 4 we introduce the cluster combined method which can obtain large size of training data for building cluster-specific transliteration model and language model. Section 5 provides the experimental setup, results, and analysis. Finally, we conclude this paper and discuss the future work.

Related Work
Machine transliteration methods can be categorized into phonetic-based models [8], spelling-based models [9], and hybrid models which utilize both the phonetic and spelling information [10,11]. Among them, spelling-based models, which directly align characters in the training corpus based on the statistical learning, have been a popular method in transliteration because it is language independent and phonetic knowledge independent, and the performance often is significantly higher compared with phonetic-based models.
In some cases, transliteration can be viewed as a special instance of machine translation, so the aligning and decoding methods in statistical machine translation (SMT) [12] can be used in transliteration. Many previous works [13,14] have built transliteration models based on GIZA++ [3] and Moses decoder [12]. In these works, a single character or syllable in a name pair is treated as a word in a sentence pair, and a monotonic many-to-many alignment is carried out. As the alignment is monotonic and the decoding is simpler in machine transliteration, many transliterationspecific methods have been proposed, such as Alpha-Beta Model [15], Joint Source Channel Model [16], and M2Maligner [4]. These models are optimized based on expectation maximization and always have the overfitting problem.
Recently, the nonparametric Bayesian model is widely used in many natural language process (NLP) tasks and achieves competitive results. Someone has tried to use this kind of model to address the transliteration. A Bayesian bilingual alignment model (BBAM) [7] is proposed to segment and align the training name pair, which uses Dirichlet process to model the alignment sequence and treats basic transliteration units as samples, and applies a blocked version of a Gibbs sampler [17] to get each transliteration unit. The BBAM has been integrated to many transliteration methods to improve transliteration task [18] and transliteration mining task [19]. In [20], Huang et al. propose a nonparametric Bayesian method to train synchronous adaptor grammars for transliteration. In this paper, we also integrate the BBAM [7] to obtain transliteration alignment for easing the overfitting issues.
The multiorigin problem in transliteration is firstly proposed by Huang et al. [20]. They choose an unsupervised bottom-up hierarchical clustering algorithm and use the language and translation model to calculate the similarity of every two clusters, and several cluster-specific translation and classification models are built based on the clustering results. Li et al. [21] propose a supervised classification model to classify the name pairs based on their language origins and genders, which chooses the most similar cluster-specific model for a source name. Hagiwara and Sekine propose a latent semantic transliteration model (DM-LST) [22] to integrate the clustering and alignment model, which is an extension to their latent class transliteration model [23] and applies Dirichlet mixture as a prior distribution for distribution of alignment units to address overfitting.

Coupled Dirichlet Process Mixture Model
In [6], Li et al. propose a new transliteration model called coupled Dirichlet process mixture model (cDPMM), which simultaneously clusters and bilingually aligns training data. And it is a fully unsupervised method which can discover an appropriate number of clusters for the training data. In this section we briefly make a description of this model. We first give the terminologies that will be used in the following section. Then we describe the detail of cDPMM; in Section 3.2 the Bayesian bilingual alignment model (BBAM) is presented, and in Section 3.3 we describe the cDPMM for clustering and the strategy to couple the BBAM.

Terminology.
In this work, we concentrate on solving the overfitting and multiorigins problems in transliteration. Firstly, we denote the training set itself as a set of sequencepairs: = { } =1 , where is the name pairs size of training data. The cDPMM will cluster and segment every bilingual name pair into bilingual character sequence-pair in the transliteration alignment stage. As in [6], we call every character sequence-pair Transliteration Unit (TU). We denote the source side and target side of a as 1 = ⟨ 1 , . . . , ⟩ and 1 = ⟨ 1 , . . . , ⟩, respectively, where ( ) is a single character in source (target) name. So a can be denoted as = ( , ) = (⟨ 1 , . . . , ⟩, ⟨ 1 , . . . , ⟩). We use the same notation to denote a transliteration pair = 1 , where is the number of s used to segment . Our aim is to obtain a bilingual alignment ⟨( 1 , 1 ), . . . , ( , )⟩ for each name pair , where each ( , ) is a segment of the whole pair (also means a ). We use = ⟨( 1 , 1 ), . . . , ( , )⟩ to indicate one derivation for , and for each there is a derivation set = { 1 , 2 , . . . , }. For each cluster obtained by cDPMM, we use to indicate the cluster ID, to indicate the set of name pairs in this cluster, to represent the alignment model parameters of cluster , and = { } ∈ to represent the clustered name pair dataset.

Alignment Model.
In this work, we use a Bayesian model [7] in the alignment step, which implements bilingual segmentation and alignment based on forward filter and backward sampling algorithm to get s, and use the multinomial Dirichlet process to model the alignment process.

Multinomial Dirichlet Process.
The alignment component of our cDPMM is a multinomial Dirichlet process. In our work, the Dirichlet process is a stochastic process defined over the set of all possible s in the training dataset and its sample path is a probability distribution on the set . Our model has two basic components: one model for generating a which has been generated at least once and the other one for assigning a probability to a that has not yet been seen. The probability of generating a new should be considerably lower than that of generating the other old s, and the more s are generated the more reliable and complete the prior distribution is. So the model prefers the s which have been seen more than the unseen ones. Theoretically, our model can generate any sequence-pair of the training dataset in the bilingual language. The Dirichlet process segmentation and alignment model can be written in the following form: is a sampled discrete probability distribution over the set , where 0 is the base measure and > 0 is the concentration parameter for the distribution . The larger is, the more similar will be to 0 .

The Bayesian Bilingual Alignment
Model. In our work, the generation process of a s sequence can be described by the Chinese Restaurant Process (CRP) [24]. Every corresponds to the dish served at its table, and the number of customers seated at each table represents the cumulative count of the . A new customer who comes to the restaurant can take a seat at an occupied table with a probability proportional to the number of customers in this table or chooses to seat at a new table with the probability given by 0 .
In [7], Finch and Sumita propose the Bayesian bilingual alignment model (BBAM) and use a joint spelling model to assign probability on new sequence-pairs according to where | |(| |) is the length of the source (target) side of the ( , ), V (V ) is the vocabulary (alphabet) size of the source (target) language, and ( ) is the expected length of source (target) side.
According to the definition of 0 , source and target sequences are generated independently. For a new , we first choose a length from a Poisson distribution for the source (target) sequence and then generate the sequence based on the vocabulary size of the source (target) language. Theoretically, our 0 can generate a with any lengths and favor shorter sequences.

Input:
Random initial corpus segmentation Output: Unsupervised co-segmentation of the corpus according to the model for co-segmentation of of do (4) Compute probability ( | ℎ); W h e r eℎ is the distribution of all s that have been generated before except that from ; (6) end for (7) Samplea from the distribution of ( | ℎ); (8) Updatecounts; (9) end for (10) end for Algorithm 1: The Blocked Gibbs Sampling Algorithm.
We use (3) to generate the th ( , ). It gives a probability to ( , ) based on the property of Dirichlet process and the other s seen in the history so far: where is the total number of s generated so far and (( , )) is the count number of ( , ) which has been seen in the history. ( − , − ) are all the s generated so far except ( , ). 0 and are the base measure and concentration parameter as before.

The Blocked Gibbs Sampling.
In the BBAM the basic samples are s. However the training data in our transliteration problem are name pairs, and a Blocked Gibbs Sampling [25] Algorithm is used here to obtain a whole sequence for a name pair. Algorithm 1 is our Blocked Gibbs Sampling Algorithm. We use the last updated alignment model, which does not include the current to be sampled, to calculate the probability for each derivation = ⟨( 1 , 1 ), ( 2 , 2 ), . . . , ( , )⟩ of based on Finally, a certain is chosen based on the derivation probability distribution of . In this paper, a forward filtering and backward sampling (FFBS) dynamic program [25] with -grams information is used to obtain a bilingual segmentation path.
Then we sample a certain ( , ) based on the probability distribution.

The Clustering
Model. The clustering component of our cDPMM is a Dirichlet process mixture model (DPMM). In our case, the DPMM is a stochastic process defined over a set of bilingual name pairs and a sample path is a probability distribution of name pair origin on . Furthermore, the definition of Dirichlet process mixture model is shown as follows: The DPMM model computes the distribution of a set of observation name pairs = 1 , 2 , . . . , by using a set of latent parameters = 1 , 2 , . . . , . And each is drawn independently and identically from ; each is drawn independently from a distribution ( ) parametrized by . As the Dirichlet process, the DPMM also has two basic components: one model for generating an from clusters which have been generated and one for assigning a probability to for being generated by a new cluster. The probability of a cluster which generates a name pair is decided by the similarity between and the name pairs which have been assigned to cluster . As our DPMM, the CRP can also be used here. Every name pair corresponds to a customer, and the dish served on each table corresponds to an origin of name pairs. We use = ( 1 , 2 , . . . , ), ∈ 1, . . . , , to indicate the cluster ID for each transliteration pair in training set which have been assigned some cluster and = 1 , 2 , . . . , , to represent the parameters of the component associated with each cluster. The generating probability of a transliteration pair for cluster is calculated as follows: where is the number of transliteration pairs in the existing cluster ∈ 1, . . . , ; is the number of clusters that are obtained by now. is the cluster indicator for , and − is the sequence of observed clusters up to .
In our cDPMM, each mixture component is a multinomial DP model, and we use the alignment method introduced in Section 3.2 to train the alignment model for each cluster. For a component , it has a distribution of s and the parametric is unfixed and changes as the transliteration pairs that newly enter into the cluster. For a new component we use (2) to calculate the probability of each . We decide the probability of a name pair belongs to a cluster based on the joint channel model. There is a set of s derivation . . , ( , )⟩} where = ⟨ 1 2 . . . , 1 2 . . . ⟩ for each name pair , and the similarity between and a cluster is calculated as follows: The input of our cDPMM is a dataset of name pairs = ( 1 , 1 ); the output is the clustering result of all and the alignment model in each cluster. In the sampling step, the cDPMM will first sample a cluster based on the cluster distribution decided by (8), and a derivation will be sampled for based on probability calculated by (9).

Cluster Combination Method and Source Name Classification Model
In [6], Li et al. just use the cDPMM to direct the alignment step. In order to make full use of clustering and alignment results of cDPMM, in this paper we train a transliteration model for each name origin cluster. However, based on the theory of unsupervised clustering model, we will get a lot of clusters with small number of name pairs. In order that each cluster has enough name pairs to train a reliable model, we propose a method that combines the clusters generated by the cDPMM and force the size of each origin cluster at least onetenth of the whole training set when decoding a source name. Also in order to classify the source name to the suitable origin 6 Advances in Artificial Intelligence cluster, the origin-specific language model is trained on each cluster and used for classification.

Cluster Combination Model (CCM).
We use the language model (LM) and transliteration model (TM) perplexities, which is used by Huang et al. [20] between two clusters to decide the similarity of them. We use = {( 1 , 1 ), . . . , ( , )} to denote the data set in cluster ; is the size of dataset . For a cluster , we can train two language models of source and target names and have an alignment model. we use = ( , , ) to indicate the model parameters; here ( ) is the language model trained by the source and target names, and is the alignment model based on our cDPMM. The formulas of the perplexities definition between clusters 1 and 2 are shown as follows: Here ( 1 , 2 ) is the distance function between clusters 1 and 2 . We greedily combine two clusters into a large one by distance until the size of each cluster reaches one-tenth of the whole training dataset.

Source Name Classification Model. For a source name
= ⟨ 1 , 2 , . . . , ⟩ we will choose the most similar cluster for it. We use the language model trained by the source names in each cluster to classify the source names and choose the transliteration model with the highest language model probability. The calculation equation is shown as follows: where the ( ) is the prior distribution of the cluster, is the number of name pairs of cluster , and is the number of the whole datasets.

Experiments
In this section, we conduct a set of experiments to validate the performance of our cDPMM and CCM. First, we choose several datasets with different name origins information to make sure our experiments results are convincing. Second, we analyze the cluster and alignment results of cDPMM. Third, we utilize the results of cDPMM in several ways to train two state-of-the-art decoders to transliterate the test sets. After comparing cDPMM with GIZA++ and BBAM, we conduct the experiments in the same dataset to test our CCM.

Corpora.
To empirically validate our approach, we investigate the result of our model by conducting Western-Chinese name transliteration on three corpora containing lots of name pairs whose degrees of mixed origin vary a lot. The first two corpora were drawn from the Names of The World's Peoples dictionary (http://zh.wikipedia.org/wiki/ 世界人名翻译大辞典) published by XinHua Publishing House. The first corpus consists of the names pairs only originating from English language (EO), and the second consists of names originating from English, Chinese, and Japanese evenly (ECJ-O). The third corpus is created by extracting name pairs from LDC (Linguistic Data Consortium) Named Entity List (https://catalog.ldc.upenn .edu/LDC2005T34), which contains names from all over the world (Multi-O). We split the datasets into training, development, and test sets for each corpus randomly with a ratio of 10 : 1 : 1 empirically. The details of three subcorpora are summarized in Table 1.

Baseline System.
We compare our alignment model cDPMM with GIZA++ [3] and the Bayesian bilingual alignment model (BBAM) [7]. GIZA++ is one of the most popular alignment models which has been used in machine translation and transliteration frequently with a comparable performance and stability and can be a convincing baseline.
BBAM is an alignment model specially for transliteration, and it is the alignment method on each cluster in our cDPMM. By comparing with BBAM the cluster effect caused by our cDPMM will be illustrated. We employ two decoding models: a phrase-based machine translation decoder (specifically Moses [12]) and the DirecTL decoder [4]. They are based on different decoding strategies and optimization objective which make our comparison more comprehensive.

Hyperparameters
Setting. In our model, there are several important hyperparameters to be set by hand: (i) max , the maximum length of the source sequences of a ; (ii) max , the maximum length of the target sequences of a ; (iii) nc, the initial number of clusters for the training data.
We set max = 6, max = 1, and nc = 5 empirically based on a small pilot experiment. The Moses decoder is used with default settings except for the distortion-limit which is set to 0 to ensure monotonic decoding. For the DirecTL decoder the following settings are used: cs = 4, = 9, and = 5. cs denotes the size of context window for features, indicates the size of -gram features, and is the size of transliteration candidate list for updating the model in each iteration. In the BBAM and cDPMM, Table 1: The size of the three multiorigin corpora.

Corpora
Corpus size  Training set  Dev. set  Test set  EO  32,680  3,267  3,267  ECJ-O  32,500  3,250  3,250  Multi-O  33,290  3,328  3,328 the concentration parameter is obtained by sampling its value. Following [17] we use a vague gamma prior Γ(10 −4 , 10 4 ) and sample new values from a log-normal distribution whose mean is the value of the parameter and variance is 0.3. We use the Metropolis-Hastings algorithm to determine whether this new sample would be accepted. The parameters and in (2) are set to = 4 and = 1. Table 2 shows some details of the results by different alignment strategies. The #(Targets) represents the average length of English character sequences that are aligned to each Chinese sequence. From the results we can see that in terms of the length of alignment targets: GIZA++ > cDPMM > BBAM. GIZA++ has considerably more targets than the other approaches likely because it overfits the training data. The cDPMM can alleviate the overfitting using its BBAM component and model the diversity caused by multiorigin sequences effectively at the same time. Table 3 shows some example of s segmentation from the alignments produced by BBAM and cDPMM on corpora Multi-O. The information in brackets in Table 3 represents the ID of the cluster and origin of the name pair; the symbol " " indicates a "NULL" alignment. We can see that the Chinese characters "丁 (ding)," "一 (yi)," and "东 (dong)" have different alignments in different origins and that the cDPMM can get the correct alignments. Figure 2 is the distribution of the size of clustered dataset in the three corpora. The -axis is the range of dataset size; the -axis is the number of clusters. Each column represents the number of clusters which have name pairs and in range of (a, b]. Figure 2 shows the cluster size distribution of each corpus. We can see most of the clusters have less than one thousand name pairs, and this may be caused by the fact that some names have strong similarity compared with some other name pairs in the same origin. And we can see that many clusters have less than 10 name pairs because of our sampling strategy. In the following sections, we conduct a different strategy to make use of the cluster and alignment results for training cluster-specific transliteration models.

Evaluation Results of Different Alignment Methods.
We run GIZA++ with its default setting as a standard SMT task. For BBAM and cDPMM models, we use the sampled alignment running for 100 iterations which is determined by experience, where every iteration will run about 20 minutes, and combine the alignment tables of each cluster. The experiments are therefore investigating whether the alignment has been improved by the clustering process or not. The top-10   Tables 4 and 5.
Our cDPMM model achieves the highest performance on all three datasets for all evaluation metrics by a considerable margin. Surprisingly, for dataset EO although there is no multiorigin factor, we still observe a notable improvement. This shows that although names may have monolingual origin, there are possible hidden factors (gender or convention) which can allow our model to fit. Other models based on supervised classification or clustering with fixed classes may fail to capture these characteristics.
To guarantee the reliability of the comparative results, we performed significance testing based on paired bootstrap resampling [26]. We found all differences significant ( < 0.05).

Evaluation Results of Cluster Combination Method.
In order to make the most use of the cluster results of the cDPMM and exploit the origin information in decoding step, we train a single transliteration model for each cluster. As Figure 2 shows, most of the clusters have a small amount of name pairs and are far from enough to train a good transliteration model. Then, we use the cluster perplexities introduced in Section 4 to combine the small cluster to a larger one which has the highest perplexity, and we make sure each origin has at least one-tenth of the whole training set.
Based on our combined method, we finally got 3 clusters for EO corpus, 6 clusters for ECJ-O corpus, and 4 clusters for Multi-O corpus. As we cluster the name pairs based on the spelling similarity and many factors can cause the spelling differences such as gender and convention, we just assign numbers to the clusters in each corpus based on   the ascending order of the dataset size. Table 6 shows the detailed information of the combined results. Then, we classify the three development sets (in Table 1) for each cluster by using (10) to calculate the similarity probability, which is based on the language model and transliteration model. will be classified to the most similar cluster.
For CCM, we just use Moses as the decoder. After extracting the phrase-table and tuning the parameter in each cluster-specific corpus, we obtain a series of origin-special transliteration models for transliteration.
The testing source names are classified into a special transliteration model based on (12). Table 7 shows the top-1   accuracy of each origin-special transliteration model. Table 8 is the comparisons of cDPMM and CCM methods using Moses as the decoder; here "CCM" represents the cluster combination method. From Tables 7 and 8, we can see the CCM get a significant improvement in all metrics of corpora EO and Multi-O. Although in ECJ-O the CCM does not outperform the cDPMM, the performance is still comparable.

Conclusions
In this paper we exploit a coupled Dirichlet process mixture model (cDPMM) to address the overfitting and names multiorigin identification issues when transliterating a Western foreign name into Chinese. Furthermore, according to the cluster information generated, we build clustering-specific transliteration models in the decoding step by combining small clusters into a large cluster based on the perplexities of name language and transliteration model. Our experiments on the three multiorigin datasets show that the cDPMM model can improve the performance of a transliteration generation system comparing with two other state-of-theart aligners. And the cluster combination model (CCM) can achieve a significant improvement in two of all three corpora. In the future, other target languages like Japanese and Korean will be tried to test whether our method is language independent or not. And improving the sampling efficiency will be another work as well.