Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
Bilingual data play an very important role in cross-lingual natural language processing (NLP) tasks, such as cross-lingual text classification, cross-lingual information retrieval, and neural machine translation. However, bilingual data are often difficult to obtain. Lexical borrowing happens in almost every language; Figure
Examples of loanwords in Uyghur2.
Loanword identification is a task of finding out loanwords of a specific language (donor language) in texts in another language (receipt language). There are about three kinds of loanword identification methods: (1) rule-based method; (2) statistical-based method; and (3) deep learning-based method. Early studies on loanword identification often based on rules. For example, McCoy and Frank [
As a common used method to alleviate the data sparseness, data augmentation is one of the most popular methods in this topic. For example, Liu et al. [
After investigation, we find that there are two important clues in loanword identification: semantic similarity and pronunciation similarity. To incorporate these two features into one feature, we propose to transfer the semantic similarity as word-level feature and pronunciation similarity as character-level feature. Then, we fuse these two features into one feature. Meanwhile, we incorporate the fusion feature, pronunciation feature, and POS feature into a log-linear RNN to achieve the best performance in loanword identification.
The main contributions of this study are as follows: First, a lexical constraint-based data augmentation method is proposed to generate more training data for loanword identification task. Second, we incorporate multilevel features, pronunciation similarity feature, and POS feature into a log-linear RNN model to improve the performance of the loanword identification model for low-resource language. Third, we conduct an experiment on loanword (Arabic, Chinese, Russian, and Turkish) identification in Uyghur; experimental results show that our proposed model achieves the best performance compared with several strong baseline systems.
The rest of this paper is organized as follows. Section
In this section, we present some work related to our study.
Lexical borrowing has received relatively little attention in natural language processing area. Tsvetkov and Dyer [
The main goal of data augmentation in NLP is to generate additional, synthetic data using the data you have to alleviate the data sparseness during model training [
There are two main types of sequence labeling methods in NLP, such as gradient-based methods and search-based methods [
In previous studies, a large scale of annotated data is used to train a loanword identification model. They treated the loanword detection as a sequence labeling problem. However, the annotated data for loanword identification are very difficult to obtain. So, one of the contributions of this study is the data augmentation for loanword identification. We propose to use a lexical constraint GAN to generate more sentences for loanword identification model training. Another contribution of this paper is the combination of several features for loanword identification model; we introduce three features such as embedding fusion feature (word level and character level), pronunciation similarity feature, and POS feature.
Our proposed method includes two parts: Data augmentation for loanword identification. Log-linear RNN-based loanword identification model.
To generate more training data for loanword identification, we propose a lexical constraint GAN-based data augmentation model. Recent methods on loanword identification often trained on features such as pronunciation similarity, POS similarity, and so on. However, these kinds of methods usually suffer from data sparseness or lack of semantic knowledge. To overcome this, we introduce a log-linear RNN-based loanword identification model which combines word-level and character-level embedding fusion features, pronunciation similarity, and POS features to predict Arabic, Chinese, Russian, and Turkish loanwords in Uyghur. The main idea of loanword identification in low-resource languages is as follows: we first use the data augmentation model to generate more training data for loanword identification in Uyghur; then, several features such as word- and character-level embedding features, pronunciation similarity, and POS features are proposed to build a multiple feature fusion-based loanword identification model (Figure
The framework of our proposed model.
Recent studies on loanword identification task often suffer from limitation of training data. In this study, we propose to use a lexical constraint GAN to generate more annotated data for the loanword identification task. As an extension of traditional GAN, our data augmentation model also includes two main parts: a generator and a discriminator. The difference is that we use two generators and a discriminator to build the data augmentation model for low-resource loanword identification. We introduce the details of our proposed model in this section.
We follow the work of [
We can define the backward generator
The generator of the entire sentence can be defined as
The two generators have the same structure but have distinct parameters. To improve the coherence of the constrained sentence, we employ an LSTM-based language model with dynamic attention mechanism (called attRNN-LM) as generator.
Another important component in our proposed method is the discriminator, which takes sentence pairs as input and distinguishes whether a given sentence pair is real or generated. It guides the joint training of two generators by assigning proper reward signals. This module can be a binary classifier or a ranker. Following previous methods [
To train the data augmentation model effectively, we first pretrain the backward and forward generators by standard MLE loss. Different from [
Loanword identification can be defined as a sequence labeling problem. However, different from a traditional sequence labeling task, loanword identification task can apply some additional knowledge such as semantic similarity, pronunciation similarity, and POS tagging. As the data augmentation can provide us more annotated data for model training, we propose to use a deep neural network model to identify loanword in low-resource settings. The principle feature we used is the fusion of word- and character-level features, which combines the word relation and pronunciation similarity in loanword identification. We also incorporate external features such as pronunciation similarity and POS information into our method. In this section, we first describe features used in our proposed method and then define the details of the loanword identification method.
We use three kinds of features in our proposed method: the fusion feature, pronunciation similarity, and POS feature.
We use the dot-product attention in this study:
The most important feature in loanword identification task is the pronunciation similarity between the word in receipt language and its corresponding word in donor language. As convolutional neural networks (CNNs) have been proven to capture the character-level information in NLP tasks, CNNs can process the sequences in the current receptive filed akin to the attention mechanism [
We follow the study of [
Max means a max pooling operation. We use it to capture the significant features assigned with the highest value for a given filter. Therefore, in the time step
To fuse the word-level and character-level features together, we propose to concatenate two features with automatic adjustment (Figure
The multilevel feature fusion method used in our proposed loanword identification model. Character embeddings and word embeddings are taken as input for the feature selection layer.
Log-linear models play a considerable role in statistics and machine learning. The most important reason we chose the log-linear model as the basic framework of our proposed loanword prediction model was because features can be easily added into it. Additionally, the log-linear model has been widely used in NLP tasks such as SMT and NMT.
To adapt the loanword prediction task and include rich features such as BiLSTM, POS, and semantic feature into the model, we use log-linear RNNs [
The hidden state at time
We assume that we have a priori fixed a certain background function
Therefore, the loanword label of
During training of our proposed loanword identification model, we use the cross-entropy loss to optimize the performance of our model [
In this section, we evaluate the effectiveness of our proposed method.
To fully evaluate the effectiveness of our proposed model, we conduct Arabic, Chinese, Russian, and Turkish loanword identification in Uyghur. The datasets used in our experiments are listed in Table
Size of datasets.
Data type | Size | |||
---|---|---|---|---|
Arabic | Chinese | Russian | Turkish | |
Sentences | 100, 780 | 125, 085 | 143, 290 | 132, 500 |
Loanwords | 690 | 2,450 | 1,274 | 2,009 |
To train the data augmentation model, we also collect some monolingual data from Internet for each language (Table
Size of monolingual data.
Languages | Uyghur | Arabic | Chinese | Russian | Turkish |
---|---|---|---|---|---|
Size (words) | 0.32 | 1.05 B | 1.70 B | 1.14 B | 1.49 B |
We train the data augmentation model on datasets described in Table
We implemented the log-linear RNNs by ourselves. We also developed the extended version of edit distance algorithm to adapt the loanword identification task. For the POS feature, we first pretrained a Uyghur POS tagging model; then, we tagged all Uyghur sentences based on this model.
We compared our method with several strong baseline systems: Rule [
Results on data augmentation and size of training data can be found in Tables
Evaluation of data augmentation methods.
Donor | Metrics | B/F-LM | BF-MLE | Ours |
---|---|---|---|---|
Arabic | BLEU-4 | 0.15 | 0.15 | 0.21 |
Self-BLEU | 64.32 | 64.58 | 63.46 | |
TER | 66.19 | 66.44 | 65.82 | |
Chinese | BLEU-4 | 0.16 | 0.17 | 0.23 |
Self-BLEU | 64.05 | 64.30 | 63.78 | |
TER | 64.23 | 65.02 | 63.98 | |
Russian | BLEU-4 | 0.18 | 0.18 | 0.23 |
Self-BLEU | 62.76 | 63.05 | 62.64 | |
TER | 63.69 | 63.92 | 63.45 | |
Turkish | BLEU-4 | 0.19 | 0.20 | 0.25 |
Self-BLEU | 62.51 | 62.86 | 62.18 | |
TER | 62.46 | 63.14 | 62.04 |
Size of training data generated in data augmentation (Uyghur sentences).
Lang | Arabic | Chinese | Russian | Turkish |
---|---|---|---|---|
Size | 302, 480 | 325, 790 | 314, 208 | 336, 852 |
The results on loanword identification on different methods can be found in Table
Loanword identification experimental results on different methods.
Donor | Model | Loanword identification results (%) | |||||
---|---|---|---|---|---|---|---|
Russian | Rule (+) | 72.04 | 72.89 | 69.31 | 70.18 | 70.65 | 71.28 |
CRF (+) | 71.63 | 72.45 | 67.28 | 68.15 | 69.39 | 70.23 | |
BLSTM-CNN (+) | 71.45 | 72.26 | 70.50 | 71.31 | 70.97 | 71.78 | |
ClEmbedding (+) | 73.12 | 73.94 | 71.84 | 72.62 | 72.47 | 73.27 | |
Ours (+) | 74.80 | 75.62 | 73.64 | 74.20 | 74.22 | 74.90 | |
Arabic | Rule (+) | 69.05 | 69.84 | 68.17 | 69.02 | 68.61 | 69.43 |
CRF (+) | 69.83 | 70.65 | 67.42 | 68.29 | 68.60 | 69.45 | |
BLSTM-CNN (+) | 68.70 | 69.52 | 69.85 | 70.67 | 69.27 | 70.09 | |
ClEmbedding (+) | 72.95 | 73.76 | 72.03 | 72.85 | 72.49 | 73.30 | |
Ours (+) | 73.91 | 74.62 | 72.35 | 73.06 | 73.12 | 73.83 | |
Turkish | Rule (+) | 72.02 | 72.86 | 69.87 | 70.50 | 70.93 | 71.66 |
CRF (+) | 71.46 | 72.29 | 69.02 | 69.95 | 70.22 | 71.10 | |
BLSTM-CNN (+) | 71.25 | 72.04 | 70.43 | 71.18 | 70.84 | 71.61 | |
ClEmbedding (+) | 72.96 | 73.64 | 73.08 | 73.85 | 73.02 | 73.74 | |
Ours (+) | 75.24 | 76.09 | 74.36 | 75.14 | 74.80 | 75.61 | |
Chinese | Rule (+) | 70.32 | 71.13 | 69.77 | 70.58 | 70.04 | 70.85 |
CRF (+) | 70.85 | 71.64 | 69.24 | 70.05 | 70.04 | 70.84 | |
BLSTM-CNN (+) | 70.58 | 71.34 | 69.98 | 70.79 | 70.28 | 71.06 | |
ClEmbedding (+) | 71.67 | 72.48 | 71.35 | 72.14 | 71.51 | 72.31 | |
Ours (+) | 74.30 | 75.07 | 72.88 | 73.95 | 73.58 | 74.51 |
Table
The first part in Table
The second part of Table
Table
Loanword identification results on different features (Turkish and Chinese loanword identification as examples).
Donor | Feature(s) | Loanword identification results (%) | |||||
---|---|---|---|---|---|---|---|
Turkish | +fusion | 74.14 | 74.95 | 73.28 | 74.16 | 73.71 | 74.55 |
+pronun | 73.96 | 74.68 | 73.02 | 73.94 | 73.49 | 74.31 | |
+pos | 72.54 | 73.36 | 72.25 | 73.07 | 72.39 | 73.21 | |
+fusion, pronun | 73.40 | 74.20 | 72.64 | 73.40 | 73.02 | 73.80 | |
+fusion, pos | 74.63 | 75.42 | 73.70 | 74.52 | 74.16 | 74.97 | |
+pronun, pos | 74.25 | 75.06 | 73.45 | 74.24 | 73.85 | 74.65 | |
+all | 75.24 | 76.09 | 74.36 | 75.14 | 74.80 | 75.61 | |
Chinese | +fusion | 73.15 | 73.94 | 71.74 | 72.56 | 72.44 | 73.24 |
+pronun | 72.76 | 73.52 | 71.32 | 72.16 | 72.03 | 72.83 | |
+pos | 71.30 | 72.09 | 70.58 | 71.25 | 70.94 | 71.67 | |
+fusion, pronun | 72.43 | 73.25 | 71.02 | 71.84 | 71.72 | 72.54 | |
+fusion, pos | 73.61 | 74.40 | 72.26 | 73.02 | 72.93 | 73.70 | |
+pronun, pos | 73.25 | 74.03 | 71.97 | 72.89 | 72.60 | 73.46 | |
+all | 74.30 | 75.07 | 72.88 | 73.95 | 73.58 | 74.51 |
In Table
The main goal of this study is to improve the performance of loanword identification for low-resource language. Our contribution includes two parts: (1) data augmentation for loanword identification and (2) loanword identification based on multiple feature fusion. In particular, data augmentation alleviates the data sparseness occurring in the loanword identification model training; we optimize the loanword identification model by introducing several features such as fusion feature of word- and character-level embeddings, pronunciation similarity, and POS feature into one model based on a log-linear RNN. To evaluate the effectiveness of our proposed method, we conduct experiments on several baseline models. Experiments show that our proposed loanword identification method achieves the best performance.
In our future work, we plan to improve the robustness of the loanword identification model by generating more diverse training data and incorporating richer contextual information into it.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research was funded by the National Natural Science Foundation of China (no. 61906158).