Identification of Code-Switched Sentences and Words Using Language Modeling Approaches

Globalization and multilingualism contribute to code-switching—the phenomenon in which speakers produce utterances containing words or expressions from a second language. Processing code-switched sentences is a significant challenge for multilingual intelligent systems. This study proposes a language modeling approach to the problem of code-switching language processing, dividing the problem into two subtasks: the detection of code-switched sentences and the identification of codeswitched words in sentences. A code-switched sentence is detected on the basis of whether it contains words or phrases from another language. Once the code-switched sentences are identified, the positions of the code-switched words in the sentences are then identified. Experimental results show that the language modeling approach achieved an F-measure of 80.43% and an accuracy of 79.01% for detecting Mandarin-Taiwanese code-switched sentences. For the identification of code-switched words, the word-based and POS-based models, respectively, achieved F-measures of 41.09% and 53.08%.


Introduction
Increasing globalism and multilingualism has significantly increased demand for multilingual services in current intelligent systems [1].For example, an intelligent traveling system which supports multiple language inputs and outputs can assist travelers in booking hotels, ordering in restaurants, and navigating attractions.Multinational corporations would benefit from developing automatic multilingual call centers to address customer problems worldwide.In such multilingual environments, an input sentence may contain constituents from two or more languages, a phenomenon known as codeswitching or language mixing [2][3][4][5][6].Table 1 lists several definitions of code-switching described in previous studies.
A code-switched sentence consists of a primary language and a secondary language, and the secondary language is usually manifested in the form of short expressions, such as words and phrases.This phenomenon is increasingly common, with multilingual speakers often freely moving from their native dialect to subsidiary dialects to entirely foreign languages, and patterns of code-switching vary dynamically with different audiences in different situations.When dealing with code-switched input, intelligent systems such as dialog systems must be capable of identifying the various languages and recognize the speaker's intention embedded in the input [7,8].However, it is a significant challenge for intelligent systems to deal with multiple languages and unknown words from various languages.
In Taiwan, while Mandarin is the official language, Taiwanese and Hakka are used as a primary language by more than 75% and 10% of the population, respectively [9].Moreover, English is the most popular foreign language and compulsory English instruction begins in elementary school.The constant mix of these languages result in various kinds of code-switching, such as Mandarin sentences mixed with words and phrases from Taiwanese, Hakka, and English.Such code-switching is not limited to everyday conversation but can frequently be heard on television dramas and even current events commentary programs.This paper takes a linguistic view towards the problem of code-switching language processing, focusing on code-switching between Mandarin and Taiwanese.We propose a language modeling approach

Hymes et al. [2]
A common term for alternative use of two or more languages, varieties of a language, or even speech styles Hoffmann [3] The alternate use of two languages or linguistic varieties within the same utterance or during the same conversation Myers-Scotton [4] The use of two or more languages in the same conversation, usually within the same conversational turn or even within the same sentence of that turn which divides the problem into two subtasks: the detection of code-switched sentences followed by identification of codeswitched words within the sentences.The first step detects whether or not a given Mandarin sentence contains Taiwanese words.Once a code-switched sentence is identified, the positions of the code-switched words are then identified within the sentence.These code-switched words can be used for lexicon augmentation to improve understanding of codeswitched sentences.The rest of this work is organized as follows.Section 2 presents related work.Section 3 describes the language modeling approach to the identification of code-switched sentences and words in the sentences.Section 4 summarizes the experimental results.Conclusions are finally drawn in Section 5, along with recommendations for future research.

Related Work
Research on code-switching speech processing mainly focuses on speech recognition [9][10][11][12][13][14], language identification [15,16], text-to-speech synthesis [17], and code-switching speech database creation [18].Lyu et al. proposed a three-step data-driven phone clustering method to train an acoustic model for Mandarin, Taiwanese, and Hakka [9].They also discussed the issue of training with unbalanced data.Wu et al. proposed an approach to segmenting and identifying mixedlanguage speech utterances [10].They first segmented the input speech utterance into a sequence of language-dependent segments using acoustic features.The language-specific features were then integrated in the identification process.Chan et al. developed a Cantonese-English mixed-language speech recognition system, including acoustic modeling, language modeling, and language identification algorithms [11].Hong et al. developed a Mandarin-English mixedlanguage speech recognition system in resource-constrained environments, which can be realized in embedded systems such as personal digital assistants (PDAs) [12].Ahmed and Tan proposed a two-pass code-switching speech recognition framework: automatic speech recognition and rescoring [13].Vu et al. recently developed a speech recognition system for code-switching in conversational speech [14].For language identification, Lyu et al. proposed a word-based lexical model integrating acoustic, phonetic, and lexical cues to build a language identification system [15].Yeong and Tan proposed the use of morphological structures and sequence of the syllable for language identification from Malay-English code-switching sentences [16].For speech synthesis, Qian et al. developed a text-to-speech system that can generate Mandarin-English mixed-language utterances [17].
Research on code-switching and multilingual language processing included applications of text mining [19][20][21][22], information retrieval [23][24][25], ontology-based knowledge management [26], and unknown word extraction [27].For text mining, Seki et al. extracted opinion holders for discriminating opinions that are viewed from different perspectives (author and authority) in both Japanese and English [19].Yang et al. used self-organizing maps to cluster multilingual documents [20].A multilingual Web directory was then constructed to facilitate multilingual Web navigation.Zhang et al. addressed the problem of multilingual sentence categorization and novelty mining on English, Malay, and Chinese sentences [21].They proposed to first categorize similar sentences and then identify new information from them.De Pablo-Sánchez et al. devised a bootstrapping algorithm to acquire named entities and linguistic patterns from English and Spanish news corpora [22].This lightly supervised method can acquire useful information from unannotated corpora using a small set of seeds provided by human experts.For information retrieval, Gey et al. pointed out several directions for cross-lingual information retrieval (CLIR) research [23].Tsai et al. used the FRank ranking algorithm to build a merge model for multilingual information retrieval [24].Jung discovered useful multilingual tags annotated in social texts [25].He then used these tags for query expansion to allow users to query in one language but obtain additional information in another language.For other application domains, Segev and Gal proposed an ontologybased knowledge management model to enhance portability and reduce costs in multilingual information systems deployment [26].Wu et al. proposed the use of mutual information and entropy to extract unknown words from code-switched sentences [27].

Language Modeling Approach
Language modeling approaches have been successfully used in many applications, such as grammar error correction [28], code-switching language processing [29], and lexical substitution [30][31][32].For our task, a code-switched sentence generally has a higher probability of being found in a code-switching language model than in a noncode-switching one.Thus, we built code-switching and noncode-switching language models to compare their respective probabilities of identifying code-switched sentences and code-switched words within the sentences.Figure 1 shows the system framework.First, a corpus of code-switched and noncodeswitched sentences is collected to build the respective codeswitching and noncode-switching language models.To identify code-switched sentences, we compare the probability of each test sentence output by the code-switching language model against the output of the noncode-switching one to determine whether or not the test sentence is code-switched.To identify code-switched words within the sentences, we select the -gram with the highest probability output by the code-switching language model and then compare it against the output of the noncode-switching one to verify whether the th word in the given sentence is a code-switched word.

Corpus Collection.
A noncode-switching corpus refers to a set of sentences containing just one language.Because Mandarin is the primary language in this study, we used the Sinica corpus released by the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) as the noncode-switching corpus.A code-switching corpus refers to a set of Mandarin sentences featuring Taiwanese words.However, it can be difficult to collect a large number of such sentences, and training a language model on insufficient data may incur the data sparseness problem.Therefore, we used more common Mandarin-English sentences as the code-switching corpus, based on the assumption that the code-switching phenomenon in Mandarin-English sentences has a certain degree of similarity to Mandarin-Taiwanese sentences, because in Taiwan, both English and Taiwanese are secondary languages with respect to Mandarin.The Mandarin-English sentences were collected from a large corpus of web-based news articles, which were then segmented using the CKIP word segmentation system developed by the Academia Sinica, Taiwan (http://ckipsvr.iis.sinica.edu.tw/)[33,34].The sentences containing words with the part-ofspeech (POS) tag "FW" (i.e., foreign word) were selected as code-switched sentences.

Detection of Code-Switched Sentences. Generally, an
-gram language model is used to predict the th word based on the previous  − 1 words using a probability function . Given a sentence  =  1 ⋅ ⋅ ⋅   , the noncode-switching -gram language model is defined as where where (⋅) denotes the frequency counts of the -grams retrieved from the noncode-switching corpus (i.e., Sinica corpus).Instead of estimating the surface form of the next word, the code-switching -gram language model estimates the probability that the next word is a code-switched word, that is, (  |  1 ⋅ ⋅ ⋅  −1 ), defined as where To estimate (  |  1 ⋅ ⋅ ⋅  −1 ), the code-switching corpus is processed by replacing the code-switched words (i.e., the words with the POS tag "FW") in the Mandarin-English sentences with a special character .The frequency counts of (  ⋅ ⋅ ⋅  −+1 ) can then be retrieved from the codeswitching corpus.This processing may also reduce the effect of the data sparseness problem in language model training.
Once the two language models are built, they can be compared to detect whether a given sentence contains codeswitching.That is, The sentence  is predicted to be a code-switched sentence if the probability of the sentence output by the codeswitching language model is greater than that output by the noncode-switching one (i.e.,  ≥ 1).

Identification of Code-Switched
Words.This step identifies the positions of the code-switched words within the sentences.To this end, the code-switching -gram language model (3) is applied to each test sentence and the probability of being a code-switched word is assigned to every next word (position) in the sentence.Among all the -grams in the sentence, the one with the highest probability indicates the most likely position of a code-switched word.That is, where  * denotes the best hypothesis of the code-switched word in the sentence.However, not all -grams with the highest probability suggest correct positions.Therefore, we further propose a verification mechanism to determine whether to accept the best hypothesis.That is, where ) represents the probability of the best hypothesis in the code-switching corpus and (  |  −1 ⋅ ⋅ ⋅  −+1 ) represents its probability in the noncodeswitching corpus.The best hypothesis  * is accepted if its probability in the code-switching corpus is greater than that in the noncode-switching corpus.

Experimental Results
This section first explains the experimental setup, including experiment data, implementation of language modeling, and evaluation metrics.We then present experimental results for the identification of both Mandarin-Taiwanese and Mandarin-English code-switched sentences and words within the sentences.

Experimental Setup.
The test set included 393 sentences of which 131 were Mandarin only (i.e., noncode-switched), while another 131 were Mandarin sentences containing Taiwanese words, and the remaining 131 were Mandarin sentences containing English words.For the evaluation of Mandarin-Taiwanese sentences, -gram models for both code-switching and noncode-switching were trained using the SRILM toolkit [35] with  = 2 and 3 (i.e., bigram and trigram).For the evaluation of Mandarin-English sentences, the CKIP word segmentation system [33,34] was used because it can associate a POS tag "FW" to English words/characters within the sentences.The evaluations metrics included recall, precision, -measure, and accuracy.The recall was defined as the number of code-switched sentences correctly identified by the method divided by the total number of code-switched sentences in the test set.The precision was defined as the number of code-switched sentences correctly identified by the method divided by the number of code-switched sentences identified by the method.The -measure was defined as (2 × recall × precision)/(recall + precision).The accuracy was defined as the number of sentences correctly identified by the method divided by the total number of sentences in the test set.

Evaluation on Mandarin-Taiwanese Code-Switched Sentences.
To identify Mandarin-Taiwanese code-switched sentences, the code-switching and noncode-switching bigram/ trigram language models were used to determine whether To identify code-switched words in Mandarin-Taiwanese code-switched sentences, all word bigrams and trigrams in each test sentence were first ranked according to their probabilities.The top  word bigrams/trigrams were then selected as candidates for further verification using (7).For instance, top 1 means that the bigram/trigram with the highest probability in a given test sentence is considered a candidate.If the candidate -gram is accepted by the verification method, then the position indicated by the gram will be considered a foreign word.Similarly, top 2 means that the method can propose two candidates for verification.To examine the effect of the data sparseness problem, we used the part-of-speech (POS) tags of words to build additional POS bigram/trigram models from the code-switching corpus.In addition to the word/POS -gram models, we also implemented a baseline system to randomly guess the positions of code-switched words in the sentences, and the top , herein, means that the system can randomly propose  candidate positions.Table 3 shows the results for the identification of code-switched words.The results show that the -measure of the baseline system (Random) was only around 18∼25%, indicating that identifying code-switched words is more difficult than identifying code-switched sentences.In addition, the proposed word/POS -gram models significantly outperformed Random.For the word-based -gram models, the word bigram model achieved an -measure of around 41%, which was much better than that of both the word trigram model and Random.Once the POS tags were used to build the language models, both the POS bigram and trigram models outperformed their corresponding word-based models in terms of -measure, as well as for recall and precision.This finding indicates that training with the POS tags can reduce the impact of the data sparseness problem.In addition, as shown in Figure 2, the accuracy improvement derived from the trigram model was significantly greater than that from the bigram model, because the trigram model tends to suffer from a more serious data sparseness problem than the bigram model when training data is insufficient.Overall, the best performance of the POS -gram models was achieved at an -measure of 53.08% (POS trigram, top 1).
Code-switched word identification can also be evaluated by allowing the methods to propose more than one candidate, that is, top 1 to top 3. Table 3 shows that, with more candidates included for verification, more code-switched words were correctly identified, thus dramatically increasing the recall of all methods, but at the cost of reduced precision.Overall, the -measure of top 2 was increased for all methods except for the POS trigram, but for top 3, increasing the number of candidates only increased the -measure of Random and word trigram.

Evaluation on Mandarin-English Code-Switched Sentences.
To identify code-switched words in Mandarin-English code-switched sentences, the words associated with the POS tag "FW" (representing a foreign word) by the CKIP word segmentation system were proposed as the answers.The Random system was also implemented to guess the English words in the test sentences.Table 4 shows the comparative results.As expected, the CKIP word segmentation system can provide very precise information for identifying English words in sentences, thus yielding very good performance.Actually, the CKIP system has been under development for over ten years and is still updated periodically.For the Random system, the -measure was around 19∼27% which was similar to that (18∼25%, Table 3) for code-switched word identification in Mandarin-Taiwanese code-switched sentences.

Conclusions
This work presents a language modeling method for identifying sentences featuring code-switching and for identifying the code-switched words within those sentences.Experimental results show that the language modeling approach achieved an -measure of 80.43% and an accuracy of 79.01% for the detection of Mandarin-Taiwanese codeswitched sentences.For the identification of code-switched words in Mandarin-Taiwanese code-switched sentences, the POS -gram models outperformed the word -gram models, mainly because of the reduced impact of the data sparseness problem.The highest -measures (top 1) for the word-based and POS-based models were 41.09% and 53.08%, respectively.For code-switched word identification in Mandarin-English code-switched sentences, the CKIP word segmentation system achieved very high performance (95.02% -measure).
Future work will focus on improving system performance by incorporating other effective machine learning algorithms and features, such as sentence structure analysis.The proposed method could also be integrated into practical applications such as a multilingual dialog system to improve effectiveness in dealing with the code-switching problem.

Figure 1 :
Figure 1: Framework of identification of code-switched sentences and words in the sentences.

Figure 2 :
Figure 2: Comparative results of top  performance on codeswitched word identification in Mandarin-Taiwanese code-switched sentences.

Table 1 :
Definitions of code-switching.

Table 2 :
Results of the identification of Mandarin-Taiwanese codeswitched sentence.

Table 3 :
Results of code-switched word identification in Mandarin-Taiwanese code-switched sentences.

Table 4 :
Results of code-switched word identification in Mandarin-English code-switched sentences.