This paper investigates the recognition of unknown words in Chinese parsing. Two methods are proposed to handle this problem. One is the modification of a character-based model. We model the emission probability of an unknown word using the first and last characters in the word. It aims to reduce the POS tag ambiguities of unknown words to improve the parsing performance. In addition, a novel method, using graph-based semisupervised learning (SSL), is proposed to improve the syntax parsing of unknown words. Its goal is to discover additional lexical knowledge from a large amount of unlabeled data to help the syntax parsing. The method is mainly to propagate lexical emission probabilities to unknown words by building the similarity graphs over the words of labeled and unlabeled data. The derived distributions are incorporated into the parsing process. The proposed methods are effective in dealing with the unknown words to improve the parsing. Empirical results for Penn Chinese Treebank and TCT Treebank revealed its effectiveness.
Parsing plays an important role in natural language processing. In recent years, Chinese parsing has received a great deal of attention, and lots of researchers have presented many of Chinese parsing models [
As far as we know, there is a large portion of fixed errors coming from unknown words in Chinese parsing. Therefore, a robust parser must have a mechanism of processing unknown words, where it discovers the POS tag and features information about unknown words during parsing. A number of researches design hand-crafted rules or make use of rich morphological features to handle them. It is well known that Chinese words tend to have greater POS tag ambiguities than English and the morphological properties of Chinese words are complicated to be predicted of POS type for unknown words. For this reason, we present a more effective character-based model to handle unknown words according to [
This paper is structured as follows. Section
The Berkeley parser [
In order to get the more refined and accurate grammar, Petrov et al. [
Since the lexical model can only generate words observed in the training data, a separate module is needed to handle the OOV words that appear in the test sentences. There are two ways to estimate an OOV word
Nevertheless, the features applied to Chinese word are simpler than English. Only the last character of word will be taken into account in estimating emission probabilities of rare word. Before applying such model, OOV words will be checked if they belong to temporal noun (NT) (by checking if the word contains characters like “
In this study, we make the modification deriving from two reasons. First of all, the Berkeley parser is adequate for English and only a limited number of classes of unknown words are handled for Chinese. In parsing phase, if the unknown words belong to the categories of digit or date, the Berkeley parser has some inbuilt ability to handle them. For words excluded from these classes, the parser ignores character level information and decides these word categories only on the rare word POS tag statistics. Let
As we can see in Table
The effect of the character-based model on TCT.
Length |
|
|
| |
---|---|---|---|---|
Baseline | All | 80.97 | 80.99 | 80.98 |
≤40 | 83.56 | 83.55 | 83.55 | |
|
||||
Character-based | All | 82.76 | 82.47 | 82.83 |
≤40 | 84.96 | 85.08 | 85.02 |
Graph-based label propagation, a critical subclass of semisupervised learning (SSL), has been widely used and shown to outperform other SSL methods [
The emphasis of this paper is on presenting a method to recognize Chinese unknown words by using two different kinds of data sources, for example, labeled texts and unlabeled texts, to construct a specific similarity graph. In essence, this problem can be treated as incorporating gainful information, for example, prior knowledge or label constraints, of unlabeled data into the supervised model. In our approach, we employ a transductive graph-based label propagation method to achieve such gainful information; for example, label distributions are inferred from a similarity graph constructed over labeled and unlabeled data. Then, the derived label distributions are regarded as “soft evidence” to augment the parsing of Chinese unknown words based on a new learning objective function. The algorithm contains the following two stages (see Algorithm
(i) (ii) (iii)
(1) (2) (3) (4) For (5) (6) (7)
In this stage (corresponding to procedures 1–3 in Algorithm
In this stage, we represent vertices by all of the word trigrams with occurrences in labeled and unlabeled sentences to construct the first graph. The graph construction is nontrivial. As Das and Petrov [
Features employed to measure the similarity between two vertices, in a given text example “
Feature | Example |
---|---|
Trigram + Context |
|
Trigram |
|
Left Context |
|
Right Context |
|
Center Word |
|
Left Word + Right Word |
|
Left Word + Right Context |
|
Left Context + Right Word |
|
To induce label distributions of unlabeled word from labeled vertices to entire graph, the label propagation algorithm Sparsity-Inducing Penalties (Sparsity) proposed by [
The estimated label distribution
Mathematically, the problem of label propagation is to get the optimal emission label distribution
Through the construction of similarity graph and propagation of labels in this stage, each unlabeled word will get a POS tag.
In this stage (corresponding to procedures 4–7 in Algorithm
After the former steps, we can get a lexicon of unlabeled words with label distribution. The lexicon is treated as an OOV lexicon which covers most of OOV words that appear in testing data but not in the training data in our system. Then this OOV lexicon should be incorporated into the Berkeley parser. Our strategy of insertion is that when an OOV word is detected, it should be firstly examined if the OOV lexicon contains such word; then corresponding estimation will be used; otherwise, the built-in OOV word model (mentioned in Section
The proposed approaches in this paper differ from previous OOV recognition models. Collins [
The experimental data are mainly taken from the Chinese Treebank (CTB-5.0) and TCT Treebank [
The statistics summary of data in CTB-5.0.
Train | Unlabeled | Dev | Test | |
---|---|---|---|---|
#Sentence | 17,785 | 19,075 | 352 | 348 |
#Word | 485,230 | 1,110,947 | 6,821 | 8,008 |
#OOV | — | — | 382 | 263 |
The statistics summary of data in TCT.
Train | Unlabeled | Dev | Test | |
---|---|---|---|---|
#Sentence | 14,045 | 19,075 | 1,755 | 1,758 |
#Word | 377,303 | 1,110,947 | 47,836 | 48,449 |
#OOV | — | — | 1,928 | 1,916 |
We firstly run the experiment on the TCT Treebank with the character-based model. The model has an overall POS tags accuracy of 94.80%, which is slightly higher than the Berkeley baseline model. This may be because the proposed model cannot well extract the features from the unknown words to improve the POS tagging. However, the parsing result is 82.83% that has a great improvement over the baseline accuracy of 80.98%. The detailed result is showed in Table
POS and parsing accuracy on TCT in character-based model.
Length |
|
|
|
POS | |
---|---|---|---|---|---|
Baseline | All | 80.97 | 80.99 | 80.98 | 94.51 |
≤40 | 83.56 | 83.55 | 83.55 | 94.56 | |
|
|||||
TCT | All | 82.76 | 82.47 | 82.83 | 94.80 |
≤40 | 84.96 | 85.08 | 85.02 | 94.76 |
Next, we use the CTB-5.0 Treebank and TCT Treebank to do the experiments in the graph-based OOV model separately. In our model, the parameter
POS and parsing accuracy on CTB in graph-based OOV model.
Length |
|
|
|
POS | |
---|---|---|---|---|---|
Baseline | All | 78.34 | 82.68 | 80.45 | 94.88 |
≤40 | 81.78 | 85.63 | 83.66 | 95.58 | |
|
|||||
CTB | All | 78.90 | 83.20 | 80.99 | 95.77 |
≤40 | 82.38 | 86.34 | 84.31 | 96.31 |
POS and parsing accuracy on TCT in graph-based OOV model.
Length |
|
|
|
POS | |
---|---|---|---|---|---|
Baseline | All | 80.97 | 80.99 | 80.98 | 94.51 |
≤40 | 83.56 | 83.55 | 83.55 | 94.56 | |
|
|||||
TCT | All | 81.30 | 81.32 | 81.31 | 95.51 |
≤40 | 83.92 | 83.91 | 83.92 | 95.60 |
In this paper, we try to use the modified character-based model to improve the performance of a PCFG-LA parser. Simultaneously, we show for the first time that the graph-based semisupervised learning is able to improve the performance of a PCFG-LA parser on OOV words. The approach mainly uses a
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank all reviewers for the very careful reading and helpful suggestions. The authors are grateful to the Science and Technology Development Fund of Macau and the Research Committee of the University of Macau for the funding support for their research, under Reference nos. MYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS.