Semi-Supervised Learning of Statistical Models for Natural Language Understanding

Natural language understanding is to specify a computational model that maps sentences to their semantic mean representation. In this paper, we propose a novel framework to train the statistical models without using expensive fully annotated data. In particular, the input of our framework is a set of sentences labeled with abstract semantic annotations. These annotations encode the underlying embedded semantic structural relations without explicit word/semantic tag alignment. The proposed framework can automatically induce derivation rules that map sentences to their semantic meaning representations. The learning framework is applied on two statistical models, the conditional random fields (CRFs) and the hidden Markov support vector machines (HM-SVMs). Our experimental results on the DARPA communicator data show that both CRFs and HM-SVMs outperform the baseline approach, previously proposed hidden vector state (HVS) model which is also trained on abstract semantic annotations. In addition, the proposed framework shows superior performance than two other baseline approaches, a hybrid framework combining HVS and HM-SVMs and discriminative training of HVS, with a relative error reduction rate of about 25% and 15% being achieved in F-measure.


Introduction
Given a sentence such as "I want to fly from Denver to Chicago, " its semantic meaning can be represented as FROMLOC(CITY(Denver)) TOLOC(CITY(Chicago)).
Natural language understanding can be considered as a mapping problem where the aim is to map a sentence to its semantic meaning representation (or abstract semantic annotation) as shown above. It is a structured classification task which predicts output labels (semantic tag or concept sequences) from input sentences where the output labels have rich internal structures.
Early approaches rely on hand-crafted semantic grammar rules to fill slots in semantic frames using word pattern and semantic tokens [1,2]. Such rule-based approaches are typically domain-specific and often fragile. In contrast, statistical approaches are able to accommodate the variations found in real data and hence can in principle be more robust. They can be categorized into three types: generative approaches, discriminative approaches, and a hybrid of the two.
Generative approaches learn the joint probability model, ( , ), of input sentence and its semantic tag sequence , then compute ( | ) using Bayes' rule, and finally take the most probable semantic tag sequence . The hidden Markov model (HMM), a generative model, has been predominantly employed in statistical semantic parsing. It models sequential dependencies by treating a semantic parse sequence as a Markov chain, which leads to an efficient dynamic programming formulation for inference and learning. Discriminative approaches directly model posterior probability ( | ) and learn mappings from to . Conditional random fields (CRFs), as one representative example, define a conditional probability distribution over label sequence given an observation sequence, rather than a joint distribution over both label and observation sequences [3]. Another example is the hidden Markov support vector machines (HM-SVMs) [4] 2 The Scientific World Journal which combine the flexibility of kernel methods with the idea of HMMs to predict a label sequence given an input sequence.
Nevertheless, statistical models mentioned above require fully annotated corpora for training which are difficult to obtain in practical applications. It thus motivates the investigation of train statistical models on abstract semantic annotations without the use of expensive token-style annotations. This is a highly challenging problem because the derivation from each sentence to its abstract semantic annotation is not annotated in the training data and is considered hidden.
A hierarchical hidden state structure could be used to model embedded structural context in sentences, such as the hidden vector state (HVS) model [5], which learns a probabilistic pushdown automaton. However, it cannot incorporate a large number of correlated lexical or syntactic features in input sentences and cannot handle any arbitrary embedded relations since it only supports right-branching semantic structures.
In this paper, we propose a novel learning framework to train statistical models from unaligned data. Firstly, it generates semantic parses by computing expectations using initial model parameters. Secondly, parsing results are then filtered based on a measure describing the level of agreement with the sentence abstract semantic annotations. Thirdly, the filtered parsing results are fed into model learning. With the reestimated parameters, the learning of statistical models goes to the next iteration until no more improvements could be achieved. The proposed framework has two advantages: one is that only abstract semantic annotations are required for training without the explicit word/semantic tag alignment; and another is that the proposed learning framework can be easily extended for training any discriminative models on abstract semantic annotations.
We apply the proposed learning framework on two statistical models, CRFs and HM-SVMs. Experimental results on the DARPA communicator data show that the framework on both CRFs and HM-SVMs outperforms the baseline approach, the previously proposed HVS model. In addition, the proposed framework shows superior performance than two other approaches, a hybrid framework combining HVS and HM-SVMs and discriminative training of HVS, with a relative error reduction rate of about 25% and 15% being achieved in -measure.
The rest of this paper is organized as follows. Section 2 gives a brief introduction of CRFs and HM-SVMs, followed by a review on the existing approaches for training semantic parsers on abstract annotations. The proposed framework is presented in Section 3. Experimental setup and results are discussed in Section 4. Finally, Section 5 concludes the paper.

Related Work
In this section, we first briefly introduce CRFs and HM-SVMs. Then, we review the existing approaches for training semantic parsers on abstract semantic annotations.

Conditional Random Fields (CRFs).
Linear-chain CRFs, as a discriminative probabilistic model over sequences of feature vectors and label sequences, have been widely used to model sequential data. This model is analogous to maximum entropy models for structured outputs. By making a firstorder Markov assumption on states, a linear-chain CRF defines a distribution over state sequence = { 1 , 2 , . . . , } given an input sequence = { 1 , 2 , . . . , } ( is the length of the sequence) as where the partition function ( ) is the normalization constant that makes the probability of all state sequences sum to one and is defined as ( ) = Σ Π Φ ( −1 , , ). By exploiting the Markov assumption, ( ) can be calculated efficiently by variants of the standard dynamic programming algorithms used in HMM instead of summing over the exponentially many possible state sequences . Φ( −1 , , ) can be factorized as where is the real weight for each feature function ( −1 , , , ). The feature functions describe some aspect of a transition from −1 to as well as and the global characteristics of . For example, may have value 1 when POS( −1 ) = DT and POS( ) = NN, which means that the previous word −1 has the POS tag "DT" (determiner) and the current word has the POS tag "NN" (noun, singular common). The final model parameters for CRFs are a set of real weights Θ = { }, one for each feature.

Hidden Markov Support Vector Machines (HM-SVMs).
For HM-SVMs [4], the function ( , ) is assumed to be linear in some combined feature representation of and ; ( , ) := ⟨ , Φ( , )⟩. The parameters are adjusted so that the true semantic tag sequence scores higher than all other tag sequences ∈ C := C\ with a large margin. To achieve the goal, the following optimization problem is solved: where is nonnegative slack variables allowing one to increase the global margin by paying a local penalty on some where ( ) is the Lagrange multiplier of the constraint associated with example and .

Training Statistical Models from Lightly Annotated Data.
Semantic parsing can be viewed as a pattern recognition problem and statistical decoding can be used to find the most likely semantic representation. The majority of statistical approaches to semantic parsing rely on fully annotated corpora. There have been some prior works on learning semantic parsers that map natural language sentences into a formal meaning representation such as first-order logic [6][7][8][9][10]. However these systems either require a hand-built, ambiguous combinatory categorical grammar template to learn a probabilistic semantic parser [11] or assume the existence of an unambiguous, context-free grammar of the target meaning representations [6,7,9,12,13]. Furthermore, they have only been studied in two relatively simple tasks, GeoQuery [14] for US geography query and RoboCup (http://www.robocup .org/) where coaching instructions are given to soccer agents in a simulated soccer field. He and Young [5] proposed the hidden vector state (HVS) model based on the hypothesis that a suitably constrained hierarchical model may be trainable without treebank data whilst simultaneously retaining sufficient ability to capture the hierarchical structure needs to robustly extract task domain semantics. Such a constrained hierarchical model can be conveniently implemented using the HVS model which extends the flat-concept HMM model by expanding each state to encode the stack of a pushdown automaton. This allows the model to efficiently encode hierarchical context, but because stack operations are highly constrained it avoids the tractability issues associated with full context-free stochastic models such as the hierarchical HMM. Such a model is trainable using only lightly annotated data and it offers considerable performance gains compared to the flat-concept model.
Conditional random fields (CRFs) have been extensively studied for sequence labeling. Most applications require the availability of fully annotated data, that is, an explicit alignment of sentence and word-level labels. There have been some attempts to train CRFs from a small set of labeled data and a large set of unlabeled data. In these approaches, a training objective is redefined to combine the conditional likelihood of labeled data and unlabeled data. Jiao et al. [15] extended the minimum entropy regularization framework to the structured prediction case so a training objective that combines unlabeled conditional entropy with labeled conditional likelihood is yielded. Mann and McCallum [16] augmented the traditional conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. Entropy regularization was employed for semisupervised learning. In [17], a training objective combining the conditional likelihood on labeled data and the mutual information on unlabeled data is proposed. It is based on the rate distortion theory in information theory. Mann and Mccallum [18] used labeled features instead of fully labeled instances to train linear-chain CRFs. Generalized expectation criteria are used to express a preference for parameter settings in which the model distribution on unlabeled data matches a target distribution. They tested their approach on the classified advertisements data set (Classified) [19] consisting of classified advertisements for apartment rentals in the San Francisco Bay Area with 12 fields being labeled for each of the advertisements, including size, rent, neighborhood, and features. With only labeled features, their approach gave a mediocre result with 68.3% accuracy being achieved. With an additional inclusion of 100 labeled instances, the accuracy is increased to 80%. The DARPA communicator data used in our experiment appear to be more complex than the Classified data since semantic annotations in the DARPA communicator data describe embedded structural context in sentences while semantic labels in the Classified data do not represent any hierarchical relations.

The Proposed Framework
Given the training data = {( 1 , 1 ), . . . , ( , )}, where is the abstract annotation for sentence , the parameters Θ will be estimated through a maximum likelihood procedure. The log-likelihood of (Θ) with expectation over the abstract annotation is calculated as follows: where is the unknown semantic tag sequence of the th word sequence. To learn statistical models, we extended the use of expectation maximization (EM) algorithm to estimate model parameters. The EM algorithm [20] is widely employed in statistical models for parameter estimation when the model depends on unobserved latent variables. Given a set of observed data , a set of unobserved latent data, or missing values , the EM algorithm seeks to find the maximum likelihood estimation of the marginal likelihood by alternating between performing an expectation step and a maximization step.
(i) E-step: given the current estimate of the parameters, calculate the expected value for unobserved latent variables or data.
(ii) M-step: find the parameter that maximizes this quantity. These parameter estimates are then used to determine the distribution of the latent variables in the next E-step.
We propose a learning framework based on EM to train statistical models from abstract semantic annotations as  is expanded to the flattened semantic tag sequence at initialization step. Based on the flattened semantic tag sequences, the initial model parameters are estimated. After that, the semantic tag sequenceî s generated for each sentence using the current model, C = {̂, = 1, . . . , }. Then,Ĉ is filtered based on a score function which measures the agreement of the generated semantic tag sequences with the actual flattened semantic tag sequences. In the maximization step, model parameters are reestimated using the filteredĈ. The iteration continues until convergence. The details of each step are discussed in Figure 1.

Preprocessing.
Given a sentence labeled with an abstract semantic annotation as shown in Table 1, we first expand the annotation to the flattened semantic tag sequence as in Table 1(a). The provision of abstract annotations implies that the semantics encoded in each sentence need not be provided in expensive token style. Obviously, there are some input words such as articles, which have no specific semantic meanings. In order to cater for these irrelevant input words, a DUMMY tag is introduced in the preterminal position. Hence, the flattened semantic tag sequence is finally expanded to the semantic tag sequence as in Table 1(b).

Expectation with Constraints.
During the expectation step, that is, calculating the most likely semantic tag sequence given a sentence, we need to impose the following two constraints which are implied from abstract semantic annotations.
(1) Considering the calculated semantic tag sequence as a hidden state sequence, state transitions are only allowed if both current and next states are listed in the semantic annotation defined for the sentence.
(2) If a lexical item is attached to a preterminal tag of a flattened semantic tag, the semantic tag must appear bound to that lexical item in the training annotation.
To illustrate how these two constraints are applied, the sentence "I want to return on Thursday to Dallas" with its annotation "RETURN(TOLOC(CITY(Dallas)) ON(DATE(Thursday)))" is taken as an example. The transition from RETURN+TOLOC+CITY to RETURN is allowed since both states can be found in the semantic annotation and follows constraint 1. However, the transition from RETURN to FLIGHT is not allowed as it does not follow constraint 1 and FLIGHT is not listed in the semantic annotation. Also, for the lexical item Dallas in the training sentence, the only valid semantic tag is RETURN+TOLOC+CITY because to apply constraint 2 Dallas has to be bound with the preterminal tag CITY.
We further describe how these two constraints can be imposed into two different models, CRFs and HM-SVMs: The Scientific World Journal

Expectation in CRFs.
The most probable labeling sequence in CRFs can be efficiently calculated using the Viterbi algorithm. Similar to the forward-backward procedure for HMM, the marginal probability of states at each position in the sequence can be computed as where ( ) = ∑ ( | ).
Given the training data = {( 1 , 1 ), . . . , ( , )}, the parameter Θ can be estimated through a maximum likelihood procedure. To calculate the log-likelihood of (Θ) with expectation over the abstract annotation as follows, where is the unknown semantic tag sequence of the th word sequence and ( ) = ∑ exp(∑ ∑ ( −1 , , )). It can be optimized using the same optimization method as in standard CRFs training.
To infer the word-level semantic tag sequences based on abstract annotations, (7) are modified as shown in (8), where ( , , ) is defined as follows: is not in the allowable semantic tag list of , 1, is not of class type and is of class type, 0, otherwise.

Expectation in HM-SVM.
To calculate the most likely semantic tag sequencêfor each sentence ,̂= arg max ∈C ( , ), we can decompose the discriminant function : S × C → R into two components, ( , ) = 1 ( , ) + 2 ( , ), where Here, ( , ) is considered as the coefficient for the transition from state (or semantic tag) to state while ( , ) can be treated as the coefficient for the emission of word from state . They are defined as follows: where ( , ) = ⟨Ψ( ), Ψ( )⟩ describes the similarity of the input patterns Ψ between word and word , the th word in the training example , and ( ) is a set of dual parameters or Lagrange multiplier of the constraint associated with example and semantic tag sequence as in (4).
Using the results derived in (13), Viterbi decoding can be performed to generate the best semantic tag sequence.
To incorporate the constraints as defined in the abstract semantic annotations, the values of ( , ) and ( , ) are modified for each sentence: where ( , ) and ℎ( , ) are defined as follows: is not in the allowable semantic tag list, 0, otherwise, is not of class type and is of class type, 0, otherwise, (15) where ( , ) and ℎ( , ) in fact encode the two constraints implied from abstract annotations.

6
The Scientific World Journal

Filtering.
For each sentence, the semantic tag sequences generated in the expectation step are further processed based on a measure on the agreement of the semantic tag sequence = { 1 , 2 , . . . , } with its corresponding abstract semantic annotation . The score of is defined as where precision = / , Recall = / . Here, is the number of the semantic tags in which also occur in , is the number of semantic tags in , and is the number of semantic tags in the flattened semantic tag sequence for . The score is similar to the -measure which is the harmonic mean of precision and recall. It essentially measures the agreement of the generated semantic tag sequence with the abstract semantic annotation. We filter out sentences with their score below certain predefined threshold and the remaining sentences together with their generated semantic tag sequences are fed into the next maximization step. In our experiments, we empirically set the threshold to 0.1.

Maximization.
Given the filtered training examples from the filtering step, the parameters Θ are adjusted using the standard training algorithms.
For CRFs, the parameter Θ can be estimated through a maximum likelihood procedure. The model is traditionally trained by maximizing the conditional log-likelihood of the labeled sequences, which is defined as where is the number of sequences. The maximization can be achieved gradient ascent where the gradient of the likelihood is For HM-SVMs, the parameters Θ = are adjusted so that the true semantic tag sequence scores higher than all the other tag sequences ∈ C := C \ with a large margin. To achieve the goal, the optimization problem as stated in (3) is solved using an online learning approach as described in [4]. In short, it works as follows: a pattern sequence is presented and the optimal semantic tag sequencê= ( ) is computed by employing Viterbi decoding. If̂is correct, no update is performed. Otherwise, the weight vector is updated based on the difference from the true semantic tag sequence ΔΦ = Φ( ,̂) − Φ( , ).

Experimental Results
Experiments have been conducted on the DARPA communicator data (http://www.bltek.com/spoken-dialog-systems/ cu-communicator.html/) which were collected in 461 days. From these, 46 days were randomly selected for use as test set data and the remainders were used for training. After cleaning up the data, the training set consists of 12702 utterances while the test set contains 1178 utterances. The abstract semantic annotations used for training only list a set of valid semantic tags and the dominance relationships between them without considering the actual realized semantic tag sequence or attempting to identify explicit word/concept pairs. Thus, it avoids the need for expensive treebank style annotations. For example, for the sentence "I wanna go from Denver to Orlando Florida on December tenth, " the abstract annotation would be FROMLOC(CITY) TOLOC(CITY(STATE)) MONTH(DAY).
To evaluate the performance of the model, a reference frame structure was derived for every test set sentence consisting of slot/value pairs. An example of a reference frame is shown in Table 2.
Performance was then measured in terms of -measure on slot/value pairs, which combines the precision ( ) and recall ( ) values with equal weight and is defined as = 2 * * /( + ).
We modified the open source of the CRF suite (http:// www.chokkan.org/software/crfsuite/) and SVM HMM (http:// www.cs.cornell.edu/people/tj/svm light/svm hmm.html/) to implement our proposed learning framework. We employed two algorithms to estimate the parameters of CRFs, the stochastic gradient descent (SGD) iterative algorithm [21], and the limited-memory BFGS (L-BFGS) method [22]. For both algorithms, the regularization parameter was empirically set in the following experiments.

Overall Comparison.
We first compare the time consumed in each iteration using HM-SVMs or CRFs as shown in Figure 2. The experiments were conducted on the Intel(R) Xeon(TM) model Linux server equipped with 3.00 Ghz processor and 4 GB RAM. It can be observed that, for CRFs, the time consumed in SGD is almost doubled compared to that in L-BFGS in each iteration. However, since SGD converges much faster than L-BFGS, the total time required for training is almost the same. As SGD gives balanced precision and recall values, it should be preferred more than L-BFGS in our proposed learning procedure. On the other hand, as opposed to CRFs which consume much less time after iteration 1, HM-SVMs take almost the same run time for all the iterations. Nevertheless, the total run time until convergence is almost the same for CRFs and HM-SVMs. Figure 3 shows the performance of our proposed framework for CRFs and HM-SVMs at each iteration. At each word position, the feature set used for both statistical models The Scientific World Journal consists of the current word and the current part-of-speech (POS) tag. It can be observed that both models achieve the best performance at iteration 8 with an -measure of 92.95% and 93.18% being achieved using CRFs and HM-SVMs, respectively.

Results with Varied Features Set.
We employed word features (such as current word, previous word, and next word) and POS features (such as current POS tag, previous one, and next one) for training. To explore the impact of the choices of features, we explored with feature sets comprised of words or POS tags occurring before or after the current word within some predefined window size. Figure 4 shows the performance of our proposed approach with the window size varying between 0 and 3. Surprisingly, the model learned with feature set chosen by setting window size 0 gives the best overall performance. Varying window size between 1 and 3 only impacts the convergence rate and does not lead to any performance difference at the end of the learning procedure.

Performance with or without Filtering
Step. In a second set of experiments, we compare the performance with or without the filtering step as discussed in Section 3.3. Figure 5 shows that the filtering step is indeed crucial as it boosted the performance by nearly 4% for CRFs with L-BFGS and 3% for CRFs with SGD and HM-SVMs.

Comparison with Existing Approaches.
We compare the performance of CRFs and HM-SVMs with HVS, all trained on abstract semantic annotations. While it is hard to incorporate arbitrary input features into HVS learning, both CRFs and HM-SVMs have the capability of dealing with overlapping features. Table 3 shows that they outperform HVS with a relative error reduction of 36.6% and 43.3% being achieved, respectively. In addition, the superior performance of HM-SVMs over CRFs shows the advantage of HM-SVMs   We further compare our proposed learning approach with two other methods. One is a hybrid generative/discriminative framework (HF) [23] which combines HVS with HM-SVMs so as to allow the incorporation of arbitrary features as in CRFs. The other is a discriminative approach (DT) based on parse error measure to train the HVS model [24]. The generalized probabilistic descent (GPD) algorithm [25] was employed to adjust the HVS model to achieve the minimum parse error rate. Table 3 shows that our proposed learning approach outperforms both HF and DT. Training statistical models on abstract annotations allows the calculation of conditional likelihood and hence results in direct optimization of the objective function to reduce the error rate of semantic labeling. On the contrary, the hybrid framework firstly uses the HVS parser to generate full annotations for training HM-SVMs. This process involves the optimization of two different objective functions (one for HVS and another for HM-SVMs). Although DT also uses an objective function which aims to reduce the semantic parsing error rate, it is in fact employed for supervised reranking where the input is the -best parse results generated from the HVS model.

Conclusions
In this paper, we have proposed an effective learning approach which can train statistical models such CRFs and HM-SVMs without using the expensive treebank style annotation data. Instead, it trains the statistical models from only abstract annotations in a constrained way. Experimental results show that, using the proposed learning approach, both CRFs and HM-SVMs outperform the previously proposed HVS model on the DARPA communicator data. Furthermore, they also show superior performance than the two other methods: one is the hybrid framework (HF) combining both HVS and HM-SVMs, and the other is discriminative training (DT) of the HVS model, with a relative error reduction rate of about 25% and 15% being achieved when compared with HF and DT, respectively. In future work, we will explore other score functions in filtering step to describe the precision of the parsing results. Also, we plan to apply the proposed framework in some other domains such as information extraction and opinion mining.