A Semieager Classifier for Open Relation Extraction

A variety of open relation extraction systems have been developed in the last decade. And deep learning, especially with attention model, has gained much success in the task of relation classification. Nevertheless, there is, yet, no research reported on classifying open relation tuples to our knowledge. In this paper, we propose a novel semieager learning algorithm (SemiE) to tackle the problem of open relation classification. Different from the eager learning approaches (e.g., ANNs) and the lazy learning approaches (e.g., kNN), the SemiE offers the benefits of both categories of learning scheme, with its significantly lower computational cost (O(n)). This algorithm can also be employed in other classification tasks. Additionally, this paper presents an adapted attention model to transform relation phrases into vectors by using word embedding. The experimental results on two benchmark datasets show that our method outperforms the state-of-the-art methods in the task of open relation classification.


Introduction
Traditionally, Information Extraction (IE) is the task of collecting structured information automatically from large size of unstructured data by learning an extractor from labeled training examples for each target relation [1][2][3]. This approach to IE cannot scale to corpora with a large number of target relations or with no prespecified target relations. In response, researchers at the University of Washington pioneered a new paradigm of open relation extraction (ORE), which enables the extraction of arbitrary relations from sentences by automatic identification of relation phrases, obviating the restriction to a prespecified vocabulary.
In recent years, a lot of ORE systems have been presented in the literature. The first ORE system is TEXTRUNNER [4] which learns a CRF on self-supervised training data constructed over Penn Tree Bank. The CRF can work on corpus not seen at all in training data, since it uses only unlexicalized features. Reverb improves over TEXTRUNNER via carefully designed linguistic patterns for relation phrases of English text [5,6]. Reverb reveals that about 85% of verb-based relation phrases can be expressed with a simple regular expression (see Box 1). OLLIE learns unlexicalized pattern templates on bootstrapped training data [7]. Not only verb-based patterns but also OLLIE can learn noun-based and even some inferential relation patterns. RELNOUN is a rule-based open relation extraction system [8] aiming at extracting nominal relations. It encodes various nounmediated patterns and pays special attention to compound relational nouns.
By themselves, the generated triples from the ORE systems appear of little interest, unless they are placed in the context of some particular downstream tasks, such as the semantic Web, event schema induction, sentence similarity, text comprehension, and Q/A system [9]. In this respect, we can consider the open relation extraction as only a prior step of a semantic analysis process [10]. ORE systems are not restricted on predefined relations in their extraction process and can extract all types of relations found in a text. However, after a set of tuples are generated by an ORE system, we can classify these tuples with a classifier which is pretrained on a labeled dataset. This is quiet important to some downstream applications, since some applications are interested in only a few types of these relations (e.g., supplier relationship or members-employees relationship). In addition, some other applications require the relations to be mapped to the relations in a particular ontology. For example, for TACKBP'2013 an NLP expert created a 2 Mathematical Problems in Engineering | | * V = verb particle? adv? = (noun | adj | adv | pron | det) = (prep | particle | inf. Marker) Box 1: A regular expression for the syntactic constraint on relation phrases.
rule-based extractor for 41 relations of interest from open tuples [11]. The extractor obtains a precision of about 0.8 but low recall. All these applications suggest the necessity of classifying open tuples. And this paper is targeted at the problem of open relation classification based on the research of traditional relation classification. A large number of approaches have been explored for traditional relation classification over the years. Recently, neural network-based approaches have achieved significant improvement over traditional methods based on humandesigned features [12]. Earlier neural networks for relation classification are usually based on one-layer Convolutional Neural Networks or recurrent networks. They may fail to perform in exploring the potential representation space at different abstraction levels [13]. The performance of supervised approaches strongly depends on the quality of the designed features. With the recent improvement in deep neural network (DNN), many researchers have presented unsupervised methods for automatic feature learning. Gated recurrent networks with long short-term memory (LSTM) were introduced to relation classification by Xu, Y et al. [14]. And Convolutional Neural Networks (CNNs) were proposed by Zeng to deal with the problem [15]. Additionally, the common Softmax loss function was replaced with a ranking loss in the CNN model in [16]. A negative sampling method based on CNNs was designed in [17]. From the viewpoint of model ensemble, algorithms incorporating CNNs with Recurrent Neural Networks (RNNs) have been proposed for relation classification [18,19]. Recently, attention mechanism based on LSTM and word vectors was presented by Zhang and Zheng [20]. Additionally, much effort has been invested in relational learning methods that can scale to large knowledge bases. The best performing neural-embedding models are NTN [21] and Bordes models (TransE and TATEC) [22,23], which extend the traditional relation classification task to semantic relation classification. Different from those approaches built over lexical and distributional word vector features, Siamak B et al. proposed a model using the combination of large commonsense knowledge bases of binary relations for the composite semantic relation classification problem [24].
Nevertheless, these approaches for traditional relation classification are unsuitable for targeting the variety and complexity of open relation types on the Web, since ORE systems are strongly in favor of speed and these approaches are time consuming. Moreover, there is no research reported on classifying open tuples by now, although it is of great importance to the downstream applications. Thus, it is necessary to develop a methodology appropriate to the problem of open tuple classification.
The machine learning algorithms for training a classifier can be organized in two categories: the eager learning and the lazy learning. A well-known disadvantage of eager learning is the high time complexity in the training process, and lazy learning suffers from its high space complexity, since lazy learning must store all the training examples. To overcome the disadvantages of eager and lazy learning but still preserve the benefits of the two models, this paper proposes a semieager learning algorithm to tackle the task for open tuple classification. The proposed algorithm is quite efficient. Its time complexity is ( ) and space complexity is ( ), with n being the number of training instances and k being the number of categories.
Recently, attention mechanism gets a wide range of applications. For NLP, it is capable of automatically concentrating on valuable words for targeted task. Inspired from this, we present a novel method to calculate representation for relation phrases, via word Vector-Sum operation weighted on attention weights. The attention weights are based on the information quantity of the words within the relation phrase.
In summary, the main contributions of this paper are as follows.
(1) A novel semieager learning algorithm is presented to classify open relation tuples. The proposed algorithm offers the benefits of both the eager learning and lazy learning scheme, with its significantly lower computational cost. This algorithm can also be employed in other classification tasks.
(2) Additionally, this paper presents an adapted attention model to transform relation phrases into vectors by using word embedding.
(3) Experiments show that the new algorithms achieve better performance compared to some recently presented methods.
The remainder of the paper is organized as follows. In Section 2, we review related work about open relation extraction and traditional relation classification. The derivation of the proposed semieager learning approach is presented in Section 3, as well as the description of the adapted attention model. In Section 4, we describe in detail the setup of experimental evaluation and the experimental results. Finally, we present the conclusion in Section 5.

Related Works
Reverb is the second generation of open relation system. Given a POS-tagged and NP-chunked sentence as input, the algorithm returns an extraction triple with the form (NP, VP, NP). The VP must satisfy the lexical constraint and syntactic constraint as shown in Box 1, and the NPs are the nearest noun phrases around the VP [5].
In the research area of relation classification, deep neural networks have been widely used in recent years, since deep architectures can automatically learn underlying features. While CNN is not suitable for learning long-distance semantic information [15], Zhang and Wang proposed a bidirectional RNN [25] to learn patterns of relations from raw text data. To overcome problem of vanishing gradient in RNN, SDP-LSTM model was proposed by Yan et al. [14]. Recently, a novel neural network Att-BLSTM was proposed for relation classification [26]. This model utilizes BLSTM with neural attention mechanism to capture the most important semantic information in a sentence, without using extra knowledge and NLP systems. The Att-BLSTM transforms all words to word vectors, forming a simple but competing model. Similarly, EAtt-BiGRU presents an entity-pair-based attention mechanism for solving relation classification, which utilizes entity pair information as a priori knowledge to adaptively generate attention weights based on word vectors [27].
Distributed representations of words in a vector space have achieved considerable success in a wide range of NLP tasks [28,29], including applications to automatic speech recognition and machine translation [30,31].
Recently, Mikolov et al. introduced two novel algorithms for computing word vectors based on large unlabeled datasets [32,33]. The first architecture is the continuous bag-ofwords model (CBOW), while the second one is named as the Skip-gram model. Given the surrounding words, the CBOW model predicts the current word, and the Skip-gram model predicts the surrounding words based on the current word. Given a word sequence 1 , 2 , 3 , . . . , , which is a sentence or document, the training objective of the CBOW model can be presented as the maximum of the average log probability Similarly, the objective of the Skip-gram model is where is the size of the training context around the center word . Larger means more training instances and thus can result in a higher accuracy, but at the cost of training time.
In order to classify the relations with the proposed semieager learning approach, phrase-level feature vectors with the same size must be generated in advance. In this paper, we employ Vector-Sum, Max-Pooling, and an adapted attention model to calculate feature vectors for relation phrases. The adapted attention model is one of our contributions in this paper. The novel model is based on word vector and inspired from the attention mechanism in deep neural network architecture.

The Proposed Algorithms
Eager learning methods construct a general, explicit description of the target function when training examples are provided, such as Artificial Neural Networks, Support Vector Machine, and Conditional Random Field algorithm. The disadvantages of eager learning includes the following: (a) the training process for these models usually requires high time cost, up to several hours and even days. For instance, the computational cost for SVMs is ( 3 ), with n being the number of training instances; (b) the eager learning approach is bounded to the problem of information loss, which may lead to a high potential risk of overfitting or underfitting; (c) these models are influenced mainly by the global distribution on the whole dataset rather than by the local behavior of unknown prediction targets. However, the local behavior is very important for the convergence of machine learning models [34].
In contrast to eager leaning methods, a delayed, or lazy, learning algorithm simply stores the training examples. Generalizing beyond these examples is postponed until a new instance must be classified. A key advantage of this kind of methods is that instead of estimating the entire instance space these methods can estimate it locally and differently for each new instance to be classified. A typical lazy learning method is k-Nearest Neighbor learning or the locally weighted regression algorithm. In order to make prediction for a new instance, lazy learning will calculate its distance to each training example. Although lazy learning needs no training effort, the time complexity for prediction is ( ), and here n is the number of training examples. This is the main disadvantage of lazy learning.
To sum up, the eager learning approach suffers the problems of concept drifting and information loss, since it computes a global model before seeing the prediction query. And the lazy learning approach suffers from simplistic predicting methods, although it can commit much richer sets of hypotheses (models) from the data. To overcome the disadvantages of eager and lazy learning but still preserve the benefits of the two models, a semieager learning algorithm is proposed in this paper. Unlike the lazy learning algorithm, the proposed model stores only "center point" for each class after the training process. The training time complexity is ( ), and both the time complexity and space complexity for prediction are ( ). Here n is the number of training instances and k is the number of categories. We call this new approach the "SemiE" learning approach . . e Semieager Learning Algorithm. (1) Consider the input space X ⊂ is a set of n-dimensional vectors, and the output space is a set of class labels, Y = { 1 , 2 , . . . , }, with ∈ . Assume x ∈ is a feature vector and y ∈ is the corresponding class label. Let P(X, Y) be the joint probability distribution over and , and the training dataset is T = {( 1 , 1 ), ( 2 , 2 ), . . . , ( , )}, where is an instance from , and is the corresponding class label from Y.
. . , N, is a partition of set T, the task of the learner is to learn the prior probability and the conditional probability distribution Thus, we can get the posterior probability 4

Mathematical Problems in Engineering
During prediction, given a certain input , the class label with the highest posterior probability will be outputted: Notice that the denominator is a constant independent of . So this item can be dropped, yielding ( ) Parameter Estimation. By using the maximum likelihood estimation method, the prior probability can be calculated as Here N is the number of samples in the training set, and the indicator function is The posterior probability P(X = | Y = ) is estimated as follows.
According to the central limit theorem, when the number of samples is large enough (N>30), within the training dataset T obeys a normal distribution with mean and variance 2 . Similarly, in each subset = {( , ) | = }, obeys a normal distribution with variance 2 centered around , if | | > 30. Thus, Substituting the above equation into (7) yields The first term in the expression is a constant independent of and therefore can be discarded, yielding Because ln is a monotonic function of , maximizing ln also maximizes : Under certain condition that a priori probabilities of all classes are equal, saying that P( ) = P( )( ̸ = ), we can get where represents the j-th "class center". Thus, to classify instance , we should determine the class centers for each class. Let = ( (1) , (2) , . . . , ( ) ); the following holds: where (⋅) is the indicator function described in (9). The semieager algorithm is summarized in Algorithm 1. The inputs of this algorithm include a set of training instances and their class labels. The algorithm outputs frequencies, class centers, and regularization terms for each class, respectively.
Remarks on semieager learning algorithm: (a) The training time complexity of the algorithm is ( ), and the prediction time complexity is ( ). It is quite efficient when it is provided as a sufficiently large set of training data.
(b) The SemiE supports Incremental Learning. (c) −2 2 ln ( ) is a regularization term. If a data point to be classified has the same distance to all class centers, the class label with the highest frequency will be assigned to it.
(d) It is robust to noisy training examples and to the irrelevant attributes for classification. A small number of noisy instances in a class will not significantly influence the center of the class.
(e) The SemiE is suitable only for the learning tasks with relatively fewer categories (k≪N). Otherwise, if k has the same quantitative level as N, SemiE will degenerate into KNN.
( ) An Example of the SemiE. Figure 1 illustrates the semieager learning algorithm for the case where the instances are points in a two-dimensional space and where the target function is Boolean valued. The positive and negative training examples are shown by "+" and "-", respectively. Both the class centers are marked in red and the noisy points are marked in blue. Individual noisy point has no significant effect on the position of the class center. In the figures, "f" represents a query data point to be classified. In order to classify the data point, distance values to each class center, ( − ) 2 , will be calculated, along with the regularization term . Since the frequency values of the two classes are almost identical in Figure 1(a), classification has no relation with the regularization term . Classification will be completely determined by the distance values to the class centers. Since the query data point is closer to the positive examples center in Figure 1(a), the algorithm classifies the data point as a positive example. In Figure 1(b), the negative examples are much more than the positive examples, and the query instance has the same distance to both the two class centers. Thus, the data point to be classified will be assigned to the negative examples according to the regularization term .
( ) Vector-Sum. Because of the various phrase lengths, a Vector-Sum operation is a rational solution. The relation phrases are represented as the normalization of sum of the vectors within the phrases.
( ) Max-Pooling. A pooling approach is often used in the CNN-based models [15]. With the RNN structure, this approach has been used for sentence-level semantic embedding [35]. The element ( ) of is obtained by using Max-Pooling operation as follows: Here is the dimension of the word vector V and V ( ) is the i-th element of V .
( ) Attention Model. Recently, attention mechanism is introduced into deep learning and achieves success in a wide range of tasks, such as question answering, machine translations and speech recognition [36], and image captioning [37]. When reading an article, people pay more attention to the words or segments that are valuable for comprehending the text. Similarly, since various words contribute differently to the ultimate objective in NLP tasks, attentive neural networks calculate a series of weighted distribution for words sequence in text. Inspired by this, we present a novel approach to calculate phrase representation based on the information quantity of the words within the phrase.
with being the probability of the word . Thus, a word with a higher probability will get a lower value.

Experimental Results and Analysis
To the best of the authors' knowledge, we are the first to tackle the problem of open relation classification by employing machine learning algorithm in this paper. The proposed semieager learning approach belongs to the class of supervised models. Consequently, one of the key challenges for this research is training data, since there is no large training set available for open relation classification. Though a multitude of ORE systems have been developed in recent years, a well-defined, generally accepted definition for this task is still missing. In this situation, it is difficult to create a large-scale annotated corpus serving as a gold standard dataset for an objective and reproducible crosssystem comparison. As a consequence, ORE systems were predominantly evaluated on small-scale corpora that were created by the researchers themselves. Although some of these corpora (e.g., the Wikipedia and Reverb datasets) are occasionally reused, none of the datasets for assessing the performance of different systems is widely agreed upon. Moreover, because ORE systems rely on unsupervised extraction strategies, these datasets generally consist only of unlabeled natural language sentences that cannot be used for training a supervised model, such as the SemiE.
Fortunately, in the research area of traditional relation classification, there are several benchmarks that consist of sentences which have been manually annotated with relation labels. As the first attempt to classify open relations, we would adopt some of these datasets (with minor modifications) to evaluate our models presented in this paper.
In subsections below, these four components are described in detail. Section 4.1 shows the datasets and evaluation metrics. The training process and the efficiency of our models are discussed in Section 4. In the SemEval-2010 Task 8, there are 9 relationships (with two directions) and an undirected 'Other' class, resulting in 19 relation classes in total. This dataset contains 10,717 annotated examples, including 8,000 sentences for training and 2,717 for testing. To make the dataset more suitable for our task, we changed the annotated instances into the form of (NP, VP, NP), as described in Section 2 and Box 1. For example, "The [owl] e1 held the mouse in its [claw] e2 ", was transformed into "[The owl] e1 held the mouse in [its claw] e2 ." The open relation phrase is "held the mouse in", and the relation class is Component-Whole(e2,e1). Since we aimed at classifying verb-mediated relations, the instances that cannot be converted into the form (NP, VP, NP) were not taken into consideration (Box 1). For example, the Component-Whole(e1,e2) instance "He decided to pad the [heel] e1 of [shoes] e2 with a shock absorbing insole or heel pad." was ignored.
The MIML-RE dataset consists of annotated sentences from the 2010 and 2013 KBP collections, along with a dump of Wikipedia in July 2013. There are a total of 33811 sentences that have been annotated. According to description of KBP task, there are 41 relations in total. We again took into consideration only the instances that can be converted into the form (NP, VP, NP), for example, the relation "per: city-of-death(e 1 ,e 2 ), SemEval-2010 dataset defined 18 generally accepted relation types and "Other". In this task, the class Other is used to indicate that the relation between two nominals does not belong to any of the 18 relation classes of interest. Therefore, SemEval-2010 dataset is "complete" for open relation classification. Although the MIML-RE dataset is of no completeness, it still defined up to 41 commonly accepted relation types. And this is practically "sufficient" for our task, to some extent. Thus, we adopted these datasets to reduce the labor costs of manual annotation.
There are mainly two aspects in which MIML-RE is different from SemEval-2010 Task 8: (1) MIML-RE is dominated with entity names (pairs of nouns) which are more sparse than SemEval-2010 Task 8. And there are more target nouns containing more than one word; (2) sentences in MIML-RE are averagely much longer than SemEval-2010 Task 8, as we can notice in Table 1.
The SemEval-2010 Task 8 benchmark is one of the commonly used datasets in traditional relation classification. The performance is evaluated in terms of the F1 score defined by this task. Both the data and the evaluation tool are publicly available (http://semeval2.fbk.eu/semeval2.php?location=data). For evaluation on this dataset, we applied the official scoring script and report the macro F1 score which also served as the official result of the shared task. In the experiments that are conducted on MIML-RE dataset, we adopt three metrics, including precision, recall, and F1 score, to evaluate our models. This dataset is also publicly available (http://nlp.stanford.edu/software/mimlre.shtml) but does not provide official evaluation tools.
. . Model Training. In order to classify the open relations with SemiE, it is required to generate phrase-level feature vectors. Three schemes are adopted to calculate representation   for the relation phrases within the annotated examples, as described in Section 3.2. In these schemes, we use the released word embedding set GoogleNews-vectors-negative300.bin to produce phrase representations, which is trained by Mikolov's word2vec tool (http://code.google.com/p/word2vec/). For the words that are not contained in pretrained word embedding set, the vectors of them are initialized randomly ranging from -0.25 to 0.25. As detailed in Algorithm 1, the proposed SemiE learns only two parameters for each class during its training process. In our experiments, all the parameters are learned by making only a single pass over the training corpus. Nevertheless, the ANN-based models have complex structure and massive parameter set. In order to optimize these parameters, it takes about tens or dozens (if not hundreds) of epochs to train the models, with the potential risk of overfitting (see Tables 2 and  3 for details). Accordingly, the SemiE algorithm presented in this paper is significantly more efficient than all of the compared methods.
. . Experimental Results. Table 2 illustrates the comparison of our proposed method with some other state-of-the-art relation classification models on SemEval-2010 Task 8. SDP-LSTM [14] picks up heterogeneous information via the shortest dependency path within entity pair and integrates external linguistic features via multichannel LSTM networks. Based on this, it obtains an F1-score of 83.7%. To get better performance, many relation classification models involve the external lexical knowledge. However, although the proposed SemiE model does not make use of any complicated human-designed features, it still achieves the superior F1-score of 85.1%, when it works with the attention model. CR-CNN [16] presents a new ranking function to substitute Softmax and pays more attention to the influence of class "Other". This targeted modification achieves F1-score of 84.1%. By contrast, our model does not use any ranking function to finish classification.
All of the three models, BRNN, BLSTM, and Att-BLSTM, are based on the bidirectional RNN architecture. BRNN [25] utilizes the original RNN to extract sentence-level feature with the assistance of Position Indicator and Max-Pooling operation, which achieves F1-score of 82.8%. BLSTM [20] leverages bidirectional LSTM and a piece-wise Max-Pooling to generate sentence-level representation. This model achieves the performance of 84.3% with NLP tools and lexical resources. However, our result is yielded without any extra features. Att-BLSTM [26] employs attention mechanism for relation classification. EAtt-BiGRU [27] utilizes attention layer with reasonable a priori knowledge. This model generates attention weights depending on corresponding entity pair information and brings better performance than Att-BLSTM. However, our attention mechanism involves only information quantity of the words within the phrase. The proposed approach outperforms these models without any extra features.
The next experiment compares with some recent models proposed by Zhang and Wang [25] with MIML-RE dataset. The experimental results are presented in Table 3.
From Table 3, we could find that when it works with Vector-Sum or Max-Pooling, the proposed SemiE gets similar F1 scores with the BRNN and CNN models but obtains a good balance of precision and recall. However, comparing to BRNN and CNN, SemiE+attention model achieves a noticeable boost in precision, which leads to the best results in terms of F1 score on the MIML-RE dataset.
A particular advantage of the RNN model is that it can tackle long-distance patterns more effectively, compared to the CNN model. Nevertheless, Table 3 shows that the proposed SemiE+attention mode significantly outperforms the RNN model. This is due to the large proportion of long contexts in the MIML-RE data. Thus, we could draw the conclusion that the SemiE+attention model is more suitable for long context relations than the RNN model.
. . Error Analysis. To better understand the merits and demerits of our models, we performed a detailed analysis of the classification errors occurring in the experiments.
The first type of errors is caused by the SemiE itself. As discussed in Section 3, SemiE takes an assumption that training examples obey a normal distribution. Nevertheless, if the number of samples is not large enough (e.g., less than 30), this assumption does not hold. By analysing the statistics of the annotated relation types of the SemEval-2010 Task 8 dataset, we found the relation type with the largest number of instances is Cause-Effect (1003 instances) and the smallest one is Instrument-Agency (504 instances). Because the number of training examples in each relation type is large enough, it is reasonable to take the assumption of normal distribution. As a consequence, the proposed SemiE achieves better performance on this dataset. In contrast, the training instances are not balanced across relation types in the MIML-RE dataset. For instance, the relation type employee-of contains 1978 annotated training examples, but there are only 14 annotated examples in the relation type schools-attended. We find that 65% of the classification errors occurred in relations with fewer annotated examples, which leads to a dramatic drop of F1 score on this dataset.
The second type of errors is caused by the relation phrase embeddings. The three most basic characteristics of a sequence are arguably its length, the items within it, and their order [38]. Somewhat surprisingly, the simple vector sum model can encode a fair amount of information with regard to length, word content, and word order, although this model does not attempt to preserve word order information. The attention model proposed in this paper is based on the vector sum but more reasonable than it. Nevertheless, we note that when encoding longer phrases the proposed attention model (and vector sum) tends to lose more order information, which may cause additional classification errors.

Conclusion
Open relation extraction has been a growing field of research in the last few years. But there is no research presented in the literature for open relation classification to our knowledge. In this paper, we proposed a semieager learning approach, named SemiE, to deal with the problem in open relation classification. Unlike the eager learning approach to construct complex models, the proposed SemiE is much simpler and quite efficient but still preserves the benefits of both the eager learning and lazy learning approaches. To classify the relations with SemiE, it is necessary to generate phrase-level feature vectors with the same size. Although Vector-Sum and Max-Pooling can be employed, we still present an adapted attention model, inspired from the attention mechanism in deep neural network architecture.
Experimental results on two benchmark datasets demonstrated that the semieager learning approach can achieve better results than the newly presented approaches. And especially with the attention model the SemiE model exhibits clear advantages for sentences with long-distance relations.