Feature selection is of paramount importance for text-mining classifiers with high-dimensional features. The Turku Event Extraction System (TEES) is the best performing tool in the GENIA BioNLP 2009/2011 shared tasks, which relies heavily on high-dimensional features. This paper describes research which, based on an implementation of an accumulated effect evaluation (AEE) algorithm applying the greedy search strategy, analyses the contribution of every single feature class in TEES with a view to identify important features and modify the feature set accordingly. With an updated feature set, a new system is acquired with enhanced performance which achieves an increased
Knowledge discovery based on text mining technology has long been a challenging issue for both linguists and knowledge engineering scientists. The application of text mining technologies based on large collections of known texts, such as the MEDLINE data base, has become especially popular in the area of biological and medical information processing [
While entity recognition has been exploited as a powerful approach towards automatic NER retrieval, there has recently been an increased interest to find more complex structural information and more abundant knowledge in documents [
Designed by Bjorne et al. [
As a sophisticated text mining classifier, TEES uses enormous feature classes. For the BioNLP 2009 shared tasks, it produced 405,165 features for trigger detection and 477,840 features for edge detection out of the training data, which not only consume large amounts of processing time but also create undesirable noises to affect system performance. To address this issue and achieve even better system performance in terms of processing time and recognition accuracy, a natural starting point is to perform an in-depth examination of the system processes for feature selection (FS) and feature reduction (FR).
The integration of FS and FR has been demonstrated to hold good potentials to enhance an existing system [
The purpose of this paper is to focus on the feature selection rules of TEES and aim to improve its performance for the GENIA task. By designing an objective function in the greedy search algorithm, we propose an accumulated effect evaluation (AEE) algorithm, which is simple and effective and can be used to numerically evaluate the contribution of each feature separately concerning its performance in the combined test. Moreover, we make further changes to the core feature class by incorporating new features and merging redundant features in the system. The updated system was evaluated and found to produce a better performance in both Tasks 1 and 2. In Task 1, our system achieves a higher
Different from the previous NER task, the GENIA task aims to recognize both the entities and the event relationship between such entities. Extended from the idea of semantic networks, the recognition task includes the classification of entities (nodes) and their associated events (edges). The participants of the shared task were expected to identify nine events concerning given proteins, that is, gene expression, transcription, protein catabolism, localization, binding, phosphorylation, regulation, positive regulation, and negative regulation. The mandatory core task, Task 1, involves event trigger detection, event typing, and primary argument recognition [
Like the other text mining systems, the performance of TEES is evaluated by precision, recall, and
The second measure, recall, is defined as
Recall is used to assess the fraction of the documents relevant to the query that are successfully retrieved. Precision and recall indicators are well-known performance measures in text mining, while
The
The core of TEES consists of two components, that is, classification-style trigger/event detection and rich features in graph structure. By mapping the tokenized word and entity to the node and mapping event relation to edge between entities, TEES regards the event extraction as a task of recognizing graph nodes and edges as shown in Figure
The graphical representation of a complex biological event (refer to [
Generally, TEES first convert the node recognition to a problem of 10-class classification, which corresponds to 9 events defined in the shared task and another class for negative case. This procedure is defined as trigger detection. Thereafter, the edge detection is defined by the recognition of concrete relationships between entities, including semantic direction and theme/cause relation.
As in Figure
The data set used in TEES consists of four files in GENIA corpus [
The scheme of TEES system consists of three phases. First, linguistic features are generated in the feature generation phase. Second, in the training phase, train123.xml is used as the training data set, devel123.xml is used as development set, and optimum of parameters is obtained. Then in the third phase, using everything123.xml (sum of train123.xml and devel123.xml) as the training set, test.xml is used as unknown data set for event prediction. Events are extracted from this unknown set and accuracy is subsequently evaluated. See Figure
Data set and pipeline of TEES, cited from TEES Toolkit. Location: TurkuEvent ExtractionSystem readme.pdf, p. 7 in the zip file,
Mainly, there are two parameters in grid searching at the training stage. The first parameter is
Features used for trigger detection are designed in a rational way, and abundant features are generated from training data set.
In terms of the training data set with GENIA format, the dependency relation of each sentence is output by Stanford parser [
For a targeted word in sentence which represents a node in graph, the purpose of trigger detection is to recognize the event type belonging to the word. Meanwhile, edge detection is to recognize the theme/cause type between two entities. Therefore, both detections can be considered as multiclass classification.
The features produced in trigger detection are categorized into six classes, as listed in Figure
Feature class in trigger detection and edge detection.
Here, the features are well structured from the macro- and microperspectives about the sentence. Basically, the feature generation rule of “main feature” and “content feature” mostly relies on microlevel information about the word token itself, while “sentence feature” and “linear order feature” rely on the macrolevel information about the sentence. Differently, “attached edge” and “chain feature” rely on the dependency tree and graph information, especially the SDP.
Similarly, features used for edge detection can be classified into 8 classes, namely, entity feature, path length feature, terminus token feature, single element feature, path grams feature, path edge feature, sentence feature, and GENIA feature. Among the features above, the third feature is omitted, since it is considered part of the first feature.
A quantitative method is designed to evaluate the importance of feature classes. Here, the feature combination methods are ranked by
Here, AEE1(
The idea of AEE1 comes from the understanding that the top classifiers with higher
For better understanding, a simple case is considered. Assume there are two feature classes that could be used in the classifier and so there are three feature combinations for the classifier, namely, 1, 2, and 1&2. Without loss of generality, we assume that the best classifier uses the feature class 1&2, the second one uses the feature class 1, and the worst classifier uses the feature class 2. Here, we denote the rank list 1&2, 1, and 2 as Rank Result A. According to the third column in Table
Examples for AEEi algorithm.
AEEi result for Ranking Result A
Rank | Result A |
|
AEE1 |
AEE2 |
|||||
---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
||
|
1&2 | 1 | 1 | 1/1 | 1/1 | 1/2 | 1/2 | ||
|
|||||||||
|
1 | 2 | 1 | 2/2 | 1/2 | 2/3 | 1/3 | ||
|
|||||||||
|
2 | 2 | 2 | 2/3 | 2/3 | 2/4 | 2/4 | ||
|
|||||||||
AEE1( |
2.667 | 2.167 | AEE2( |
1.667 | 1.333 |
AEEi result for Ranking Result B
Rank | Result B |
|
AEE1 |
AEE2 |
|||||
---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|||
|
1 | 1 | 0 | 1/1 | 0/1 | 1/1 | 0/1 | ||
|
|||||||||
|
1&2 | 2 | 1 | 2/2 | 1/2 | 2/3 | 1/3 | ||
|
|||||||||
|
2 | 2 | 2 | 2/3 | 2/3 | 2/4 | 2/4 | ||
|
|||||||||
AEE1( |
2.667 | 1.167 | AEE2( |
2.167 | 0.833 |
As another example, we assume a rank list 1, 1&2, and 2, which is denoted as Rank Result B and shown in Table
Therefore, an alternative algorithm, AEE2, is proposed by updating
Considering AEE1(1) = 2.667 > AEE1(2) = 2.167 and AEE2(1) = 1.667 > AEE2(2) = 1.333 in Table
Unlike a trial-and-error procedure, the scheme of this research is oriented towards better feature selection so that important feature classes are identified by evaluating the contribution of the individual features.
Accordingly, codes are written to enhance vital features. Thereafter, the classifiers with new updated features are tested and better performance is presumed. Compared with the previous TEES system, our main contribution is to import feature selection strategy in Phase 1, that is, “linguistic feature selection,” as shown in Figure
Flowchart of the research.
Based on the AEE algorithm, all of the combinations of the feature classes are used in classifiers and their corresponding
Using the quantitative algorithm, the importance of feature classes is addressed in the two-phase TEES procedure that involves trigger detection and edge detection. For better understanding, the top ten classifiers with respective feature combinations are shown in Table
Top 10 best classifiers with corresponding feature classes in combination experiment.
Rank |
|
Trigger feature combination with fixed trigger feature |
|
Edge feature combination with fixed edge feature |
---|---|---|---|---|
1st | 51.34 | 1&2&3&4&5 | 52.16 | 1&2&4&5&6&7&8 |
2nd | 51.21 | 1&2&3&4&5&6 | 51.81 | 1&2&4&5&6&8 |
3rd | 50.71 | 1&2&4&5 | 51.29 | 1&2&5&6&7&8 |
4th | 50.19 | 1&2&3&4&6 | 51.18 | 1&2&5&6&8 |
5th | 49.9 | 1&2&4&5&6 | 50.33 | 1&2&4&6&7&8 |
6th | 49.74 | 1&3&4&5 | 50.23 | 1&2&4&6&8 |
7th | 49.44 | 1&4&5 | 49.26 | 2&4&5&6&8 |
8th | 49.16 | 1&2&3&4 | 49.02 | 1&2&4&5&6 |
9th | 48.82 | 1&2&4&6 | 48.32 | 2&4&5&6&7&8 |
10th | 47.82 | 1&2&4 | 47.42 | 2&5&6&8 |
Score of features in trigger detection.
Feature ID | 1 | 2 | 3 | 4 | 5 | 6 | ||
|
||||||||
Feature name | Sentence feature | Main feature | Linear order feature | Content feature | Attached edge feature | Chain feature | Theoretical maximum | Theoretical minimum |
|
||||||||
AEE1 score | 40.83 | 39.94 | 32.99 | 52.23 | 36.12 | 30.09 | 53.43 | 10.27 |
|
||||||||
AEE2 score | 10.80 | 10.82 | 8.99 | 14.16 | 9.80 | 8.43 | 20.52 | 3.28 |
Score of features in edge detection.
Feature ID | 1 | 2 | 4 | 5 | 6 | 7 | 8 | ||
|
|||||||||
Feature | Entity | Path length | Single element | Path grams | Path edge | Sentence | GENIA | Theoretical | Theoretical |
Name | Feature | Feature | Feature | Feature | Feature | Feature | Feature | Maximum | Minimum |
|
|||||||||
AEE1 score | 77.60 | 83.84 | 68.45 | 74.88 | 77.76 | 56.09 | 66.35 | 107.61 | 20.08 |
|
|||||||||
AEE2 score | 18.63 | 19.57 | 16.57 | 17.51 | 17.88 | 13.40 | 15.45 | 35.80 | 5.54 |
During the trigger detection, the results show that the 4th feature performs best and 6th feature performs worst. By calculating AEEi value of features, Figures
AEE1 and AEE2 plots of the best/worst feature in trigger detection.
AEE1 and AEE2 plots of the best/worst feature in edge detection.
The comparison between the two features shows how the 4th feature performs better than 6th feature. And 52.23 and 30.09 also correspond to the value of area below the curve. The AEE1 and AEE2 plot of the best and worst features in trigger and edge detections are shown in Figures
Combinations of features for trigger detection show that the “content feature” in trigger detection is the most important one, and the “chain feature” is the worst. This result shows that, in terms of identifying a target word token, the target itself provides more information than the neighboring tokens.
Taking the feature generation rule of 4th feature into consideration, the “content feature” class contains four features, “upper,” “has,” “dt,” and “tt.” Specifically, “upper” is to identify the upper case or lower case of letters, “has” is to address the existence of a digit or hyphen, “dt” is to record the continuous double letters, and “tt” is to record three continuous letters. Since the content feature is vital for trigger detection, the feature generation rule could be strengthened similarly. Accordingly, a new “ft” feature is inserted into the “content feature” class by which the consecutive four letters in the word are considered. Moreover, modification is performed on the 6th feature class by merging similar features related to dependency trees in both the 5th and the 6th feature classes. For simplicity, the updated features are denoted as 4′ and 6′.
Furthermore, if we compare the best performance between classifiers with trigger feature comprising the original features, 4′ features, 6′ features, or 4′&6′ features (all include original edge features), we get that the best ones for trigger features are 1&2&3&4&5, 1&2&3&4′&5, 1&2&3&4&5&6′, and 1&2&3&4′&5&6′, respectively, while the
The complete combination experiment shows that the best combination of trigger feature is 1&2&3&4′&5, with an
Similarly, for feature selection in edge detection, various experiments are carried out based on the best combination of trigger features. Here, the best feature and the worst feature (2nd and 7th feature) are chosen to be modified in edge detection, and we denote the new feature classes as 2′ and 7′. With the fixed trigger feature 1&2&3&4′&5, we test the classifier with the original edge features, 2′ feature, 7′ feature, and 2′&7′ feature. We obtain the best classifier in each combination experiment. The best classifier in each combination owns a feature set 1&2&4&5&6&7&8, 1&2′&4&5&6&7&8, 1&2&4&5&6&7′&8, or 1&2′&4&5&6&7′&8, separately, and the achieved
In the above experiments, we test the performance of trigger feature class by fixing edge features. Likewise, we test edge features by fixing trigger features. We observe that feature modifications in this phase are indeed capable of achieving improvement, where all of the best combinations perform better than the result of the best trigger. Finally, we use the best trigger feature (1&2&3&4′&5) and best edge feature (1&2′&4&5&6&7′&8), and eventually the best combination of feature set achieved the highest score of 53.27, which is better than the best performance of 51.21 previously reported for TEES 2.0. Therefore, it is concluded that the best classifier has a feature set with trigger feature 1&2&3&4′&5 and edge feature 1&2′&4&5&6&7′&8, where trigger-
The best feature combination after choosing the best trigger feature (1&2&3&4′&5) and best edge feature (1&2′&4&5&6&7′&8).
Task 1 | Recall | Precision |
|
---|---|---|---|
Strict evaluation mode | 46.23 | 62.84 | 53.27 |
Approximate span and recursive mode | 49.69 | 67.48 | 57.24 |
Event decomposition in the approximate span mode | 51.19 | 73.21 | 60.25 |
|
|||
Task 2 | Recall | Precision |
|
|
|||
Strict evaluation mode | 44.90 | 61.11 | 51.77 |
Approximate span and recursive mode | 48.41 | 65.81 | 55.79 |
Event decomposition in the approximate span mode | 50.52 | 73.15 | 59.76 |
Recall =
Comparing with the 24 participants in GENIA task of BioNLP 2009 and historical progress of Bjorne’s work, a contour performance is given in Figure
As Figure
Comparison of
Bjorne et al. 2009 [ |
Bjorne et al. 2011 [ |
Bjorne et al. 2012 [ |
TEES1.01 |
Riedel et al. 2011 [ |
Ours | |
|
||||||
|
51.95 | 52.86 | 53.15 | 55.65 | 56.00 | 57.24 |
In this research, we designed a feature selection strategy, AEE, to evaluate the performance of individual feature classes to identify the best performing feature sets. An important finding is that the greatest contribution comes from the content feature class in trigger detection. In this section, a routine analysis of the contribution is shown, which yields the same finding and supports the same conclusion that the content feature class contributes the most towards event recognition and extraction.
First, retaining one feature class in the classifier, we can get separate
Second, we observe all of the double combinations involving the
Third, a similar phenomenon occurs in the case of three-feature-combination experiment and four-feature-combination experiment. In all cases, when
The best feature combination after choosing features dynamically in trigger and edge detection.
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
Feature in trigger | #Sentence feature | #Main feature | #Linear order | #Content feature | #Attached edge | #Chain feature |
Feature | Feature | |||||
Feature size | 18998 | 24944 | 73744 | 8573 | 100561 | 178345 |
(Merely one feature analysis) | ||||||
|
0 | 42.05 | 3.50 | 27.11 | 7.33 | 5.48 |
Average contribution | 0 | 0.001 |
|
0.003 |
|
|
(Double feature combination analysis) | ||||||
Best performance | (+4) | (+4) | (+4) | (+1) | (+2) | (+4) |
Involving |
45.39 | 36.65 | 22.29 | 45.39 | 28.27 | 23.48 |
Worst performance | (+2) | (+1) | (+1) | (+3) | (+1) | (+1) |
Involving |
0 | 0 | 0 | 22.29 | 0.91 | 3.10 |
(Three-feature combination analysis) | ||||||
Best performance | (+4, 5) | (+1, 4) | (+1, 4) | (+1, 5) | (+1, 4) | (+1, 4) |
Involving |
49.44 | 47.82 | 45.13 | 49.44 | 49.44 | 45.51 |
Worst performance | (+2, 3) | (+1, 3) | (+1, 2) | (+3, 6) | (+1, 3) | (+1, 3) |
Involving |
0.11 | 0.11 | 0.11 | 20.76 | 1.98 | 2.40 |
(Four-feature combination analysis) | ||||||
Best performance | (+2, 4, 5) | (+1, 4, 5) | (+1, 4, 5) | (+1, 2, 5) | (+1, 2, 4) | (+1, 2, 4) |
Involving |
50.71 | 50.71 | 49.74 | 50.71 | 50.71 | 48.82 |
Worst performance | (+2, 3, 6) | (+1, 3, 6) | (+1, 2, 6) | (+3, 5, 6) | (+1, 2, 3) | (+1, 2, 3) |
Involving |
5.77 | 5.77 | 5.77 | 21.13 | 7.02 | 5.77 |
(Five-feature combination analysis) | ||||||
Performance | ||||||
Without |
34.22 | 47.1 | 49.90 | 16.37 | 50.19 | 51.34 |
Finally, in yet another analysis, we observe the cases where the
Through the routine analysis above, there is ample evidence arguing in support of the importance of the 4th feature in trigger detection. Compared with the results in numerical scores, the contribution value of the 4th feature is greater than the others, which confirms the judgment. Furthermore, we can sort these features according to their contribution values. These results can further prove our decision to modify the 4th feature and thereafter enhance system performance.
It is interesting that the routine analysis shows the substantial positive evaluation for the 4th trigger feature, which is proved by the results of quantitative analysis of AEE algorithm. This shows a consistent tendency of feature importance, which in turn proves the reliability of the AEE algorithm. Since it is clumsy to use routine analysis to analyze all of the features, we expect that the AEEi algorithm makes sense in generalized circumstances.
The effectiveness of the 4th trigger feature motivated the feature class modification by inserting “ft” features. One should also note that all these features in the 4th feature class are related to the spelling of word tokens, which is similar to stems but contains more abundant information than stems. Besides, we can also analyze and ascertain the importance of other features, like POS information, through feature combinations. Here, the edge features are fixed, and only a smaller section of the trigger features is tested through combination experiments.
The full combinations are listed in Supplementary Material Appendix E and Table
Analysis of average contribution of lexical features.
Feature | Feature class |
|
Feature size | Average contribution | |
---|---|---|---|---|---|
1 | Nonstem | Main feature | 6.43 | 154 | 0.04175 |
2 | POS | Main feature | 1.52 | 47 | 0.03234 |
3 | dt | Content feature | 20.13 | 1172 | 0.01718 |
4 | tt | Content feature | 27.63 | 7395 | 0.00374 |
5 | Stem | Main feature | 35.45 | 11016 | 0.00322 |
As a matter of fact, trigger features could be analyzed according to their generation rule, namely, sentence feature, main feature, linear order feature, content feature, attached edge feature, and chain feature. This is a state-of-the-art strategy in feature selection. TEES is a nice system based on machine learning, which, however, does not perform intensive feature selection. The absence of a feature selection strategy in previous research mainly stems from two reasons. The first is that the natural core idea of machine learning is just to put enough features into the classifier as a black box; the second is that the performance of a classifier with huge sizes of features is always better in accuracy and
The authors declare that there is no conflict of interests regarding the publication of this paper.
Research described in this paper was supported in part by grant received from the General Research Fund of the University Grant Council of the Hong Kong Special Administrative Region, China (Project no. CityU 142711), City University of Hong Kong (Project nos. 6354005, 7004091, 9610283, 7002793, 9610226, 9041694, and 9610188), the Fundamental Research Funds for the Central Universities of China (Project no. 2013PY120), and the National Natural Science Foundation of China (Grant no. 61202305). The authors would also like to acknowledge supports received from the Dialogue Systems Group, Department of Chinese, Translation and Linguistics, and the Halliday Center for Intelligent Applications of Language Studies, City University of Hong Kong. The authors wish to express their gratitude to the anonymous reviewers for their constructive and insightful suggestions.