A Novel Sample Selection Strategy for Imbalanced Data of Biomedical Event Extraction with Joint Scoring Mechanism

Biomedical event extraction is an important and difficult task in bioinformatics. With the rapid growth of biomedical literature, the extraction of complex events from unstructured text has attracted more attention. However, the annotated biomedical corpus is highly imbalanced, which affects the performance of the classification algorithms. In this study, a sample selection algorithm based on sequential pattern is proposed to filter negative samples in the training phase. Considering the joint information between the trigger and argument of multiargument events, we extract triplets of multiargument events directly using a support vector machine classifier. A joint scoring mechanism, which is based on sentence similarity and importance of trigger in the training data, is used to correct the predicted results. Experimental results indicate that the proposed method can extract events efficiently.


Introduction
With the rapid growth of the amount of unstructured or semistructured biomedical literature, researchers need considerable time and effort to read and obtain relevant scientific knowledge. Event extraction from biomedical text is the task of extracting the semantic and role information of biological events, which are often complex structures, such as the relationship between the disease and the drug [1], the relationship between the disease and the gene [2], the interaction between drugs [3], and the interaction between proteins [4,5]. Automatic extraction of biomedical events can be applied to many biomedical applications. Therefore, biomedical text mining technology is useful for people to find biological information more accurately and effectively.
The official BioNLP challenges have been held for several years since 2009 [6][7][8]. The BioNLP shared task (BioNLP-ST) [9] aims to extract fine-grained biomolecular events. It includes a number of subtasks, such as GENIA Event Extraction (GE), Cancer Genetics (CG), Pathway Curation (PC), and Gene Regulation Ontology (GRO). Increasing attention has been given to the task of event extraction, where the major task is GE in BioNLP-ST, and it aims to extract structured events from biomedical text such as event types, triggers, and parameters. An event is defined by GE using a formula including an event trigger and one or several arguments. Nine types of events were defined in BioNLP-ST GENIA Event Extraction 2011 (GE'11) and extended to fourteen types of events in BioNLP-ST GENIA Event Extraction 2013 (GE '13). Due to scarce samples of newly defined event types for good training, the study presented in this paper is still based on the nine types defined in GE'11. Table 1 shows the event types, which can be divided into three categories: the simple event class (SVT), Binding event class (BIND), and regulation event class (REG), where there are five simple events, including Gene expression, Transcription, Protein catabolism, Localization, and Phosphorylation. Each event has only one argument, that is to say, one theme. Themes in the Binding event comprise up to two arguments. The REG event class includes Regulation, Positive regulation, and Negative regulation. They are complex because they have two arguments: a theme and an optional cause. Figure 1 shows an example of an event where "IRF-4" and "IFNalpha" are proteins, "expression" and "induced" are triggers,  and two events can be expressed as {E1: Gene expression: "expression", Theme: "IRF-4"} and {E2: Positive regulation: "induced", Theme: E1, Cause: "IFN-alpha"}. We aim to extract these event structures from the text automatically. Pattern-based methods are used in biomedical relation extraction [10,11] but are less used in biomedical event extraction. These methods mainly extract the relations between entities by manually defined patterns and automatically learned patterns from the training data set. Rule-based methods [12][13][14][15] and machine learning-based methods [16][17][18] are the main methods in an event extraction task. Rulebased methods are similar to the pattern-based methods, which manually define syntax rules and learn new rules from the training data. Machine learning-based methods regard the extraction task as a classification problem. The problem of highly unbalanced training data sets in biomedical event extraction is seldom addressed by most systems. The solutions with support vector machines (SVMs) usually use the simple class weighting strategy [19][20][21]. Other approaches, such as active learning [22,23] and semisupervised learning [24,25], solve this problem by increasing the positive sample size. In this study, a sample selection method based on a sequential pattern is proposed to solve the problem of imbalanced data in classification, and a joint scoring mechanism based on sentence semantic similarity and importance of triggers is introduced to correct further false positive predictions.
The paper is organized as follows: related work is presented in Section 2. Our work, the sequence pattern-based sample selection algorithm, detection of multiargument events, and the joint scoring mechanism are presented in Section 3. Section 4 describes the experiment results in GE'11 and in GE'13 test sets. Finally, a conclusion is presented in Section 5.

Related Work
Since the organizers of the BioNLP-ST held the first competition on the fine-grained information extraction task of biomedical events in 2009, a variety of methods have been proposed to solve the task. At present, the event extraction systems are mainly divided into two types: rule-based event extraction systems and machine learning-based event extraction systems. The overview papers of BioNLP-ST 2011 and 2013 [7,8] show that the results of machine learning-based methods are better than the results of rule-based methods.
Rule-based event extraction systems [26][27][28][29] are based on sentence structure, grammatical relation, and semantic relation, which make it more flexible. However, the results obtained by those methods have high precision and low recall, which are noticeable in simple event extraction. To improve recall, rule-based event extraction systems are forced to relax constraints in the automatic access of learning rules.
The system based on machine learning is generally divided into three groups. The first group is the pipeline model [30][31][32], which has an event extraction process that can be divided into three steps. The first step predicts the trigger. The second step is the edge detection and assignment of arguments based on the first step. The final step is the event element detection. The pipeline model in the event extraction task has achieved excellent results, such as the champion of GE'09 [30] (Turku) and the champion of GE'13 [32] (EVEX). Zhou et al. [33] proposed a novel method based on the pipeline model for event trigger identification. They embed the knowledge learned from a large text corpus into word features using neural language modeling. Experimental results show that the -score of event trigger identification improves by 2.5% compared with the approach proposed in [34]. Campos et al. [35] optimized the feature set and training arguments for each event type but only predicted the events in the GE'09 test sets. A linear SVM with "one-versus-the-rest" multiclass strategy is used to solve multiclass and multilabel classification problems based on an imbalanced data set at each stage. Although the performance of the pipeline model  is excellent, its time complexity is high and each step is carried out based on the last step, which make its performance dependent on the first step of trigger detection. Thus, if an error occurs at the first step, it will propagate to the next step, thus causing a cascade of errors. The second group is called the joint model [16,17], which overcomes the problem mentioned previously. McClosky et al. [36] used the dual-decomposition method for detecting triggers and arguments and extracted the events using the dependence analysis method. Li et al. [37] integrated rich features and word embedding based on the dual-decomposition to extract the biomedical event. However, the optimal state of this joint model requires considering the combination of each token, including the unlikely token in the search space, making its calculation too complicated.
The third group is called the pairwise model [38,39], which is a combination of the pipeline and joint models that directly extracts trigger and argument instead of detecting the trigger and edge. Considering the relevance of the triggers and arguments, the accuracy of the pairwise model is higher than that of the pipeline model, and it is faster than the joint model in execution time because of the application of a small amount of inference. However, the pairwise model still uses SVM with the "one-versus-the-rest" multiclass strategy to solve multiclass and multilabel classification problems without dealing with the problem of data imbalance.

Methods
This section presents the major steps in the proposed system. The system is based on the pairwise structure in the pairwise model. The event extraction process is summarized in Figure 2. First, the sequential patterns are generated from the training data after text preprocessing. The unlabeled sample pairs in the generation of candidate pair (trigger, argument) will be selected based on the sequence pattern. Then, they will be trained together with the labeled samples. Second, the triplets are extracted directly for multiargument events, and then the predicted results between multiargument and single argument events will be integrated. Finally, the joint scoring mechanism is applied in the postprocessing, and the predicted results are optimized.

Text Preprocessing.
Text preprocessing is the first step in natural language processing (NLP). In the preprocessing stage, nonstandard symbols will be removed by NLP tools. We use nltk (nltk.org) to split the words and sentences and use the Charniak-Johnson parser with McClosky's biomedical parsing model (McClosky et al. [36]) to analyze the dependency path. After the sentences and words are split and the full dependence path is obtained, we use the four features' set of the TEES [30]

Sample Selection Based on Sequential
Pattern. Sequential pattern mining is one of the most important research subjects in the field of data mining. It aims to find frequent subsequences or sequential events that satisfy the minimum support. There are many efficient sequential pattern mining algorithms that are widely used.
Given a sequence database , which is a set of different sequences, let = { 1 , 2 , . . . , }, where each sequence is an ordered list of items and = { 1 , 2 , . . . , }, where is an item and is the number of items. The length of sequence ; then sequence 1 is called a subsequence of 2 , or 2 contains 1 , which is denoted as The support of the sequence 1 is the number of sequences in the sequence database S containing s 1 , denoted as ( 1 ). Given a minimum support threshold minsup, if the support 1 is no less than in , sequence 1 is called a frequent sequential pattern in , which is denoted In this study, the sequence patterns are generated to select samples in combination with the PrefixSpan algorithm [40]. The PrefixSpan algorithm uses the principle "divide and conquer" by generating a prefix pattern and then connects it with the suffix pattern to obtain the sequential patterns, thus avoiding generating candidate sequences.

Extraction of Sequential Patterns in Texts.
A sequence database is constructed. We denote = { }, = 1, 2, . . . , , as the set of candidate triggers which come from the trigger dictionary and = { }, = 1, 2, . . . , , as the set of candidate arguments which come from the training corpus. The set of pair (trigger, argument) is denoted as }. The dependency path between the labeled candidate pairs ( , ) from the training data is extracted, and it consists of typed dependency sequence. For example, the sequence , } is the dependency path between the labeled candidate pair ( , ) (the dependency path refers to typed dependency sequence from to ). The dependency paths from all labeled candidate pairs make up the sequence database , where 1 is one of the sequences. Table 2 shows part of the sequences in and frequent subsequences. The sequence 3 is shown as a subsequence of 1 and 5 ; therefore, ( 3 ) is 3 in DS. If we set = 3, we obtain s 3 as a frequent sequential pattern in DS.
We select each unlabeled candidate pair (c i , a j ) based on the frequent sequential patterns. The output frequent patterns set is denoted as , and the typed dependency sequence of the pair ( , ) is denoted as ( , ) . If ( , ) contains enough number of sequences in , where the number is denoted as ( , ) , Θ is a threshold; if ( , ) > Θ, then the pair is selected. This makes selecting a threshold for selecting the unlabeled candidate pair. We select the suitable threshold with respect to the performance on the development set and Initialize parameter threshold and set of sentences , ∈ (7) for each ∈ (8) Initialize candidate entities = { } and candidate arguments = { }, and then generate end for (13) end for (14) end for (15) return ( ) Algorithm 1: Sample filter. discuss threshold in more detail in the experiment section (Section 4.1.1). The formula is as follows: For example, let sequences = { , }, = { }, and = { , , } be the three frequent sequences in LS, and sequence is the typed dependency sequence of the candidate pair ( , ). The sequences and are the subsequences of . Set threshold as 2 and obtain ( , ) > 2, where the candidate pair ( , ) is selected. Algorithm 1 summarizes sample selection based on sequential pattern algorithm.

Detection of Multiargument
Events. BIND and REG event classes are more complex because of the involvement of primary and secondary arguments. In the primary arguments, some are unary and others can be related to two arguments. In this study, only the primary arguments (theme (protein/event), cause (protein/event), and theme (protein) +) are taken into account. To better solve the multiargument events, which can be represented as a triplet (trigger, argument, argument2), we propose a method that extracts triplet relations directly.
For the single argument events, the pairs (trigger, argument) are extracted directly. For multiargument events, they are usually detected based on single argument events extraction. Then, the second arguments are assigned and reclassified to predict. This approach will result in cascading errors. Considering the Binding multiargument event of the pairwise model [32] as an example, the detect process mainly included two phases: (1) detected pairs. For example, there are two pairs ( , ) and ( , ) that are extracted from the same sentence with the same trigger and labeled as Binding type. (2) Based on the previous step, evaluate the potential triplet using a dedicated classifier. For example, the triplet Computational and Mathematical Methods in Medicine 5 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ⟨s⟩ ⟨s⟩ w 1  ( , , ) is evaluated as a potential Binding event. Here, is a trigger labeled previously; and are proteins labeled previously in pairs. The result of the first step affects the second step. If pair ( , ) or pair ( , ) is not labeled, triplet ( , , ) will not be detected too. Therefore, for the events that include two arguments, the solution is to extract triplet relations directly. This method uses a single dictionary and classifier for multiargument events. The detail is as follows: (1) Generate dictionary for BIND event class and REG event class from the training data. Here, = { } is the set of candidate entities and = { } is the set of candidate arguments in a sentence , where is labeled proteins and candidate entities from the training data.
For the Binding event, if the triplet ( , , ) is predicted as true, the single argument of events = ( , ( , )) and = ( , ( , )) predicted will be removed in the step of integrating the prediction results of the single argument and multiargument of a Binding event. will be output as the cause argument for the REG event class.
If the triplet ( , , ) is irrelevant to the pair ( , ) from the same sentence, it is output directly. Compared to pairwise model, this approach considers joint information among the triplet (trigger, argument, argument2) from the start. It performs better in the multiargument events extraction.

Joint Scoring Mechanism.
Due to the introduction of the sequential pattern method to balance the training data, the recall performance is significantly improved. Meanwhile, to correct the false positive examples, a joint scoring mechanism is proposed for the predicted results. The scoring mechanism considers two aspects of sentences: similarity and the importance of trigger, where those less than the threshold will be false positive examples.
Sentence similarity is widely used in the field of online search and question answering system. It is an important research subject in the field of NLP. Here, we use the tool sentence2vec based on convolutional deep structured semantic models (C-DSSM) [41,42] to calculate the semantic relevance score.
Latent semantic analysis (LSA) is a better-known method for index and retrieval. There are many new methods extending from LSA and C-DSSM is one of them. It combines deep learning with LSA and extends. C-DSSM is mainly used in web search, where it maps the query and the documents to a common semantic space through a nonlinear projection. This model uses a typical convolutional neural network (CNN) architecture to rank the relevant documents. The C-DSSM model is mainly divided into two stages.
(1) Map the word vectors to their corresponding semantic concept vectors. Here, there are three hidden layers in the architecture of CNN. The first layer is word hashing, and it is mainly based on the method of letter -gram. The word hashing method reduces the dimension of the bagof-words term vectors. After the word hashing layer, it has a convolutional layer that extracts local contextual features. Furthermore, it uses max-pooling technology to integrate local feature vectors into global feature vectors. A high-level semantic feature vector is received at the final semantic layer. The learning of CNN has been effectively improved. Figure 3 describes the architecture of the C-DSSM.
is denoted as the input term vector, and y is the output vector, , = 1, . . . , − 1, are the intermediate hidden layers, is the th weight matrix, and is the th bias term. Therefore, the problem becomes (2) Calculate the relevance score between the document and query. By calculating the cosine similarity of the semantic  Training  Devel  Test  Training  Devel  Test  Training  Devel  Test  GE'13  10  10  14  0  0  0  2817  3199  3348  GE'11  5  5  4  800  150  260  10310  4690  5301 Training is training data, Devel is development data, and Test is test data.
concept vector of ⟨query, document⟩, the score is obtained and is measured as The process to calculate the joint score for each predicted result ( , ( , , )) is described as follows.
Step 1. Calculate the similarity between the sentence where the predicted result is located and all the related sentences in . Denote = { 1 , 2 , . . . , } as the set of sentences that contain the same trigger with , and obtain the maximum value otherwise.

(4)
Step 2. Compute the importance of the trigger, where 1 and 2 are the importance of trigger in training data, ( ) refers to the number of trigger in the event type , 1 is the number of trigger belonging to in the predicted result set , 2 is the number of trigger in the predicted result set , and V is the event type described in Table 1.
Step 3. Combine and ( , , ) to score the predicted results. The calculation formula is given as follows: where represents a weight. The sentence similarity computation is based on the semantic analysis, which can correct the false positive example very well. Therefore, weight in formula (6) will be given a higher value.
Step 4. Given a threshold , if ( , , ) < , the example is considered as negative.

Experimental Setup.
The experiments on GE'11 and GE'13 corpus are conducted. Nine types of events were defined in GE'11 and extended to fourteen types of events in GE'13. The study presented in this paper is still based on the nine types defined in GE'11. The data sets of GE'11 and GE'13 are different. No abstracts were included in GE'13, and the number of papers in GE'13 is more than that of the papers in GE'11. Table 3 shows the statistics on the different data sets. We merge GE'11 and GE'13 training data and development data as the final training data. The final training data, which eliminate duplicate papers, contain 16375 events. All parameters of our system have been optimized on the development set. The approximate span/approximate recursive assessment is reported using the online tool provided by the shared task organizers. Our method is mainly divided into three steps: sample selection based on the sequential pattern for imbalanced data, pairs and triplets extraction for multiargument events and integration between them, and a joint scoring mechanism based on sentence semantic similarity and trigger importance.

Filter Imbalanced Data.
In the selection stage for sequential patterns sample, we optimize the parameters of the sequence pattern on the GE'11 development set, where a different sequence pattern of minimum support and threshold results in different -score. We merge GE'11 and GE'13 training data as training data of optimized parameters. We aim to improve the recall through sequential pattern sample selection in the event extraction and then to improve the precision of each event while maintaining the recall, thus improving the final -score. Table 4 shows the ratio of the positive and negative samples after different minimum support and threshold selection by the sequential pattern on the parameter training data. Here, we use the number of events as the number of samples. The ratio of positive samples and negative samples is 1 : 13.163 in the annotated corpus. Reducing the negative samples too much or too little will lead to offset data, thus affecting the classifier performance, which is not our original intention. Therefore, we choose to reduce about 50% of the negative samples by setting a minimum support and threshold ( , Θ) = {( , Θ) | ∈ {4, 5}, Θ ∈ {2, 3}}. Figure 4(a) shows the -score of the four sequences in the GE'11 development set; when the minimum support is 4 and the threshold is 2, the -score of each event is significantly higher than that of the other sequences. Table 5 shows the ratio of the positive and negative samples after different minimum support and threshold selection by the  is minimum support (minsup), is threshold Θ, and P : N is the ratio of the positive and negative samples. sequential pattern on the final training data. From Tables 4  and 5, the ratio of the positive and negative samples is very close. Therefore, we will use the sequence where the minimize support is 4 and the threshold is 2 on GE'11 and GE'13 test set. Figure 4(b) shows that almost the -score of every event is less than the original model, which is pairwise model on the GE'11 development set, after the sample selection based on the sequential pattern. Given that we reduce the negative samples resulting in high recall and lower prediction, we propose a joint scoring mechanism to improve the prediction performance.

The Integration of Multiargument Events.
The results in Table 6 show that the recall and -score have been significantly improved by extracting directly the triplets of the Binding events. The REG event class includes nested events; therefore, the extraction of the multiargument has high complexity. We only study the Binding events of the multiargument events in this paper. This method, which extracts the triplets directly for the multiargument events, will not cause cascading errors. Therefore, it is effective to extract the triplets of the events.

Result on BioNLP-ST GENIA 2011.
We evaluate the performance of the method and compare it with the results of other systems. Table 7 shows the results of the method using the official GE'11 online assessment tool. Given that GE'11 corpus contains abstract and full text, we evaluate the performance on whole, abstract, and full text. The results of abstract and the full text, as well as the whole results, are reported to illustrate that the method is outstanding in classifying events. Table 7 shows that the -score of the full text is higher than the -score of the abstract, which is 81.32 and 71.44 in the simple event, respectively. However, the -score of the abstract is higher than the -score of the full text in BIND event class, which is 54.17 and 45.93, respectively. The -score of abstract is also higher than the -score of the full text in REG event class, which is 41.36 and 41.10, respectively. The total -score of the full text is higher than that of the abstract, which is 54.23 and 53.64, respectively. Table 7 also illustrates that the method performs well on full text. Table 8 shows the comparison results of the proposed method with other GE'11 systems. The results for FAUST, is minimum support (minsup), is threshold Θ, and P : N is the ratio of the positive and negative samples. Performance is shown in recall ( ), precision ( ), and -score ( ). Performance is shown in recall ( ), precision ( ), and -score ( ).
UMass, UTurku, MSR-NLP, and STSS models are reproduced from [7,24]. Our approach in full text obtains the best -score of 54.23. This score is higher than the best extraction system, such as FAUST of GE'11 (1.56 points), the STSS, and UTurku (about 3.5 points). The performance in precision and recall of full text is also better than that of other systems. However, the precision and recall of SVT and REG event classes are slightly lower than FAUST and UMass in the abstract, but they are higher in Binding events. However, the whole -score is slightly lower than that of the FAUST and UMass and higher than the UTurku, STSS, and MSR-NLP. However, the recall has achieved the highest score, which is mainly due to the sequential pattern sample selection of the unbalanced data.

Result on BioNLP-ST GENIA 2013.
The pipeline approaches are the best methods performing on the GE'13, where EVEX is the official winner. We train the model on the training set and development set and evaluate it on the test set using the official GE'13 online assessment tool. Table 9 shows the evaluation results. GE'13 test data does not contain abstracts; therefore, we evaluate the performance on full papers only. Table 10 shows the comparison results of our method with other GE'13 systems, including TEES 2.1 and EVEX, because they belong to the pipeline model. We add BioSEM system in the table, which is a rule-based system and has achieved the best results in the Binding events. The results for TEES 2.1, EVEX, and BioSEM are reproduced from [8].

Conclusions
In this study, a new event extraction system was introduced.
Comparing our system with other event extraction systems, Evaluation results (recall/precision/ -score) in whole data set (W), abstracts only (A), and full papers only (F). Performance is shown in recall ( ), precision ( ), and -score ( ).
we obtained some positive results. First, we proposed a new method of sample selection based on the sequential pattern to balance the data set, which has played an important role in the process of biomedical event extraction. Second, taking into account the relevance of the trigger and argument of multiargument events, the system extracts the pair (trigger, argument) and triplet (trigger, argument, argument2) at the same time. The integration of the pair and triplet improves the performance of multiargument events prediction, which improves the -score as well. Finally, a joint scoring mechanism based on C-DSSM and the importance of the trigger is proposed to correct the predictions. In general, samples selection based on the sequential pattern achieved the desired effectiveness, and it was combined with the joint scoring mechanism to further improve the performance of the system. The performance of this method was evaluated with extensive experiments. Although our method is a supervised learning method, we provide a new idea on constructing a good predictive model because high recall can be used in disease genes. Although numerous efforts are made, the extraction of complex events is still a huge challenge. In the future, we will further optimize the joint scoring mechanism and integrate external resources into biomedical event extraction through semisupervised or unsupervised approach. Performance is shown in recall ( ), precision ( ), and -score ( ).