For the automatic extraction of protein-protein interaction information from scientific articles, a machine learning approach is useful. The classifier is generated from training data represented using several features to decide whether a protein pair in each sentence has an interaction. Such a specific keyword that is directly related to interaction as “bind” or “interact” plays an important role for training classifiers. We call it a dominant keyword that affects the capability of the classifier. Although it is important to identify the dominant keywords, whether a keyword is dominant depends on the context in which it occurs. Therefore, we propose a method for predicting whether a keyword is dominant for each instance. In this method, a keyword that derives imbalanced classification results is tentatively assumed to be a dominant keyword initially. Then the classifiers are separately trained from the instance with and without the assumed dominant keywords. The validity of the assumed dominant keyword is evaluated based on the classification results of the generated classifiers. The assumption is updated by the evaluation result. Repeating this process increases the prediction accuracy of the dominant keyword. Our experimental results using five corpora show the effectiveness of our proposed method with dominant keyword prediction.
Proteins and their interactions play a leading role in the most fundamental biological processes, including metabolic activity, signal transduction, and DNA replication and transcription. In general, proteins express their functions through interaction with other molecules, including other proteins.
The information of a protein-protein interaction (PPI) can be found in the scientific literature. Although many efforts have created databases that store PPIs in computer readable form as structured data, it still takes too much time and labor to extract these valuable sources from the scientific literature. As a result, in recent years, much research has addressed the automated extraction of PPI information from biological literature.
For the automatic extraction of PPI, the machine learning technique is often utilized. In such approaches, classifiers are created to identify whether there is an interaction between two proteins appearing in a sentence. Many methods that apply the machine learning technique have been proposed, and it is very common to adopt supervised learning, which uses explicit PPI information as training data.
In most of these methods, a protein pair, which consists of two protein names appearing in a sentence, is regarded as an instance along with this sentence. Each instance is represented by many features including lexical features, word context features, and syntactic features derived from the sentence or its syntactic structure. A classifier is trained from instances given as a set of feature vectors. For example, the method proposed by Bunescu and Mooney learned extraction patterns for PPI with a generalized subsequence kernel that utilizes the following three patterns in a sentence: before the first protein, between two proteins, and after the second protein [
In such frameworks, certain feature values in some instances, which could determine very clearly whether there are interactions in them significantly impact the training of the classifiers. The presence of such keywords related to interaction as “
Dominant keywords are exceptionally effective in the process of training classifiers. However, note that the existence of some dominant keywords renders other features ineffective. In other words, dominant keywords might become double-edged swords and decrease the accuracy of training classifiers. Therefore, when dominant keywords greatly contribute to training classifiers, more focus should be exerted on the keyword features. In the opposite case, on the contrary, it is important to give significant consideration to other features. Moreover, not only dominant keywords but also some features play important roles depending on the sentence structure.
In this paper, we propose a novel method in which a training set is divided into four subsets based on the dominant keywords and the sentence structure (the position of the keyword) in each instance and four types of classifiers are generated from each subset to improve the classification accuracy. If the training set covers all of the possible instances completely, a keyword, which can determine whether the instance with this keyword belongs to a positive class (including PPI) or a negative class (excluding PPI), can be considered dominant. However, the training set contains unbalanced data biased toward negative instances including no PPI and is created manually. Therefore, it is very difficult to determine if the bias in the classification into classes is due to these keywords or that the instances using these keywords are gathered in only one class by chance. Furthermore, whether a keyword is
Therefore, we introduce a mechanism that can predict whether the mentioned keyword is dominant for each instance. Initially, we assume that a keyword is a dominant keyword based on the bias of the classes in the instance that contains it. After the training set is divided into two subsets, one consisting of instances with one of the initially assumed dominant keywords and another consisting of the instances without the assumed dominant keywords, two classifiers are generated by training these two subsets. Based on the classification result of these two classifiers, we verify the reasonableness of the presence or the absence of the dominant keyword that was assumed previously and update this assumption. By repeating these verification and assumption processes of the dominant keyword, we can obtain a more appropriate division of subsets. On the other hand, with respect to the division of subsets based upon differences in the sentence structure, we do not use the verification and assumption processes shown in the prediction of dominant keywords. Instead, we exploit the sentence patterns provided beforehand. Since several features are useless for particular sentence patterns, they are removed by feature selection to improve the extraction accuracy.
The rest of this paper is organized as follows. In the next section, we show features that represent the instances in our proposed method. In Section
We consider a binary classification problem in which we deal with positive instances that include PPI and negative instances that do not include it. In the automatic PPI extraction approach, criteria for distinguishing between positive and negative instances, which are identified beforehand, are automatically found using the characteristics of the sentences containing the protein pairs. In this paper, we refer to these sentence characteristics as features.
The PPI extraction framework is described as follows: The training data are given by a list of features (feature vectors) and their known class labels. Based on the training data, we perform the machine learning algorithm and train the classifiers. The test data whose feature vectors are known beforehand (but no class label) are given to the classifiers. The classifiers output the prediction results that identify whether a PPI exists between any protein pair.
The features obtained from a sentence related to PPI may be a description that directly expresses PPI, the existence of words implying PPI, or a description that shows that no PPI exists. These features have been used in many studies of PPI information extraction [
The features used in this paper are broadly divided into three categories: features obtained directly from the sentence, those obtained from parsing information, and those using existing patterns. Next we describe them in detail. In the following tables,
Features that can be extracted from sentences in the text are summarized in Table
Features obtained directly from sentences.
Features | Definitions/remarks | Values | Examples |
---|---|---|---|
Keywords | Words representing relationship between two proteins | One of the 180 kinds of words obtained by stemming 642 kinds of words such as |
|
|
|||
Distance between protein pair and keyword: three types | The word distance defined by the number of words appearing between keyword mentioned above and protein names constituting the protein pair; Type 1 is the distance between |
Integer value | In sentence |
|
|||
Position of keyword: three types | The word order of protein pair and keyword | “Infix” (the order of the sentence is [ |
In sentence |
|
|||
Position of protein names | The value adding word distance between the word at the beginning of the sentence and the protein name to one; positions 1 and 2 are defined for |
Integer value | In sentence |
|
|||
Comma between keyword and protein pair: four types | Since the topic often changes before and after a comma, we use such information if there is any comma between the keyword and the protein pair | “yy”, “nn”, “yn”, or “ny” (e.g., “yy” means commas are observed between |
In sentence |
|
|||
Negative words | Whether any negative word such as |
“True” or “false” | In sentence |
|
|||
Conjunctive words | Whether one of the following 16 kinds of words representing conjunctive relations appears: |
“True” or “false” | In sentence |
|
|||
“Which” | Whether “which” appears; since “which” also represents the conjunctive relation but occurs more frequently than the 16 words mentioned above, we distinguish “which” from the above features | “True” or “false” | |
|
|||
“But” | Whether “but” appears; in addition to “which”, “but” also frequently represents the conjunctive relation; however, “but” introduces negation to the context | “True” or “false” | |
|
|||
Words representing assumptions or conditions | Whether “if” or “whether” appears between the protein names or the keyword and the protein name | “True” or “false” | |
|
|||
Preposition of keyword | The preposition following the keyword providing that the word distance between the keyword and the preposition is within 3; if there are many prepositions, the preposition is used whose word distance from the keyword is nearer | One of the prepositions | In sentence “ |
|
|||
Multiple occurrences of keywords | Whether there is more than one keyword in a sentence | “true” or “false” | In sentence “ |
|
|||
Second keywords: seven kinds | Only one of seven particular words: “ |
“True” or “false” for each of the seven words (if some of these seven words appear in the sentence and are not selected as a keyword, we use “true” as a feature value for them) | In sentence “ |
|
|||
Parallel expression of protein pair | Whether the protein names constituting the protein pair are adjacent (they are also considered adjacent even if “—”, “/”, “and”, “or”, “(” appears between them); if protein names are expressed in parallel in a sentence, interaction between them is difficult; we can easily determine the parallel expression of a protein pair in a sentence by determining whether these protein names are adjacent in the word order of that sentence | “True” or “false” | In sentence “ |
The syntactic structure of the sentence is expressed by parse trees. From them, we can clarify such syntactic features of sentences as the structure of phrases and the structural relation of word pairs. The features obtained from a parse tree are shown in Table
Features obtained from parsing information.
Features | Definitions/remarks | Values | Examples |
---|---|---|---|
Height of protein pair and keyword: three types | The heights of the protein names constituting the protein pair and the keyword at the parse tree structure: these heights differ from word distances; features height_ |
Integer value | In Figure |
|
|||
Part-of-speech information of protein pair and keyword: three types | The part-of-speech information of PATH (the path from the root) at the parse tree structure of the protein names constituting the protein pair and the keyword; it is possible to represent the syntax structure and train classifiers to learn pseudo grammar structure; features POS_ |
List of part-of-speech information of PATH | In Figure |
Example of parse tree.
Several structure patterns are related to the presence or absence of PPI [
Set of 13 PPI patterns.
Number | PPI pattern |
---|---|
Pattern 1 |
|
Pattern 2 |
|
Pattern 3 |
|
Pattern 4 |
|
Pattern 5 |
|
Pattern 6 |
|
Pattern 7 |
|
Pattern 8 | complex between |
Pattern 9 | complex of |
Pattern 10 |
|
Pattern 11 |
|
Pattern 12 |
|
Pattern 13 | between |
In Table
In a variety of the features mentioned in the preceding section, the feature
Note that the same keyword can be both dominant and not depending on its contexts. For example, in sentence “GerE binds to a site on one of these promoters, cotX, that overlaps its -35 region,” the value of feature
As described in Section
For example, if the value of feature
As mentioned in Sections
Division of training set.
Subset | Dominant keyword | Position of keyword |
---|---|---|
II | Included | Infix |
IP | Included | Prefix/postfix |
NI | Not included | Infix |
NP | Not included | Prefix/postfix |
We also classify the unlabeled instances that need to determine the presence or the absence of a PPI into one of four categories and use the corresponding classifier to predict PPI. An overview of this process is shown in Figure
Overview of PPI prediction based on division of training set.
The division of a training set based on the appearance order of the protein names and the keyword in the sentence can be performed straightforwardly by referring to the feature
Next we outline our method that predicts the presence or absence of dominant keywords. By utilizing the presence of a keyword that easily becomes dominant and is less likely to become dominant, we assume that it is a dominant keyword in each instance. At the first stage of the assumption, each keyword is tentatively determined to be a dominant keyword or not to be a dominant keyword. Next, the training set is divided into two subsets: instances assumed to possess dominant keywords and instances that are not assumed to possess dominant keywords. These two subsets are used to generate two different classifiers. Then, by evaluating the success and failure of the classification results from these two classifiers, we verify whether the assumption of the presence or absence of the dominant keywords is appropriate and update this assumption. We later discuss the details of this update process. By repeating the assumption and verification processes, we improve the accuracy of predicting the existence or the nonexistence of dominant keywords for each instance.
The initial assumption about whether a certain keyword is dominant is given by the observation of the bias of the classes of instances when classifying them based on the presence or absence of this keyword. We define unbalance degree
For certain keyword
The training set is divided into two subsets based on the
General flow of updating
To update
(1) for (2) for (3) (4) for (5) (6) (7)
The details of procedure
For every instance in The prediction results of the two classifiers are different. Both of the prediction results are correct. Both of the prediction results are incorrect.
The fact that the prediction results of the two classifiers are different means that the impact of the presence or the absence of the dominant keyword on the classifiers is high. Therefore, we can determine that if a certain instance is predicted accurately by classifier
If both of the prediction results of the two classifiers are correct, we can only predict the class of the instance from features other than
On the other hand, after a certain number of iterations have updated the
If the prediction results of the two classifiers are both incorrect, the instance does not possess any valid dominant keyword, or the prediction from the features besides
As explained in Section
In the above framework, because the original training set is divided into four training subsets, not all of the features are always valid for each training subset. Therefore, we only consider the selection of the valid features from all of the prepared features for each training subset. Generally, such a process is called feature selection, and various methods for it have been proposed. Instead of applying such methods, however, we adopt a simple manual method of feature selection in which meaningless or redundant features are eliminated beforehand based on the sentence structure by focusing on the division of the original training set based on the presence or the absence of the dominant keywords and word order. Lists of the features removed for each category of the training subsets are shown in Table
Removed features for each training subset.
Subset | Removed features |
---|---|
II | Patterns 7, 8, 9, and 13 |
IP | Patterns 1, 2, 10, and 12 |
NI | Patterns 7, 8, 9, and 13 |
NP | Patterns 1, 2, 10, and 12 |
As shown in Table
We use five PPI corpora, LLL [
Furthermore, let threshold value
Let the number of folds,
We used evaluation data created from the above five corpora and divided them into training and test datasets to apply 10-fold CV. To evaluate the test data, we also used average Recall, Precision, and
Table
Experimental results.
Corpus | LLL | HPRD50 | IEPA | AImed | BioInfer | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(%) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SC | 85.4 | 79.1 | 82.1 | 70.1 | 75.2 | 72.3 | 63.6 | 68.9 | 66.1 | 49.5 | 67.9 | 57.3 | 67.7 | 74.8 | 71.1 |
MC | 85.4 | 81.9 | 83.6 | 72.4 | 71.5 | 72.0 | 64.2 | 70.3 | 67.1 | 54.4 | 67.7 |
|
68.1 | 74.3 | 71.1 |
DK-MC | 84.8 | 84.8 | 84.8 | 77.3 | 72.8 | 75.0 | 66.9 | 71.3 | 69.0 | 54.4 | 66.8 | 60.0 | 69.5 | 74.3 | 71.7 |
FS-MC | 87.8 | 81.8 | 84.7 | 77.3 | 73.7 | 75.4 | 65.6 | 72.6 | 69.0 | 51.8 | 66.7 | 58.3 | 69.1 | 75.0 | 71.9 |
DK-FS-MC | 86.6 | 83.5 |
|
77.9 | 76.0 |
|
67.2 | 71.4 |
|
55.0 | 66.0 | 60.0 | 70.8 | 74.8 |
|
Although
Predicting the presence or the absence of the dominant keyword in each instance can improve the learning performance compared with uniformly setting up the presence or the absence of the dominant keyword in each feature value.
In the AImed corpus, although the
We explored the influence of the value of threshold
Influence of value of
|
LLL | HPRD50 | IEPA | AImed | BioInfer |
---|---|---|---|---|---|
0.15 |
|
77.0 |
|
|
|
0.20 | 83.7 |
|
68.4 | 59.4 | 71.6 |
0.25 | 81.6 | 74.6 | 67.7 | 58.9 | 71.6 |
0.30 | 83.3 | 72.0 | 67.5 |
|
71.9 |
0.35 | 84.3 | 73.0 | 66.1 | 59.3 | 71.9 |
The comparison results of our proposed method (
Performance comparison of PPI extraction.
Corpus | Method |
|
|
|
---|---|---|---|---|
LLL | Fundel et al. [ |
79.0 |
|
82.0 |
Fayruzov et al. [ |
86.0 | 72.0 | 78.0 | |
Van Landeghem et al. [ |
84.0 | 79.0 | 82.0 | |
DK-FS-MC |
|
83.5 |
|
|
|
||||
HPRD50 | Van Landeghem et al. [ |
71.0 | 71.0 | 71.0 |
DK-FS-MC |
|
|
|
|
|
||||
IEPA | Van Landeghem et al. [ |
|
|
|
DK-FS-MC | 67.2 | 71.4 | 69.2 | |
|
||||
AImed | Giuliano et al. [ |
|
64.5 |
|
Mitsumori et al. [ |
53.6 | 55.7 | 54.3 | |
Fayruzov et al. [ |
50.0 | 41.0 | 45.0 | |
Van Landeghem et al. [ |
58.0 | 66.0 | 62.0 | |
Edit of Erkan et al. [ |
43.5 |
|
55.6 | |
Cosine of Erkan et al. [ |
55.0 | 62.0 | 58.1 | |
DK-FS-MC | 55.0 | 66.0 | 60.0 |
Fundel et al. [
Although the
It is not easy to compare the results in AImed with other related research due to different preprocessing and feature extraction ways. For example, although the method by Giuliano et al. utilized neither features obtained from parsing information, nor features using existing patterns, nor dominant keywords, their
Mitsumori et al. [
Erkan et al. [
Number of positive/negative pairs in AImed corpus applied in our work and existing works.
Our work | Mitsumori et al. [ |
Giuliano et al. [ |
Van Landeghem et al. [ |
Erkan et al. [ |
Fayruzov et al. [ | |
---|---|---|---|---|---|---|
Positive pairs | 1,000 | 1,107 | 1,008 | 1,000 | 951 | 816 |
Negative pairs | 4,834 | 4,369 | 4,634 | 4,670 | 3,075 | 3,204 |
In this paper we described our automatic extraction method for PPI from scientific articles based on dominant keywords that considerably contribute to learning and classification. Based on the existence of dominant keyword and sentence structure, a training set is divided into four subsets and four classifiers are generated from each training subset.
We introduce a mechanism that can predict whether the mentioned keyword is dominant for each instance. Initially, a particular keyword is assumed to be dominant based on the bias of the classes. Then two classifiers are generated by training the two subsets divided from the training set. Based on the classification result, the assumption of the existence of a dominant keyword that was assumed previously is verified and updated. By repeating this process, we implemented more accurate predictions about dominant keywords. Moreover, we performed feature selection in which redundant features are removed beforehand based on the sentence structure to improve the extraction accuracy.
Through experimental results, we showed that dominant keyword prediction greatly improves the accuracy of PPI extraction. Moreover, the DK-FS-MC method shows good results of
Extraction accuracy is influenced by the unbalance of PPI data. Since we used the Random Forest algorithm, we cannot tackle this challenge of unbalanced PPI data. In ongoing work, we will apply the Weighted Random Forest or the Balanced Random Forest [
The authors declare that there is no conflict of interests regarding the publication of this paper.
Shun Koyabu and Thi Thanh Thuy Phan contributed equally to this work.
This work was partly supported by JSPS KAKENHI Grant no. 24300056 (Grant-in-Aid for Scientific Research(B)).