A Shortest Dependency Path Based Convolutional Neural Network for Protein-Protein Relation Extraction

The state-of-the-art methods for protein-protein interaction (PPI) extraction are primarily based on kernel methods, and their performances strongly depend on the handcraft features. In this paper, we tackle PPI extraction by using convolutional neural networks (CNN) and propose a shortest dependency path based CNN (sdpCNN) model. The proposed method (1) only takes the sdp and word embedding as input and (2) could avoid bias from feature selection by using CNN. We performed experiments on standard Aimed and BioInfer datasets, and the experimental results demonstrated that our approach outperformed state-of-the-art kernel based methods. In particular, by tracking the sdpCNN model, we find that sdpCNN could extract key features automatically and it is verified that pretrained word embedding is crucial in PPI task.


Introduction
Biomedical relations play an important role in biologic processes and are widely researched in the field of biomedical natural language processing (BioNLP). PPI task aims to extract protein interactions; for example, in sentence "The distribution of actin filaments is altered by profilin overexpression," the interaction between protein entities "actin" and "profilin" would be extracted. A number of databases, such as BIND [1], MINT [2], and IntAct [3], had been created to store structured interactions. However, the biomedical literature regarding protein interactions is expanding rapidly, making it difficult for these databases to keep up with the latest proteinprotein interactions. Consequently, effective and automatic protein-protein relation extraction systems become more significant.
Previous researches have illustrated the effectiveness of the shortest dependency path (sdp) between entities for relation extraction in many fields [4][5][6][7]. For example, in PPI task, [8] proposed an edit-distance kernel based on sdp and classified the relations by SVM. Reference [9] has made a detailed investigation into the relevant work of relation extraction and elaborated the important role of sdp in relation extraction. However, how to preprocess the sdp (e.g., using a variety of kernels) and how to combine different features (e.g., part-of-speech, -grams, and parser tree) still are open problems. In this work, the proposed approach takes raw sdp as the only input, and it can learn features automatically. And thus, different from previous researches, manual feature selection and feature combination are not necessary in our approach.
Many efforts have been done on PPI task, especially the kernel based methods. Most of these methods take the PPI task as a binary classification problem by determining whether there is an interaction between the two entities. The kernels include bag-of-words kernel [10], all-path kernel [11], subset-tree kernel [12], edit-distance kernel [8], and graph kernel [13], and they have shown effectiveness in PPI task. Considering that single kernel partly calculates the similarity of two instances, hybrid kernel [14][15][16][17] has been proposed and demonstrated much better performance than single kernel. Kernel methods are effective, because they integrate a large amount of manually selected features. The problem of existing kernel based method is how to combine different features; in most cases, sophisticated design is required.
Deep learning methods have achieved remarkable results in computer vision [18] and speech recognition [19], and due to much of the effective work involved in neural network language models (NNLM) [20,21], recently, some work has focused on neural network especially CNN for natural language processing (NLP) problem. Using CNN to extract features for NLP was previously researched by the authors in [22]; they considered the tasks including part-of-speech (POS), chunking, name entity recognition (NER), and semantic role labeling (SRL) as sequential labeling problems. In recent years, researches have proposed the use of CNN to extract features for relation extraction. Reference [23] combined the word representation, lexical level features, and word features and used the CNN model to learn the sentence-level features; the features were then concatenated into a vector and fed to a Softmax layer to classify the relationship. Reference [24] shared a similar idea to [23]; the authors proposed a new logistic loss function and a pairwise method to train their CNN model. However, the CNN based methods described by [23,24] usually take whole sentence or the context between two target entities as input. The problem of these methods is that such representations fail to describe the relationships of two target entities far in sentence distance, and the irrelevant information may also be considered due to the long distance. Considering the described problems and the complexity of PPI task, in this work, we use dependency parsing to analyze the sentence for generating the sdp at first to capture semantical and syntactical features and then send sdp to sdpCNN for classification.
Comparing with the prior work, the contributions of our work can be concluded as follows: (1) We propose a new model (sdpCNN) to tackle PPI task and show that sdpCNN model built on word embedding is effective in extracting protein-protein relations.
(2) We demonstrate that sdpCNN with pretrained word embedding performs much better than randomly generated word embedding and state-of-the-art kernel based methods. It could be concluded that the well pretrained word embedding is important in PPI task. (3) The proposed model is able to extract key features automatically such that the manual feature selection procedure can be avoided.

Material and Methods
In this section, we firstly introduce word embedding, and then we describe the proposed sdpCNN model in detail. The proposed model consists of three parts: the sdp extraction, sdpCNN based feature extraction, and multilayer perceptron (MLP) based classification.

Introduction for Word Embedding.
Word embedding is a feature learning technique in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size. Many methods have been proposed to train the word embedding, but most of the methods are based on the distributional hypothesis: words that occur in similar contexts tend to have similar meanings. Given this hypothesis, the trained word embeddings would be close to each other in vector space when the words contain similar meanings ( Figure 1 shows visualization of word embedding by t-SNE [25]).
Compared with traditional "one-hot" representation, pretrained word embedding brings about three advantages. (1) It could capture semantic information and weaken word gap problem; for example, in Figure 1, interaction verbs (interaction verbs usually indicate the relation among entities and thus they are important in PPI task) "affects" and "enhance" are clustered together; however, in traditional "one-hot" representation, the verbs "affects" and "enhance" are completely different.
(2) Data sparseness problem could be avoided since all words are mapped into low-dimensional vectors.
(3) Pretrained word embedding is trained on large unlabeled corpora, and thus it could enlarge the coverage of vocabulary and decrease the number of unknown words.

Shortest Dependency Path (sdp) Extraction.
Semantic dependency parsing had been frequently used to dissect sentence and to capture word semantic information close in context but far in sentence distance. To extract the relationship between two entities, the most direct approach is to use sdp. The motivation of using sdp is based on the observation that the sdp between entities usually contains necessary information to identify their relationship [9]. For example, in Figure 2, the word "affects" in sdp provides useful information for classifying two target proteins, and the dependency relationship such as "nsubj" (the dependency relation "nsubj" represents "nominal subject," and the governor of this relation is always a verb, because interaction verbs are crucial in PPI task; thus, this dependency relation is important in PPI task; more detailed descriptions for relation "nsubj" can be found in [26]) between words "profilin" and "affects" also adds supplemental information for classification.
To reduce the sparseness and ensure the generalization of features, we replace two target proteins with special symbols "Protein1" and "Protein2," respectively, and thus we can get a sdp "Protein1-nsubj-affects-dobj-properties-prep-of-Protein2" from Figure 2. Figure 3 shows the architecture of the proposed sdpCNN model. In the first step, the model transforms a sdp into a matrix representation by looking up pretrained word embedding; and then, a convolution layer is applied to this matrix to automatically extract the features. The following max-pooling operation generates the most useful local features. At last, the extracted features are fed to a multilayer perceptron (MLP) with a hidden layer and a Softmax classifier.

sdpCNN Model for Feature Extraction.
For notation, we use D ∈ | |× to represent pretrained word embedding, where is the vocabulary of corpora   The words in blue are the two target proteins, and the sdp between the proteins is represented by the red arrows. Tags such as "nsubj" and "dobj" are the dependency relations between two words. and is the dimension of word embedding. Suppose x = { 1 , 2 , 3 , . . . , } is an input sdp with length (we fix the length of input path as by truncating or padding special symbol "PADDING"). When we assign each word in sdp x with a corresponding row vector from D, we would get a matrix representation P ∈ × for input sdp (yellow part in Figure 3).
The convolutional operation would be considered to apply filter W ∈ ℎ× to the ℎ-word window in input sdp x. An ℎ-word window in input sdp can be represented as P , +ℎ−1 ∈ ℎ× (yellow part surrounded with red rectangle in Figure 3) by connecting row to + ℎ − 1 in P. A feature can be generated by where is an activation function such as hyperbolic tangent (tanh), 1 is the bias term, and ⊙ is element-wise multiplication. By applying filter to each word window of the input sdp, the model will produce a new feature which we call feature map c in Max-pooling operation (see (3)) takes the maximum value over all the word windows in feature map c which brings about two advantages: (1) it could extract the most important local features and (2) it reduces the computational complexity by reducing the feature dimension. Hence, * = max (c) . ( As each filter produces a feature * , multiple filters will generate multiple features. Suppose is the number of the filters; the model would get fixed-size distributed features r = [ * 1 , * 2 , . . . , * ], where * is the th feature generated by th filter.

MLP for Classification.
A MLP model is employed to calculate the probability of each class. Given the distributed representation r, the full-connection weight matrix W 2 ∈ × , the number of hidden layers , and the bias term 2 , the output of full-connection layer O ∈ ×1 is calculated by  In this example, the input sdp has 7 words (dependency relations such as "nsubj" and "dobj" are also considered as words), each word embedding is 4 dimensions, and 5 filters are used. The yellow part is the matrix representation for an input sdp; each column in the green part represents the feature map generated by a filter through (1) and (2); and the red part represents the max-pooling results by taking the maximum value over each column in the green part by (3). The arrows in red show the process of generating a feature map * . The blue part is a MLP classifier with a full-connection layer and a Softmax layer.
Before applying Softmax layer for classification, the original feature space is transformed into confidence space. The input for Softmax layer I ∈ ×1 is described by where W 3 ∈ × is a transformation matrix and is the number of classes. This task is binary classification, so is 2.
Each value in I represents the confidence of the current sample that belongs to each class. A Softmax layer normalizes the confidence to [0, 1]. Given I = [ 1 , 2 , . . . , ], the output of Softmax layer S = [ 1 , 2 , . . . , ]. The Softmax operation can be calculated by (6). Both and ( | x) represent probability of sdp x that belongs to class . Hence, 2.5. Training Procedure. There are several parameters that need to be updated during the training: the multifilter W, the full-connection weight W 2 , the transformation matrix W 3 , and the bias terms 1 and 2 . All of the parameters are represented by = (W, W 2 , W 3 , 1 , 2 ). We apply Negative Log-Likelihood (NLL) in (7) (y ∈ {0, 1} is annotated label for the input sdp x ) as loss function. In order to minimize the loss function, we use gradient descent (GD) based method to learn the network parameters. For each input pair (x , y ),  we calculate the gradient (using the chain rules) of each parameter relative to loss and update each parameter with learning rate by (8). It is notable that fixed learning rate would lead to unstable loss in training. In this work, we use an improved GD based algorithm, Adadelta [27], to update the parameters in each training step. Adadelta is able to dynamically adjust the learning rate. Hence, = − loss .

Experimental Setup
3.1.1. Datasets. Two standard datasets (both datasets are available at http://corpora.informatik.hu-berlin.de/), Aimed and BioInfer [28], are used to evaluate our model. Aimed was manually tagged by [9] which included about 200 medical abstracts with around 1900 sentences and was considered as a standard dataset for PPI task. BioInfer was developed by Turku BioNLP group (see details at http://bionlp.utu.fi/) which contained about 1100 sentences. If there is an interaction between the two entities, we consider this instance as a positive one; otherwise, we consider it as a negative one (in Table 1). Text preprocessing includes sentence splitting, word segmentation, and dependency parsing (Stanford parser was utilized).

Word Embedding Initialization.
In experiments, we compare the performances of pretrained embedding with randomly initialized word embedding. When the words that appeared in the datasets are not included in the pretrained word embedding, we follow [29] and initialize word embedding by randomly sampling from [− , ], where is the variance of pretrained word embedding trained by word2vec. For random part, all of the words are initialized by sampling from [− , ].

Model Hyperparameters Settings.
We experimentally choose the hyperparameters for the model on BioInfer and Aimed datasets shown in Table 2. The Discussion gives details on parameter selection as well as the impact of the parameters.

Evaluation Metrics.
We use precision ( ), recall ( ), and -score ( ) to evaluate the performances of our sdpCNN model. is the harmonic mean of recall and precision which is defined by (9). 10-cross-validation (10-fold CV) method is used to calculate the average -scores. Hence, 3.2. Performance Comparison. We evaluate our system and compare the performance with state-of-the-art kernel based methods. We start from a baseline model with randomly initialized word embedding, and then we evaluate our model with the pretrained word embedding. Table 3 shows the comparison results in detail. We firstly compare the performance with other sdp based methods, and then we compare the results with hybrid kernels based methods. The descriptions for methods in Table 3 are as follows: Walk-Weighted Subsequence Kernel [30]. Generating sdp at first and then integrating the proposed e-walk and v-walk kernels for classification. Graph Kernel [13]. Encoding the dependency parser results into a graph, proposing an all-path graph kernel by leveraging sdp; at last, least squares support vector machine is used for classification. SDP-CPT [4]. Using both sdp and directed constituent parser tree for classification. Tree Kernel [6]. On the bias of SDP-CPT, considering the modal verb phrases and appositive dependency features. Edit-Distance Kernel [8]. A semisupervised machine learning approach (TSVM) with edit-distance kernel based on sdp. Hybrid Kernel [14]. A combination of bag-of-words (BOW) kernel, subset-tree (ST) kernel, and graph kernel. [31]. A combination of rich features including bag-of-words features, sdp features, and graph features. Multiple Kernel [32]. A weighted multiple kernel by combining parser tree, graph features, POS, and sdp.

Multiple Features and Parser
As we can see, kernel methods listed in Table 3 usually require sophisticated design and complex feature combination, and feature engineering still accounts for a large proportion of these systems. In this work, we avoid manual features selection and features combination by using CNN. In addition, the features used in these kernel based methods are all discrete; therefore, the "word gap" problem is inevitable, while, by leveraging word embedding and CNN, we can train our model in continuous space and avoid hard assignment.
The main differences of the sdp based methods listed in Table 3 are how sdps were used and how similarity functions were calculated. For example, the most direct way is to encode sdp into "one-hot" representation and use SVM for classification [4,6]. Another way is by using editdistance kernel [8] to calculate the similarity of two sdps through Levenshtein distance. Compared with these sdp based methods in Table 3, even the baseline model achieved competitive results. Furthermore, pretrained sdpCNN model improved the -scores by 12.4 and 6.4 compared with tree kernel [6] and edit-distance kernel [8] on BioInfer and Aimed datasets, respectively.
It has been verified that a combination of multiple kernels could improve the effectiveness of kernel based PPI extraction methods. Kernels such as tree kernel, graph kernel, and bag-of-words kernel are commonly used in hybrid kernel based methods. Compared with the methods listed in Table 3, the baseline model alone yielded competitive results and improved the -scores by 5.3 on BioInfer dataset when compared with [14]. By integrating pretrained word embedding, our pretrained sdpCNN model exceeded 7.1 and 1.6 compared with [14,32] on BioInfer and Aimed datasets in Table 3. The experimental results showed that, with the appropriate expression (the sdp in this work) of the relationship, the sdpCNN model built on word embedding can get much better results than the combination of a variety of features (or kernels).
For better understanding extracted features by sdpCNN, Figure 3 illustrates the way of generating a feature map * in sdpCNN model. By following the negative direction of the red arrows in Figure 3, we can find which word window contributes most to the final classifier. Considering the example in Figure 3, the 3-word window ("Proteins nsubj affects") circled with a red rectangle is key item. We define the word in the middle of the key word window as key-word, and thus the word "nsubj" in the middle of the 3-word window "Proteins nsubj affects" in Figure 3 is key-word. Each filter produces a key-word; consequently, filters will generate key-words. In our experiments, we noticed that interaction verbs such as "inhibits," "cause," and "bind" were often chosen as key-words by sdpCNN model. Generally, the construction of an interaction verbs dictionary manually requires a great deal of time and effort, but our model can extract these verbs automatically.
Moreover, the experimental results also showed that the proposed method achieved considerably higher precision (73.4 on BioInfer dataset and 64.8 on Aimed dataset) than the existing approaches.

Evaluation on Different Scales of Training Data.
In order to investigate the effect of different scales of training data, we split the original datasets by different ratios. Figure 4 shows the changes of performance on different scales of training data. As we can see, the performance varied significantly depending on the size of training and test corpus, andscores changed from 75.1 to 48.2 on BioInfer dataset and 71.1 to 36.2 on Aimed dataset when proportion of test data ranged from 0.1 to 0.9; too few training data would have the risk of loss of data information; as a result, the trained sdpCNN model cannot well generalize the original data which would lead to poor performance.

Discussion.
In this section, we firstly investigate the impact of hyperparameters and provide general parameters settings for sdpCNN. After that, we compare the performances among the four proposed methods in Table 5. At last, we manually analyze the errors of sdpCNN alone with the possible solutions to errors.

The Influence of Different Hyperparameters Settings.
Consider the following: (1) Window size ℎ: a 3-word window is commonly used in many related works [22][23][24]; we tested a 2word window on both Aimed and BioInfer datasets. On Aimed dataset, the results remained essentially unchanged; however, when tested on BioInfer dataset, -scores reduced by 5. We also tested a 4-word window, while, in this experiment, performances are markedly inferior on both datasets, which means a 4-word window is too long to capture the structure information. (2) The length of fixed-size sdp : the lengths of most paths (more than 95%) in Aimed dataset are less than 20, while, in BioInfer dataset, most of the path lengths (more than 95%) are less than 30. And thus we set with 20 and 30 on Aimed and BioInfer datasets, respectively.
(3) The filters size : due to the limited size of corpora, when the filters size is too big, the model is prone to overfitting; we heuristically choose as 100 in our experiments.
(4) The number of full-connection layer units : based on the idea of [33], the appropriate increment of full-connection layer units could improve the performance. But too many units also suffer from overfitting, so we set with 500 in this experiment.

Random sdpCNN Model versus Pretrained sdpCNN Model.
From Table 3, we can find that the pretrained sdpCNN model performed much better than random sdpCNN model and improved the -scores by 1.8 and 3.3 on BioInfer and Aimed datasets, respectively. Intuitively, the pretrained word embedding could capture the semantic information of words, which means words with similar semantics are clustered together in the vector space ( Figure 1). Table 4 shows the examples of neighboring words of target words based on cosine similarity; we can see that word, for example, "affect," shares a similar meaning with words "impacting," "jeopardize," and so forth. However, when we randomly allocated the word embedding, semantic information among words would be discarded; as a result, random sdpCNN model might correctly classify the sentence "Protein1 affects Protein2" but fails on the sentence "Protein1 impacts Protein2" although both sentences indicate interactions. Random sdpCNN model is somewhat similar  To better learn the representation of the raw sdp input, we also proposed a model that combined the pretrained and random word embedding (see details in Table 5). The combined model improved the -scores by 0.6 on Aimed corpus and kept the performance on BioInfer corpus when compared with pretrained sdpCNN model. However, it is also notable that the combined model would take more than two times the cost on training time. There is always a trade-off between time and performance. Among these four models, pretrained sdpCNN model is more time-saving (relative to combined model and random (update) sdpCNN model), robust (relative to random (update) sdpCNN model), and effective (relative to random (update) sdpCNN model and random sdpCNN model). In conclusion, a CNN model built on high-quality pretrained word embedding could be considered as an effective alternative in PPI task.

Errors Analysis.
Confined to the complexity and diversity of the biomedical expressions, extracting relations from biological articles remains a big challenge. In this subsection, we carefully analyze the errors of sdpCNN and list the three typical errors as follows: (1) When an input sentence is too long, the Stanford dependency analysis tool is prone to errors, and because our model is built on sdp the propagation of errors would lead to poor performance of sdpCNN. (2) When irrelevant interaction verbs are included in sdp, as mentioned before, interaction verbs strongly suggest interactions; as a result, the model would make a mistake. (3) Randomly initialized word embedding would also hurt the system's performance. In our system, the dependency relations such as "nsubj" and "prepof " are all considered as input words, and such words are not likely to be included in pretrained word embedding, and thus these words are randomly assigned with vectors. As a result, "nsubj" and "prepof" might be far from each other in vector space.
The possible solutions for the mentioned errors are described as follows: the first error could be weakened by integrating the context between two target entities, because the context could provide supplementary information when standard tools fail to capture dependency relations among words. As for the second error, a possible solution is to introduce position information, because, in most of the time, the relevant interaction verbs locate in the middle of two target entities. For randomly initialized word embedding problem, we might take word embedding as hyperparameter and update it during the training. Meanwhile, word embedding used in this work is trained on large unlabeled Google News; it would be better to train word embedding on large biological articles to enrich semantic information.

Conclusion
In this paper, we have described a sdpCNN model built on word embedding for PPI task. Experiments demonstrated that our method outperformed the state-of-the-art kernel based methods. The main contribution of the proposed method is the integration of word embedding, sdp, and CNN. Word embedding is able to capture semantic information and effectively weaken word gap problem. By applying sdp and CNN, the proposed model could make full use of structure information and avoid manual feature selection. Our experimental results also indicated that (1) the raw sdp input is crucial to describe protein-protein relationship in PPI task; (2) the CNN model is useful to capture the local features and structure information; (3) high-quality pretrained word embedding is important in PPI task. Through error analysis, we notice that there still is room for improvement. In our future work, we would like to train our own word embedding and design our PPI system by making full use of context information, position information, and sdp.