Prediction of Protein-Protein Interaction Strength Using Domain Features with Supervised Regression

Proteins in living organisms express various important functions by interacting with other proteins and molecules. Therefore, many efforts have been made to investigate and predict protein-protein interactions (PPIs). Analysis of strengths of PPIs is also important because such strengths are involved in functionality of proteins. In this paper, we propose several feature space mappings from protein pairs using protein domain information to predict strengths of PPIs. Moreover, we perform computational experiments employing two machine learning methods, support vector regression (SVR) and relevance vector machine (RVM), for dataset obtained from biological experiments. The prediction results showed that both SVR and RVM with our proposed features outperformed the best existing method.


Introduction
In cellular systems, proteins perform their functions by interacting with other proteins and molecules, and proteinprotein interactions (PPIs) play various important roles. Therefore, revealing PPIs is a key to understanding biological systems, and many investigations and analyses have been done. In addition, a variety of computational methods to predict and analyze PPIs have been developed, for example, methods for predicting PPI pairs using only sequences information [1][2][3][4][5], for predicting amino acid residues contributing to PPIs [6][7][8], and for assessing PPI reliability in PPI networks [9,10]. As well as studies of PPIs, analyses of strengths of PPIs are important because such strengths are involved in functionality of proteins. In terms of transcription factor complexes, if a constituent protein has a weak binding affinity, target genes may not be transcribed depending on intracellular circumstance. For example, it is known that multi-subunit complex NuA3 in Saccharomyces Cerevisiae consists of five proteins, Sas3, Nto1, Yng1, Eaf6, and Taf30, acetylates lysine 14 of histone H3, and activates gene transcription. However, only Yng1 and Nto1 are found solely in the complex, and interaction strengths between each component protein are thought to be different and transient. Hence, Byrum et al. proposed a biological methodology for identifying stable and transient protein interactions recently [11].
Although many biological experiments have been conducted for investigating PPIs [12,13], strengths of PPIs have not been always provided. Ito et al. conducted largescale yeast two-hybrid experiments for whole yeast proteins. In their experiments, yeast two-hybrid experiments were conducted for each protein pair multiple times, the number of experiments that observe interactions, or the number of interaction sequence tags (ISTs), was counted. Consequently, they decided that protein pairs having three or more ISTs should interact and reported interacting protein pairs.
The ratio of the number of ISTs to the total number of experiments for a protein pair can be regarded as the interaction strength between their proteins. On the basis of this consideration, several prediction methods for strengths of PPIs have been developed. LPNM [14] is a linear programming-based method; ASNM [15] is a modified method from the association method [16] for predicting PPIs. Chen et al. proposed association probabilistic method 2 The Scientific World Journal (APM) [17], which is the best existing method for predicting strengths of PPIs as far as we know.
These methods are based on a probabilistic model of PPIs and make use of protein domain information. Domains are known as structural and functional units in proteins and wellconserved regions in protein sequences. The information of domains is stored in several databases such as Pfam [18] and InterPro [19]. The same domain can be identified in several different proteins. In these prediction methods, interaction strengths between domains are estimated from known interaction strengths between proteins, and interaction strengths for target protein pairs are predicted from estimated strengths of domain-domain interactions (DDIs).
On the other hand, Xia et al. proposed a feature-based method using neural network with features based on constituent domains of proteins [20], and they compared their method with the association method and the expectationmaximization method [21]. For the feature-based prediction of PPI strengths, we also utilize domain information and propose several feature space mappings from protein pairs. We use supervised regression and perform threefold cross validation for dataset obtained from biological experiments. This paper augments the preliminary work presented in conference proceedings [22]. Specifically, major augmentations of this paper and differences from the preliminary conference version are summarized as follows.
(i) We employ two supervised regression methods: support vector regression (SVR) and relevance vector machine (RVM). Note that we used only SVR with the polynomial kernel in the preliminary version [22]. (ii) The Laplacian kernel is used as the kernel function for SVR and RVM, and kernel parameters are selected via fivefold cross validation. (iii) We prepare the dataset from WI-PHI dataset [23] with high reliability.
The computational experiments showed that the average root mean square error (RMSE) by our proposed method was smaller than that by the best existing method, APM [17].

Materials and Methods
In this section, we briefly review a probabilistic model and related methods, and propose several feature space mappings using domain information.

Probabilistic Model of PPIs Based on DDIs.
There are some computational prediction methods for PPI strengths, and they are based on the probabilistic model of PPIs proposed by Deng et al. [21]. This model utilizes DDIs and assumes that two proteins interact with each other if and only if at least one pair of the domains contained in the respective proteins interacts. Figure 1(a) illustrates an example of this interaction model. In this example, there are two proteins 1 and 2 , which consist of domains 1 , 2 and domains 2 , 3 , 4 , respectively. According to Deng's model, if 1 and 2 interact, at least one pair among interacts. Conversely, if a pair, for instance, ( 2 , 4 ), interacts, 1 and 2 interact. From the assumption of this model, we can derive the following simple probability that two proteins and interact with each other: where = 1 indicates the event that proteins and interact (otherwise, = 0), = 1 indicates the event that domains and interact (otherwise, = 0), and and also represent the sets of domains contained in and , respectively. Deng et al. applied the EM (expectation maximization) algorithm to the problem of maximizing loglikelihood functions, the estimated probabilities that two domains interact, Pr( = 1), and proposed a method for predicting PPIs using the estimated probabilities of DDIs [21]. Actually, they calculated Pr( = 1) using (1) and determined whether or not and interact by introducing a threshold ; that is, and interact if Pr( = 1) ≥ ; otherwise, the proteins do not interact.
As Deng's method, typical PPIs prediction methods based on domains have the following two steps. First the interaction between domains contained in interacting proteins is inferred from existing protein interaction data. And then, an interaction between new protein pairs is predicted on the basis of the inferred domain interactions using a certain model. Figure 1(b) illustrates the flow of this type of PPIs prediction. Since interacting sites may not be always included in some known domain region, it can cause the decrease of prediction accuracy in this framework.

Association Method:
Inferring DDI from PPI Data. As described previously, probability of PPIs could be predicted based on probabilities of DDIs. In this subsection, we will briefly review related methods to estimate a probability of interaction for domain pair.

Association Method.
Let P be a set of protein pairs that have been observed to interact or not. The association method [16] gives the following simple score for two domains and using proteins that include the following domains: where | | indicates the number of elements contained in the set . This score represents the ratio of the number of interacting protein pairs including and to the total number of protein pairs including and . Hence, it can be considered as the probability that and interact. association method [15]. This method takes strengths of PPIs as input data. Let represent the interaction strength between and , and we suppose that is defined for all ( , ) ∈ P. Then, the ASNM score for domains and is defined as the average strength over protein pairs including and by

Association Probabilistic Method (APM).
Although ASNM is a simple average of strengths of PPIs, Chen et al. proposed the association probabilistic method (APM) by replacing the strength with an improved strength [17]. It is based on the idea that the contribution of one domain pair to the strength of PPI should vary depending on the number of domain pairs included in a protein pair. They assumed that the interaction probability of each domain pair is equivalent in a protein pair, and transformed (1) as follows: Thus, by substituting the numerator of ASNM, APM is defined by They conducted some computational experiments, and reported that APM outperforms existing prediction methods such as ASNM and LPNM.

Proposed Feature Space Mappings from Protein Pairs.
The association methods including ASNM and APM are based on the probabilistic model of PPIs defined by (1), and infer strengths of PPIs from estimated DDIs using given frequency of interactions or interaction strengths of protein pairs. On the other hand, we can also infer PPI strengths utilizing features obtained from given information such as sequence and structure of proteins with machine learning methods. Xia et al. proposed a method to infer strengths of PPIs using artificial neural network with features from constituent domains of proteins [20]. In this paper, for predicting strengths of PPIs, we propose several feature space mappings from protein pairs making use of domain information.

Feature Based on Number of Domains (DN).
As described above, constituent domains information is useful for inferring PPIs and also can be used as a representation of each protein. Actually, Xia et al. represented each protein by binary numbers indicating whether a protein has a domain or not based on the information of constituent domains, and used them with the artificial neural network to predict PPI strengths [20]. Here, it can be considered that the probability that two proteins interact increases with a larger number of domains included in the proteins. Therefore, in this paper, we propose a feature space mapping based on the number of constituent domains (called DN) from two proteins. The feature vector of DN for two proteins and is defined by The Scientific World Journal where indicates the total number of domains over all proteins and ( , ) indicates the number of domains identified as in protein .

Feature by Restriction of Spectrum Kernel to Domain
Region (SPD). DN is based only on the number of constituent domains of each protein, while amino acid sequences of domains are also considered useful for inferring strength of PPI. Therefore, we propose a feature space mapping by restricting the application of the spectrum kernel [24] to domain regions (called SPD). Let A be the set of 21 alphabets representing 20 types of amino acids and others. Although we used the set of 20 alphabets to express 20 types of amino acids in the preliminary conference version [22], we add one alphabet to take the ambiguous amino acids such as X into consideration. Then, A ( ≥ 1) means the set of all strings with length generated from A. The -spectrum kernel for sequences and is defined by where Φ ( ) = ( ( )) ∈A and ( ) indicates the number of times that occurs in . To make use of domain information, we restrict an amino acid sequence to which the -spectrum kernel is applied to the domain regions. Figure 2 illustrates the restriction. In this example, the protein consists of domains 1 , 2 , 3 , and each domain region is surrounded by a square. Then, the subsequence in each domain is extracted, and all the subsequences in the protein are concatenated in the same order as domains. We apply the -spectrum kernel to the concatenated sequence. Let ( ) ( ) be the number of times that string occurs in the sequence restricted to the domain regions in protein in the above manner. The feature vector of SPD for proteins and is defined by It should be noted that ( ) for proteins having the same composition of domains can vary depending on the amino acid sequences of their proteins. That is, even if and have the same compositions as and , respectively, and the feature vector of DN for and is the same as that for and , then the feature vector of SPD for and can be different from that for and .

Support Vector Regression (SVR).
To predict strengths of PPIs, we employ support vector regression (SVR) [25] with our proposed features. In the case of linear functions, SVR finds parameters and for ( ) = ⟨ , ⟩ + by solving the following optimization problem: where and are positive constants and ( , ) is a training data. Here, the penalty is added only if the difference between ( ) and is larger than . In our problem, means a protein pair, and means the corresponding interaction strength.

Relevance Vector Machine (RVM).
In this paper, we also employ relevance vector machine (RVM) [26] to predict strengths of PPIs. RVM is a sparse Bayesian model utilizing the same data-dependent kernel basis as the SVM. Its framework is almost the same as typical Bayesian linear regression. Given a training data { , } =0 , the conditional probability of given is modeled as where = 2 is noise parameter and (⋅) is a typically nonlinear projection of input features. To obtain sparse solutions, in RVM framework, a prior weight distribution is modified so that a different variance parameter is assigned for each weight as where = +1 and = ( 1 , . . . , ) is a hyperparameter. RVM finds hyperparameter by maximizing the marginal likelihood ( | , ) via "evidence approximation. " In the process of maximizing evidence, some approach infinity and the corresponding become zero. Thus, the basis function corresponding with these parameters can be removed, and it leads sparse models. In many cases, RVM performs better than SVM especially in regression problems.

Computational Experiments.
To evaluate our proposed method, we conducted computational experiments and compared with the existing method, APM.

Data and Implementation.
It is difficult to directly measure actual strengths of PPIs for many protein pairs by biological and physical experiments. Hence, we used WI-PHI dataset with 50000 protein pairs [23]. For each PPI, WI-PHI contains a weight that is considered to represent some reliability of the PPI and is calculated from several different The Scientific World Journal 5 kinds of PPI datasets in some statistical manner to rank physical protein interactions. As strengths of PPIs, we used the value dividing the weight of PPI by the maximum weight for WI-PHI. We used dataset file "uniprot sprot fungi.dat.gz" downloaded from UniProt database [27] to get amino acid sequences, information of domain compositions, and domain regions in proteins. In this experiment, we used 1387 protein pairs that could be extracted from WI-PHI dataset with complete domain sequence via UniProt dataset. The extracted dataset contains 758 proteins and 327 domains. Since this dataset does not include protein pairs with interaction strength 0, we randomly selected 100 protein pairs that do not have any weights in the dataset and added them as protein pairs with strength 0. Thus, totally 1487 protein pairs were used in this experiment. We used "kernlab" package [28] for executing support vector regression and relevance vector machine and used the Laplacian kernel ( , ) = exp(− ‖ − ‖). The dataset and the source code implemented by R are available upon request.
To evaluate prediction accuracy, we calculated the root mean square error (RMSE) for each prediction. RMSE is a measure of differences between predicted valueŝand actually observed values and is defined by where is the number of test data. pair also can be used as input features. Therefore we also used APM scores as inputs for SVR and RVM and compared the model using APM scores with the model using our proposed features to confirm the usefulness of feature representation. Here, we used candidate set ∈ {3.0, 3.1, 3.2, . . . , 9.0} for kernel parameter of RVM + APM model because the model could not be trained with values smaller than 3. On the other hand, for of SVM + APM model, we used the same set as other models. Table 1 shows the results of the average RMSE by SVR and RVM with our proposed features (DN and SPD of = 1, 2) and APM score and by APM, for training and test datasets. For training set, the average RMSEs by RVM with SPD of = 2 were smaller than those by APM and others. Moreover, for test set, all the average RMSEs by RVM with SPD and DN were smaller than those by APM. The results suggested that supervised regression methods, SVR and RVM, with domain based features are useful for prediction of PPI strengths. Taking all results together, the model by RVM with SPD of = 2 was regarded as the best for prediction of PPI strengths.

Results of Computational
Since the average RMSEs of SVR with APM for both training and test dataset were smaller than those of original APM, SVR has potential to improve prediction accuracies. By contrast, the average RMSEs of RVM with APM became larger than those of original APM, and all average RMSEs of the models with APM for test set were larger than those of the models with DN and SPD. Accordingly, the results suggested that prediction accuracies were enhanced by feature representation and SPD is especially useful among these feature representations for predicting strengths of PPIs. Although DN and SPD of = 1 have 654 and 42 dimensions for each protein pair, respectively, the average RMSEs with SPD of = 1 for training set were smaller than those with DN. It implies that information of amino acid sequence in domain regions is more informative comparing with information of domain compositions to make a model fit in with dataset.
In contrast, the RMSEs by SVR with DN were smaller than those by others in some cases of test set. Table 2 shows the numbers of relevance vectors and support vectors and the values selected by fivefold cross-validation in all cases. For the models with DN and APM scores, the numbers of relevance vectors were smaller than the numbers of support vectors. On the other hand, the numbers of relevance vectors were larger than the numbers of support vectors for the 6 The Scientific World Journal models with SPD feature in spite of the fact that usually RVM provides a sparse model compared with SVR. In RVM framework, sparsity of model is caused by distributions of each weight; that is, the number of relevance vectors is influenced by values and variances of each dimension of features rather than by the number of dimensions of features. Actually, each dimension of SPD feature almost always has widely varying values. In contrast, DN feature has many zeros, and APM score is inferred from training dataset and thereby has similar distribution. Thus, it is considered that many weights corresponding to features in RVM model did not become zero and the RVM models with SPD feature tended to be complex and to overfit the training data.

Conclusions
For the prediction of strengths of PPIs, we proposed feature space mappings DN and SPD. DN is based on the number of domains in a protein. SPD is based on the spectrum kernel and defined using the amino acid subsequences in domain regions. In this work, we employed support vector regression (SVR) and relevance vector machine (RVM) with the Laplacian kernel and conducted threefold crossvalidation using WI-PHI dataset. For both training and test dataset, the average RMSEs by RVM with SPD feature were smaller than those by APM. The results showed that machine learning methods with domain information outperformed existing association method that is based on the probabilistic model of PPIs and implied that the information of amino acid sequence is useful for prediction comparing with only information of domain compositions. However, the models with SPD feature tended to be complex and overfitted to the training data. Therefore, to further enhance the prediction accuracy, improving kernel functions combining physical characteristics of domains and amino acids might be helpful.