Detecting Protein-Protein Interactions with a Novel Matrix-Based Protein Sequence Representation and Support Vector Machines

Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally.


Introduction
Since detection of protein interactions is of fundamental importance to understand the molecular mechanism in biological systems, many researchers have focused on this area in postgenome era [1,2]. Over the past decades, highthroughput experimental techniques, such as yeast twohybrid (Y2H) system [3,4] and mass spectrometry (MS), involving genome-wide detection of PPIs, have been developed to generate large amounts of interaction data. However, these traditional experimental methods are time-consuming and expensive, especially for genome-wide scale. In addition, the high-throughput biological experiment usually suffers from high rates of both false negatives and false positives [5]. Combining the experimental techniques with computational model is a promising direction to better understand the mechanisms of protein interactions at the molecular level and to unravel the global picture of PPIs in the cell [6,7]. Hence, it is of great practical significance to build low cost protein detection systems and establish the reliable computational methods to facilitate the detection of PPIs.
So far, a variety of computational methods have been developed to effectively and accurately predict protein interactions [2,[8][9][10]. The computational approaches for in silico prediction can be roughly categorized into genome based approaches, network topology based approaches, literature knowledge based methods, and structure based approaches [11]. In addition, there are also some approaches that integrate interaction information from several different biological data sources [9,10].
However, the aforementioned approaches cannot be implemented if prior information about the proteins is  not available [12]. Recently, the sequence-based approaches which derive information directly from protein amino acids sequence are of particular interest [13,14]. Prediction of protein interactions from only protein sequence is a much more universal way [15,16]. The previous works demonstrate that the RNA and protein sequences alone contain sufficient information [17,18]. The previous researches demonstrated that the information of protein amino acid sequences is sufficient to predict PPIs. Although the sequence-based approaches can yield a high prediction accuracy of 80%∼ 88%, it is necessary to design the novel approaches to further improve the prediction performance compared with the existing methods.
In recent years, many efforts have been made aiming to develop accurate approaches for identifying PPIs based on protein sequence information [19,20]. Shen et al. built a prediction model by employing the conjoint triad feature extraction and support vector machine. When applied to predicting human PPIs, this method yields a high prediction accuracy of about 84% [21]. Because the conjoint triad method did not take the neighboring effect into account and protein interactions usually occur in the discontinuous amino acids segments in the sequence, Guo et al. proposed an approach based on SVM and autocovariance feature representation which extract the interactions information in the discontinuous amino acids segments in the sequence [22]. Their approach reached a prediction accuracy of 86.55%, when applied to predicting saccharomyces cerevisiae PPIs. Lately, You et al. developed a novel ensemble learning model to predict Saccharomyces cerevisiae PPIs from protein primary sequences directly [23]. In this study, the protein pairs retrieved from the database of interacting proteins (DIP) were encoded into feature vectors by using four kinds of protein sequences information. Focusing on dimension reduction, an effective feature extraction method PCA was then employed to construct the most discriminative new feature set. Finally, multiple extreme learning machines were trained and then aggregated into a consensus classifier by majority voting. The experimental results show that it is a very promising scheme for PPIs prediction.
In this study, we report a novel sequence-based method for the prediction of interacting protein pairs using a matrixbased protein sequence descriptors combined with support vector machine (SVM) algorithm. More specifically, we first represent each protein sequence as a feature matrix, from which a novel matrix-based protein descriptor is extracted to numerically characterize each protein sequence. Then we characterize a protein pair in different feature vectors by coding the vectors of two proteins in this protein pair. Finally, an SVM model is established using these feature vectors of the protein pair as input. To evaluate the prediction performance, the proposed method was applied to Saccharomyces cerevisiae and Helicobacter pylori PPI datasets. The experiment results show that our method can achieve 90.06% and 85.91% prediction accuracy with 94.37% and 83.33% specificity at the sensitivity of 85.74% and 85.27%, respectively. Achieved results demonstrate that the approach can be a helpful supplement for the interactions that have been detected experimentally.

Materials and Methodology
In this section, we outline the main idea behind the proposed method. The schematic diagram intuitively showing how to detect protein interactions using experimental PPIs data with computational model is given in Figure 1. Firstly, we briefly discuss the PPIs datasets which is employed in the study (the source code and the datasets are freely available at http://sites.google.com/site/zhuhongyou/data-sharing/ for academic use). Next we propose the novel matrix-based protein representation method. Finally, we briefly describe the computational model, SVM, used in this study.

Golden Standard Datasets.
We evaluated the proposed method with two real PPIs datasets. The first one was collected from Saccharomyces cerevisiae core subset of database of interacting proteins (DIP). After the redundant protein pairs which contain a protein with fewer than 50 residues or have ≥40% sequence identity were deleted, the remaining 5,594 protein pairs comprise the golden standard positive dataset. The selection of golden standard negative dataset has an important impact on the prediction performance, and it can be artificially inflated by a bias towards dominant samples in the positive data. For golden standard negative dataset, we followed the previous work [22] assuming that the proteins in different subcellular compartments do not interact with each other.
After strictly following the steps in Guo's work, we finally obtained 5,594 protein pairs as the golden standard negative dataset. By combining the above two golden standard positive and negative PPI datasets, the final whole PPI dataset consists of 11,188 protein pairs, where nearly half are from the positive dataset and half are from the negative dataset. The second one is a small-scale Helicobacter pylori PPIs dataset, which is composed of 2,916 protein pairs (1,458 interacting pairs and 1,458 noninteracting pairs) as described by Martin et al. [24].

Representing Proteins with Descriptors from Primary
Protein Sequences. To successfully use the machine learning algorithm to detect PPIs from primary protein amino acids sequences, one of the computational challenges is to effectively characterize a protein sequence by a fixed length feature vector in which the important information content of proteins is fully encoded [25]. In this study, we propose a novel matrix-based protein sequence representation approach for predicting PPIs. Firstly, the protein sequence is transformed into a sparse matrix, which considered the properties of one amino acid and its vicinal amino acids and regarded any two continuous amino acids as a unit. Then the protein features are extracted from the obtained sparse matrix.
A protein sequence can be represented as a series of amino acids by their single character codes A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, and V. Consider a protein sequence with amino acid residues: where 1 denotes the amino acid at protein chain position 1, 2 denotes the amino acid at protein chain position 2, and so forth. denotes the length of the protein sequence. We scan the protein sequence from left to right by stepping each two vicinal amino acids at a time, which considers the properties of one amino acid and its vicinal amino acid and regards any two continuous amino acids as a unit. Here the number of all possible pairs of amino acids (dipeptides) that can be extracted from the protein sequence is 400, that is, AA, AR, AN, . . ., YV, and VV.
For step ( = 1, 2, 3, . . . , − 1), if the " +1 " is the th type of dipeptide, then we set the element = 1. The rest can be done in the same manner and then a protein sequence can be transformed into a 400 by − 1 matrix (see Table 1), namely, , as follows: where is the length of protein sequence, = 1, 2, 3, . . . , 400, = 1, 2, 3, . . . , − 1, and dipeptide( ) denotes the th type of dipeptides listed in Table 1. Here, each column of the matrix is a unit vector, in which only one element is 1 and the others are all 0. We can see from Table 1 that the occurrence position of all kinds of dipeptides along the protein sequence is contained in the column of the matrix . Meanwhile, the row of the matrix denotes the th kind of dipeptide appearing at the th position within the protein sequence. Generally speaking, the matrix transformed from protein amino acid sequence embodies the essential information including the information of its sequence order and sequence length of the protein sequence. Thus, given a protein primary sequence, we can design a matrix-based protein descriptor to represent it, which is capable of facilitating PPIs detections.
Low-rank approximation (LRA) is an important matrix analysis method, in which the cost function measures the fit between a given sparse matrix and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced rank [26]. Here, using LRA upon the obtained protein feature matrix, we derive a matrix-based descriptor to represent the protein sequence. For a feature matrix , which denotes a 400 * ( − 1) matrix, the LRA of the data can be written as follows: where ‖ ⋅ ‖ is the Frobenius norm. The above minimization problem has analytic solution in terms of the singular value decomposition (SVD) of the data matrix . Let = Σ ∈ × be the SVD of and partition , Σ =: diag( 1 , 2 , 3 , . . . , 400 ), and as follows: where Σ 1 is a × matrix, 1 is × , and 1 is × . Then the rank-matrix is obtained as follows: where Then we compute the square root of the reduced matrix Σ 1 to obtain Σ 1/2 1 with dimensions -by-. Finally, we can get a 400 * matrix 1 Σ 1/2 1 , which contains the information of protein sequence order. It should be noticed that the feature matrix for different protein sequences sometime have different columns with each other, which shows that these protein primary sequences are of nonequal length. However, the 1 Σ 1/2 1 for different protein sequences are 400 * matrix. We build a vector (row matrix) from the obtained matrix 1 Σ 1/2 1 by concatenating all rows, from 1 to 400, of matrix 1 Σ 1/2 1 . Therefore, the matrix-based protein descriptor consists of a total of 400 * descriptor values; that is, a 400 * dimensional vector has been built to represent the protein sequence. Considering the trade-off between the overall prediction accuracy and computational complexity for extracting protein sequence descriptors, the optimal rank is = 4. Thus, we set to 4 in this study. A representation of an interaction pair is formed by concatenating the descriptors of two protein sequences in this protein pairs.

Support Vector Machine.
Machine learning has been seen as useful and reliable in many applications. Various machine learning techniques can be employed to predict the PPIs. Among them, support vector machine (SVM) is one of the popular learning algorithms based on statistical learning theory [27]. Here we give a brief introduction to the basic idea of SVM.
The goal of the SVM algorithm is to find an optimal hyperplane that separates the training samples by a maximal margin, with all positive samples lying on one side and all negative samples lying on the other side. Suppose that we are given a training dataset of instance-labeled pairs = {( 1 , 1 ), ( 2 , 2 ), . . . , ( , )} with input data ∈ and labeled output data ∈ {+1, −1}. The SVM algorithm solves the quadratic optimization problem as minimizing the function as below: min , , subject to where is the normal vector of hyperplane; is the bias of hyperplane; is the penalty factor; is the slack variable.
In real applications, the training samples are not linearly separable in its original space. Usually, the training samples are mapped into a high-dimensional feature space through some nonlinear function . Then SVM finds a linear separating hyperplane with the maximal margin in this higherdimensional space. Furthermore, ( , ) = ( ) ⋅ ( ) is called the kernel function. Actually, the flexibility and classification power of SVM reside in its kernel functions, since they make it possible to discriminate within challenging here, , , and are kernel parameters which are set a priori.
If we replace samples with their mapping in the feature space ( ), (9) becomes and the decision function becomes where is the number of SV, = [ 1 , 2 , 3 , . . . , ] is the input sample, and and are Lagrange multipliers.

Results and Discussion
In the section, we describe our simulation methodology and present the experimental results that evaluate the effectiveness of our schemes. The proposed sequence-based PPI predictor was implemented using MATLAB platform. For SVM algorithm, the LIBSVM implementation available from http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ was utilized, which was originally developed by Chang and Lin [28]. As the kernels, four kinds of kernel functions, radial basis function (RBF), polynomial function, linear function, and sigmoid function, were selected to implement the experiment. The optimized parameters for the SVM were obtained with a grid search approach. In the simulation, all the experiments were carried out on a computer with 3.1 GHz 2-Core CPU, 12 GB memory, and Windows operating system.

Measures for the Prediction Performance.
In the study, fivefold cross-validation technique has been employed to evaluate the performance of the proposed model. In the fivefold cross-validation technique, the whole dataset is randomly divided into five subsets, where each subset consists of nearly equal number of interacting and noninteracting protein pairs. Four subsets are used for training and the remaining set for testing. This process is repeated five times so that each subset is used once for testing. The performance of method is average performance of method on five sets. Several evaluation measures have been used in the study to measure the predictive ability of the proposed method. The parameters are as follows: (1) where true positive (TP) is the number of true PPIs that are predicted correctly; false negative (FN) is the number of true PPIs that are predicted to be noninteracting pairs; false positive (FP) is the number of true noninteracting pairs that are predicted to be PPIs; and true negative (TN) is the number of true noninteracting pairs that are predicted correctly. The above-mentioned parameters rely on the selected threshold. The area under the ROC curve (AUC), which is threshold-independent for evaluating the performances, can be easily calculated according to the following formula [29]: where 0 and 1 denote the number of positive and negative samples, respectively, and 0 is the sum of the ranks of all positive samples in the list of all samples ranked in increasing order by estimated probabilities belonging to positive. AUC values can give us a good insight into performance comparison of different prediction methods. Although the AUC is threshold-independent, an appropriate threshold must be selected for the final decision. For the classifier which outputs a continuous numeric value to represent the confidence or probability of a sample belonging to the predicted class, adjusting the classification threshold will lead to different confusion matrices which decide different ROC points [21].

Prediction Performance of Proposed Model.
We evaluated the performance of the proposed model using the DIP PPIs data as investigated in Guo et al. [22]. To guarantee that the experimental results are valid and can be generalized for making predictions regarding new data, the fivefold cross-validation is utilized to evaluate the performance of the proposed method. The whole PPI dataset is randomly divided into five subsets of roughly equal size, and each subset consists of nearly equal number of interacting and noninteracting protein pairs. Four out of these five subsets are used for training and the remaining one for test. This process is repeated five times such that each subset is used once and only once for test. The results are then averaged over the five runs to ensure the highest level of fairness. The prediction performance of SVM predictor with matrix-based protein sequence representation across five runs is shown in Table 2. It can be observed from Table 2 that high prediction accuracy 90.06% is obtained for the proposed model. To better investigate the prediction ability  Table 2 that the standard deviation of sensitivity, precision, accuracy, MCC, and AUC is as low as 0.0094, 0.0098, 0.0064, 1.03, and 0.0064, respectively. We further compared our method with those of Guo et al. [22], Zhou et al. [30], and Yang et al. [31], where the SVM, SVM, and KNN were performed with the conventional autocovariance, local descriptor, and local descriptor representation as the input feature vectors, respectively. From Table 2, we can see that the performance of all of these methods with different machine learning models and sequence-based feature representation methods are lower than ours, which indicates the advantages of our method. To sum up, we can readily conclude that the proposed approach generally outperforms the previous model with higher discrimination power for predicting PPIs based on the information of protein sequences. Therefore, we can see clearly that our model is a much more appropriate method for predicting new protein interactions compared with the other methods. Consequently, it makes us more convinced that the proposed method can be very helpful in assisting the biologist to contribute to the design and validation of experimental studies and in the prediction of interaction partners.

Comparison between the Proposed Model and AADC
Method. The amino acid dipeptide composition (AADC) is a representation method for protein sequences that count the frequency of occurrence of adjacent pairs of amino acids. Similar to the proposed matrix-based protein sequence representation method, AADC only needs the information of protein amino acids; no attention is paid to the physicochemical properties of amino acids or other pieces of biological information about proteins. To demonstrate the performance of the proposed model, we further compared the proposed protein feature representation methods with AADC method.
The prediction performance of SVM predictor with the aforementioned two protein sequence representation across five runs is shown in Table 3. It can be observed from Table 3 that high prediction accuracy of 90.06% is achieved for the proposed model with Gaussian kernel function. To better investigate the prediction ability of our model, we also calculated the values of sensitivity, specificity, PPV, NPV,score, MCC, and AUC. From Table 3, we can see that our model gives good prediction performance with an average sensitivity value of 85.74%, specificity value of 94.37%, PPV value of 93.84%, NPV value of 86.89%, -score value of 89.61%, MCC value of 82.03%, and AUC value of 95.28%. Further, it can also be seen from Table 3 that the standard deviation of accuracy, sensitivity, specificity, PPV, NPV,score, MCC, and AUC is as low as 0.0064, 0.0094, 0.0095, 0.0098, 0.0048, 0.0076, 0.0103, and 0.0064, respectively. The performance of the proposed model with other kernel functions including sigmoid function, polynomial function, and linear function is also demonstrated in Table 3.
In addition, the prediction performance of AADC based model is shown in Table 3. The AUC of the AADC model with Gaussian kernel is 0.9292, which is lower than that of the proposed model. The overall accuracy, sensitivity, specificity, PPV, NPV, 1 score, and MCC of AADC model are, respectively, 86.54%, 83.49%, 89.59%, 88.92%, 84.43%, 86.12%, and 76.66% as illustrated in Table 3. Hence, it can be seen that almost all evaluation measures of the proposed model are better than those of AADC method.
We also conduct experiment to characterize the sensitivity (i.e., the size of true positives that can be detected by our method) and specificity (i.e. 1 − false positive rate) of the proposed approach for different activation functions (see Figure 2). The results in Figure 2 are reported using receiver operator characteristic (ROC) curves, which plot the achievable sensitivity at a given specificity (1 − false positive rate). Good performance is reflected in curves with a stronger bend towards the upper-left corner of the ROC graph (i.e., high sensitivity is achieved with a low false positive rate). We found that proposed method achieved over 89 percent detection rate with less than 10 percent false positive rate. The results demonstrate that the proposed matrix-based model can successfully classify positive and negative samples in all five activation functions that we investigated. Our algorithm can perfectly classify interacting and noninteracting protein pairs with only a few exceptions. To sum up, considering the high efficiency as well as the good performance we can readily conclude that the proposed approach generally outperforms the AADC model with higher discrimination power for predicting PPIs based on the information of protein sequences. Therefore, we can see clearly that our model is a much more appropriate method for predicting new protein interactions compared with the other methods.

Comparing the Prediction Performance between Our
Method and Other Existing Methods. In order to highlight the advantage of our model, it was also tested by Helicobacter pylori dataset. This dataset gives a comparison of proposed method with several previous works including phylogenetic bootstrap [32], signature products [24], HKNN [33], and boosting [34]. The methods of phylogenetic bootstrap, signature products, and HKNN are based on individual classifier system to infer PPIs, while the methods of boosting belong to ensemble-based classifiers.
The average prediction results of 10-fold cross-validation over five different approaches are demonstrated in Table 4. From Table 4, we can see that the average prediction performance, that is, sensitivity, precision, accuracy, and MCC achieved by proposed predictor, are 85.27%, 83.33%, 85.91%, and 75.53%, respectively. It clearly shows that our method outperforms all other individual classifier-based methods and the ensemble classifier systems (i.e., boosting). All these results demonstrate that the proposed method not only achieves accurate performance, but also substantially improves precision in the prediction of PPIs.

Conclusions
In this paper, we proposed an efficient and accurate learning technique, which utilizes the information of protein amino acid sequence order and distribution, for accurate identification PPIs at considerably high speed. It is well known that the order and distributions of dipeptide possess more pieces of information than those of amino acid dipeptide composition (AADC), so the main advantage is that this algorithm can extract more pieces of information hidden in protein primary sequences than AADC can. Then, the application of SVM predictor ensures reliable recognition with minimum error. Experimental results demonstrated that the proposed method performed significantly well in distinguishing interacting and noninteracting protein pairs. It was observed that the proposed method achieved the mean classification accuracy of 90.06% using fivefold crossvalidation. Meanwhile, comparative study was conducted on the proposed method and other existing methods. The experimental results showed that our method outperformed these works in terms of classification accuracy.