Prediction of Protein-Protein Interaction By Metasample-Based Sparse Representation

Protein-protein interactions (PPIs) play key roles in many cellular processes such as transcription regulation, cell metabolism, and endocrine function.Understanding these interactions takes a great promotion to the pathogenesis and treatment of various diseases. A large amount of data has been generated by experimental techniques; however, most of these data are usually incomplete or noisy, and the current biological experimental techniques are always very time-consuming and expensive. In this paper, we proposed a novel method (metasample-based sparse representation classification, MSRC) for PPIs prediction. A group of metasamples are extracted from the original training samples and then use the l 1 -regularized least square method to express a new testing sample as the linear combination of these metasamples. PPIs prediction is achieved by using a discrimination function defined in the representation coefficients. The MSRC is applied to PPIs dataset; it achieves 84.9% sensitivity, and 94.55% specificity, which is slightly lower than support vector machine (SVM) and much higher than naive Bayes (NB), neural networks (NN), and k-nearest neighbor (KNN). The result shows that the MSRC is efficient for PPIs prediction.


Introduction
Protein-protein interactions are a hot research topic of bioinformatics.Proteins form protein-protein complexes and perform different biological processes by the interaction between protein and protein.PPIs play important roles in most cellular processes including regulation of transcription and translation, signal transduction, and recognition of foreign molecules [1].So far, many experimental methods have been explored for detecting PPIs, including two-hybrid systems, which detect both transient and stable interactions [2,3], mass spectrometry, which is used to identify components of protein complexes [4], and protein chip technology [5], which solidifies some proteins already known to us on a chip, and then uses the chip to predict the interactions of proteins; the advantages of these methods are easy to manipulate, and the results generated from these experimental methods are intuitive and authentic; however, such experiments for high throughput data are impossible.
Currently, a number of computational methods have been widely exploited for the prediction of PPIs.These computational methods [6] can be roughly divided into sequencebased [7][8][9], structure-based [10][11][12], and function annotation-based [13][14][15] methods.The advantage of sequencebased methods is not requiring expensive and time-consuming processes to determine protein structures.Martin et al. [16] used a novel description of interacting protein by extending the signature descriptor to predict PPIs.Bock and Gough [17,18] attempt to solve the classification problem based on SVM with several structural and physiochemical descriptors.The pseudoamino acid composition approach [19,20] was used to predict PPIs in a hybridization space by Chou and Cai [21].The autocorrelation descriptor with SVM was used to predict PPIs by Guo et al. [22] and when performed on the PPI data of yeast S. cerevisiae, it achieved a very promising prediction result.Zhang et al. [23] used pairwise kernel support vector machine to predict PPIs.There are already many ways to predict PPIs, but these methods are 2 Mathematical Problems in Engineering not efficient and reliable to a certain extent.Moreover, most of them have not adequately taken the local environment of residues into account.Sparse presentation which is inspired by the recent progress of  1 -norm minimization based methods is a powerful data processing method and the  1 -norm minimization based methods include basis pursuing [24], compressive sensing for sparse signal reconstruction [25][26][27], and least absolute shrinkage and selection operator (LASSO) algorithm for feature selection [28].The SR method presents a test sample in terms of the training samples of the same category.To discover the SR coefficient vector,  1 -regularized least square [29] should be used.A training procedure is used to create a classification model for testing in the common learning methods.Different from that, the sparse representation approach does not include separate training and testing stages.The SR methods present the PPIs test dataset as a sparse linear combination of the original training samples, and the representation error over each class is regarded as an indicator.Nevertheless, due to the fact that the original PPIs training samples do not contain the intrinsic structural information of the data, the metasample [30,31] must be more effective for PPIs prediction than the original training samples.
The metasample can grasp the intrinsic structural information of the data, which present protein-protein interactions as a linear combination.The metasample can be obtained by using singular value decomposition (SVD) from the original PPI data.The  1 -regularized least square is used to find the SR coefficient vector, and classification is achieved on the metasamples by using a discriminating function of the SR coefficient vector.
Here, we use the sparse representation classification (SRC) [32] method with metasample for PPIs prediction; the approach is named as metasample-based sparse representation classification (MSRC) [33].

Metasample of PPIs Data.
Normally, metasamples which can receive the inherent information are extracted from the original sample and defined as a linear combination of several samples.Through the matrix decomposition, Figure 1 illustrates the original matrix is converted into the following two matrices: ( The PPIs data are represented as matrix  by preprocessing.Each row represents sample and each column represents feature.The original matrix can be converted into two matrices, where  is of size  ×  and  is of size  × ; each of the  columns defines a metasample.Thus a lot of information which may express the implicit characteristic of data is obtained.
For metasamples, it can be extracted based on SVD which is used for matrix decomposition, and it is expected to acquire some implicit information of PPIs data for classification.
SVD is one of the important matrix decompositions in linear algebra.SVD converts the original matrix into a feature matrix and a diagonal matrix which consisted of feature value.The feature value from smallest to largest is arranged in sequence in the diagonal matrix.The researchers use several columns of data to arrange front in the feature matrix.In other words, for a matrix with high dimension, SVD performs a linear transformation on the matrix.

Sparse Representation of Test PPI Samples.
In fact, PPIs prediction is a binary classification problem.Normally, training dataset of PPIs is represented by  ×  matrix  with each sample being a row and each feature being a column.
Each of the classes has one matrix, such as the   samples of th class which has a matrix Given a class of training samples and  representing the testing samples of PPIs, the testing sample should be associated with training samples for the given class;  is represented as the linear weighted of the training samples: The class of new test sample  is unknown in the prediction of PPIs.When there are a lot of categories, we use the matrix notation and any test sample  is expressed as a linear combination of all the training samples: 0 is the weighted matrix of the nonzero weights with the corresponding class; we can determine the class of the new test sample  from  0 : In order to determine the class that the test sample belongs to,  0 should be evaluated.From the formula mentioned above, we can see that representation of  is naturally sparse.If  belongs to one class, the nonzero elements in vector  must be associated with that class, and the remaining part is zero which associates with other classes, more categories, and more zeros in vector.The problem can be converted into finding a vector .In the following optimization problem, ‖‖ 0 is the  0 -norm of , and it expresses the number of nonzero elements in vector : x0 = arg min ‖‖ 0 subject to  = . ( The above problem is an optimization problem with equality constraint.Since the problem is NP-hard problem, in order to solve the problem, (5) can convert to the following  1 -minimization problem: For matrix , (6) cannot obtain accurate solution, so (6) should be converted to the following generalized version: Equation ( 7) is  1 -regularized least square problem that can accept certain extent noise and it is a generalized version of (6).The  1 -regularized least square problem always has a solution. 1 -regularized LS typically yields a sparse vector  that has relatively few nonzero coefficients.Here, ‖‖ 1 represents the  1 -norm of  and  > 0 is the regularization parameter [29].Through (7), we expect that the classifier can let the output value of the  and  as close as possible.The -value of  and  should be as small as possible; also the positive parameter  in (7) can prevent overfitting.In conclusion, the original problem is showed by sparse representation and then converts to optimization problem (7) by a series of transformations.This optimization problem can be solved by the truncated Newton interior-point method [29].

Metasample-Based Sparse Representation Classification.
The metasamples contain the inherent structural information of training samples.Each subdataset matrix   can be factorized into two matrices as follows: The matrix we used to represent the metasamples from all the  classes after computing the metasamples   of each class is as follows: After converting  into , SR is computed by minimizing the following equality for a given test sample : The optimization problem in (10) is solved using the truncated Newton interior-point method, which is done by l1 ls MATLAB package.
The nonzero entries in the vector  will be all related to the columns of  from a single class  when predicting PPIs without the noise and error; that is to say, the category of the new test sample  is class .But a few nonzero entries must be related to multiple object classes if the noise and error exists; in order to solve this problem, we use the coefficients from each class to observe how well the test sample can be reconstructed.
The   chooses the coefficients related to the th class for each class ; it is the feature function.We can reconstruct the given test sample  as ŷ1 =   (), then compute the -value of  and ŷ, and finally minimize the -value as the following equality: The flow chart of experiment can be showed in Figure 2.
where true positive (TP) represents true interaction pair, true negative (TN) represents true noninteraction pair, false positive (FP) represents false interaction pair, and false negative (FN) represents false noninteraction pair.All these indicators are obtained by 5-fold cross validation.

Generation of the Dataset.
A dataset of physical protein interactions [34] from Guo et al. [35] has been used during our experiments.We download the database from S. cerevisiae core subset of database of interacting proteins (DIP) [36].There are 5594 protein pairs left to form an eventual positive dataset after removing the protein with fewer than 50 residues or ≥40% sequence identity.The noninteracting pairs comprise another final negative dataset, which were generated from those pairs of proteins that have different subcellular localizations.All these positive datasets and negative datasets come together to form the final dataset that consist of 11188 protein pairs.80% of the protein pairs from the final dataset were, respectively, randomly used as the training set, and the rest of the protein pairs as the testing set.

Feature Representation.
Conjoint triad (CT) [37] is used as feature representation method due to its prediction accuracy in previous study.CT takes the properties of one amino acid and its vicinal amino acids into account and any three continuous amino acids have been treated as a unit.Therefore, according to the classes of amino acid, we can differentiate the triad.Here, we use a binary space (, ) to represent a protein sequence;  is the vector of the sequence features;  is the frequency vector corresponding to .According to the dipoles and volumes of the side chains, the 20 amino acids have been clustered into seven classes, the classification of amino acids is listed in Table 1, and the size of  should be 7 × 7 × 7 = 343.Figure 3 showed the descriptors for (, ).Eventually, a 686dimensional vector can be set up to represent each protein pair.

Classification of PPIs Dataset.
The experiment of twoclass classification has been completed by the proposed method.Each experiment has been repeated 5 times to acquire the result of high precision.The mean classification accuracies of 5-fold cross validation are charted in Figure 4. Through the experiment, all the accuracy, sensitivity, specificity, and precision can be obtained.Figure 3 shows the classification accuracy on the PPI dataset.In Figure 4, axis shows the number of metasamples and -axis shows the accuracy of classification.As can be seen from Figure 4, it could be drawn that the relationship between the number of metasamples and the accuracy of classification has a general trend of fluctuations.From the graph, it also revealed that the accuracy depends on the number of metasamples.The more the number of metasamples, the higher the accuracy.During the dimension range from 0 to 840, the accuracy is on a steady rise across the board.Then when the count of metasample is 840, the accuracy reaches its highest value about 89.72%.After the number of metasamples dropped below 840, in the area of 840 to 1340, the accuracy begins to decline.In other words, if the number of metasamples is less than 840, the metasample could not be able to capture sufficient inherent structural information of each class.In addition, the training  samples for metasample training cannot be too limited, which is the main weakness of the proposed method.

3.4.
Comparison with Other Methods.Among these algorithms, here, the dataset is divided into 80% and 20%, the 80% part representing training set which takes 5-fold cross validation based on SVM (http://www.csie.ntu.edu.tw/∼cjlin/libsvm/) to select the optimal parameter of  and  ( = 8,  = 0.001953125).Then the optimal parameter could apply to the other 20% representing testing set to obtain the result with the accuracy reaching 91.96%.In order to obtain respective accuracy, sensitivity, specificity, and precision, Weka (http://www.cs.waikato.ac.nz/ml/weka/) is used to implement KNN, NN, and NB algorithm.Comparing the performance of MSRC with SVM, KNN, NN, and NB, the result reveals the advantages of MSRC.As can be seen from Table 2 and Figure 4 for the PPIs dataset, MSRC-SVD achieves better classification results than KNN, NN, and NB.

The Number of Samples for Metasample Training.
From the above experimental results, we can see that our method could effectively classify PPI data.The number of metasamples will influence the result of MSRC.The metasamples are extracted by SVD; we should determine the number of metasamples of each class, which is the value of   in (8).In the PPIs dataset of this paper, there are only two categories, such that the distinct number of each class is not big.So we make  1 =  2 = ; the value of  depends on the nested stratified 5-fold cross validation.
In the experiment, SVD is applied to extract metasample from the origin training samples.It extracts data separately aimed at each class.In detail, the method gets samples of equal count from each class to combine the metasamples.Because it should generate eigenvalues and eigenvectors first when reducing the dimension of the SVD matrix, the data from each class in the experiment could emerge as eigenvector of 686 * 686.Then the eigenvectors corresponding to related rows could be extracted from this eigenvector.In this situation, the number of rows corresponds to the number of samples in each category to be extracted and the data extracted from every row cannot surpass the number of eigenvectors' rows.According to the dataset, the highest number from each class should not surpass 686.As a result, the count of each extracted sample is equal; that is to say, up to a total of 1362 samples can be extracted.

Conclusion
PPIs prediction is one of the hot research areas at present.A novel method based on SR was developed for PPIs prediction here.Since the original training samples do not contain the instinct structural information of data as the metasample, MSRC with PPIs uses the SVD to extract a set of metasamples which can represent each testing sample as a linear combination.From the experiment results, we can see that MSRC is efficient in PPIs prediction; the approach can match the better performance than other methods.Moreover, our method is different from other common classification algorithms which construct a model by training samples.In the future, we will investigate how to extract the appropriate number that can improve the accuracy of classification.

Figure 1 :
Figure 1: The metasample model of protein-protein interactions.

Figure 3 :
Figure 3: Schematic diagram for constructing the vector space (, ) of protein sequence.

Table 1 :
Division of amino acids based on the dipoles and volumes of the side chains.

Table 2 :
Comparison of state-of-the-art methods on the PPIs dataset.

Table 2
shows the accuracy, sensitivity, specificity, and precision in prediction.The result demonstrates that our method is able to correctly predict the PPIs with the accuracy of 89.72%, slightly lower than SVM and obviously higher than KNN, NN, and NB.NN and NB show a distinct worse result in sensitivity than MSRC, with the sensitivity value SVM may be the best classifier for predicting PPI data, so the sensitivity value of MSRC is also lower than SVM.In the aspect of specificity, MSRC has the distinct advantage compared to SVM, KNN, NN, and NB, which is at 94.55%.MSRC also does its best in terms of precision, which is 93.97%, much better than other four algorithms.