Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences

We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM) model and Local Phase Quantization (LPQ) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research.


Introduction
Proteins are crucial molecules that participate in many cellular functions in an organism. Typically, proteins do not perform their roles individually, so detection of PPIs becomes more and more important. Knowledge of PPIs can provide insight into the molecular mechanisms of biological processes and lead to a better understanding of practical medical applications. In recent years, various high-throughput technologies, such as yeast two-hybrid screening methods [1,2], immunoprecipitation [3], and protein chips [4], have been developed to detect interactions between proteins. Until now, a large quantity of PPI data for different organisms has been generated, and many databases, such as MINT [5], BIND [6], and DIP [7], have been built to store protein interaction data. However, these experimental methods have some shortcomings, such as being time-intensive and costly.
In addition, the aforementioned approaches suffer from high rates of false positives and false negatives. For these reasons, predicting unknown PPIs is considered a difficult task using only biological experimental methods.
As a result, a number of computational methods have been proposed to infer PPIs from different sources of information, including phylogenetic profiles, tertiary structures, protein domains, and secondary structures [8][9][10][11][12][13][14][15][16]. However, these approaches cannot be employed when prior knowledge about a protein of interest is not available. With the rapid growth of protein sequence data, the protein sequence-based method is becoming the most widely used tool for predicting PPIs. Consequently, a number of protein sequence-based methods have been developed for predicting PPIs. For example, Bock and Gough [17] used a support vector machine (SVM) combined with several structural and physiochemical descriptors to predict PPIs. Shen et al. [18] developed a conjoint triad method to infer human PPIs. Martin et al. [19] used a descriptor called the signature product of subsequences and an expansion of the signature descriptor based on the available chemical information to predict PPIs. Guo et al. [20] used the SVM model combined with an autocorrelation descriptor to predict Yeast PPIs. Nanni and Lumini [21] proposed a method based on an ensemble of K-local hyperplane distances to infer PPIs. Several other methods based on protein amino acid sequences have been proposed in previous work [22,23]. In spite of this, there is still space to improve the accuracy and efficiency of the existing methods.
In this paper, we propose a novel computational method that can be used to predict PPIs using only protein sequence data. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise by using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. More specifically, we first represent each protein using a PSSM representation. Then, a LPQ descriptor is employed to capture useful information from each protein PSSM and generate a 256-dimensional feature vector. Next, dimensionality reduction method PCA is used to reduce the dimensions of the LPQ vector and the influence of noise. Finally, the RVM model is employed as the machine learning approach to carry out classification. The proposed method was executed using two different PPIs datasets: Yeast and Human. The experimental results are found to be superior to SVM and other previous methods, which prove that the proposed method performs incredibly well in predicting PPIs.

Dataset.
To verify the proposed method, two publicly available datasets are used in our study. The datasets are Yeast and Human that were obtained from the publicly available Database of Interaction Proteins (DIP) [24]. For better implementation, we selected 5594 positive protein pairs to build the positive dataset and 5594 negative protein pairs to build the negative dataset from the Yeast dataset. Similarly, we selected 3899 positive protein pairs to build the positive dataset and 4262 negative protein pairs to build the negative dataset from the Human dataset. Consequently, the Yeast dataset contains 11188 protein pairs and the Human dataset contains 8161 protein pairs.

Position Specific Scoring Matrix. A Position Specific Scoring Matrix (PSSM) is an
× 20 matrix = { : = 1 ⋅ ⋅ ⋅ , = 1 ⋅ ⋅ ⋅ 20} for a given protein, where is the length of the protein sequence and 20 represents the 20 amino acids [28][29][30][31][32][33]. A score is allocated for the th amino acid in the th position of the given protein sequence in the PSSM. The score of the position of a given sequence is expressed as = ∑ 20 =1 ( , ) × ( , ), where ( , ) is the ratio of the frequency of the th amino acid appearing at position of the probe to be the total number of probes and ( , ) is the value of Dayhoff 's mutation matrix [34] between the th and th amino acids [35][36][37]. As a result, a high score represents a largely conserved position and a low score represents a weakly conserved position [38][39][40].
PSSMs are used to predict protein folding patterns, protein quaternary structural attributes, and disulfide connectivity [41,42]. Here, we also use PSSMs to predict PPIs. In this paper, we used the Position Specific Iterated BLAST (PSI-BLAST) [43] to create PSSMs for each protein sequence. The -value parameter was set as 0.001, and three iterations were selected for obtaining broadly and highly homologous sequences in the proposed method. The resulting PSSMs can be represented as 20-dimensional matrices. Each matrix is composed of × 20 elements, where is the total number of residues in a protein. The rows of the matrix represent the protein residues, and the columns of the matrix represent the 20 amino acids.

Local Phase Quantization. Local Phase Quantization
(LPQ) has been described in detail in the literature [44]. The LPQ method is based on the blur invariance property of the Fourier phase spectrum [45][46][47]. It is an operator used to process spatial blur in textural features of images. The spatial invariant blurring of an original image ( ) apparent in an observed image ( ) can be expressed as a convolution, given by where ℎ( ) is the function of the spread point of the blur, * represents two-dimensional convolutions, and is a vector of coordinates [ , ] . In the Fourier domain, this amounts to where ( ), ( ), and ( ) are the discrete Fourier transforms (DFT) of the blurred image ( ), the original image ( ), and ℎ( ), respectively, and is a vector of coordinates [ , V] . According to the characteristic of the Fourier transform, the phase relations can be expressed as When the spread point function ℎ( ) is the center of symmetry, meaning ℎ( ) = ℎ(− ), the Fourier transform of ℎ( ) always has a real value. As a result, its phase can be expressed as a two-valued function, given by This means that The shape of the point spread function ℎ( ) is similar to the Gaussian or Sin function. This ensures that ( ) ≥ 0 and ∠ ( ) = ∠ ( ) at low frequencies, which means that the phase characteristics are due to blur invariance. The local phase information can be extracted using the twodimensional DFT in LPQ. In other words, a short-term Fourier transform (STFT) computed over a rectangular × neighborhood at each pixel position of an image ( ) is represented by where is the basis vector of the two-dimensional DFT at frequency and is another vector containing all 2 image samples from . Using LPQ, the Fourier coefficients of four frequencies are calculated: where is a small enough number to satisfy ℎ( ) ≥ 0. As a result, each pixel point can be expressed as a vector, given by Then, using a simple scalar quantizer, the resulting vectors are quantized, given by where ( ) is the th component of . After quantization, becomes an eight-bit binary number vector, and each component of is assigned a weight of 2 . As a result, the quantized coefficients are represented as integer values between 0 and 255 using binary coding Finally, a histogram of these integer values from all image positions is composed and used as a 256-dimensional feature vector in classification. In this paper, the PSSM matrixes of each protein from the Yeast and Human datasets were converted to 256-dimensional feature vectors using this LPQ method.

Principal Component Analysis. Principal Component
Analysis (PCA) is widely used to process data and reduce the dimensions of datasets. In this way, high-dimensional information can be projected to a low-dimensional subspace, while retaining the main information. The basic principle of PCA is as follows.
A multivariate dataset can be expressed as the following matrix : . . .
where is the number of variables and is the number of samplings of each variable. PCA closely related to singular value decomposition (SVD) of matrix and the singular value decomposition of matrix as follows: where represent feature vector of and represent feature vector of and is singular value. If there are linear relationships between variables, then singular values are zero. Any line of can be expressed as feature vector ( 1 , 2 , . . . , ): where ( ) = ( ) is projection ( ) on , feature vector ( 1 , 2 , . . . , ) is load vector, and ( ) is score. When there is a certain degree of linear correlation between the variables of matrix, then the projection of final several load vectors of matrix will be enough small for resulting from measurement noise. As a result, the principal decomposition of matrix is represented by where is error matrix and can be ignored. This does not bring about the obvious loss of useful information of data. In this paper, for the sake of reducing the influence of noise and improving the prediction accuracy, we reduce the dimensionality of the Yeast dataset from 256 to 180 and dimensionality of the Human dataset from 256 to 172 in the proposed method by using Principal Component Analysis.

Relevance Vector
Machine. The characteristics of the Relevance Vector Machine have been described in detail in the literature [48]. For binary classification problems, assume that the training sample sets are { , } =1 , ∈ is the training sample, ∈ {0, 1} represents the training sample label, represents the testing sample label, and = + , where = ( ) = ∑ =1 ( , ) + 0 is the model of classification prediction; is additional noise, with a mean value of zero and a variance of 2 , where ∼ (0, 2 ), ∼ ( , 2 ). Assuming that the training sample sets are independent and identically distributed, the observation of vector obeys the following distribution [49][50][51]: where is defined as follows: ) .
The RVM uses sample label to predict the testing sample label * , given by  To make the value of most components of the weight vector zero and to reduce the computational work of the kernel function, the weight vector is subjected to additional conditions. Assuming that obeys a distribution with a mean value of zero and a variance of −1 , the mean ∼ (0, −1 ), ( | ) = ∏ =0 ( | ), where is a hyperparameters vector of the prior distribution of the weight vector . Hence, Because ( , , 2 | ) cannot be obtained by an integral, it must be resolved using a Bayesian formula, given by The integral of the product of ( | , 2 ) and ( | ) is given by Because ( , 2 | ) ∝ ( | , 2 ) ( ) ( 2 ) and ( , 2 | ) cannot be solved by means of integration, the solution is approximated using the maximum likelihood method, represented by The iterative process of and 2 is as follows: where ∑ , is th element on the diagonal of Σ and the initial value of and 2 can be determined via the approximation of and 2 by continuously updating using formula (21). After enough iterations, most of will be close to infinity, the value of the corresponding parameters in will be zero, and other values will be close to finite. The resulting corresponding parameters of are now referred to as the relevance vector.

Procedure of the Proposed Method.
In the paper, our proposed method contains three steps: feature extraction, dimensionality reduction using PCA, and sample classification. The feature extraction step contains two steps: (1) each protein from the datasets is represented as a PSSM matrix and (2) the PSSM matrix of each protein is expressed as a 256-dimensional vector using the LPQ method. Dimensional reduction of the original feature vector is achieved using the PCA method. Finally, sample classification occurs in two steps: (1) the RVM model is used to carry out classification based on the datasets from Yeast and Human whose features have been extracted and (2) the SVM model is employed to execute classification on the dataset of Yeast. The flow chart of the proposed method is displayed in Figure 1.
where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. True positives stand for the number of true interacting pairs correctly predicted. True negatives are the number of true noninteracting pairs predicted correctly. False positives stand for the number of true noninteracting pairs falsely predicted, and false negatives are the number of true interacting pairs falsely predicted to be noninteracting pairs. Moreover, a Receiver Operating Curve (ROC) was created to evaluate the performance of our proposed method.

Performance of the Proposed Method.
To avoid the overfitting in the prediction model and to test the reliability of our proposed method, we used 5-fold cross-validation in our experiment. More specifically, the whole dataset was divided into five parts; four parts were employed for training model, and one part was used for testing. Five models were gained from the Yeast and Human datasets using this method, and each model was executed alone in the experiment. For the sake of ensuring fairness, the related parameters of the RVM model were set up the same for the two different datasets, Yeast and Human. Here, the Gaussian function was selected as the kernel function with the following parameters: width = 0.6, initapla = 1/ 2 , and beta = 0, where width represents the width of the kernel function, is the number of training samples, and the value of beta was defined as zero, which represents classification. The experimental results of the prediction models of the RVM classifier combined with Local Phase Quantization and the Position Specific Scoring Matrix and Principal Component Analysis based on the protein sequence information from the two datasets are listed in Tables 1 and 2.
Using the proposed method on the Yeast dataset, we achieved the results of average accuracy, sensitivity, precision, and MCC of 96.25%, 92.63%, 92.67%, and 87.27%. The standard deviations of these criteria values were 0.95%, 0.55%, 1.40%, and 1.61%, respectively. Similarly, we also obtained good results of average accuracy, sensitivity, precision, and MCC of 97.92%, 99.187%, 96.77%, and 95.95% on the Human dataset. The standard deviations of these criteria values were 0.81%, 0.21%, 1.57%, and 1.58%, respectively. It can be seen from Tables 1 and 2 that the proposed method is accurate, robust, and effective for predicting PPIs. The better performance for predicting PPIs may be attributed to the feature extraction of the proposed method. This approach is novel and effective, and the choice of the classifier is accurate. The proposed feature extraction method contains three data processing steps. First, the PSSM matrix not only describes the order information for the protein sequence but also retains sufficient prior information; thus, it is widely used in other proteomics research. As a result, we converted each protein sequence to a PSSM matrix that contains all the useful information from each protein sequence. Second, because Local Phase Quantization has the advantage of blur invariance in the domain of image feature extraction, information can be effectively captured from the PSSMs using the LPQ method. Finally, while meeting the condition of maintaining the integrity of the information in the PSSM, we reduced the dimensions of each LPQ vector and reduced the influence of noise using Principal Component Analysis. Consequently, the sample information that was extracted using the proposed feature extraction method is very suitable for predicting PPIs.

Comparison with the SVM-Based Method.
Although our proposed method achieved reasonably good results on the Yeast and Human datasets, its performance must be further validated against the state-of-the-art support vector machine (SVM) classifier. More specifically, we compared the classification performances between SVM and RVM model on the Yeast dataset using the same feature extraction method. The LIBSVM tool (available at https://www.csie.ntu .edu.tw/∼cjlin/libsvmtools/) was employed to carry out classification in SVM. Two corresponding parameters of SVM,  and , are optimized using a grid search method. In the experiment, we set = 0.7 and = 0.6 and used a radial basis function as the kernel function.
The prediction results of the SVM and RVM methods on Yeast dataset are shown in Table 3, and the ROC curves are displayed in Figure 2. From Table 3, the prediction results of the SVM method achieved 85.34% average accuracy, 84.40% average sensitivity, 86.89% average specificity, and 74.97% average MCC, while the prediction results of the RVM method achieved 92.65% average accuracy, 92.63% average sensitivity, 92.67%, average specificity, and 86.40% average MCC. From these results, we can see that the RVM classifier is significantly better than the SVM classifier. In addition, the ROC curves were analyzed in Figure 2, showing that the ROC curve of the RVM classifier is significantly better than that of the SVM classifier. This clearly proves that the RVM classifier of the proposed method is an accurate and robust classifier for predicting PPIs. The increased classification performance of the RVM classifier compared with the SVM classifier can be explained by two reasons: (1) the obvious advantage of RVM is that the computational work of the kernel function is greatly reduced and (2) RVM overcomes the shortcoming of the kernel function being required to satisfy the condition of Mercer. Due to these reasons, the RVM classifier of our proposed method is significantly better than the SVM classifier. At the same time, it has been proven that the proposed method can yield highly accurate PPI predictions.

Comparison with Other
Methods. In addition, a number of PPI prediction methods based on protein sequences have been proposed. To prove the effectiveness of our proposed method, we compared the prediction ability of our proposed method, which uses an RVM model combined with a Position Specific Scoring Matrix, Local Phase Quantization, and Principal Component Analysis, with existing methods on Yeast and Human datasets. It can be seen from Table 4 that the average prediction accuracy of the five different methods is between 75.08% and 89.33% for Yeast dataset. The prediction accuracies of these methods are lower than that of the proposed method, which is 92.65%. Similarly, the precision and sensitivity of our proposed method are also superior to those of the other methods. At the same time, Table 5 shows the average prediction accuracy between the six different methods and the proposed method on the Human dataset. From Table 5, the prediction accuracies yielded by the other methods are between 89.3% and 96.4%. None of these methods obtains higher prediction accuracy than our proposed method. From Tables 4 and 5, it can be observed that the proposed method yielded obviously better prediction results compared to other existing methods based on ensemble classifiers. All these results prove that the RVM classifier combined with Local Phase Quantization and the Position Specific Scoring Matrix and Principal Component Analysis can improve the prediction accuracy relative to current stateof-the-art methods. Our method improves predictions by using a correct classifier and a novel extraction method that captures the useful evolutionary information.

Conclusion
Knowledge of PPIs is becoming increasingly more important, which has prompted the development of computational  methods. Though many approaches have been developed to solve this problem, the effectiveness and robustness of previous prediction models can still be improved. In this study, we explore a novel method using an RVM classifier combined with Local Phase Quantization and a Position Specific Scoring Matrix. From the experimental results, it can be seen that the prediction accuracy of the proposed method is obviously higher than those of previous methods. It is a very promising and useful support tool for future proteomics research. The main improvements of the proposed method come from adopting an effective feature extraction method that can capture useful evolutionary information. Moreover, the results showed that PCA significantly improves the prediction accuracy by integrating the useful information and reducing the influence of noise. In addition, the experimental results show that the RVM model is suitable for predicting PPIs. In conclusion, the proposed method is an efficient, reliable, and powerful prediction model and can be a useful tool for future proteomics research.