ACT-SVM: Prediction of Protein-Protein Interactions Based on Support Vector Basis Model

. The interactions between proteins play important roles in several organisms, and such issue can be involved in almost all activities in the cell. The research of protein-protein interactions (PPIs) can make a huge contribution to the prevention and treatment of diseases. Currently, many prediction methods based on machine learning have been proposed to predict PPIs. In this article, we propose a novel method ACT-SVM that can eﬀectively predict PPIs. The ACT-SVM model maps protein sequences to digital features, performs feature extraction twice on the protein sequence to obtain vector A and descriptor CT, and combines them into a vector. Then, the feature vectors of the protein pair are merged as the input of the support vector machine (SVM) classiﬁer. We utilize nonredundant H. pylori and human dataset to verify the prediction performance of our method. Finally, the proposed method has a prediction accuracy of 0.727897 for H. pylori data and a prediction accuracy of 0.838799 for human dataset. The results demonstrate that this method can be called a stable and reliable prediction model of PPIs.


Introduction
Proteins are the material basis of all life composed of 20 types of amino acids in the level of biology [1]. ere are several kinds of proteins with different properties and functions, which play a pivotal role in the cells and tissues of various biological species. Not only is it an important part of the living organism, but also it participates in and carries all important life activities in the life process. However, most proteins often do not perform their functions alone. Instead, it is more common that two or more proteins work together by forming a protein complex, and a large protein-protein interaction network is finally built [2][3][4][5][6]. Obviously, PPIs play a key role in cellular processes and are involved in many important biological processes such as immune response, material transport, and gene expression regulation. erefore, exploring the interactions between proteins has become one of the most important links in researching the function and mechanism of proteins [7][8][9]. In addition, PPIs are a major molecular mechanism of virus pathogenic, which makes them one of the important research objects for disease discovery and treatment. e importance of researching PPIs has advanced the methods for predicting and identifying PPIs [10][11][12][13]. In recent years, some high-throughput laboratory biotechnology has been widely utilized in PPIs, such as yeast twohybrid (Sato et al.; Schwikowski et al.; Coates Hall ) [14][15][16] and coimmunoprecipitation (Free et al.) [17]. However, they all have some defects in common or personality. For example, some methods fail to overcome higher proportion of false negatives and false positives, and some methods require more sample material to extract proteins, which is surprisingly expensive. At the same time, methods such as protein phylogenetic profile (Kim et al.) [18,19], natural language processing (Daraselia et al.) [20], and protein tertiary structure (Aloy and Russell) [21] have also been favored by researchers. However, if there is no known protein-related biological knowledge, these kinds of methods are difficult to implement, and some of them cannot fully predict PPIs [22,23].
In addition, with the tireless efforts of researchers, it was found that PPIs can be predicted based on the amino acid sequence of the protein [24][25][26][27]. At the same time, machine learning has been utilized by researchers far and wide. en, a large number of prediction methods based on protein sequences and machine learning algorithms have appeared [13,[28][29][30][31][32]. For example, Cui et al. [33] utilized support vector machine classifier to predict human proteins that interact with viral proteins [34][35][36][37]. e L1-logreg classifier proposed by Dhole et al. can effectively predict PPIs and advance related research such as drug design. Xia et al. [38] proposed a sequence-based multiclassifier system called Spinning Forest to infer PPIs [39]. e performance of their method on the Saccharomyces cerevisiae and H. pylori datasets is better than previously published literature methods. And as an effective machine learning method, deep learning is also utilized in the prediction of PPIs (Du et al.) [40].
In this paper, we propose a novel prediction model which is based on support vector machine to predict PPIs named ACT-SVM. Two different methods were utilized to extract features from protein sequences, and finally we reconstruct them into a feature vector. First, we extract an A vector for each protein sequence in the dataset. Hereafter, we construct composition (C) and transformation (T) descriptors to describe protein sequences. Last, we utilize their combination as the input of the classifier. In general, the area under curve (AUC), accuracy (Acc), specificity (Sp), and Matthew correlation coefficient (Mcc) are utilized to evaluate the performance of our prediction method.
We have additionally constructed 5 different classifiers for comparing the predictive performance, including knearest neighbor (KNN), artificial neural network (ANN), random forest (RF), naive Bayes (NB), and logistic regression (LR). We utilized H. pylori and human datasets to evaluate our novel predictor. Experimental results demonstrate that the novel model based on support vector machine which is proposed by us is performs best.

Methods and Materials
In scientific research, it is extremely important to first define the workflow. Our working flow is demonstrated in Figure 1. First, we obtained nonredundant H. pylori and human datasets. en, we map each protein sequence to digital features by constructing A vector, composition, and transformation (CT) and combine them into one feature vector as the input of the classifier. e following process is to input the extracted digital feature into different classifiers to train different classification models and evaluate them by 5-fold cross-validation, 8-fold crossvalidation, and 10-fold cross-validation, respectively. Finally, on the independent test datasets, we sequentially verified the 6 trained models. In addition, we utilize AUC, Acc, Sp, Sn, and MCC indicators to evaluate the performance of our novel predictive silver and five models utilized as a comparison.

Dataset.
As people pay an increasing attention to PPIs, the number of databases utilized to research PPIs is increasing, such as BioGRID, GeneMANIA, and DIP. However, there is inevitable redundancy in the data in these existing databases. To make our prediction tool more effective, we derived nonredundant H. pylori and human PPIs dataset utilized by Kong et al. [41]. ey downloaded the H. pylori and human PPIs dataset from the DIP database and utilized the cd-hit tool to construct nonredundant sequences for these two datasets. After removing redundancy, the H. pylori dataset contains 1458 interacting protein pairs and 1457 noninteracting protein pairs, while the human dataset has 3899 interacting protein pairs and 4262 noninteracting protein pairs. In this way, according to the category, we replace each amino acid in the sequence with the corresponding C 1 , C 2 , . . ., C 6 . en, we can obtain a simplified sequence. We utilize f i to describe the frequency of occurrence of each element in the simplified sequence (i � 1, 2, ..., 6) and finally get the A vector. e detailed definitions of f i and A vector are illustrated by equations (1) and (2).

Sequence
where l is the length of the protein sequence, m i is the number of type i amino acids in the protein sequence, i � 1, 2, ..., 6. For example, if there is a sequence "MGPDDSKRYE," it can be replaced with C 1 , C 6 , C 6 , C 5 , C 5 , C 3 , C 4 , C 4 , C 2 , and C 5 . We can see that there are one C 1 , one C 2 , one C 3 , two C 4 , three C 5 and two C 6 in the simplified sequence. us, A vector can be constructed as en, we got a 6-dimensional A vector to describe the feature of the protein.

Sparse Matrix and Descriptor.
First, we construct a 20 × n sparse matrix B, where n is the number of amino acids in the protein sequence. We assume that there is a protein sequence S � S 1 , S 2 , . . . , S n . At the same time, we put 20 amino acids in E, E � {A, V, L, I, M, C, F, W, Y, H, S, T, N, Q, K, R, D, E, G, P}. When the i-th amino acid in E is the same as the j-th amino acid in S, the corresponding element b ij in the sparse matrix takes 1; otherwise, it takes 0. e sparse matrix of this protein sequence is demonstrated in the following: Next, we divide each of the 20 row vectors in the sparse matrix into P subvectors. e descriptor consists of composition (C) and transformation (T), and they are extracted from each subvector. Among them, the composition (C) is composed of two parts, including the frequency of 0 and 1 in the subsequence. e transition (T) consists of three parts: the sum of the number of 01 and 10 in the subvector, the number of "11" and the number of "111." Suppose P � 4, and the first subsequence of a protein sequence is "MYAHQAAA." en, the first subvector of the first row vector in the sparse matrix is {0, 0, 1, 0, 0, 1, 1, 1}. Obviously, there are four "0," four "1," two "01," one "10," two "11," and one "111." erefore, the five parts of the composition and transformation (CT) are 4 * 100%/8 � 50%, 4 * 100%/8 � 50%, 3 (2 + 1 � 3), 2, and 1.

Reconstruction of Feature Vectors.
For each protein sequence, we extracted two feature vectors, including a 6dimensional vector A and a 400-dimensional descriptor.
en, we combined them into a 406-dimensional vector as the feature vector of a protein. Finally, the feature vectors of two proteins are connected as a 812-dimensional feature vector, describing the PPIs between them.

Classifier Construction.
Our model is based on SVM. As a linear classifier, SVM is widely utilized in classification problems. Its learning strategy is to maximize the interval. Finally, it can find a geometric hyperplane with the largest distance in the feature space to segment the sample. SVM is extremely stable and sparse. e partitioning hyperplane in the sample space can be described as Among them, the direction of the hyperplane is determined by ω, and b represents the distance from the origin to the hyperplane. If the hyperplane can correctly classify the samples, one side of the hyperplane is positive samples and the other side is negative samples. Assume that the samples in the sample space are (x i , y i ), y i ∈ +1, −1 { }, which can be expressed as e distance from any point in the sample space to the hyperplane can be described by equation (6): e closest sample point to the hyperplane is called the support vector. e sum of the distance from the positive sample support vector to the hyperplane and the distance from the negative sample support vector to the hyperplane is called the interval, which can be expressed as e ultimate goal of support vector machine is to find a hyperplane that maximizes the interval, so the support vector machine can be described as where m is the number of samples. Formulas (8) and (9) can also be rewritten as rough continuous experimentation, we finally set the kernel function of the SVM classifier to a linear kernel function. And combined with our proposed feature extraction method ACT, it showed superior prediction performance on H. pylori and human dataset.

Evaluation of the Predictor.
In order to verify the reliability and stability of our proposed predictor, we trained 6 models using H. pylori and human dataset and performed 5-fold cross-validation, 8-fold cross-validation, and 10-fold cross-validation [42]. In actual training, the model usually fits the training data better, but it is not particularly ideal for novel data outside the training data. k-fold cross-validation can be utilized to evaluate the generalization ability of models, so as to choose a better model and prevent the model from being too complex and causing overfitting. e basic idea of k-fold crossvalidation is to divide the dataset into k parts in equal proportions. en each part of the data is utilized in turn as the test dataset, and the other k−1 parts of the data are utilized as training data. k-fold cross-validation is performed for k trainings to ensure that the k parts of the data have been the test data; the remaining k−1 parts have been utilized as training data. e obtained K experimental results are equally divided as the final score of the model ultimately. For k-fold cross-validation, we set k to 5, 8, and 10, respectively, to verify the performance of our model.
In this paper, we employ four evaluation indicators to evaluate the predictive performance of our proposed method, including accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew correlation coefficient (Mcc). Among them, Acc reflects the model's ability to classify positive samples correctly; Sn measures the classifier's ability to recognize positive samples; Sp reflects the model's ability to recognize negative samples; Mcc returns a value between −1 and +1, which is an indicator often utilized to measure the performance of binary classification models. eir definitions are as follows: where TP is the number correctly divided into positive samples, FP is the number incorrectly divided into positive samples, FN is the number incorrectly divided into negative samples, and TN is correctly divided into negative samples. In addition, we still utilize the AUC value to evaluate the performance of our proposed model. AUC is defined as the area under the ROC curve. In many cases, the ROC curve does not clearly indicate which classifier works better. As a numerical value, the larger the corresponding AUC value, the better the classifier. us, we utilize the AUC value as one of the evaluation criteria of the model.

Model Stability Analysis. K-fold cross-validation is
widely utilized to compare the performance of different machine learning models on a specific dataset. e principle of k-fold cross-validation is to divide the dataset into equal k shares for k trainings and finally take the average of the K results. However, there may be outliers in the k-time results, which means that this classifier may not have good stability for the prediction of all samples. We utilized H. pylori and human dataset to train 6 models and performed 5-fold crossvalidation, 8-fold cross-validation, and 10-fold cross-validation to evaluate their performance. We draw boxplots to reflect the stability of 5-fold cross-validation, 8-fold crossvalidation, and 10-fold cross-validation of the two datasets in 6 classifiers. Six boxplots were drawn to describe the results of 5-fold cross-validation, 8-fold cross-validation, and 10-fold cross-validation of two datasets in 6 classifiers. Among them, the ordinate of the boxplot is accuracy (Acc), and the abscissa is 6 classifiers. at is to say, each boxplot has 6 boxes, and each box stores the Acc value in the k times of k-fold cross-validation in the classifier. e boxplots of the H. pylori dataset on 6 classifiers for 5-fold cross-validation, 8-fold cross-validation, and 10-fold cross-validation are demonstrated in Figure 2(a), and the boxplots for the human dataset are demonstrated in Figure 2(b). e hollow dots appearing in the boxplots are outliers, the size of the boxes reflects the degree of dispersion of the data, and the height of the boxes represents the accuracy value. From the 5-fold cross-validation box diagram in Figure 2(a), we can see that there are outliers in the 5 Acc values obtained by KNN, NB, and SVM in 5 trainings, while the box of the RF classifier is too large that the data is more discrete.
e box size of the ANN and LR classifiers is similar, but from the box height, it can be seen that the accuracy of the ANN is higher. erefore, on the H. pylori dataset, the best performing model using 5-fold cross-validation is ANN. Although the SVM classifier has an outlier in the 8-fold cross-validation, the impact is not significant.
Since the outlier has a very small offset and the cabinet is small in size and high in position, SVM still performs best. In this way, we can see from Figure 2 in turn that, on the H. pylori dataset, the best performing model using 10-fold cross-validation is SVM. On human dataset, the most stable classifiers with 5-fold cross-validation, 8-fold cross-validation, and 10-fold cross-validation are still SVM. is can prove that the predictor which is based in SVM we proposed performs the most stability in k-fold cross-validation.

Model Performances.
To verify the reliability of our proposed method, we constructed 5 traditional classifiers for comparison, including KNN, RF, ANN, LR, and NB. We utilized H. pylori and human datasets and chose 8-fold crossvalidation to evaluate the classifiers we constructed. Finally, we utilize 5 evaluation indicators (AUC, Acc, Sn, Sp, and Mcc) to evaluate the predictive performance of each classifier. e experimental results demonstrate that the SVM classifier performs best, as demonstrated in Table 2.
In Table 2, the AUC, Acc, and MCC values of the SVM classifier are the highest of the six classifiers, reaching 0.800963, 0.727897, and 0.455814, respectively, in the H. pylori dataset. e KNN classifier has the highest Sn value 0.794168, while the RF classifier has the highest Sp value 0.953052. Although the Sn and Sp values of the SVM classifier are not the highest values, they are not much lower than the highest value, which are 0.723842 and 0.731959, respectively. More importantly, the Sn and Sp values of the SVM classifier are the closest, which means that its ability to correctly predict positive and negative samples is similar. In human dataset, the Acc value of the SVM classifier reached 0.838799, and the MCC value was also the highest among the six classifiers. Although AUC, Sn, and Sp are not the highest values, they are close to the highest values. As in the H. pylori dataset, the SVM classifier has the smallest difference in its ability to identify positive and negative samples. From these data, it is clear that the SVM classifier has higher accuracy, pretty good stability, and higher reliability compared to the other five classifiers. us, the superior performance of our proposed method has been further verified.

Helicobacter pylori 5-fold cross-validation
Helicobacter pylori 8-fold cross-validation Helicobacter pylori 10-fold cross-validation    Table 3. e experimental results demonstrate that, in the H. pylori dataset, the five evaluation indexes of the six classifier models using our proposed feature extraction method are better than those using FCTP. In the human dataset, the performance of the model constructed by our method combining SVM and LR is better than that of Kongs' method. In particular, our proposed model ACT-SVM has an Acc value which is 0.08 higher than that of the model using FCTP. Although on the human dataset FCTP performs better on ANN, KNN, RF, and NB, our method also demonstrates good performance with a small gap in indicators in all aspects. Overall, FCTP performed well on the human dataset but performed poorly on the H. pylori dataset. Our feature extraction method demonstrates good prediction performance on both datasets and is relatively stable. erefore, the method we proposed is further proved to be a reliable and stable prediction model for PPIs.

Conclusions
In recent years, the problem of identifying PPIs has been valued by researchers and in-depth research. Several efforts to solve this problem have appeared one after another. Although machine learning methods are widely utilized in the prediction of PPIs, there is still a lack of predictors that can accurately and efficiently make predictions. Our proposed model ACT-SVM can effectively predict PPIs. We utilize a combination of A vector, composition, and transition (CT) descriptors as the digital features of the amino acid sequence and utilize them as input to train the SVM model. We evaluate the performance of our proposed method by constructing multiple classifiers using 5-fold cross-validation, 8-fold crossvalidation, and 10-fold cross-validation. With these evaluations, we can easily get the conclusion that the model we proposed has the better performance in the majority of situations. e prediction accuracy of our method for H. pylori data reaches 0.727897, and the prediction accuracy for human dataset reaches 0.838799. e experimental results demonstrate that our proposed model based on SVM can efficiently predict PPIs. It has good performance on H. pylori and human dataset and can be utilized as a research tool to support biomedical and other fields.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.