Protein-protein interactions are the basis of biological functions, and studying these interactions on a molecular level is of crucial importance for understanding the functionality of a living cell. During the past decade, biosensors have emerged as an important tool for the high-throughput identification of proteins and their interactions. However, the high-throughput experimental methods for identifying PPIs are both time-consuming and expensive. On the other hand, high-throughput PPI data are often associated with high false-positive and high false-negative rates. Targeting at these problems, we propose a method for PPI detection by integrating biosensor-based PPI data with a novel computational model. This method was developed based on the algorithm of extreme learning machine combined with a novel representation of protein sequence descriptor. When performed on the large-scale human protein interaction dataset, the proposed method achieved 84.8% prediction accuracy with 84.08% sensitivity at the specificity of 85.53%. We conducted more extensive experiments to compare the proposed method with the state-of-the-art techniques, support vector machine. The achieved results demonstrate that our approach is very promising for detecting new PPIs, and it can be a helpful supplement for biosensor-based PPI data detection.
Proteins play crucial roles in cellular biology, including signaling cascades, metabolic cycles, and DNA transcription. In most cases, proteins rarely perform their functions alone; instead, they cooperate with other proteins by forming protein-protein interactions (PPIs) networks. PPIs are responsible for the majority of cellular functions. Over the past decades, many innovative techniques and systems for identifying protein interactions have been developed [
A number of computational methods have been proposed for the prediction of PPIs based on different data types, including phylogenetic profiles, gene neighborhood, gene fusion, sequence conservation between interacting proteins, and literature mining knowledge [
Because the conjoint triad approach did not take neighboring effect into account and the interactions usually occur in the discontinuous amino acids segments in the sequence, on the other work Guo et al. developed a method based on SVM and autocovariance to extract the interactions information in the discontinuous amino acids segments in the sequence [
The general trend in the current study for predicting PPIs has focused on high accuracy but has not considered the running time taken to train the classification model, which should be an important factor of developing a sequence-based method for predicting PPIs because the total number of possible PPIs is very large. For example, if we assume that the
In the present work, we report a novel sequence-based method for the prediction of interacting protein pairs using ELM combined with local and global descriptors. More specifically, we first represent each protein sequence as a vector by utilizing the novel representation of local and global protein sequence descriptors which provides us with a chance to mine interaction information from the multiscale amino acids segments at the same time. Then we characterize a protein pair in different feature vectors by coding the vectors of two proteins in this protein pair. Finally, an ELM model is constructed using these feature vectors of the protein pair as input. To evaluate the performance, the proposed method was applied to
In this section, we outline the main idea behind the proposed method. The flowchart intuitively showing how to map large-scale PPIs by integrating biosensor-based PPI data with computational model is given in Figure
The schematic diagram for mapping large-scale protein-protein interactions by integrating biosensor data with ELM model.
We evaluated the proposed method with the
The chosen golden negative dataset has a variable impact on the prediction performance, and it can be artificially inflated by a bias towards dominant samples in the positive data. For golden negative set, we followed the previous work [
We also downloaded the golden negative dataset of human with experimental evidence used in the study of Smialowski et al. [
To successfully use the machine learning methods to identify PPIs from primary protein amino acids sequences, one of the most important computational challenges is how to effectively represent a protein sequence by a fixed length feature vector in which the important information content of proteins is fully encoded [
In order to extract local information, we first divided the entire protein sequence into seven equal length fractions. Then a novel binary coding scheme was adopted to construct a set of continuous regions on the basis of the above partition. For example, consider a protein sequence “CCYGGGYYCYYYCGGCCYYCG” containing 21 residues. To represent the sequence by a feature vector, let us first divide each protein sequence into multiple regions. For simplicity, the protein sequence is divided into four equal length segments (denoted as S1, S2, S3, and S4). Then it is encoded as a sequence of 1’s and 0’s of 4-bit binary form. In binary format, these combinations are written as
It should be noticed that the proposed representation can be simply and conveniently edited at multiple scales, which offers a promising new approach for addressing these difficulties in a simple, unified, and theoretically sound way when presenting a protein sequence. For a given number of bits, each protein sequence may take on only a finite number of continuous or discontinuous regions. This limits the resolution of the sequence. If more bits are used for each protein sequence, then a higher degree of resolution is obtained. In this study, the protein sequence is encoded by 7-bit binary form; each protein sequence may take on 126 (27−2) different regions. Higher bit encoding requires more storage for data and requires more computing resource to process. In this study, only the continuous regions are used and the discontinuous regions are discarded.
For each continuous region, three types of descriptors, composition (
The three descriptors can be calculated in the following ways. Firstly, in order to reduce the complexity inherent in the representation of the 20 standard protein amino acids, we firstly clustered them into seven clusters based on the volumes and dipoles of the side chains. Amino acids within the same groups likely involve synonymous mutations because of their similar characteristics [
Division of amino acids into seven groups based on the dipoles and volumes of the side chains.
Group | Class | Dipole scale | Volume scale |
---|---|---|---|
1 | Ala, Gly, Val | Dipole < 1.0 | Volume < 50 |
2 | Ile, Leu, Phe, Pro | Dipole < 1.0 | Volume > 50 |
3 | Tyr, Met, Thr, Ser | 1.0 < dipole < 2.0 | Volume > 50 |
4 | His, Asn, Gln, Trp | 2.0 < dipole < 3.0 | Volume > 50 |
5 | Arg, Lys | Dipole > 3.0 | Volume > 50 |
6 | Asp, Glu | Dipole > 3.0 |
Volume > 50 |
7 | Cys | 1.0 < dipole < 2.0 |
Volume > 50 |
Then, every amino acid in each protein sequence is replaced by the index depending on its grouping. For example, protein sequence “CCYGGGYYCYYYCGGCCYYCG” is replaced by 773111337333711773371 based on this classification of amino acids (see Figure
Sequence of a hypothetic protein indicating the construction of composition, transition, and distribution descriptors of a protein region.
For distribution
The structure of extreme learning machine.
For each continuous local region, the three descriptors (
By virtue of their approximation capabilities for nonlinear mappings, the feed-forward neural networks (FNN) have become ideal classifiers in many applications. Huang et al. proved that the single-hidden-layer FNN could exactly learn
Extreme learning machine (ELM) was originally developed for the single hidden layer feed-forward neural network (SLFNN) and then extended to the generalized SLFNN where the hidden layer need not be neuron alike [
The ELM algorithm transforms the learning problem into a simple linear system; that is, the output weights of ELM can be analytically determined through a generalized inverse operation of the hidden layer weight matrices. Compared with traditional learning frameworks such a learning scheme can operate at extremely much fast speed. Improved generalization performance of ELM with the smallest training error shows its superior classification capability for real-time applications at an exceptionally fast pace without any learning bottleneck [
The basic idea behind ELM algorithm is briefly descripted as follows: suppose learning
In summary, given a training dataset
Assign arbitrary input weight
Calculate the hidden layer output matrix
According to (
In this section, we describe our simulation methodology and present the experimental results that evaluate the effectiveness of our schemes. The proposed sequence-based PPI predictor was implemented using MATLAB platform. For ELM algorithm, the implementation by Zhu and Huang available from
In the study, fivefold cross-validation technique has been employed to evaluate the performance of the proposed model. In five-fold cross-validation technique, the whole dataset is randomly divided into five subsets, where each subset consists of nearly equal number of interacting and noninteracting protein pairs. Four subsets are used for training and the remaining set for testing. This process is repeated five times so that each subset is used once for testing. The performance of method is average performance of method on five sets.
Seven metrics have been used in the study to measure the predictive ability of the proposed method. The parameters are as follows: (1) the overall prediction accuracy (ACC) is the percentage of correctly identified interacting and noninteracting protein pairs; (2) the sensitivity (SN) is the percentage of correctly identified interacting protein pairs; (3) the specificity (SP) is the percentage of correctly identified noninteracting protein pairs; (4) the positive predictive value (PPV) is the positive prediction value; (5) the negative predictive value (NPV) is the negative prediction value; (6) the
The above mentioned parameters rely on the selected threshold. The area under the ROC curve (AUC), which is threshold-independent for evaluating the performances, can be easily calculated according to the following formula [
The number of hidden nodes is a critical factor for the generalization of ELM. To determine the parameter, four-fifths of the whole dataset are randomly chosen to train the ELM classifiers with different number of hidden nodes, while the rest one-fifths of the dataset are used as the validation set to compute the accuracy.
Here the sigmoid function was used as the activation function of the ELM classifier. The results are plotted in Figure
The relationship between the prediction accuracy and the number of hidden neurons. The
The relationship between the consuming time and the number of hidden neurons. The
We evaluated the performance of the proposed model using the PPIs dataset as described in the aforementioned section. To guarantee that the experimental results are valid and can be generalized for making predictions regarding new data, we adopted the fivefold cross-validation in this study. The advantages of cross-validation are that the impact of data dependency is minimized and the reliability of the results can be improved.
The prediction performance of ELM predictor with novel representation of protein sequence across five runs is shown in Table
Comparison of the prediction performance by the proposed method and state-of-the-art SVM classifier on the human dataset.
Method | Kernel | Mean/std | Time (s) | ACC | SN | SP | PPV | NPV | F1 | MCC | AUC |
---|---|---|---|---|---|---|---|---|---|---|---|
Testing | |||||||||||
ELM | Sigmoid | Mean | 72.7901 | 0.8480 | 0.8408 | 0.8553 | 0.8547 | 0.8415 | 0.8477 | 0.7422 | 0.9232 |
Variance | 1.9062 | 0.0022 | 0.0019 | 0.0028 | 0.0040 | 0.0038 | 0.0029 | 0.0030 | 0.0028 | ||
Hardlim | Mean | 77.4139 | 0.8206 | 0.8171 | 0.8242 | 0.8227 | 0.8185 | 0.8199 | 0.7056 | 0.9020 | |
Variance | 3.7710 | 0.0050 | 0.0040 | 0.0063 | 0.0088 | 0.0026 | 0.0063 | 0.0064 | 0.0031 | ||
Gaussian | Mean | 76.9615 | 0.7257 | 0.7328 | 0.7186 | 0.7232 | 0.7283 | 0.7279 | 0.6018 | 0.7624 | |
Variance | 4.1012 | 0.0036 | 0.0048 | 0.0054 | 0.0085 | 0.0077 | 0.0044 | 0.0033 | 0.0017 | ||
|
|||||||||||
Training | |||||||||||
ELM | Sigmoid | Mean | 1282.12 | 0.8887 | 0.8831 | 0.8944 | 0.8933 | 0.8843 | 0.8882 | 0.8022 | 0.9561 |
Variance | 17.25 | 0.0006 | 0.0010 | 0.0018 | 0.0014 | 0.0001 | 0.0008 | 0.0010 | 0.0012 | ||
Hardlim | Mean | 1330.33 | 0.8668 | 0.8655 | 0.8682 | 0.8683 | 0.8654 | 0.8669 | 0.7691 | 0.9397 | |
Variance | 46.28 | 0.0027 | 0.0021 | 0.0033 | 0.0027 | 0.0027 | 0.0024 | 0.0039 | 0.0031 | ||
Gaussian | Mean | 1435.45 | 0.7824 | 0.7896 | 0.7753 | 0.7790 | 0.7860 | 0.7843 | 0.6595 | 0.8626 | |
Variance | 94.85 | 0.0033 | 0.0022 | 0.0053 | 0.0040 | 0.0026 | 0.0029 | 0.0037 | 0.0038 | ||
|
|||||||||||
Testing | |||||||||||
SVM | Sigmoid | Mean | 2794.29 | 0.8177 | 0.8119 | 0.8232 | 0.8215 | 0.8144 | 0.8165 | 0.7018 | 0.8878 |
Variance | 16.71 | 0.0127 | 0.0266 | 0.0128 | 0.0067 | 0.0200 | 0.0155 | 0.0160 | 0.0143 | ||
Gaussian | Mean | 5237.89 | 0.6947 | 0.4714 | 0.9191 | 0.8535 | 0.6348 | 0.6064 | 0.5320 | 0.8997 | |
Variance | 67.82 | 0.0228 | 0.0412 | 0.0112 | 0.0178 | 0.0265 | 0.0340 | 0.0276 | 0.0364 | ||
Polynomial | Mean | 3612.98 | 0.8019 | 0.8219 | 0.7819 | 0.7903 | 0.8144 | 0.8057 | 0.6820 | 0.8838 | |
Variance | 20.16 | 0.0101 | 0.0126 | 0.0117 | 0.0165 | 0.0114 | 0.0125 | 0.0122 | 0.0138 |
To demonstrate the performance of the proposed model, we further compared our method with the state-of-the-art predictor SVM. From Table
We also conduct an experiment to characterize the sensitivity (i.e., the size of true positives that can be detected by our method) and specificity (i.e., 1 − false positive rate) of proposed approach for different activation functions (Figure
The ROC (receiver operator characteristic) curve illustrating the performance of different activation functions. The curve presents the true positive rate (sensitivity) against the false positive rate (1 − specificity).
To sum up, considering the high efficiency as well as the good performance we can readily conclude that the proposed approach generally outperforms the state-of-the-art model with higher discrimination power for predicting PPIs based on the information of protein sequences. Therefore, we can see clearly that our model is a much more appropriate method for predicting new protein interactions compared with the other methods. Consequently, it makes us be more convinced that the proposed method can be very helpful in assisting the biologist to assist in the design and validation of experimental studies and for the prediction of interaction partners.
In this paper, we have developed an efficient and fast learning technique, which utilizes global and local information of protein amino acid sequence, for accurate identification PPIs at considerably high speed both in training and testing phase. The first contribution of this work is a novel protein amino acids sequence representation using amino acid composition and a descriptor to represent global and local information of a protein sequence, respectively. Then, the application of extreme learning machine ensures reliable recognition with minimum error and learning speed approximately thousands of times faster than the state-of-the-art classification method SVM. Experimental results demonstrated that the proposed method performed significantly well in distinguishing interacting and noninteracting protein pairs. It was observed that the proposed method achieved the mean classification accuracy of 84.8% using 5-fold cross-validation. Meanwhile, comparative study was conducted on the proposed method and the state-of-the-art SVM. The experimental results showed that our method significantly outperformed SVM in terms of classification accuracy with shorter running time.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Science Foundation of China, under Grants 61102119, 61373086, 61133010, U1201256, and 61171125. The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.