The prediction of RNA-binding proteins is one of the most challenging problems in computation biology. Although some studies have investigated this problem, the accuracy of prediction is still not sufficient. In this study, a highly accurate method was developed to predict RNA-binding proteins from amino acid sequences using random forests with the minimum redundancy maximum relevance (mRMR) method, followed by incremental feature selection (IFS). We incorporated features of conjoint triad features and three novel features: binding propensity (BP), nonbinding propensity (NBP), and evolutionary information combined with physicochemical properties (EIPP). The results showed that these novel features have important roles in improving the performance of the predictor. Using the mRMR-IFS method, our predictor achieved the best performance (86.62% accuracy and 0.737 Matthews correlation coefficient). High prediction accuracy and successful prediction performance suggested that our method can be a useful approach to identify RNA-binding proteins from sequence information.
RNA-binding proteins are important functional proteins that are pivotal to a cell’s function, such as in gene expression, posttranscriptional regulation, protein synthesis, and replication and assembly of many viruses [
Previous studies have investigated the mechanisms by which proteins bind to DNA; however, research on RNA-binding proteins lags behind. Methods to identify RNA-binding proteins could be divided into two categories: recognition from protein structure and prediction from amino acid sequences. The structure-based prediction approach usually produces a better performance; however, obtaining the protein structure is still costly and time consuming. Considering the theory that a protein’s amino acid sequence contains all the necessary information to predict its function [
Support vector machine (SVM) [
To obtain a good predictive model, two major problems should be considered. One is feature extraction and selection and the other is the selection of the classification algorithm. To solve the first problem, we proposed a novel feature called evolutionary information combined with physicochemical properties (EIPP). The results show that EIPP has a more powerful ability to distinguish RNA-binding proteins from nonbinding ones than PSSM, which dramatically improved the prediction of RNA-binding proteins compared with a previous work [
RNA-binding proteins and nonbinding proteins were obtained from release “2014_06” of the UniProtKB database (
As indicated by previous studies [
To evaluate the performance of our method in comparison with previously well-known studies, we used an independent test dataset (Testset). The Testset comprised 144 RNA-binding proteins and 144 nonbinding proteins obtained from MDset that had not been used in previous studies [
Prediction of RNA-binding residues was used to identify the RNA-binding proteins from nonbinding ones. We had already developed an RNA-binding residues prediction model, PRBR [
RNA-binding proteins have many more binding residues than nonbinding proteins and RNA-binding residues tend to gather together spatially; therefore, two binding propensity measures were defined as follows:
We used predicted RNA-binding residues; therefore, the reliability index is applied in those two formulas. The
We also defined two nonbinding propensities for nonbinding proteins. The definitions of
Evolutionary information in the form of a position-specific scoring matrix (PSSM) has been used successfully to represent proteins in many applications, such as prediction of DNA-binding residues [
The physicochemical property feature has been used effectively in many fields, such as the identification of DNA∖RNA-binding proteins [
Electrostatic and hydrophobic interactions influence protein-nucleic acid interactions and may be reflected by the dipoles and volumes of the side chains of amino acids, respectively. Based on the dipoles and volumes of the side chains, the 20 kinds of amino acids could be clustered into seven classes [
As mentioned above, for each query protein, the vector size of a feature is
The random forest (RF) algorithm [
To evaluate the performance of the classifier, a 5-fold cross-validation procedure for the training dataset was used in this research. During the procedure, we randomly divided the data instances into five parts. Four of these parts were input into the RF to establish a model for classification, and every instance of the remaining part was predicted by the model. Ultimately, the prediction performance of the classifier was evaluated by the remaining part.
To evaluate the performance of the RNA-binding proteins predictor, the accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC) were calculated as follows:
Considering the successful application on several classification researches [
The mRMR method was developed by Peng et al. [
In (
Let
To obtain the feature
In this study, after using the mRMR method, all of the 340 features were ordered as follows:
In (
To select the optimal features, we used incremental feature selection (IFS) [
We then constructed 340 individual predictors for the 340 feature sets to predict RNA-binding proteins. Each predictor was constructed by the RF algorithm and evaluated by 5-fold cross-validation. The 340 MCC values were calculated from all the predictors and obtained the IFS curve with feature index
We explored the performance of RF-based predictors for predicting RNA-binding proteins by various features. The prediction results of the individual RF-based predictors using 10 cycles of 5-fold cross-validation over the MDset are shown in Table
The prediction performance of the RF model based on various features, evaluated by 10 cycles of 5-fold cross-validation on the MDset dataset.
Feature | Accuracy ± SD | Sensitivity ± SD | Specificity ± SD | MCC ± SD |
---|---|---|---|---|
PSSM-400 | 0.7967 ± 0.0062 | 0.7003 ± 0.0093 | 0.8894 ± 0.0075 | 0.620 ± 0.016 |
EIPP | 0.8311 ± 0.0105 | 0.7487 ± 0.0071 | 0.9107 ± 0.0129 | 0.662 ± 0.021 |
CT | 0.7482 ± 0.0092 | 0.6591 ± 0.0067 | 0.8406 ± 0.0153 | 0.5096 ± 0.015 |
EIPP + BP + NBP | 0.8428 ± 0.0038 | 0.7573 ± 0.0082 | 0.9367 ± 0.0043 | 0.704 ± 0.008 |
CT + BP + NBP | 0.7661 ± 0.0197 | 0.7034 ± 0.0132 | 0.8587 ± 0.0114 | 0.568 ± 0.026 |
EIPP + CT | 0.8317 ± 0.0139 | 0.7482 ± 0.0068 | 0.9202 ± 0.0127 | 0.671 ± 0.018 |
EIPP + BP + NBP + CT | 0.8573 ± 0.0117 | 0.7764 ± 0.0143 | 0.9424 ± 0.0062 | 0.729 ± 0.020 |
We ranked a list of 340 features for MDset dataset using the mRMR method, which was downloaded from
Based on the list of 340 features obtained from the mRMR method, we obtained 340-feature subsets. We then built 340 individual predictors for the 340-subfeature sets to predict RNA-binding proteins, evaluated by 5-fold cross-validation on the MDset dataset. As shown in Figure
Optimal 47 features for prediction of RNA-binding proteins.
Rank | Feature |
---|---|
1 | EIPP of ASP in protein sequence for the pKa values of amino group |
2 | EIPP of GLU in protein sequence for the Balaban index |
3 | BP(2) |
4 | EIPP of TYR in protein sequence for the pKa values of amino group |
5 | CT of class a, class b, and class e |
6 | CT of class d, class b, and class e |
|
|
7 | EIPP of HIS in protein sequence for the pKa values of amino group |
8 | EIPP of LYS in protein sequence for the pKa values of carboxyl group |
9 | CT of class b, class d, and class e |
10 | CT of class d, class c, and class e |
11 | EIPP of MET in protein sequence for the molecular mass |
12 | CT of class b, class e, and class a |
13 | EIPP of ARG in protein sequence for the pKa values of amino group |
14 | NBP(2) |
15 | CT of class c, class e, and class d |
16 | BP(1) |
17 | EIPP of TRP in protein sequence for the pKa values of amino group |
18 | CT of class d, class d, and class e |
19 | EIPP of LYS in protein sequence for the Balaban index |
20 | NBP(1) |
21 | CT of class c, class a, and class d |
22 | CT of class b, class e, and class d |
23 | CT of class e, class d, and class e |
24 | EIPP of HIS in protein sequence for the pKa values of carboxyl group |
25 | CT of class d, class c, and class f |
26 | CT of class e, class f, and class d |
27 | CT of class e, class b, and class d |
28 | CT of class d, class e, and class c |
29 | EIPP of GLY in protein sequence for the pKa values of carboxyl group |
30 | EIPP of THR in protein sequence for the molecular mass |
31 | CT of class c, class b, and class e |
32 | CT of class c, class e, and class a |
33 | EIPP of GLN in protein sequence for Wiener index |
34 | EIPP of SER in protein sequence for Wiener index |
35 | EIPP of ASN in protein sequence for the molecular mass |
36 | CT of class b, class a, and class c |
37 | CT of class e, class d, and class f |
38 | CT of class e, class b, and class a |
39 | EIPP of TRP in protein sequence for the pKa values of carboxyl group |
40 | CT of class a, class e, and class c |
41 | EIPP of ARG in protein sequence for the lowest free energy |
42 | CT of class e, class c, and class d |
43 | EIPP of LYS in protein sequence for the molecular mass |
|
|
44 | CT of class e, class e, and class d |
45 | EIPP of TYR in protein sequence for Wiener index |
46 | CT of class e, class c, and class b |
47 | CT of class f, class c, and class d |
The IFS curve showing MCC values against feature numbers. The maximum MCC value was 0.684 when the top 47 features were selected.
As described in Section
(a) Feature distribution for the 47 optimal features. (b) The selection proportion of each type of feature.
As shown in Figure
All four BP and NBP features in the original feature dataset were selected to the optimal feature set, which revealed that BP and NBP features contribute mostly to distinguish RNA-binding proteins from nonbinding ones. We also calculated the
The superior performance of BP and NBP features represents the reliability of the definition of BP and NBP features. The detailed explanation for the reliability of the definitions of BP and NBP could be as follows. Compared with nonbinding proteins, RNA-binding residues should show a higher tendency to exist in binding proteins and RNA-binding residues should tend to gather together spatially on the surface of an RNA-binding protein. The two BP features revealed the character of RNA-binding proteins at the sequence level and the spatial level, respectively. By contrast, the proportion of nonbinding residues should be much higher for nonbinding proteins in comparison to RNA-binding proteins. This phenomenon represents the reliability of the proposed NBP feature. Therefore, BP and NBP features worked well, as we expected.
We selected 19 EIPP features in the optimal feature set after using the mRMR-IFS method. Considering that EIPP was constructed by the evolutionary information of each type of amino acid in sequences and physicochemical property, we collected the statistics of the number of each type of amino acid and the number of each type of physicochemical property that constituted the 19 EIPP features. Figures
(a) Physicochemical property distribution to construct the 19 EIPP features that were selected in the optimal feature set. (b) The type of amino acids distribution to construct the 19 EIPP features that were selected in the optimal feature set.
As seen from Figure
Twenty-four CT features were selected in the optimal feature set and the number of each type of class, which constituted the 24 CT features, was analyzed and shown in Figure
The type of class distribution to construct the 24 CT features that were selected in the optimal feature set.
To evaluate the effectiveness of our protocol, we compared the performance of our method with existing methods. Currently, there are two webservers for identifying RNA-binding proteins based on sequence information. One is SVMprot by Han et al. [
Comparison of the predicted results by our method and some webservers on the Testset.
Method | ACC (%) | SE (%) | SP (%) | MCC |
---|---|---|---|---|
Our method | 0.7674 | 0.7222 | 0.8125 | 0.537 |
SVMprot | 0.5764 | 0.7639 | 0.3889 | 0.165 |
RNApred | 0.6111 | 0.6389 | 0.5833 | 0.223 |
Accurate identification of new RNA-binding proteins is important to understand RNA-protein interactions. In this study, an accurate method was developed to predict RNA-binding proteins using only sequence information. We proposed three novel features, binding propensity (BP), nonbinding propensity (NBP), and evolutionary information combining with physicochemical properties (EIPP). BP and NBP were constructed based on the prediction results of RNA-binding residues and nonbinding residues, respectively. The EIPP features were improved on those of PSSM by combining evolutionary information with physicochemical properties. The results showed that using those novel features dramatically improved the prediction performance and were effective in distinguishing RNA-binding proteins from nonbinding ones. The mRMR-IFS feature selection method and RF algorithm are then utilized to construct the prediction model. This is the first study in which the mRMR-IFS feature selection method has been successfully used to predict RNA-binding proteins. The prediction model achieved excellent performance, with 86.62% accuracy, 78.34% sensitivity, and 94.91% specificity and an MCC of 0.737. These results indicated that our predictor is a useful tool to predict RNA-binding proteins.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China under Grant no. 61305072 and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant no. 14KJB520020 and sponsored by Qing Lan project.