It is important to identify which proteins can interact with RNA for the purpose of protein annotation, since interactions between RNA and proteins influence the structure of the ribosome and play important roles in gene expression. This paper tries to identify proteins that can interact with RNA using voting systems. Firstly through Weka, 34 learning algorithms are chosen for investigation. Then simple majority voting system (SMVS) is used for the prediction of RNA-binding proteins, achieving average ACC (overall prediction accuracy) value of 79.72% and MCC (Matthew’s correlation coefficient) value of 59.77% for the independent testing dataset. Then mRMR (minimum redundancy maximum relevance) strategy is used, which is transferred into algorithm selection. In addition, the MCC value of each classifier is assigned to be the weight of the classifier’s vote. As a result, best average MCC values are attained when 22 algorithms are selected and integrated through weighted votes, which are 64.70% for the independent testing dataset, and ACC value is 82.04% at this moment.
Protein-RNA interactions play significant roles in a wide range of biological processes, including regulation of gene expression, protein synthesis and replication, and the assembly of many viruses [
The above reviewed papers applied a single classifier to determine the interactions between RNA and proteins. However, for a specific biological dataset, an individual classifier has its own strengths and weaknesses. Underfit or overfit of a single classifier will affect the accuracy or the generalization of the prediction performance. Thus, people are inspired to integrate multiple classifiers [
Using “RNA binding” as keywords to search the SWISS-PROT database (version 54.2), 20132 proteins were retrieved. This collection was designated as “positive” dataset.
A “contrast” set of 72331 proteins was retrieved from SWISS-PROT by searching with a list of keywords which possibly imply RNA/DNA-binding functionality, using the ‘‘or’’ logic, which was proposed by Cai and Lin [
the proteins in the “contrast” dataset were removed from the SWISS-PROT database (it has 232345 sequence entries) and 160014 proteins were obtained to form the “negative” dataset.
protein sequences with length >6000 aa or <50 aa were removed since they might be protein complexes or protein fragments. Proteins including irregular amino acid characters such as “x” and “z” were also removed. Moreover, the redundancy among the sequences in “positive” and “negative” datasets was removed by using CD-HIT [
The distribution of proteins in training dataset and test dataset.
Dataset | A | B |
---|---|---|
Basic training dataset | 1376 | 1376 |
Independent test dataset | 687 | 687 |
A successful classification requires an effective way to represent a protein. Under current techniques, it is not possible to know every aspect of a protein from its sequential information. However, the biological properties of the amino acids that compose a protein are known, and they may reveal some properties of a whole protein sequence. Thus, in this paper a protein is represented by amino acid compositions and the biological properties of each amino acid [
34 machine learning algorithms in Weka [
BayesNet, DecisionTable, JRip, PART, Ridor, AttributeSelectedClassifier, Bagging, ClassificationViaRegression, Dagging, Decorate, END, EnsembleSelection, FilteredClassifier, LogitBoost, MultiClassClassifier, OrdinalClassClassifier, RacedIncrementalLogitBoost, RandomSubSpace, ClassBalancedND, ND, DataNearBalancedND, RandomCommittee, IB1, AdaboostM1, Kstar, MultilayerPerceptron, SimpleLogistic, SMO, J48, J48graft, NBTree, RandomForest, REPTree, SimpleCart.
Readers may refer to [
Four ensemble approaches, Simple majority voting system (SMVS), weighted majority voting system (WMVS), SMVS with algorithm selection (SMVS_AS), and WMVS with algorithm Selection (WMVS_AS), are introduced briefly here. Readers may refer to [
34 algorithms were tested by tenfold cross-validation (10-CV) on both the basic training dataset and the independent testing dataset. The detailed outputs of 10-CV on the basic training dataset and independent testing dataset are listed in Supplementary Material.
Figures
The standard deviation of the 34 algorithms.
Algorithm | Standard deviation | |||
Basic training dataset | Independent test dataset | |||
ACC (%) | MCC (%) | ACC (%) | MCC (%) | |
AdaBoostM1 | 0.61 | 1.16 | 1.00 | 1.94 |
J48 | 0.88 | 1.76 | 1.42 | 2.84 |
IBk | 0.52 | 1.01 | 1.18 | 2.21 |
MultiClassClassifier | 0.60 | 1.21 | 1.04 | 2.09 |
PART | 0.55 | 1.25 | 1.26 | 2.54 |
MultilayerPerceptron | 1.26 | 2.52 | 2.22 | 3.04 |
KStar | 0.72 | 1.41 | 1.07 | 2.00 |
Bagging | 0.76 | 1.51 | 0.43 | 0.88 |
NBTree | 0.82 | 1.64 | 2.04 | 4.09 |
Decorate | 0.73 | 1.47 | 1.16 | 2.25 |
RandomForest | 0.67 | 1.32 | 0.62 | 1.25 |
JRip | 0.48 | 0.96 | 2.25 | 4.43 |
RandomCommittee | 0.51 | 0.99 | 1.23 | 2.59 |
FilteredClassifier | 1.11 | 2.22 | 1.16 | 2.32 |
ClassificationViaRegression | 0.96 | 1.91 | 0.80 | 1.57 |
Dagging | 0.70 | 1.38 | 1.00 | 2.00 |
AttributeSelectedClassifier | 0.85 | 1.71 | 0.66 | 1.40 |
REPTree | 0.71 | 1.46 | 1.32 | 2.66 |
SMO | 0.55 | 1.10 | 1.06 | 2.11 |
J48graft | 1.06 | 2.12 | 1.40 | 2.81 |
Ridor | 1.01 | 2.14 | 1.70 | 3.44 |
RandomSubSpace | 0.91 | 1.84 | 1.22 | 2.44 |
EnsembleSelection | 0.78 | 1.60 | 1.35 | 2.42 |
SimpleLogistic | 0.41 | 0.83 | 0.92 | 1.84 |
DecisionTable | 0.98 | 2.06 | 1.86 | 3.87 |
DataNearBalancedND | 0.88 | 1.76 | 1.42 | 2.84 |
RacedIncrementalLogitBoost | 0.63 | 1.59 | 1.68 | 3.61 |
SimpleCart | 0.63 | 1.26 | 1.13 | 2.25 |
LogitBoost | 0.43 | 0.87 | 1.23 | 2.47 |
ND | 0.88 | 1.76 | 1.42 | 2.84 |
BayesNet | 0.51 | 1.02 | 1.02 | 2.10 |
ClassBalancedND | 0.88 | 1.76 | 1.42 | 2.84 |
OrdinalClassClassifier | 0.88 | 1.76 | 1.42 | 2.84 |
END | 0.88 | 1.76 | 1.42 | 2.84 |
The average ACC values of 34 algorithms in basic training dataset.
The average MCC values of 34 algorithms in basic training dataset.
The average ACC values of 34 algorithms in independent test dataset (including the results of SMVS and WMVS_MCC).
The average MCC values of 34 algorithms in independent test dataset (including the results of SMVS and WMVS_MCC).
The Matthew’s correlation coefficient (MCC) is used in machine learning as a measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC can be calculated directly from the confusion matrix using the following formula:
Average predicted results and standard deviation of SMVS and WMVS are shown in Table
The comparison of the predictors.
Predictor | Average predicted results | Standard deviation | ||
ACC (%) | MCC (%) | ACC (%) | MCC (%) | |
Best individual algorithm | 79.29 | 58.58 | 1.06 | 2.11 |
SMVS | 79.72 | 59.77 | 0.76 | 1.49 |
WMVS | 80.82 | 61.94 | 0.68 | 1.32 |
SMVS_AS | 81.88 | 64.40 | 0.55 | 1.02 |
WMVS_AS | 82.04 | 64.70 | 0.42 | 0.81 |
Algorithms are added into the voting system one by one according to the order of mRMR. The voting result of each added algorithm is plotted in Figure
The average MCC value of SMVS_AS and WMVS_AS.
SMVS_AS and WMVS_AS achieve the highest average MCC value of 64.40% and 64.70% when the 22th algorithm is added. The curve in Figure
In Weka version 3.5.7, the 34 algorithms are divided into Bayesian classifiers (Bayes), trees, rules, functions, metalearning algorithms (meta), and lazy classifiers (lazy). The number of algorithms of different types involved in the voting before algorithm selection and after algorithm selection is shown in Figure
Distribution of algorithms.
To predict the interaction between proteins and RNA, we integrate a number of machine learning algorithms selected from Weka using four voting systems [
This paper is supported by grants from the National Natural Science Foundation of China (20973108), the Key Research Program (CAS) (KSCX2-YW-R-112), Shanghai Leading Academic Discipline Project (J50101) and Systems Biology Research Foundation of Shanghai University, the National Natural Science Foundation of China (20902056), and Science Foundation of Shanghai for Excellent Young Teachers (B.37010107716).