Prediction of RNA-Binding Proteins by Voting Systems

It is important to identify which proteins can interact with RNA for the purpose of protein annotation, since interactions between RNA and proteins influence the structure of the ribosome and play important roles in gene expression. This paper tries to identify proteins that can interact with RNA using voting systems. Firstly through Weka, 34 learning algorithms are chosen for investigation. Then simple majority voting system (SMVS) is used for the prediction of RNA-binding proteins, achieving average ACC (overall prediction accuracy) value of 79.72% and MCC (Matthew's correlation coefficient) value of 59.77% for the independent testing dataset. Then mRMR (minimum redundancy maximum relevance) strategy is used, which is transferred into algorithm selection. In addition, the MCC value of each classifier is assigned to be the weight of the classifier's vote. As a result, best average MCC values are attained when 22 algorithms are selected and integrated through weighted votes, which are 64.70% for the independent testing dataset, and ACC value is 82.04% at this moment.


Introduction
Protein-RNA interactions play significant roles in a wide range of biological processes, including regulation of gene expression, protein synthesis and replication, and the assembly of many viruses [1][2][3][4]. A good knowledge of protein-RNA interactions is fundamentally important for the understanding of how proteins regulate gene expression. Machine learning and data mining methods have been widely applied in the fields of computational biology and bioinformatics [5][6][7][8][9], and the same principles are also applied to determine whether a protein participates in RNAbinding [10][11][12][13][14][15][16]. Some investigations code a protein using primary amino acid compositions [10,11,13,14], and some code with protein chemical or physical properties and structural information [10][11][12][14][15][16]. In terms of machine learning methods, support vector machine (SVM) [10,14], artificial neural networks [17], Naive Bayes [18], and so forth, were all found in the literature to uncover the interaction between proteins and RNA. A specific study [19] was carried out to determine the interaction sites between RNA and Rev proteins of HIV-1 and EIAV, in which both protein-protein interface residues and protein-RNA interface residues were predicted, by first training the predictors using known protein-protein and protein-RNA complexes and then using the trained predictors to predict the binding sites of HIV-1 and EIAV Rev proteins.
The above reviewed papers applied a single classifier to determine the interactions between RNA and proteins. However, for a specific biological dataset, an individual classifier has its own strengths and weaknesses. Underfit or overfit of a single classifier will affect the accuracy or the generalization of the prediction performance. Thus, people are inspired to integrate multiple classifiers [20,21], in attempts to improve the prediction/classification performance. Recently, Chen et al. [21] proposed a few voting systems for the classification (prediction) of protein structural classes. Chen et al. [21] used an unprecedented number of machine learning algorithms from Weka (http://www.cs.waikato.ac.nz/∼ml/weka/) for the voting systems and realized that some of the classifiers  may be redundant since they could worsen the overall classification performance if included. Therefore, mRMR (minimum redundancy maximum relevance) [22] strategy, which is originally developed for feature selection [23,24], was transferred into classifier selection. As a result, four voting systems were developed [21]. They are simple majority voting system (SMVS), weighted majority voting system (WMVS), SMVS with algorithm selection (SMVS AS); and WMVS with algorithm selection (WMVS AS). In this paper, these voting systems are adopted and applied to predict the interaction between proteins and RNA.

Data Preparation
(i) The Rough "Positive" Dataset: Using "RNA binding" as keywords to search the SWISS-PROT database (version 54.2), 20132 proteins were retrieved. This collection was designated as "positive" dataset.
(ii) The "Contrast" Dataset: A "contrast" set of 72331 proteins was retrieved from SWISS-PROT by searching with a list of keywords which possibly imply RNA/DNA-binding functionality, using the "or" logic, which was proposed by Cai and Lin [10].
(iii) The Rough "Negative" Dataset: the proteins in the "contrast" dataset were removed from the SWISS-PROT database (it has 232345 sequence entries) and 160014 proteins were obtained to form the "negative" dataset.
(iv) The RNA-Binding Protein Dataset: protein sequences with length >6000 aa or <50 aa were removed since they might be protein complexes or protein fragments. Proteins including irregular amino acid characters such as "x" and "z" were also removed. Moreover, the redundancy among the sequences in "positive" and "negative" datasets was removed by using CD-HIT [25] and PISCES [26] program, with a threshold of 40%. As a result, 2063 and 21562 proteins were produced in nonredundant RNA-binding and "negative" datasets, respectively. To achieve data balance, datasets were built in the following manner: first all the proteins in the "positive" subset were selected as the first part. Then the proteins in the "negative" subset were randomly selected as the second part. The number of proteins selected in the "negative" subset equals that of the first part. Thirdly we combined the first part and the second part together to be total dataset; finally we randomly drew out third of that total dataset to be test dataset, the rest to be train dataset and Consequently, the RNA-binding protein training dataset of 2752 proteins and the RNA-binding protein testing dataset of 1374 proteins (see Table 1, "A" means RNAbinding protein and "B" means RNA-nonbinding protein)   In order to ensure the stability of the built model, we repeat these steps ten times. That is to say, we build ten train datasets and ten test datasets randomly, and all of ACC (overall prediction accuracy) value and MCC (Matthew's correlation coefficient) value in our paper are the average value.

Feature Vector.
A successful classification requires an effective way to represent a protein. Under current techniques, it is not possible to know every aspect of a protein from its sequential information. However, the biological properties of the amino acids that compose a protein are known, and they may reveal some properties of a whole protein sequence. Thus, in this paper a protein is represented by amino acid compositions and the biological properties of each amino acid [14] which is one of the popular representation methods in the literature. The biological properties include hydrophobicity, predicted secondary structure, predicted solvent accessibility, normalized Van Der Waals volume, polarity, and polarizability. As a result, totally 132 features are derived, among which 112 features come from biological properties and 20 from the amino acid compositions. Detailed information of these features can be found in [14]. Readers may refer to [27] for detailed introduction about these algorithms.

Ensemble Approach.
Four ensemble approaches, Simple majority voting system (SMVS), weighted majority voting system (WMVS), SMVS with algorithm selection (SMVS AS), and WMVS with algorithm Selection (WMVS AS), are introduced briefly here. Readers may refer to [21] for the detailed information about these voting  systems. SMVS takes the class label that gains the majority votes as the class of a processed data. WMVS weighs each vote with the overall prediction accuracy of the corresponding classifier on a training dataset. SMVS AS first selects some classifiers using mRMR method, and then the selected algorithms are integrated through SMVS. WMVS AS is like the SMVS AS to first select some classifiers using mRMR method, but then WMVS is used instead of SMVS in the integration.

Prediction Results of the 34
Algorithms. 34 algorithms were tested by tenfold cross-validation (10-CV) on both the basic training dataset and the independent testing dataset. The detailed outputs of 10-CV on the basic training dataset and independent testing dataset are listed in Supplementary Material. Figures 1, 2 Table 2; it seems that the results are stable.
The Matthew's correlation coefficient (MCC) is used in machine learning as a measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC can be calculated directly from the confusion matrix using the following formula: In this equation, TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.

Results of SMVS and WMVS.
Average predicted results and standard deviation of SMVS and WMVS are shown in Table 3. SMVS and WMVS perform better than any individual algorithm selected in Weka, and WMVS performs a little better than SMVS. It implies that as a whole the 34 algorithms collaborate to improve the prediction accuracy through voting. The values of standard deviation also decrease significantly through voting. It implies that voting system increases the stability of prediction model.

Results of SMVS AS and WMVS AS.
Algorithms are added into the voting system one by one according to the order of mRMR. The voting result of each added algorithm is plotted in Figure 5. SMVS AS and WMVS AS achieve the highest average MCC value of 64.40% and 64.70% when the 22th algorithm is added. The curve in Figure 5 shows that WMVS AS performs better than SMVS AS in most cases, especially when  the voting system involves an even number of algorithms. Voting systems with algorithm selection perform better than those without, indicating that some of the 34 algorithms cause a negative effect or no effect and should be excluded in 6 Journal of Biomedicine and Biotechnology   Figure 6 (the number of algorithms used by WMVS AS is average value of 22 algorithms). In terms of proportion, all adopted lazy and rules classifiers are selected by the voting system, and around half of functions and tree classifiers are selected, indicating that there is less redundancy among these types of classifiers. The Bayes classifier is excluded, indicating that it performs negatively or has no effect in the voting. Because the number of metaclassifiers is the greatest among all types of classifiers involved, many of them are redundant and excluded from the voting. Nevertheless, more metaclassifiers remain in the voting than any other types of classifiers after the algorithm selection. On the whole, the number of classifiers of different types becomes evener after the algorithm selection, indicating that classifiers from different types tend to collaborate better in the voting than those from the same type.

Conclusions
To predict the interaction between proteins and RNA, we integrate a number of machine learning algorithms selected from Weka using four voting systems [21]. As a result, voting systems perform better than any single classifier, voting systems with algorithm selection perform better than those without, and weighted voting systems perform better than those without weighting. Weighted voting systems with algorithm selection achieve the best prediction results with 82.04% (ACC value) and 64.70% (MCC value) on the independent dataset.