Protein-Specific Prediction of RNA-Binding Sites Based on Information Entropy

Understanding the protein-RNA interaction mechanism can help us to further explore various biological processes. The experimental techniques still have some limitations, such as the high cost of economy and time. Predicting protein-RNA-binding sites by using computational methods is an excellent research tool. Here, we developed a universal method for predicting protein-specific RNA-binding sites, so one general model for a given protein was constructed on a fixed dataset by fusing the data of different experimental techniques. At the same time, information theory was employed to characterize the sequence conservation of RNA-binding segments. Conversation difference profiles between binding and nonbinding segments were constructed by information entropy (IE), which indicates a significant difference. Finally, the 19 proteins-specific models based on random forest (RF) were built based on IE encoding. The performance on the independent datasets demonstrates that our method can obtain competitive results when compared with the current best prediction model.


Introduction
RNA-binding proteins (RBPs) play an important role in gene expression and regulation since they are highly involved in various biological processes such as mRNA stability [1], stress responses [2], and gene regulation at the transcriptional and posttranscriptional levels [3]. Understanding RNA-protein interactions can lead to further study on the mechanisms underlying these biological processes. Accurate identification of RNA-protein binding sites is very useful for studying these biological processes. In recent years, many high-throughput experimental methods are useful for studying these biological processes. In recent years, many high-throughput experimental methods, such as PAR-CLIP [4], have been developed which can accurately determine the binding sites of RNA-protein interactions at the experimental level. However, these experimental methods are time-consuming and cost-effective. It is necessary to develop computational methods to predict the binding sites between RNAs and proteins.
At present, several researchers have developed computational methods for predicting RNA-protein binding sites. RNA context is a method with sequence and accessibility information to predict binding motifs [5], and Maticzka et al. present GraphProt to predict binding preferences by using RNA sequence and secondary structural contexts [6]. Zhang et al. integrate RNA sequence, secondary structural contexts, and RNA tertiary structural information by using a deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs [7]. Strazar et al. develop an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to predict RBP interaction sites on RNAs [8]. In iONMF, the features of protein-RNA interactions are the positions of RNA structure and sequence motifs, RBP cobinding, and gene region types. Pan et al. have given iDeep, iDeepS, and iDeepE to predict RNA-protein binding sites from RNA sequences [9]. iDeep developed a novel hybrid convolutional neural network and deep belief network with RNA sequence, secondary structure, clip-cobinding, region type, and motif features to predict the RBP interaction sites and motifs on RNAs. iDeepS can identify the binding sequence and structure motifs from RNA sequences at the same time by using convolutional neural networks and a bidirectional long short-term memory network only with RNA sequence and predicted secondary structure. iDeepE improves the network structure to predict RNA-protein binding sites and motifs by combining local and global deep convolutional neural networks. e above research has achieved satisfactory prediction results, and some of them obtain the high area under curve (AUC) values in most datasets, such as iDeepS. However, there are still limitations deserving improvement in previous studies, including different experimental datasets and the selection of features.
Since different experimental methods for the same protein are usually different, previous prediction methods are all based on different datasets from different experimental techniques, so multiple different models are usually constructed for a given protein, such as the protein ELAVL1 which has two datasets with different protocols of PAR-CLIP and CLIP-seq [10,11]. We know that each protein has a definite RNA-binding motif, but there may be some errors that make the RBPs undetectable during the experiment. Different protocols can make up for the loss caused by experimental errors, so different experimental datasets for a given protein could be merged as one general dataset to build a protein-specific model which is universal to different experimental techniques. In addition, the current accuracy of the predicted secondary structures is relatively low, so the reported performance may be of some discount for the existing methods based on the predicted secondary structures.
In our study, we developed a simpler and universal model for predicting RNA-protein binding sites for different proteins only based on RNA sequence information. Firstly, for 19 proteins, 19 general protein-specific datasets were achieved by fusing different experimental data. en, only based on the sequence information, k-mer-based features and IE profiles were respectively used to represent the binding segments. Finally, the RF model based on IE profiles was proved to be with the best performance, compared with 3, 4, and 5-mer features. e final model yields satisfactory results with an average AUC of 0.849 for the 19 proteins. In addition, the model that uses single RNA secondary structures was also built, but it gives the lowest prediction accuracy. Overall, our method only based on sequence conservation information by IE can obtain competitive results when compared with the current best prediction model.

Dataset.
We downloaded the datasets from iDeepS [12] which is available at https://github.com/xypan1232/iDeepS. iDeepS collected the data from iONMF [8], and it contains 31 CLIP-seq datasets with 19 binding proteins. Original CLIP-seq data are from servers iCount (http://icount.biolab. si) and DoRiNA [13]. In the iDeepS datasets, protein-RNA-binding sites are positive samples. Fifty bases were selected from each side of the binding site by sliding window to form a sequence with a length of 101. Genes that had never been identified as interacting sites in 31 experiments were used as negative samples.
In our work, in order to construct the protein-specific and experiment-universal model, we merged 31 CLIP-seq datasets into 19 general datasets for 19 different proteins. Here, due to the different structures, Ago2-MNase and Ago2, as well as ELAVL1-MNase and ELAVL1, are deemed as different proteins from each other [14]. Finally, we separated each dataset into training and testing one according to the ratio of 5 : 1. e detailed information about the 19 datasets is shown in Table 1.

Feature Extraction.
Here, we tried to represent protein binding RNAs from RNA primary sequences by IE profiles and K-mer features. Besides, since RNA secondary structures have been used and demonstrated to be useful for representing binding segments by previous reports [15], they were also extracted to be compared with features from sequence information.
(1) Information Entropy. Information entropy (IE) was proposed by Shannon [16] and has been deemed as one of the simplest and most common measures of conservation at a site on protein sequences in the field of bioinformatics [17]. IE describes the occurrence probability of discrete random events, and it also reflects the evolutionary conservatism of each location in the sequence. It has been successfully used in our previous research works of Shi et al. [18] and Wang et al. [19]. Shi et al. used IE to distinguish the difference between methylated and nonmethylated peptide segments, and Wang et al. used it to classify Type IV secreted effectors from the negative sample based on the N-terminal 100 residues.
In our work, IE is used to measure evolutionary conservation differences between binding and nonbinding sites of RNA sequences. e information content of each nucleic base at each position can be calculated as the input features to predict protein-RNA interaction binding sites.
(2) K-mer. K-mer is a common genomic feature in bioinformatics, and it has been widely used to identify some regions in biological DNA or protein molecules. Zhang et al. use K-mer to predict piRNA [20], and Cao et al. make use of K-mer to predict subcellular localization of lncRNA [21]. To characterize protein binding RNA sequences, we used all the 3-5 nt strings, including 3mer strings, 4mer strings, and 5mer strings.
(3) RNA Secondary Structure. RNA secondary structure refers to the planar structure formed by various components, such as single-stranded region structure, stem-ring structure, and double-stranded structure, which are composed of complementary base pairs in an RNA molecule and self-folding through these structures. RNA secondary structure is a kind of reversal formed by RNA molecules under natural conditions. RNA secondary structures are widely used in protein-RNA-binding site prediction, such as iONMF [8] and iDeepS [22]. RNAshapes is a tool for predicting the secondary structure of RNA [23]. We used RNAshapes to obtain RNA secondary structure annotations by referring to Fukunaga et al. [24]. In this way, we can obtain the secondary structure of each position in the sequence ( Figure 1). Six types of secondary structures were considered, including stems (S), multiloops (M), hairpins (H), internal loops (I), dangling end (T), and dangling start (F).

Random Forests.
e random forests (RFs) algorithm is an integrated classification model [25]. Its construction process is mainly composed of three aspects: the generation of a training set, the construction of a decision tree, and the generation of the algorithm. First, we need to generate training sets from the original data by sampling. rough the bagging algorithm, N samples are extracted from the original data set. Each sample will produce a decision tree, and the generated decision tree does not need pruning, thus establishing N decision trees to form forests.
At present, the RF algorithm is one of the most popular machine learning algorithms, and it has been widely used to solve biological classification [26][27][28][29][30].
e RF model we used in this study was implemented by a random forest package in the R language.

Evaluation Indicators.
We used five effective performance evaluation indicators to evaluate the predictive ability of the model, namely area under curve (AUC), sensitivity (SE), specificity (SP), accuracy (ACC), and Matthew's correlation coefficient (MCC), respectively. AUC represents the area under the receiver operating characteristic (ROC) curve. When drawing the ROC curve, the true positive is taken as ordinate, and the false positive is taken as abscissa.  Computational Intelligence and Neuroscience e closer the AUC is to 1, the better the prediction will be these metrics are commonly defined as follows: where TP, FP, TN, and FN are true positive, false positive, true negative, and false negative, respectively.

Difference Analysis between Positive and Negative
Samples. e distribution of nucleic bases is the basic information for an RNA sequence. With the segment length of 101 bases, we plotted two sample logos [15,31] of 19 protein training sets to show the difference in base composition between positive and negative samples in each dataset. Here, we select two of them as examples. We can find that the positive and negative samples of the same protein have obvious differences in base compositions. For each segment, the 51st base in the positive sample sequence is the protein-RNA interaction site. According to the literature [32], the binding motif of protein-RNA interaction is usually 6-8 bases. From Figure 2(a), we can find that the ( Figure 2      We also plotted the two sample logos of the secondary structure of 19 protein training sets and showed the difference in the secondary structure of positive and negative samples for each data set. When protein and RNA interact, they both have unique space structures [33]. erefore, the secondary structure of positive and negative samples will be different. It is also proved by our two sample logos. In Figure 2(b), we can find that the 48-54 positions in the positive samples of QKI datasets are usually multiloop (M) and hairpin (H). In the secondary structure of negative samples, we find that there are more stem (S) structures. In Figure 2(d), the positive sample of the MOV10 dataset is stem (S) structures at 48-54 positions. Negative samples did not show many stem (S) structures. We can also find that the positive and negative samples have obvious secondary structure differences by using two sample logos.
en, we calculated the information entropy of 19 datasets. Information entropy can characterize the evolutionary conservation of each position in a sequence. We selected two training datasets of PUM2 and TIA1 to show the results of IE difference between positive and negative samples in Figure 3. From Figure 3, we can find that the IE values of positive and negative samples have obvious differences. We know that the lower the IE value of a given position is, the more conservative this position is. In general, we can easily see that all bases in the positive dataset are more conservative than those in the negative dataset. Moreover, there are significant conservation differences between positive and negative samples at the location of the  Computational Intelligence and Neuroscience 51st base and its adjacent bases, as marked by the red ovals in Figure 3. e large difference in sequence conservativeness between binding and nonbinding sequences means that IE will be an important feature for predicting potential protein-RNA interaction sites.

Results on the Original 31 Datasets and the Combined 19
Datasets. According to the definition of IE and IG, we calculated the IG value of each position in the sequence. e IG value can quantitatively reflect the evolutionary conservation of each site in the sequence. We can find that the evolutionary conservatism of positive and negative samples is obviously different. Firstly, we use a single IG as the feature to construct the classifying model. Figure 4 shows the AUC of 31 original test sets. e average AUC value of test datasets is 0.86, which indicates that IE is generally effective for predicting binding sites of protein-RNA interactions.
In order to get a more general model, we merged the datasets from different experiments for the same protein, so 19 protein-specific datasets were achieved. According to 19 proteins, the protein-specific model for each protein was constructed on each protein-specific dataset. Figure 5 shows the AUCs of the combined test datasets, and Table 2 lists the detailed information. We can find that IE-based models also yield good prediction performance with an average AUC of 0.807. Moreover, 12 of 19 models give the AUC values higher than 0.8. e results of test datasets indicate that the general prediction model for each protein is valid, and it is a feasible way to develop the protein-specific model for the data from different experiments.
Secondly, six protein training sets were selected from 19 proteins according to the results of the model constructed with a single information entropy feather. Table 3 shows the AUC comparison of three models based on the features of K � 3, 4, and 5, respectively. e result of 4-mer is similar to that of 5-mer and better than that of 3-mer. When K � 5, the feature dimension is 1028, so there are many positions in 1028 features with a result of 0. In order to reduce the noise of the model, we give up 5-mer as a feature. Finally, we chose K � 4.
Previous research has selected secondary structures as a feature to predict protein-RNA interaction binding sites [7]. We also tried to introduce the secondary structure into our prediction model. Table 4 is the test sets AUC of the original 31 datasets using the secondary structure separately. However, we find that the contribution of a single secondary structure to the model is very small. e average AUC is only 0.55. ere are specific secondary structures when proteins interact with RNA. But the RNA secondary structure used in Computational Intelligence and Neuroscience 7 current research is not the real secondary structure. ey are derived from predictive tools such as RNAshapes [23] and RNAfold [34]. So, it will provide some wrong information to the model. Because the secondary structure cannot provide a positive impact on the prediction performance of the model, we do not choose to add the secondary structure to the final prediction model.

Comparison of Our Work and
Other's works. Finally, we construct a hybrid model to predict protein-RNA interaction binding sites by combining IE with 4-mer. en, the two models constructed by us with IE and IE + 4-mer were compared with the reported tool of iDeepS [12]. Table 5 shows the comparisons among the three models. On average, the hybrid model based on IE + 4-mer is superior to the model only with IE. For most datasets, the prediction performance of the hybrid model is almost equal to that of iDeepS. e average AUC of our hybrid model is 0.849, and that of iDeepS is 0.863. However, iDeepS is a tool based on sequence and secondary structure information. Our hybrid model is only sequence information. So, it is more concise and practical. e prediction performance of the models constructed with information entropy and 4-mer feather across 31 original experiment datasets can be seen in Table 6. e detailed comparison results are shown in Table 5.

Conclusions
In this work, we have developed a simpler and more applied model for predicting protein-RNA interaction binding sites. According to the same binding protein, we merged the original datasets from different experiments, so the model constructed by the merged dataset is more general. We compared the performance of the prediction models with single IE, K-mer, and RNA secondary structures, respectively, and we found that models based on single IE and single 4-mer give satisfactory performance. en, we constructed a hybrid model based on IE + 4-mer and compared it with the reported tool of iDeepS. We find that our hybrid model gives a competitive performance. However, in this paper, we first develop a general model for a specific protein, and our model is only based on the sequence information, so it is more feasible than other tools based on RNA structures.
Data Availability e dataset can be accessed upon request.