Utilizing Selected Di- and Trinucleotides of siRNA to Predict RNAi Activity

Small interfering RNAs (siRNAs) induce posttranscriptional gene silencing in various organisms. siRNAs targeted to different positions of the same gene show different effectiveness; hence, predicting siRNA activity is a crucial step. In this paper, we developed and evaluated a powerful tool named “siRNApred” with a new mixed feature set to predict siRNA activity. To improve the prediction accuracy, we proposed 2-3NTs as our new features. A Random Forest siRNA activity prediction model was constructed using the feature set selected by our proposed Binary Search Feature Selection (BSFS) algorithm. Experimental data demonstrated that the binding site of the Argonaute protein correlates with siRNA activity. “siRNApred” is effective for selecting active siRNAs, and the prediction results demonstrate that our method can outperform other current siRNA activity prediction methods in terms of prediction accuracy.


Introduction
RNA interference (RNAi) is a cellular process whereby double-stranded RNA (dsRNA) leads to posttranscriptional gene silencing through base-pairing interactions and is found in many eukaryotic systems, including plants, fungi, invertebrates, and mammals [1][2][3][4]. In mammalian cells, long dsRNA is processed into short 21-23 nucleotide (nt) dsRNAs known as small interfering RNA (siRNA) and induces instant target gene knockdown [3]. In functional genomic research, RNAi has become very helpful in drug and therapeutic applications [5]. Highly effective siRNAs can be synthesized to design novel drugs for influenza virus [6], HIV virus [7], and cancer [8]. However, Takayuki measured the RNAi activities of siRNAs targeting all positions of a single mRNA in human cells and found that few siRNAs show very high activities [9]. Therefore, predicting siRNA activity is a critical step for the successful implementation of RNAi.
Numerous siRNA-designing algorithms, which can be generally categorized as first-and second-generation algorithms, have been reported to date. The first-generation algorithms are based on small validated siRNA datasets and exploit multiple siRNA features, including GC content [10], base preferences at specific positions [11,12], thermodynamic stability [13], internal structure [14], and target mRNA secondary structure [15][16][17]. However, a large majority of siRNAs designed by the first-generation algorithms are not very effective [18]. The reason may be that the early datasets are too small to cover all the important features [19].
The second-generation algorithms were developed with the accumulation of validated siRNAs. Huesken developed "Biopredsi" [20] based on artificial neural network and built a major siRNA dataset including 2431 siRNAs through highthroughput analysis technology. A number of siRNA activity prediction algorithms based on machine learning models were built using Huesken's dataset. The algorithms Thermo-Composition21 [21], DSIR [22], -score [23], and Biopredsi were estimated as the best predictors [24]. In addition, Takayuki et al. proposed a complete dataset including the siRNAs targeting all positions of a single mRNA in human cells and developed an algorithm "siExplored." They found that specific residues at every third position of siRNAs greatly influenced its RNAi activity [9].
The performance of second-generation algorithms heavily depends on the selection of the included features [25]. Because the siRNA sequence is the most important factor that determines RNAi activity, more potential features embedded in siRNA sequences should be exploited to increase prediction accuracy. Takahashi found that when the 2-3 bp RNA at every position of a siRNA sequence were substituted by DNA, the RNAi activity changed [26]. Thus, we consider that the diand trinucleotides at certain positions of siRNA may correlate with its RNAi activity.
In this paper, we developed a powerful siRNA activity predictor by fusing multiple potential features. Our experimental results demonstrate that siRNA activity is significantly affected by its di-and trinucleotides; thus, we proposed 2-3NTs as our new features. In addition, a new mixed 230dimensional feature set was formed by combining 191 traditional features and 39 new features. To select the most relevant features, we proposed a Binary Search Feature Selection (BSFS) algorithm. Finally, a Random Forest predictor is constructed using the selected features. At the same time, a user-friendly web server named siRNApred is developed and is available for free at http://www.jlucomputer .com:8080/RNA/. siRNApred showed better performance compared with first-generation and second-generation algorithms. The result suggests that the di-and trinucleotides of siRNA can provide important information for prediction of active siRNAs.

The Importance of the Di-and Trinucleotides of siRNA.
In this section, we first elucidated the importance of our proposed di-and trinucleotides of siRNA on its activity. The di-and trinucleotides of siRNA can be defined as follows: The guide strand of siRNA = 1 , 2 , . . . , , . . . , 21 , where 1 ≤ ≤ 21.  All di-and trinucleotides at all positions of siRNA are obtained by a sliding window size of 2-3. Huesken's dataset is divided into two classes: 1218 potent siRNAs with activities greater than 0.7 and 1213 nonpotent siRNAs with activities less than 0.7.
There are 16 2-mer RNA subsequences, that is, AA, AU, etc., and the frequencies of all 2-mer RNA subsequences at positions 1 to 20 are calculated for the two classes. The significance level is calculated by Student's -test and the 2mer RNA subsequences with minimal value are shown in Table 1 ( -value < 0.05). Table 1 shows that the 2-mer RNA subsequences that appeared most often as potent were different than those that appeared most often as nonpotent siRNAs. We found that "UU" occurred more often than other 2-mer RNA subsequences in potent siRNAs, whereas "GG" and "CC" appeared most often in nonpotent siRNAs. Most of the "UU" 2-mers were found at positions 1, 4, 6, and 7 of potent siRNAs. In nonpotent siRNAs, "GG" often occurred at positions 1, 13, 14, 15, and 16 and "CC" often occurred at positions 3, 4, 5, 6, and 20.
There are 64 3-mer RNA subsequences, that is, AAA, AAU, etc. In addition, the frequencies of all 3-mer RNA subsequences at positions 1 to 19 are calculated for the two classes. The significance level is calculated by Student's -test and the 3-mer RNA subsequences with minimal value are shown in Table 2 ( value < 0.05).
The results demonstrate that di-and trinucleotides of siRNAs at certain positions can be used as indicators to distinguish between potent siRNAs and nonpotent siRNAs and can possibly be used as a potential feature for siRNA activity prediction.

Feature Extraction.
A total of 230 features are extracted in this section for siRNA activity prediction. These features include 2-3NTs, thermodynamic stability, nucleotide representation, and nucleotide compositions.
The categorical feature ( position position+1 ) is calculated using the following formula:   Computational and Mathematical Methods in Medicine Then, the feature vector 3NT , which includes 19 categorical features, is extracted from the trinucleotides of siRNA as follows: The categorical feature ( position position+1 position+2 ) is calculated using the following formula: 2.3.2. Thermodynamic Stability. The thermodynamic stability of siRNA may influence the strand selection in the process of RNAi; thus it would influence the RNAi activity [23]. Δ duplex is the sum of all the siRNA local duplex stability. The siRNA local duplex stability is calculated for every two base pairs along the siRNA duplex and the thermodynamic parameters for calculations were supplied by Xia et al. [29]. The ΔΔ is the Δ difference of duplex formation at the 5 and 3 ends of siRNA for 5 terminal nucleotides.

Nucleotide Representation.
Preferred nucleotides at specific positions are important indicators for activity prediction [21]. For example, the nucleotides at the first position of potent siRNAs were most often or , while often appeared at positions 7 and 11 in nonpotent siRNAs [11,20]. We defined the siRNA as a 21-dimensional vector and indicated the nucleotides at all positions. , , , and were digitized as 0.1, 0.2, 0.3, and 0.4.

Nucleotide
Compositions. The compositions of short motifs of 1-3 nt in siRNA and mRNA contained relevant information for activity prediction [30,31]. There are 4, 16, and 64 possible subsequences for all 1-mer, 2-mer, and 3mer RNAs, respectively. Thus, there are 168 features extracted from nucleotide compositions.

Model Construction.
Random Forest (RF) [32] is an ensemble learning method for classification and regression by growing a collection of trees. In the process of regression, the trees are constructed using a training set with variables. variables from these input variables are selected for the construction of an individual tree. The mean prediction of the individual tree will be output when the testing samples are pushed down these trees. Because the RF algorithm can randomly select features to build the ensemble of trees, it has stronger robustness than other methods. In this paper, the RF algorithm was used to develop siRNA activity prediction model.

Feature Selection.
We combined 39 2-3NTs, 2 thermodynamic stabilities, 21 nucleotide representations, and 168 nucleotide compositions to obtain a 230-dimensional feature vector. Since the contributions of these features are different, we proposed BSFS algorithm based on RF-variable importance to select the optimal feature set. The process of the algorithm is shown as follows.
Firstly, all features are ranked in descending order according to its -score. The -score is calculated by the RF algorithm to measure the feature importance [32]. To get the -score, Variable Importance (VI) should be first calculated. VI of the th variable was calculated according to the mean decrease in classification accuracy after permuting values of variable over all trees. The VI( ) of each tree is computed as follows: Input: A data set = {( ( ), )} 1 , where ( ) = { 1 , 2 , . . . , } is the feature set extracted from siRNA sequence and is the experimentally determined siRNA activities. The features of are first sorted by the variable importance -score in descending order. The initial value of min and max are 1 and , respectively. Output: optimal features ( ) = { 1 , 2 , . . . , }.
The dataset is divided into ten parts. Nine parts are used as the training set and the rest are used as a testing set. We build a Random Forest model using the feature set ( ) and the training set and then predict the testing siRNAs using the model. The correlation coefficient between the observed and predicted siRNA activities is Corr1. where ( ) is OOB samples of tree .
Please note that if is not in the tree , then VI ( ) ( ) = 0. Over all trees, VI( ) is defined as follows: where tree is the number of trees in the Random Forest. Finally, the -score of the th feature is defined as follows: wherêis the standard deviation of the raw importance. Secondly, the first features are selected as the optimal features. Set < and the calculation process of threshold is summarized in Algorithm 1.

Model Performance Evaluation.
As a validation step, we used the Pearson Correlation Coefficient (PCC) to describe the correlation between experimentally determined and predicted siRNA activity. It may be defined as follows: where is the sample size and and are the average value and standard deviation, respectively.
In addition, the Receiver Operating Characteristic (ROC) curve is applied to illustrate the performance of a binary classifier system by plotting sensitivity ( axis) against 1 − specificity ( axis) at various threshold settings.
where TN is the number of true negatives, FN is the number of false negatives, TP is the number of true positives, and FP is the number of false positives. The area under the ROC curve (AUC) is a single measurement of the algorithm's overall performance, and AUC of 1 and 0.5 represents perfect classification and random classification, respectively.

Performance of the 2-3NTs Features.
To investigate the importance of di-and trinucleotides of siRNA, we learn two RF regression models trained using Huesken train and tested on Huesken test. "model 1" is constructed with 2 thermodynamic stabilities, 21 nucleotide representations, and 168 nucleotide compositions, which are often used for siRNA activity prediction [24]. Then, "model 2" which extended "model 1" by considering 39 2-3NTs was constructed for comparisons.
The experimental prediction results are shown in Figure 1, and the PCC between the observed and predicted siRNA activities for model 1 and model 2 are 0.671 and 0.704, respectively. The prediction efficacy achieved 4.92% improvement after adding the new proposed features. It validates that 2-3NTs are important features for the prediction of siRNA activity.

Feature Selection Result.
The optimal feature set is obtained by our proposed BSFS algorithm. The details of this algorithm are shown in Section 2.5. Table 3 shows the threshold " " and the prediction accuracy "PCC" of our model with the top features for all steps. The results show that, when = 57, the PCC of our model reaches a maximum of 0.722. Thus, we choose = 57 as the threshold of the feature selection algorithm.
As shown in Figure 2, 57 features are selected by the BSFS algorithm and ranked in descending order according to -score. The higher the -score, the stronger the predictive ability of the feature. There are ten features proposed by our paper in the selective feature set, including the trinucleotides at positions 1, 2, 7, 18, and 19 and the dinucleotides at positions 1, 2, 8, and 19. Significantly, Takahashi noted the terminal bps of RNA (positions [19][20][21] provide Argonaute protein binding sites [26]. Our results show that "CUG" occurred most often at this position in potent siRNAs. The Argonaute protein is the endonuclease of RNA-induced silencing complexes (RISC) and cleaves the target mRNA whose sequence is complementary to the guide strand of siRNA [26]. We consider that, because the trinucleotide at position 19 is the binding site of the Argonaute protein, it will influence siRNA activity. However, further experiments are needed to validate if the Argonaute protein prefers to bind to potent siRNAs with specific trinucleotides at position 19. Some other features previously proven to be associated with silencing efficacy are selected, including the nucleotides at positions 1, 2, 7 and 19; thermodynamic stability Δ duplex and ΔΔ ; and U%, GGG%, C%, G%, CC%, GG%, GGC%, UGA%, CG%, GCC%, UC%, ACU%, UUC%, AA%, UU%, CGG%, AUG%, AG%, and AGA% of siRNA; AAU%, UUG%, GGG%, AAA%, ACA%, GU%, GCA%, CGU%, GCU%, CU%, GC%, CCG%, AGU%, CGA%, UA%, AU%, UAU%, UAA%, CUC%, GCG%, CUU%, AUU%, and CAU% of mRNA. Graphical boxplots are shown in Figure 3 to display the spread of potent and nonpotent siRNAs for the top 15 features.

Comparison of Algorithms.
After finding the optimal feature set, the final model, siRNApred, was created. The parameters and are the number of decision trees to be grown in the forest and the number of variables to split  at each node, respectively. The default and are 500 and /3. is the number of features. To find the optimal parameters, we used a grid search method with the step size of 100 and 1. The final results are = 1000 and = 24. The PCC between the observed and predicted siRNA activities of our model with these parameters is 0.722, which is a 1.7% improvement compared to the model with default parameters. However, the results are not sensitive to over the range 24-30 according to our experimental results.
To test the performance of siRNApred, we compared our model with the most state-of-the-art methods for siRNA activity prediction recently reported in the literature. Two experiments were carried out in the same conditions and the comparative evaluation is as follows.
First, our method was compared with Biopredsi [20], -score [23], ThermoComposition-21 [21], and DSIR [22]. All the algorithms were trained using Huesken train and tested on Huesken test. Table 4 shows that the PCC between observed and predicted siRNA activities of our model tested on Huesken test is 0.722, which is 9.39%, 10.39%, 9.56%, and 7.76% higher than the other four algorithms.
In addition, the ROC curves combining both sensitivity and specificity of the five methods are plotted (Figure 4). For ROC analysis, siRNAs that produce at least 70% target gene knockdown were accepted as active siRNAs, and those below 70% were considered inactive siRNA. We calculated an AUC of 0.898 for our model, which is better than those obtained from Biopredsi, -score, ThermoComposition-21, and DSIR.
In siRNA design, more inactive siRNAs predicted as active siRNAs will increase the experimental cost, so siRNA design tools are expected to be capable of rejecting as many false positives as possible and retain the maximum number of true positives. Consequently, we should focus on the area that has higher specificity and compare the sensitivities among different algorithms in this area. Figure 4 shows that in the higher specificity area, siRNApred outperforms all other algorithms. Table 5 shows two group sensitivities of all the algorithms. When the specificity of all algorithms is 96.5%, the sensitivity of our method is 51.9%. The value is higher than Biopredsi, -score, ThermoComposition-21, and DSIR, which is 16.3%, 24.4%, 28.9%, and 20%, respectively. Our model also performs best when the specificity of all the algorithms is 99.1%. The results demonstrate that our method had more advantages than the other four algorithms for siRNA design.
A second experiment was conducted to compare our model with the other nine models, including the firstgeneration siRNA design algorithms Reynolds [11], Ui-Tei [14], Amarzguioui [12], Katoh [9], Hsieh [33], and Takasaki [34] and the second-generation algorithms Biopredsi, -score, ThermoComposition-21, and DSIR. All the algorithms were trained on Huesken train and tested on the three independent datasets of Vickers, Reynolds, and Harborth. Figure 5 shows that siRNApred achieves the highest PCC compared to all nine models on all three independent testing datasets and obtained a higher AUC except when tested on Vickers' dataset. Otherwise, siRNApred produces more stable results across each of the independent siRNA datasets. In addition, the results show that both the PCC and AUC of the first-generation siRNA design algorithms are lower than the second-generation algorithms. It was found that siRNApred is more stable and effective than other models in the two experiments. The reason may be that our model takes account into the influence of di-and trinucleotides and removes several redundant features. The comparison results demonstrated that prediction accuracy can be improved significantly when considering the 2-3NTs of siRNA guide strand.

Conclusions
Activity prediction of siRNA is a critical step for the successful implementation of RNAi. In this study, we introduced 2-3NTs as our new features. A new mixed 230-dimensional feature set was formed by combining 191 traditional features and our 39 proposed features. Since there were many potential features, the BSFS method based on RF-variable importance was proposed to select the optimal feature set. A total of 57 features were selected as input vectors of the RF model to predict siRNA activity, and nine of our proposed features were included. Significantly, the trinucleotide motif at position 19 was included in the selected feature set, which is the binding site of the Argonaute protein. We found that "CUG" occurred most often at position 19 of potent siRNAs. Further experiments are needed to validate if the Argonaute protein prefers to bind to potent siRNAs possessing a specific trinucleotide at position 19. Finally, we describe a highly accurate and reliable tool called "siRNApred." It can design effective siRNAs for an input mRNA using an optimal feature set. The experimental comparative evaluation on commonly used datasets showed that siRNApred produced better results than first-generation and second-generation siRNA design methods. Consequently, we consider siRNApred a worthy tool for efficient siRNA design.