Deep Multiple Kernel Learning for Prediction of MicroRNA Precursors

MicroRNAs are a group of noncoding RNAs that are about 20–24 nucleotides in length. -ey are involved in the physiological processes of many diseases and regulate transcriptional and post-transcriptional gene expression. -erefore, the prediction of microRNAs is of great significance for basic biological research and disease treatment. MicroRNA precursors are the necessary stage of microRNA formation. RBF kernel support vector machines (RBF-SVMs) and shallow multiple kernel support vector machines (MK-SVMs) are often used in microRNA precursors prediction. However, the RBF-SVMs could not represent the richer sample features, and the MK-SVMs just use a simply convex combination of few base kernels. -is paper proposed a localized multiple kernel learning model with a nonlinear synthetic kernel (LMKL-D).-e nonlinear synthetic kernel was trained by a three-layer deep multiple kernel learning model. -e LMKL-D model was tested on 2241 pre-microRNAs and 8494 pseudo hairpin sequences. -e experiments showed that the LMKL-D model achieved 93.06% sensitivity, 99.27% specificity, and 98.03% accuracy on the test set. -e results showed that the LMKL-D model can increase the complexity of kernels and better predict microRNA precursors. Our LMKL-D model can better predict microRNA precursors compared with the existing methods in specificity and accuracy. -e LMKL-D model provides a reference for further validation of potential microRNA precursors.


Introduction
MicroRNAs are a class of highly conserved endogenous noncoding RNAs with a length of about 20-24 nucleotides. ey are single stranded and regulate gene expression at the post-transcriptional or translational level by binding specifically to target messenger RNA [1,2]. Studies have shown that some microRNAs can play a role by regulating cell proliferation, cell migration, invasion, and immune response [3], and at the same time, microRNAs can also play an important role in inflammatory response [4], neural development, and other processes [5,6]. In organisms, microRNA is first transcribed by RNA polymerase II into long initial transcription, primary microRNA, which is then processed by Drosa enzyme into a precursor with a length of about 70 nucleotides, that is, pre-microRNA [3,7]. Pre-microRNA is exported from the nucleus with the help of RanGTP/exportin 5 and then exported to be processed and matured by the Dicer enzyme in the cytoplasms [8,9]. After being processed into mature microRNAs, microRNAs form RNA-induced silencing complex (RISC) in some way to affect the protein abundance of target genes by inhibiting translation or degrading the mRNAs of target genes.
MicroRNA precursors (pre-microRNAs) can fold into hairpin structures, which are considered the most important indicators of microRNA maturation [10]. Figure 1 shows a pre-microRNA sequence and its hairpin structure. However, there are a large number of nonprecursors with similar hairpin structures in many genomic regions, which are called pseudo hairpin sequences [11]. Accurately and effectively identifying microRNA precursors from a large number of candidate hairpin sequences is a challenging task [12]. e methods of finding new microRNAs mainly include biological experimental methods and computer prediction methods [13]. Biological experimental methods are more direct and highly reliable, but the expression level of microRNAs is relatively low. Some microRNAs are only expressed under specific conditions, such as cell type and physiological state of the body. Moreover, due to the high cost and long experimental cycle, it is difficult to replicate microRNA expressed in a specific tissue and period. With the help of computer, the computer prediction method can identify new microRNAs more efficiently. e prediction method of microRNAs based on machine learning has been applied to bioinformatics, which can overcome various defects of biological experimental methods, prevent microRNAs from being affected by expression time, tissue specificity, or expression level, and provide reliable samples for subsequent biological experiments.
MicroRNA precursors have a unique hairpin structure and are easier to obtain than microRNAs. us, computational prediction methods use machine learning to mainly identify microRNA precursors from candidate hairpin sequences. e authors in [14] and [15] proposed a set of novel features and used a support vector machine (SVM) with only a RBF kernel to classify real and pseudo pre-microRNAs and proposed Triplet-SVM and PremipreD While different kernels have different characteristics, a RBF kernel could not adequately map the pre-microRNAs to the appropriate feature spaces. When the features of input data contain heterogeneous information [16,17] or the data are nonflat in the high-dimensional feature space [18], it is not reasonable to use a single simple kernel to map all the input data. e authors in [19] used a random forest classifier to find shallow features and only got 91.29% accuracy and proposed MipIe. e authors in [20] adopted multiple kernel SVM with different weights, but only the shallow features are used and then LMKL-MiPred was proposed. It shows good accuracy but no deep features of pre-microRNAs were explored. e authors in [21] used a simple three-layer backpropagation neural network and proposed MiRANN. However, when there are limited candidate hairpin sequences, the threelayer backpropagation neural networks typically do not have a good generalization performance, and they can even increase the risk of over-fitting under some conditions. Multiple kernel methods have been successful on small data sets. By mapping the samples into a high-dimensional reproducing kernel Hilbert space, they only use very few parameters to enable a classifier to learn a complex decision boundary. How to determine the basic kernel function is the difficulty and key problem of multiple kernel learning. e localized multiple kernel learning [22] uses different weights to combine simple basic kernel (linear kernel, polynomial kernel, and RBF kernel) but could not obtain the deep features of the samples.
is paper presents a localized multiple kernel learning model with a nonlinear deep synthetic kernel (LMKL-D). e deep synthetic kernel was trained by a deep multiple kernel learning model with a tree structure [23]. We found that the neural networks are easy to obtain deep features by gradient descent. us, we adopt the gradient descent approach and use a deep multiple kernel learning model with a tree structure to get a nonlinear deep synthetic kernel. We combine kernels at each layer and then optimize over an estimate of the leave one out error [23]. Starting from some simple basic kernels, a deep synthetic kernel can be achieved after a learning process. We combined the deep synthetic kernel and other simple basic kernels by localized multiple kernel learning. e deep synthetic kernel was composed of complex combination of basic kernels. us, the LMKL-D model can take advantage of both the shallow and deep features of the input data. As a result, the LMKL-D model can represent more features and obtain better performance than existing SVM methods. e rest of the paper is organized as follows. In Section 2, we introduce datasets selection and feature selection and then we provide the background about SVM, localized multiple kernel learning, and the multiple kernel learning methods. Kernels and model selection are also included in this section. In Section 3, we show the experimental results and comparisons with other methods. Finally, conclusions and future work are shown in Section 4.

Biologically Relevant Datasets.
e LMKL-D model proposed in this paper should be able to correctly identify pre-microRNAs and pseudo hairpin sequences from the candidate hairpin sequences dataset. us, the candidate hairpin sequence datasets have two parts. One is the positive real pre-microRNAs sequences. We obtained a total of 4,028 annotated known pre-microRNA sequences spanned 45 species from miRBase 12 [24]. We removed sequences with homology greater than 90% from the original sequences, and finally 2,241 nonhomologous pre-microRNAs were selected as positive sequences. e other part is the negative pseudo hairpin sequences. e pseudo hairpin sequences were obtained from the UCSC refGene annotation list [25] and the human RefSeq gene [26]. For pseudo hairpin sequences, their sequence fragments have similar hairpin structures of pre-microRNAs and were not reported as pre-microRNAs. Finally, we selected 8,494 pseudo hairpin sequences from the protein coding region. ese sequences must be guaranteed to be around 90 ribonucleotides, with a minimum of −15 kcal/mol and a maximum 18 kcal/mol free energy.
In order to select better model, we randomly selected seventy percent of the candidate hairpin sequences as the training set and the remaining thirty percent as the test set. us, we randomly selected 1,500 pre-microRNAs and 6,000 pseudo pre-microRNAs as the training set. As for test set, 700 of the remaining positive real pre-microRNAs and 2,400 of the remaining negative pseudo hairpin sequences were randomly selected. Both training set and test set are normalized.

Feature Selection.
ere are many methods to select the pre-microRNAs features. Traditionally, sequence, secondary structure, and thermodynamic properties are considered. In this paper, we use the dinucleotide frequencies proposed in [27] to characterize sequence and secondary structure properties.
ermodynamic characteristics are also included.
e LMKL-D model assumed that the hairpin structure of each sequence could be individually characterized as an eigenvector containing 29 global and intrinsic folding properties [27,28]. Seventeen attributes are the A; C; G; U dinucleotide frequencies; and G + C ratio; three attributes are the folding measures, including the base pairing propensity, base pair distance, and Shannon entropy; three thermodynamic properties such as minimum free energy (MFE) of folding, MFE index 1 MFEI 1 , and index 2 MFEI 2 ; one topological attribute; and five normalized variants of folding measures. e sequence properties and thermodynamic properties can be calculated by ViennaRNA Package 2.0 [29].

Kernels and Support Vector Machine.
e kernels are the inner products of the mapping relationship. A kernel can be described by the dot product of its two basic mapping functions as follows [23]: where K(x, y) represents a kernel and ϕ(x) and ϕ(y) represent the mapping functions. e mapping functions ϕ(x) and ϕ(y) are hard to find, but the dot product of the two mapping functions can be easily calculated by the kernel matrix [30]. We can use the characteristics of the kernels to construct a new kernel that can enhance the ability to represent richer features. us, the new kernel can map the input data from the low-dimensional linear indivisible feature space to a high-dimensional linearly separable feature space. Synthetic kernels can create different representations of the data using basic kernels.
Kernels are usually associated with SVMs. e basic principle of a single kernel SVM is, for a given dataset with corresponding labels y i (y i � +1 or − 1), SVM finds the linear separable hyperplane with the maximum margin in the feature space induced by the mapping functions ϕ(x) and ϕ(y). Equation (2) is the SVM decision function [31] given as follows: where α i are the coefficients to be learned and K θ (x i , x) is a kernel that depends on a set of parameters θ. Traditionally, parameters α i are trained through maximizing the dual objective function as the following equation [31]:

Multiple Kernel
Learning. Multiple kernel learning model is a kind of kernel-based learning model with more flexibility. Recent theories and applications have proved that using multiple kernels instead of single kernel can enhance the interpretability of the decision function and obtain better performance than the single kernel model [31,32]. Multiple kernels can create different representations of the input data using basic kernels. When we combine multiple kernels within a kernel such as by taking their sum, we can obtain a new kernel that is different from each of them. Moreover, the new kernel has more complicated representations that could not be well represented by a single kernel [33,34].
In the multiple kernel learning model, K θ is considered as a linear convex combination of multiple basis kernels [31]: where K i are the basic kernels and m is the total number of basic kernels.

Deep Multiple Kernel Learning.
e traditional multiple kernel learning method is just a simple linear combination of a set of basic kernels and could not represent the deep features of the samples. us, we adopt a three-layer multiple kernel learning model to represent the deep features of the samples [23]. e complex combination of basic kernels still meets Mercer standards. A deep multiple kernel model is a n-layer multiple kernel model with m kernels at each layer: where K (n) m represents the m th kernel at layer n with an associated weight parameter θ (n) m and K (n) represents the synthetic kernel at layer n. A deep multiple kernel learning model with n layers is shown in Figure 2.
Although the increased complexity of the kernels can increase the risk of over-fitting, Strobl et al. [23] proved that the upper bound of the generalization error for deep multiple kernels is significantly less than that for deep feedforward neural networks under some conditions.

Scientific Programming
Leave one out error has shown better accuracy in multiple kernel learning [31]. To decide the weight parameter θ (n) m , we adopted concept of the span of support vectors [35]. e main idea of SVMs is that mapping input data into a high-dimensional feature space where a hyperplane can separate the input data. e hyperplane can be constructed by maximizing the margin. It is well known that the error rate for SVM is bounded by where R is the radius of the smallest ball containing the training data in the feature space and M is the margin. e smaller the error, the better the SVM performance. However, traditional SVM methods only maximize M while R may still be very large. To address the above problem, we minimize R based on upper bounds of leave one out error. R can be shown in span. e span bound of the leave one out error can be shown as the following equation [31]: where L is leave one out error and S p is the distance between the point ϕ K θ (x p ) and the set Γ p . Γ p is the linear combination of support vectors mapping into feature space, (6) and evaluate its performance. Here, c and d are nonnegative arguments and then a regularization term is added to prevent over-fitting [23]: Span is optimized using the gradient descent method. Now, we get the deep multiple kernel learning algorithm with the derivative of (zT span /zθ). By using gradient descent, θ and α can be solved by fixing θ and solving for α and fixing α and solving for θ.
In the deep synthetic kernel proposed in this paper, the number of layers was set as three layers and each layer was set as 3 kernel functions. e kernel functions of the first layer were linear kernel, polynomial kernel, and Gaussian kernel.

Localized Multiple Kernel
Learning. While a single kernel function has only one characteristic, multiple kernel learning (MKL) has more flexibility by choosing a combination of basic kernels. However, multiple kernel learning assigns the same weight to each kernel when combining the basic kernels. e localized multiple kernel learning (LMKL) algorithm uses a gating model to locally select the appropriate weight for each basic kernel. Compared with MKL, LMKL could select suitable weight for the datasets. Experimental results on bioinformatics datasets show that LMKL with the gating model has better accuracy than the model with single kernel [22]. Equation (8) gives a decision function for LMKL [22] as follows: where N is the number of samples, P is the number of basic kernels, b is the bias, K m is the mth basic kernel, and η m (x) is the gating model. For input sample x, the gating model chooses feature space m as a function of input sample x. η m (x) can be learned from the sample datasets. e gating model η m (x) is defined as the following equation [22]: where v m and v m0 are the parameters of the gating model and the softmax guarantees nonnegativity. By modifying equation (8), with selection function, we get the following optimization problem as the following equation [23]: where K η (x i , x j ) is defined as equation (11); here, K η is positive semidefinite [23].
Derivatives of equation (10) are taken with respect to v m and v m0 , and then we use gradient descent to train the gating model. We just need to fix η m (x) and then solve a canonical multiple kernel SVM dual problem first and then update the parameters of the gating model with gradient descent at each step.

Kernels Selection and Model Selection.
Traditional multiple kernel learning methods only select a few simple basic kernels, such as linear kernel (K L ), polynomial kernel (K P ), and Gaussian (or RBF) kernel (K G ). In our proposed model, we selected three simple basic kernels, linear kernel, polynomial kernel, and Gaussian kernel and a deep synthetic kernel proposed above (K D ) as the localized multiple kernel learning combination. Finally, we got the LMKL-D model as shown in Figure 3. K D was obtained by the deep multiple kernel learning (DMKL) model. e formulas for the three simple basic kernel functions are shown in the following equation [20]: We used grid search to find the parameters of the simple basic kernels. e parameters with the highest accuracy were adopted. Finally, the parameter of the Gaussian kernel s was set to 1 and the polynomial kernel exponent q was set to 2, while α and β were both 1.
We used multiple kernel learning models to obtain K D . Multiple kernel learning models often use linear kernels, Gaussian kernels, and polynomial kernels to map input data into feature spaces. Since the DMKL model should try to maximize the upper bound of the final kernel to increase its richness with each layer, we combined the linear kernel, polynomial kernel, and Gaussian kernel into one set of kernels. From [23], the number of layers for the DMKL was set to 3. K D was trained on train set and tested on test set. We used leave one out validation and the minimum value of span to evaluate K D . K D with a minimum span value was adopted. To find better performance, the penalty parameter C was set in the range of 10 − 6 , 10 − 5 , . . . , 10 − 1 , 1, 10 and the learning rate was set in the range of 10 − 6 , 10 − 5 , . . . , 10 − 1 , 1, 10 . After trained and tested, we got K D . In the end, we chose four kernels, K L , K P , K G , and the best K D as the final basic kernels of localized multiple kernel learning.
For model selection, the dataset selection operations were repeated three times, and the average value of the results on test set was taken as the final performance of the model. us, for each training and test, the training set had 7,500 samples in total and the test set had 3,100 samples in total. For the DMKL model, we used LIBSVM [36] package to solve the SVM optimization problem. For localized multiple kernel learning, we used SMO to speed up the SVM optimization.  [20]. We compare the LMKL-D model with triplet-SVM [14], miPred [27], MiPred [37], and a threelayer backpropagation neural network (BPNN). e results are shown in Figure 4 and Table 1. Ultimately, the LMKL-D model obtained an accuracy rate of 98.03% on test set, while the triplet-SVM, miPred, MiPred, and BPNN (3 layers) on the test set obtained accuracy rate of 83.90%, 93.50%, 91.29%, and 95.18%, respectively.

Comparison with Other Classification
As shown in Figure 4 and Table 1, LMKL-D has the best 99.27% SP, best 96.11% GM, and best 98.03% ACC, which means it can better distinguish real pre-microRNAs and pseudo hairpin sequences. Since there are a large number of gene sequences with hairpin structures to be identified, higher specificity can filter the pseudo hairpin sequences. For geometric mean, LMKL-D achieved the highest geometric mean (96.11%) among these methods. at means the LMKL-D model can achieve high performance while maintaining stability. e AUC (area under the curve) 0.9611 (we can see in Figure 5) indicates that the LMKL-D Scientific Programming can predict pre-microRNA accurately. Since the ratio of pre-microRNA sequences to pseudo hairpin sequences is about 1 to 4, the SE of the LMKL-D model might be lower. Next, we need find new methods to deal with class imbalances. ese data prove that our proposed localized multiple kernel learning using the deep synthetic kernel model can increase classification accuracy with low risk of over-fitting and has a more accurate predictive ability and stability to identify the new microRNA precursors in many species.

Comparison with Localized Multiple Kernel Learning.
In order to better evaluate our LMKL-D model, we also compared LMKL-D with LMKL. For basic kernels, the LMKL-D model used four basic kernels, K L , K P , K G , and K D . e LMKL model used three basic kernels, K L , K P , and K G . K L , K P , and K G of the two models adopted the same parameters. e penalty parameter C was fixed on 0.035. e results on test set are shown in Figure 6. e performance of the two models on training and test set is shown in Table 2.
From Figure 6 and Although LMKL-D acquires a little lower specificity than LMKL, on sensitivity, accuracy, and AUC, LMKL-D is always better than LMKL. On geometric mean, LMKL-D is always higher than LMKL. at means that LMKL-D is more stable than LMLK. We can draw a conclusion from Figure 6 and Table 2 that LMKL-D has better sensitivity, geometric mean, and accuracy than LMKL. For specificity, the two models have similar performance. e results show that in terms of correctly and stably identifying pre-microRNA, the LMKL-D is more

Input Data
Localized multiple kernel learning model Layer 1 Layer 2 Layer n

Conclusions
In this work, we have proposed a localized multiple kernel learning model with a three-layer deep synthetic kernel in improving the pre-microRNAs prediction accuracy of existing methods. e experiments show that our proposed model yielded comparable better predictive performances and is more stable than existing classifiers for identifying known pre-microRNAs. After being trained on hairpin sequences train set, the LMKL-D methods obtain 93.06% sensitivity, 99.27% specificity, 96.11% geometric mean, and 98.03% accuracy on test set. By applying deep architecture to localized multiple kernel learning, we found that the LMKL-D model is both useful and reliable, as demonstrated in the results above.
e LMKL-D model was examined by comparing with the triplet-SVM, miPred, MiPred, and a three-layer backpropagation neural network. We also compare the LMKL-D model and LMKL model. Our results show a more efficient model compared with the multiple kernel learning model with simple basic kernels. e threelayer deep synthetic kernel can indeed increase the richness of kernels and represent deep features. On the other hand, the LMKL-D model could use both shallow and deep features. e number of pseudo hairpin sequences in nature is much larger than known pre-microRNAs. ere are always more negative samples than positive samples. With the development of bioinformatics, it is still a challenging work to solve the problem of sample imbalance and explore more classification methods.
Data Availability e known pre-microRNA sequences data can be downloaded from the miRBase website at http://www.mirbase.org; the UCSC refGene annotation list and the human RefSeq gene were available through https://www.ncbi.nlm. nih.gov/refseq/.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.