NOVOMIR: De Novo Prediction of MicroRNA-Coding Regions in a Single Plant-Genome

MicroRNAs (miRNA) are small regulatory, noncoding RNA molecules that are transcribed as primary miRNAs (pri-miRNA) from eukaryotic genomes. At least in plants, their regulatory activity is mediated through base-pairing with protein-coding messenger RNAs (mRNA) followed by mRNA degradation or translation repression. We describe NOVOMIR, a program for the identification of miRNA genes in plant genomes. It uses a series of filter steps and a statistical model to discriminate a pre-miRNA from other RNAs and does rely neither on prior knowledge of a miRNA target nor on comparative genomics. The sensitivity and specificity of NOVOMIR for detection of premiRNAs from Arabidopsis thaliana is ~0.83 and ~0.99, respectively. Plant pre-miRNAs are more heterogeneous with respect to size and structure than animal pre-miRNAs. Despite these difficulties, NOVOMIR is well suited to perform searches for pre-miRNAs on a genomic scale. NOVOMIR is written in Perl and relies on two additional, free programs for prediction of RNA secondary structure (RNALFOLD, RNASHAPES).


Introduction
MicroRNAs (miRNAs) are genome-encoded single-stranded RNA molecules of ∼22 nt in length, which play a significant role in regulation of gene expression in eukaryotes. Many details on biogenesis and interactions of miRNAs are known (see recent reviews, e.g., [1,2]). Briefly, miRNAs can be encoded by miRNA genes, but also be generated from different RNA transcripts (e.g., from introns of protein-coding genes). Plant and animal miRNAs differ to some extent with respect to biogenesis and structural characteristics but also in their mode of action. In plants, most if not all miRNAs are transcribed from genes by RNA-dependent RNA polymerase II (polII) into primary transcripts called pri-miRNA; these transcripts fold into (possibly imperfect) stem-loop structures. From the pri-miRNA Dicer-like (DCL) enzymes process the stem-loop structure (pre-miRNA), which is usually longer (∼130 nt; see below) than nonplant pre-miRNA (∼86 nt), and finally a miRNA/miRNA * duplex. In the cytoplasm, the miRNA is incorporated into the RNA-induced silencing complex (RISC), and base-pairing of the miRNA with complementary messenger RNA (mRNA) regions leads to mRNA degradation or to inhibition of mRNA translation. Most plant miRNAs base-pair with their respective target mRNAs in the coding region with perfect or near-perfect complementarity leading to cleavage (and degradation) of the mRNAs; animal miRNAs usually base-pair with 3 untranslated regions through imperfect complementarity leading to translation repression.
Finding of miRNA genes either needs costly experimental approaches-for example, genetics, which led to the detection of the first animal miRNAs [3,4], cloning and sequencing of cDNA, or deep sequencing-or computational prediction methods, which facilitate subsequent experimental verification or falsification. The different properties of miRNAs in plants and animals gave rise to different computational approaches (for reviews see [5][6][7]). Most of these tools, however, rely on the following features: the miRNA resides in a stem-loop structure, which possess a high thermodynamic stability and does not contain large internal loops or asymmetric bulges at least in the region of the mature miRNA [8]. In addition, many tools take into account a phylogenetic conservation of the pre-miRNA structure and miRNA sequence, which limits the chance to 2 Journal of Nucleic Acids detect non-conserved, evolutionary new miRNA genes. For example, Dezulian et al. [9] identify plant miRNA homologs in a set of sequences, given a query miRNA, by a sequence similarity search step and a set of structural filters; Pfeffer et al. [10] identify DNA-viral pre-miRNAs, which show neither detectable conservation to other viral pre-miRNAs nor to host pre-miRNAs, by a search for stable stem-loops and scoring of these according to free energy of folding, base composition, and number of base pairs; Wang et al. [11] as well as Jones-Rhoades and Bartel [12] search for putative miRNA/miRNA * complexes in the intergenic regions of Arabidopsis thaliana and filter these according to GC content, mismatches in the stem, conservation in the rice genome, and the characteristic stem-loop structure.
To our knowledge, the only tools for de novo prediction of pre-miRNAs in plants are HHMMiR [13] and triplet-SVM [14]. HHMMiR calculates first the mfe structure of sequence regions (using RNAfold in a scanning window approach with window length of less than 500 nt), extracts stem-loops that possess at least 10 base pairs, a minimum length of 50 nt, a loop of less than 20 nt and no multiloop(s), and finally classifies via a hierarchical hidden Markov model (HHMM). The sensitivity of HHMMiR is published to be 0.865 for Oryza sativa (96 sequences taken from miRBase 5) and 0.973 for A. thaliana (75 sequences). triplet-SVM calculates by RNAfold the mfe structure of sequences, rejects those with junction(s), too few base pairs, and a high free energy (i.e., low structural stability), parses the remaining structures in "triplets" (type of nucleotide plus paired or unpaired state of the nucleotide and its two neighbors), and finally classifies these features with a support vector machine (SVM). The sensitivity of triplet-SVM is published to be 0.948 for Oryza sativa and 0.92 for A. thaliana using the same sequences from miRBase 5 as in the test with HHMMiR.
In the following, we describe our tool, called novoMIR, to detect pre-miRNA and miRNA/miRNA * sequences in a plant genome. For this purpose novoMIR uses a series of filter steps, similar to those mentioned above, followed by a statistical model to discriminate a pre-miRNA from all other RNAs and by another statistical model to locate the miRNA/miRNA * complex in a putative pre-miRNA. Thresholds and statistical values are learned from sets of true positive sequences (plant pre-miRNAs taken from miRBase; [15]) and true non-miRNA sequences (tRNAs, 5 S rRNA, 5.8 S rRNA, mRNAs, etc.). For detection, novoMIR relies neither on comparative genomics nor on prior knowledge of a miRNA target; thus novoMIR allows for searches in single plant genomes as well as in viral or viroid genomes.

Features of Plant
Pre-miRNA. Sequences of plant pre-miRNAs were obtained from different versions of miRBase [15,16]: version 10.0 contains 1,247 sequences; the recent version 14 contains 2,030 sequences. The mean and median length of plant sequences are about (150 ± 73) nt and 130 nt, respectively (see Figure S1 in Supplementary Material available online at doi:10.4061/2010/495904 ); the shortest pre-miRNA is 54 nt in length (miRBase ID: gma-MIR2107) and the longest is 932 nt (cre-MIR916). The mean and median length of nonplant sequences are about (88 ± 14) nt and 86 nt, respectively; the shortest pre-miRNA is 44 nt in length (hsa-mir-1973) and the longest is 215 nt (dmemir-997). That is, most plant pre-miRNAs are longer than animal pre-miRNAs and their size range is more diverse. The sequences of pre-miRNAs and mature miRNAs are slightly enriched in U [17] and U plus G, respectively (see Figure  S2). The four nucleotides are not equally distributed at each position along the miRNA sequences (see Figure S3): for example, a U is the preferred 5 nucleotide ( f 1,U = 0.65), a G on position 8 ( f 8,G = 0.44), and a C on position 19 ( f 19,C = 0.52). The minimum free energy ΔG 0 37 • C of the secondary structures of pre-miRNAs, as calculated by RNAfold [18] using default parameters, is in a wide range due to the different lengths L and G+C contents f GC of the sequences (see Figure S4); normalization of ΔG 0 37 • C to length and f GC [17] results in ΔG 0 37 • C /L = (−0.45 ± 0.12) kcal/mol/nt and ΔG 0 37 • C /L/ f GC = (−1.02 ± 0.26) kcal/mol/nt; the latter value is significantly lower than that of other RNA according to Zhang et al. [17].

Training Data.
We used the 184 pre-miRNAs and mature miRNAs of A. thaliana as listed in miRBase version 10 as the true-positive data set for establishing all thresholds and parameters of novoMIR. Sequences containing nucleotides other than A, C, G, U(T) were discarded. For evaluation of sensitivity we used in addition the plant pre-miRNAs and mature miRNAs from miRBase version 14 (190 from A. thaliana and 1,853 from other plants). The sensitivity of novoMIR was nearly identical for both data sets (and also with sequences from version 14 minus those from version 10; see supplemental Table S1); thus we refrained from training with different data sets.

Test Data.
As the true-negative data set, we assembled RNA sets from the following sources: (vi) 2,760 shuffled pre-miRNA sequences (each of the 184 A. thaliana sequences from miRBase 10 was shuffled 5 times using shuffle [20] preserving (a) the mononucleotide content, (b) mono-and dinucleotide content, and (c) mononucleotide content in a window of 20 nt, resp.) (vii) repetitive genomic elements from A. thaliana from the RepeatMasker library [21] (in total 134,000 nt) (viii) 8,000 pseudohairpin sequences from Homo sapiens [22] Journal of Nucleic Acids 3 (ix) 10,000 pseudohairpin sequences from A. thaliana; these were selected using RNALfold from the TAIR cDNA library [23] to have a minimum stem-loop length of 50 nt in a base pair span of 400 nt (x) 10 × 5, 000 sequences of a length between 80 and 800 nts randomly selected from the five chromosomes of A. thaliana.

Availability and Requirements. novoMIR is written in
Perl and was tested under Linux. It relies on RNAshapes [24,25] and RNALfold [26] (which is part of the Vienna RNA package [18]) for secondary structure calculations.
RNALfold finds subsequences of a long RNA sequence that fold into locally stable (i.e., thermodynamically favorable) RNA secondary structures; the computational effort is O(NL 2 ) with length N of the long RNA sequence and maximal base-pair separation L of the subsequences. For an RNA sequence, RNAshapes computes shapes, which are classes of similar secondary structures, and a representative structure ("shrep") of minimal free energy within each shape.

Algorithm
In the following, we describe the workflow of novoMIR (see supplemental Figure S5).
(1) A typical plant pre-miRNA consists of a relatively short sequence (with median length ∼130 nt and mean length ∼ 150 ± 73 nt) that is able to fold into a stable stem-loop structure. Thus, we search in the genomic sequence for subsequences with locally stable secondary structure(s) via RNALfold. In case the genomic sequence is longer than 1000 nt, we subdivide it into 1000 nt fragments overlapping by 400 nt. We choose a maximal base pair separation L = 400 nt. This limit excludes only a few exceptionally long pre-miRNAs; that is, only 8 of 1356 plant pre-miRNA sequences in miRBase 10 and 14 of 2030 in miRBase 14, respectively, are dismissed due to this restriction for the sake of a fast first step. From the output of RNALfold, the five subsequences with best locally stable structures are treated further as individual sequences.
(2) The original sequence (with length ≤1000 nt) or a subsequence (with length ≤400 nt) selected by RNALfold is discarded if the sequence has a base composition not typical for pre-miRNAs; that is, the sequence is only retained if the fraction of each nucleotide is above 0.1. This filter rejects 9 and 21 plant pre-miRNA sequences from miRBase 10 and 14, respectively.
(3) RNAshapes is used to predict the thermodynamically optimal secondary structure (minimum free energy (mfe) structure with ΔG 0 mfe ) and the optimal secondary structure of up to three shapes with energies less favorable than that of the mfe shape class by 0.1 kcal/mol. The shapes have to differ in their nesting pattern for all loop types but positions of unpaired regions are not of relevance (RNAshapes's option −t 3). In general, it is assumed that the mfe structure of pre-miRNAs is the conformation adequate for further processing by Dicer. In our case, however, we do not know the true 5 and 3 ends; thus, the unrelated termini of the respective sequence, which do not belong to the true pre-miRNA, might cause the pre-miRNA structure to be thermodynamically suboptimal. Moreover, the restriction by RNAshapes to the shrep prediction avoids prediction (and further processing) of the immense number of suboptimal structures.
(4) Any sequence that is not able to fold into a structure (as predicted in step (3)) with ΔG 0 (5) Next, each retained secondary structure is reformatted from the bracket-dot notation used by RNAshapes into an alignment-like format [27] (for an example see Figure 1), which eases handling during the following steps: at each multiloop, the structure is divided into the respective stem-loop structures, which are separately processed further; 5 and 3 dangling ends are removed; a hairpin loop is removed; and asymmetric loops are made symmetric by introduction of gap symbols. Afterwards each (sub)structure consists of the following states: base pairs (match states M symbolized by + + ), loop "pairs" (mismatched states N, − − ), and insertion (I) and deletion (D) states ( − | and | − , resp.).
(6) A stem-loop shorter than 30 states in the alignmentlike format is deleted. For efficiency of this filter, see Figure 2(a).
(7) Next, a window of length 25 states is moved (in steps of (1) state) along the structure in the alignmentlike format, and the fraction of base-paired states is determined for each window. A stem-loop is deleted unless at least a mean fraction of 0.65 basepaired states is present in five different windows, which might overlap. For efficiency of this filter see Figure 2(b).
(8) A stem-loop is deleted if it does not contain a helix with at least 8 consecutive base pairs. For efficiency of this filter see Figure 2(c).
(9) A stem-loop is deleted if the ratio of its sequence length (as predicted by RNALfold) and the length of the stem-loop in the alignment-like format is above 6; that is, the structure contains too many junctions and/or large, unstructured hairpin loops. For efficiency of this filter see Figure 2(d).
(10) If a sequence (and structure) remains after the filter steps, novoMIR decides on its possibility to be a pre-miRNA using a paired Hidden-Markov model identical to that described by Nam et al. [27]. Briefly, a a g a g a a a c g c a a a − g a a a c u g a c a g − a a g a g − a g u g a g c a c a c a a − − − a g g c a a u u u g c a u u c u c u a g u c g u g g c c u u a g a c u g u c u u u c u c g u c a c u c g u g c g u u c u c u u c g u u c − a c g u + + + + + + − − − + + + − − | + + + − + + + + + + + | + + + + + | + + + + + + + + + − + + + | | | + + + + + + − − + + + + + + + + + + − − − + + + − − − + + + − + + + + + + + − + + + + + − + + + + + + + + + − + + + − − − + + + + + + − | + + + +  the joint probability P(x, π) of an observed sequence x and a state sequence π is with transition probabilities T kl = P(π i = l | π i−1 = k) between the four states k, l ∈ {M, N, I, D}, emission probabilities E k (b) = P(x i = b | π i = k) of the different nucleotide and gap pairs b, window size L = 21, and the probability of starting in state k defined as T 0π1 . In contrast to Nam et al. [27], we use four hidden states (is miRNA, is miRNA → is not miRNA,is not miRNA → is miRNA, is not miRNA; see Figure S6). For the decision that the sequence is a pre-miRNA or not, the values for the j ∈ {is miRNA, is not miRNA} states are normalized and summed up The squared ratio as well as the mean of the nine highest values of the difference are compared to thresholds for the pre-miRNA decision.
(11) In case of a positive decision in the previous step, the values that lead to the six highest values of R 2 are predicted as positions of probable miRNA/miRNA * duplices (see Figure 1(c)).

Results and Discussion
Our program novoMIR uses a set of heuristic filters and a statistical model to discriminate a miRNA precursor from all other RNAs (see Figure S5). The data for this model are collected based on a set of true positive sequences (miRNA precursors from A. thaliana as in miRBase 10) and a set of true non-miRNA sequences (for details, see Section 2.3 ).   Table 1. The sensitivity values of novoMIR for A. thaliana pre-miRNA sequences of both miRBase versions are very close to each other (0.837 and 0.832, resp.). The values for all plant pre-miRNA sequences are slightly lower (0.791 and 0.792, resp.), but show no clear trend that sequences of miRBase 14 (not present in miRBase 10) are different from those of miRBase 10 or that sequences from a certain taxonomic group might be different from those of others (see supplemental Table S1).

Test on miRBase
The sensitivity of novoMIR in predicting the position of the miRNA/miRNA * complex is also high (0.73 for A. thaliana and 0.82 for all plants; see Table 1). For this, a position is counted as correctly predicted if it matches exactly the annotated mature miRNA or overlaps by five or fewer nucleotides.

Comparison with Other Tools.
We tested HHMMiR [13] and triplet-SVM [14] for sensitivity with the sequences from miRBase 10 and 14 (see Table 1). Their sensitivity is at maximum 0.15 and 0.45, respectively. The filtering steps of both tools reject already many sequences (HHMMiR more than 80% and triplet-SVM more than 22%). For the sequences remaining after the filtering steps, the sensitivity of the HHMM and SVM is at maximum 0.79 and 0.60, respectively, which is also lower than that of novoMIR with a sensitivity of at least 0.80 (using all filter steps).

Tests on Specificity.
We assembled different data sets to test the specificity of novoMIR. These data sets should not contain any true (pre-)miRNA. For example, we used well-annotated RNAs (mRNA, noncoding RNA) and sets of Table 1: Sensitivity of novoMIR, triplet-SVM [14], and HHMMiR [13] in pre-miRNA prediction for different versions of miRBase. The row "14-10" shows values for sequences from miRBase 14 which are not present in miRBase 10.  Table 2). With all other data sets specificity was from 0.98 up to 1.00.

A Search for Pre-miRNAs in the Genome of Arabidopsis
Thaliana. We wanted to test the program with a more realistic scenario, given the satisfying sensitivity and specificity values of novoMIR with our test data (see Tables 1  and 2). We selected all intergenic and intronic regions of the A. thaliana genome from "The Arabidopsis Information Resource" (TAIR), removed all pre-miRNA sequences, and searched within the remaining sequences for potential pre-miRNAs via novoMIR. novoMIR classified 828 sequences from the 30,413 intergenic sequences and 649 sequences from the 148,558 intronic sequences, respectively, as potential pre-miRNAs. Despite this pleasingly low numbers of hits, however, an interpretation of this outcome is not easy. To get an impression on the hits, we searched with these potential pre-miRNA sequences with BLAST for any annotation and for the miRNA-typical expression pattern in the "Arabidopsis Small RNA Project Database" (ASRP) [28,29]; such a typical expression pattern of a pre-miRNA includes sequences for the miRNA as well as for the miRNA * (for an example see supplemental Figure S7). To our surprise, we detected that some of the predicted candidates are already described as true pre-miRNAs. An example of such a sequence, predicted by novoMIR as a potential pre-miRNA, is located on A. thaliana chromosome 3 in the region between genes At3G09280 and At3G09290. Its secondary structure and its support by expressed small RNAs are shown in Figure 3 and Figure S8, respectively. It is already known as pre-miR2111a [30,31], but not present in miRBase 14. The sequences of the mature miR2111a and of miR2111a * predicted by novoMIR also coincide with the sequences given in [30].
In the following, we mention shortly three further candidate hits, for which we found some support by small-RNA expression in the ASRP but no explicit annotation. One novoMIR hit is located on chromosome 4 between At4G22760 and At4G22770 close to the 3 terminus of the latter, but on the opposite strand; for further details, see Figure 3 and Figure S9. The next hit (see Figure 3 and Figure  S10) is located in between At5G52689 and At5G52690. The last mentioned hit is located in an intron of AT1G01650, which encodes for an aspartic-type endopeptidase/peptidase; the structure of this sequence is shown in Figure 3 and the expression pattern of the genomic region in Figure S11.
Several candidate hits have no support by small RNAs in the ASRP. It is known that many miRNAs are induced by biotic and abiotic stress [36][37][38]. Thus, a lack of small RNAs might either point to a false-positive prediction or to a stress condition not analyzed for expression of small RNAs. Further candidate hits are located in regions showing expression patterns similar to those of repetitive elements. A recently published review [39] discussed the possibility that some miRNAs could be evolved from repetitive genomic elements and/or duplication of genomic regions.

Viroids as Pre-miRNAs?
Viroids are plant-infectious, noncoding, unencapsidated, circular RNAs that are transcribed in a rolling-circle mechanism either in nuclei (Pospiviroidae) or in chloroplasts (Avsunviroidae) of infected plants. Viroids cause the production of viroid-specific small RNAs (vsRNA) similar in size to small interfering (siRNA) and miRNAs, but they do escape the cytoplasmic silencing mechanism. A positive (or negative) novoMIR prediction of viroids as potential pre-miRNAs would point to the genesis of vsRNAs. For further details, see recent reviews [40][41][42][43].
Potato spindle tuber viroid (PSTVd) is the type strain of Pospiviroidae. Because of its high self-complementarity the circular PSTVd RNA folds into a rod-like secondary structure of high thermodynamic stability (see Figure 4). This structure can be divided into five structural domains on the basis of homology between different pospiviroids [34]. Most sequence variants or strains of PSTVd differ by mutations in the pathogenicity-modulating (P) domain and/or variable (V) domain. Only a few nucleotide changes in the P domain are sufficient to exhibit remarkably different symptoms in infected tomato plants Solanum lycopersicon cv Rutgers. If this P domain would be the source of miRNA-like    Journal of Nucleic Acids  Figure 4: Secondary structure of PSTVd and location of miRNA/miRNA * complexes as predicted by novoMIR. The structure scheme is based on a consensus structure of 45 (+)-stranded circular PSTVd sequence variants [32]; the sequence is given for the PSTVd variant Intermediate [33]. The five homology domains of pospiviroids are marked as proposed by Keese and Symons [34]: terminal left and right (TL,TR), pathogenicity-modulating (P), central conserved (C), and variable (V) domain. The transcription start site for (−)-strand synthesis is marked by "polII" [35]. The predicted miRNA/miRNA * complexes are shown as larger italic characters. The complex in the P domain is predicted using default parameters; the complex in the TR domain is additionally predicted with a normalized energy threshold t ΔG/L/ f GC = 0.69 (instead of 0.75).
vsRNAs, these could interfere somehow with the host's metabolism leading to symptom production. For an RNA with PSTVd sequence from positions 263-359/1-96, which is one of the structural elements present during processing of (+)-strand replication intermediates to circles [44], novoMIR predicted miRNA/miRNA * complexes in the P domain of PSTVd; for an RNA from positions 103-255, which is also a structural elements during processing, novoMIR predicted a further miRNA/miRNA * complex in the TR domain, but only after lowering the normalized energy threshold from the default value t ΔG/L/ fGC = 0.75 to 0.69. Both regions are marked by italic characters in Figure 4. novoMIR predicted identical positions for complexes in a full-length, linear PSTVd (1-359). Especially the prediction of vsRNAs derived from the P domain supports an involvement of vsRNAs in symptom production via vsRNA-induced (mis)regulation of plant-endogenous RNAs like mRNAs coding for transcription factors. This hypothesis is supported by deep-sequencing of PSTVd-derived vsRNAs in PSTVd-infected tomato plants (Diermann, Matoušek, Teune, Riesner and Steger, submitted) and sequencing of vsRNAs produced in vitro by DCL processing of PSTVd [45] which showed clusters of vsRNAs derived from the P domain. In contrast, [45,46] found only vsRNAs in PSTVdinfected tomato plants that clustered in regions outside of the P domain. This discrepancy is unresolved but might be based for example on different purification procedures of the vsRNAs.

Conclusion
Plant pre-miRNAs are more heterogeneous in size and structure than animal pre-miRNAs but still show sufficient characteristic features-such as relative thermodynamic stability of their structure, length of helices, and number and size of loops-to be differentiated from other RNAs. Based on several of these features, we developed a series of filter steps and a statistical model that together are able to detect pre-miRNAs with a sensitivity of about 0.8 and a specificity of about 0.99. Thus, the program, which we call novoMIR, is well suited to search on a genomic scale for new pre-miRNAs that are not necessarily evolutionarily conserved. As an example, we searched with novoMIR for pre-miRNAs in nontranslated regions of the A. thaliana genome and detected among the high-scoring sequences experimentally verified pre-miRNAs, which were not annotated in the recent version of miRBase. Additionally, novoMIR recognizes viroids as pre-miRNAs, which supports the hypothesis that viroid-specific small RNAs are generated in a miRNA-like pathway.

Funding
The project was supported by a grant from the German Science Foundation to Detlev Riesner and G. Steger.