MicroRNAs (miRNA) are small regulatory, noncoding RNA molecules that are transcribed
as primary miRNAs (pri-miRNA) from eukaryotic genomes. At least in plants, their
regulatory activity is mediated through base-pairing with protein-coding messenger RNAs
(mRNA) followed by mRNA degradation or translation repression.
We describe novoMIR, a program for the identification of miRNA genes in plant
genomes. It uses a series of filter steps and a statistical model to discriminate a pre-miRNA
from other RNAs and does rely neither on prior knowledge of a miRNA target nor on
comparative genomics. The sensitivity and specificity of novoMIR for detection of premiRNAs
from Arabidopsis thaliana is ~0.83 and ~0.99, respectively. Plant pre-miRNAs
are more heterogeneous with respect to size and structure than animal pre-miRNAs. Despite
these difficulties, novoMIR is well suited to perform searches for pre-miRNAs on a
genomic scale.
novoMIR is written in Perl and relies on two additional, free programs for prediction
of RNA secondary structure (RNALfold, RNAshapes).
1. Introduction
MicroRNAs (miRNAs) are genome-encoded single-stranded RNA molecules of ~22 nt in length, which play a significant role in regulation of gene expression in eukaryotes. Many details on biogenesis and interactions of miRNAs are known (see recent reviews, e.g., [1, 2]). Briefly, miRNAs can be encoded by miRNA genes, but also be generated from different RNA transcripts (e.g., from introns of protein-coding genes). Plant and animal miRNAs differ to some extent with respect to biogenesis and structural characteristics but also in their mode of action. In plants, most if not all miRNAs are transcribed from genes by RNA-dependent RNA polymerase II (polII) into primary transcripts called pri-miRNA; these transcripts fold into (possibly imperfect) stem-loop structures. From the pri-miRNA Dicer-like (DCL) enzymes process the stem-loop structure (pre-miRNA), which is usually longer (~130 nt; see below) than nonplant pre-miRNA (~86 nt), and finally a miRNA/miRNA* duplex. In the cytoplasm, the miRNA is incorporated into the RNA-induced silencing complex (RISC), and base-pairing of the miRNA with complementary messenger RNA (mRNA) regions leads to mRNA degradation or to inhibition of mRNA translation. Most plant miRNAs base-pair with their respective target mRNAs in the coding region with perfect or near-perfect complementarity leading to cleavage (and degradation) of the mRNAs; animal miRNAs usually base-pair with 3′ untranslated regions through imperfect complementarity leading to translation repression.
Finding of miRNA genes either needs costly experimental approaches—for example, genetics, which led to the detection of the first animal miRNAs [3, 4], cloning and sequencing of cDNA, or deep sequencing—or computational prediction methods, which facilitate subsequent experimental verification or falsification. The different properties of miRNAs in plants and animals gave rise to different computational approaches (for reviews see [5–7]). Most of these tools, however, rely on the following features: the miRNA resides in a stem-loop structure, which possess a high thermodynamic stability and does not contain large internal loops or asymmetric bulges at least in the region of the mature miRNA [8]. In addition, many tools take into account a phylogenetic conservation of the pre-miRNA structure and miRNA sequence, which limits the chance to detect non-conserved, evolutionary new miRNA genes. For example, Dezulian et al. [9] identify plant miRNA homologs in a set of sequences, given a query miRNA, by a sequence similarity search step and a set of structural filters; Pfeffer et al. [10] identify DNA-viral pre-miRNAs, which show neither detectable conservation to other viral pre-miRNAs nor to host pre-miRNAs, by a search for stable stem-loops and scoring of these according to free energy of folding, base composition, and number of base pairs; Wang et al. [11] as well as Jones-Rhoades and Bartel [12] search for putative miRNA/miRNA* complexes in the intergenic regions of Arabidopsis thaliana and filter these according to GC content, mismatches in the stem, conservation in the rice genome, and the characteristic stem-loop structure.
To our knowledge, the only tools for de novo prediction of pre-miRNAs in plants are HHMMiR [13] and triplet-SVM [14]. HHMMiR calculates first the mfe structure of sequence regions (using RNAfold in a scanning window approach with window length of less than 500 nt), extracts stem-loops that possess at least 10 base pairs, a minimum length of 50 nt, a loop of less than 20 nt and no multiloop(s), and finally classifies via a hierarchical hidden Markov model (HHMM). The sensitivity of HHMMiR is published to be 0.865 for Oryza sativa (96 sequences taken from miRBase 5) and 0.973 for A. thaliana (75 sequences). triplet-SVM calculates by RNAfold the mfe structure of sequences, rejects those with junction(s), too few base pairs, and a high free energy (i.e., low structural stability), parses the remaining structures in “triplets” (type of nucleotide plus paired or unpaired state of the nucleotide and its two neighbors), and finally classifies these features with a support vector machine (SVM). The sensitivity of triplet-SVM is published to be 0.948 for Oryza sativa and 0.92 for A. thaliana using the same sequences from miRBase 5 as in the test with HHMMiR.
In the following, we describe our tool, called novoMIR, to detect pre-miRNA and miRNA/miRNA* sequences in a plant genome. For this purpose novoMIR uses a series of filter steps, similar to those mentioned above, followed by a statistical model to discriminate a pre-miRNA from all other RNAs and by another statistical model to locate the miRNA/miRNA* complex in a putative pre-miRNA. Thresholds and statistical values are learned from sets of true positive sequences (plant pre-miRNAs taken from miRBase; [15]) and true non-miRNA sequences (tRNAs, 5 S rRNA, 5.8 S rRNA, mRNAs, etc.). For detection, novoMIR relies neither on comparative genomics nor on prior knowledge of a miRNA target; thus novoMIR allows for searches in single plant genomes as well as in viral or viroid genomes.
2. Methods2.1. Features of Plant Pre-miRNA
Sequences of plant pre-miRNAs were obtained from different versions of miRBase [15, 16]: version 10.0 contains 1,247 sequences; the recent version 14 contains 2,030 sequences. The mean and median length of plant sequences are about (150±73) nt and 130 nt, respectively (see Figure S1 in Supplementary Material available online at doi:10.4061/2010/495904); the shortest pre-miRNA is 54 nt in length (miRBase ID: gma-MIR2107) and the longest is 932 nt (cre-MIR916). The mean and median length of nonplant sequences are about (88±14) nt and 86 nt, respectively; the shortest pre-miRNA is 44 nt in length (hsa-mir-1973) and the longest is 215 nt (dme-mir-997). That is, most plant pre-miRNAs are longer than animal pre-miRNAs and their size range is more diverse. The sequences of pre-miRNAs and mature miRNAs are slightly enriched in U [17] and U plus G, respectively (see Figure S2). The four nucleotides are not equally distributed at each position along the miRNA sequences (see Figure S3): for example, a U is the preferred 5′ nucleotide (f1,U=0.65), a G on position 8 (f8,G=0.44), and a C on position 19 (f19,C=0.52). The minimum free energy ΔG37°C0 of the secondary structures of pre-miRNAs, as calculated by RNAfold [18] using default parameters, is in a wide range due to the different lengths L and G+C contents fGC of the sequences (see Figure S4); normalization of ΔG37°C0 to length and fGC [17] results in ΔG37°C0/L=(-0.45±0.12) kcal/mol/nt and ΔG37°C0/L/fGC=(-1.02±0.26) kcal/mol/nt; the latter value is significantly lower than that of other RNA according to Zhang et al. [17].
2.2. Training Data
We used the 184 pre-miRNAs and mature miRNAs of A. thaliana as listed in miRBase version 10 as the true-positive data set for establishing all thresholds and parameters of novoMIR. Sequences containing nucleotides other than A, C, G, U(T) were discarded. For evaluation of sensitivity we used in addition the plant pre-miRNAs and mature miRNAs from miRBase version 14 (190 from A. thaliana and 1,853 from other plants). The sensitivity of novoMIR was nearly identical for both data sets (and also with sequences from version 14 minus those from version 10; see supplemental Table S1); thus we refrained from training with different data sets.
2.3. Test Data
As the true-negative data set, we assembled RNA sets from the following sources:
710 mRNA sequences randomly selected from A. thaliana
631 tRNA sequences from A. thaliana
63 5.8 S rRNA sequences from Rfam version 7.0 [19]
602 5 S rRNA sequences from Rfam version 7.0
one randomly selected RNA sequence from each of the 455 noncoding RNA families from Rfam version 7.0 (except miRNA families);
2,760 shuffled pre-miRNA sequences (each of the 184 A. thaliana sequences from miRBase 10 was shuffled 5 times using shuffle [20] preserving (a) the mononucleotide content, (b) mono- and dinucleotide content, and (c) mononucleotide content in a window of 20 nt, resp.)
repetitive genomic elements from A. thaliana from the RepeatMasker library [21] (in total 134,000 nt)
8,000 pseudohairpin sequences from Homo sapiens [22]
10,000 pseudohairpin sequences from A. thaliana; these were selected using RNALfold from the TAIR cDNA library [23] to have a minimum stem-loop length of 50 nt in a base pair span of 400 nt
10×5,000 sequences of a length between 80 and 800 nts randomly selected from the five chromosomes of A. thaliana.
2.4. Availability and Requirements
novoMIR is written in Perl and was tested under Linux. It relies on RNAshapes [24, 25] and RNALfold [26] (which is part of the Vienna RNA package [18]) for secondary structure calculations. RNALfold finds subsequences of a long RNA sequence that fold into locally stable (i.e., thermodynamically favorable) RNA secondary structures; the computational effort is 𝒪(NL2) with length N of the long RNA sequence and maximal base-pair separation L of the subsequences. For an RNA sequence, RNAshapes computes shapes, which are classes of similar secondary structures, and a representative structure “shrep” of minimal free energy within each shape.
3. Algorithm
In the following, we describe the workflow of novoMIR (see supplemental Figure S5).
A typical plant pre-miRNA consists of a relatively short sequence (with median length ~130 nt and mean length ~150±73 nt) that is able to fold into a stable stem-loop structure. Thus, we search in the genomic sequence for subsequences with locally stable secondary structure(s) via RNALfold. In case the genomic sequence is longer than 1000 nt, we subdivide it into 1000 nt fragments overlapping by 400 nt. We choose a maximal base pair separation L=400 nt. This limit excludes only a few exceptionally long pre-miRNAs; that is, only 8 of 1356 plant pre-miRNA sequences in miRBase 10 and 14 of 2030 in miRBase 14, respectively, are dismissed due to this restriction for the sake of a fast first step. From the output of RNALfold, the five subsequences with best locally stable structures are treated further as individual sequences.
The original sequence (with length ≤1000 nt) or a subsequence (with length ≤400 nt) selected by RNALfold is discarded if the sequence has a base composition not typical for pre-miRNAs; that is, the sequence is only retained if the fraction of each nucleotide is above 0.1. This filter rejects 9 and 21 plant pre-miRNA sequences from miRBase 10 and 14, respectively.
RNAshapes is used to predict the thermodynamically optimal secondary structure (minimum free energy (mfe) structure with ΔGmfe0) and the optimal secondary structure of up to three shapes with energies less favorable than that of the mfe shape class by 0.1 kcal/mol. The shapes have to differ in their nesting pattern for all loop types but positions of unpaired regions are not of relevance (RNAshapes’s option −t 3). In general, it is assumed that the mfe structure of pre-miRNAs is the conformation adequate for further processing by Dicer. In our case, however, we do not know the true 5′ and 3′ ends; thus, the unrelated termini of the respective sequence, which do not belong to the true pre-miRNA, might cause the pre-miRNA structure to be thermodynamically suboptimal. Moreover, the restriction by RNAshapes to the shrep prediction avoids prediction (and further processing) of the immense number of suboptimal structures.
Any sequence that is not able to fold into a structure (as predicted in step (3)) with ΔG37°C0/L/fGC≤-0.75 kcal/mol/nt is rejected.
Next, each retained secondary structure is reformatted from the bracket-dot notation used by RNAshapes into an alignment-like format [27] (for an example see Figure 1), which eases handling during the following steps: at each multiloop, the structure is divided into the respective stem-loop structures, which are separately processed further; 5′ and 3′ dangling ends are removed; a hairpin loop is removed; and asymmetric loops are made symmetric by introduction of gap symbols. Afterwards each (sub)structure consists of the following states: base pairs (match states M symbolized by ++), loop “pairs” (mismatched states N, --), and insertion (I) and deletion (D) states (-| and |-, resp.).
A stem-loop shorter than 30 states in the alignment-like format is deleted. For efficiency of this filter, see Figure 2(a).
Next, a window of length 25 states is moved (in steps of (1) state) along the structure in the alignment-like format, and the fraction of base-paired states is determined for each window. A stem-loop is deleted unless at least a mean fraction of 0.65 base-paired states is present in five different windows, which might overlap. For efficiency of this filter see Figure 2(b).
A stem-loop is deleted if it does not contain a helix with at least 8 consecutive base pairs. For efficiency of this filter see Figure 2(c).
A stem-loop is deleted if the ratio of its sequence length (as predicted by RNALfold) and the length of the stem-loop in the alignment-like format is above 6; that is, the structure contains too many junctions and/or large, unstructured hairpin loops. For efficiency of this filter see Figure 2(d).
If a sequence (and structure) remains after the filter steps, novoMIR decides on its possibility to be a pre-miRNA using a paired Hidden-Markov model identical to that described by Nam et al. [27]. Briefly, the joint probability P(x,π) of an observed sequence x and a state sequence π is
P(x,π)=T0π1∏i=1LEπi(xi)Tπi,πi+1,
with transition probabilities Tkl=P(πi=l∣πi-1=k) between the four states k,l∈{M,N,I,D}, emission probabilities Ek(b)=P(xi=b∣πi=k) of the different nucleotide and gap pairs b, window size L=21, and the probability of starting in state k defined as T0π1. In contrast to Nam et al. [27], we use four hidden states (is_miRNA, is_miRNA→is_not_miRNA, is_not_miRNA→is_miRNA, is_not_miRNA; see Figure S6). For the decision that the sequence is a pre-miRNA or not, the values for the j∈{is_miRNA, is_not_miRNA} states are normalized and summed up
Pj=∑i=1LEπi,j(xi,j)Tπi,j,πi+1,j∑j=14Eπi,j(xi,j)Tπi,j,πi+1,j.
The squared ratio
R1=(Pis_miRNAPis_not_miRNA)2,
as well as the mean of the nine highest values of the difference
R2=max∑k=ll+20Pis_miRNA-Pis_not_miRNA,
are compared to thresholds for the pre-miRNA decision.
In case of a positive decision in the previous step, the values that lead to the six highest values of R2 are predicted as positions of probable miRNA/miRNA* duplices (see Figure 1(c)).
Example of a reformatted pre-miRNA structure and predicted localization of a miRNA/miRNA* complex in the pre-miRNA structure. (a) Secondary structure of the pre-miRNA of ath-mir156a in standard representation. The mature miRNA is shown in larger italic characters. (b) Section of the pre-miRNA in an alignment-like format. First and fourth line show the sequence except the sequence region of the hairpin loop; lines 2 and 3 describe the state of the opposing nucleotides: a base pair is marked by “+”, nucleotides of an internal loop by “-” in both lines, and nucleotides that are part of an asymmetrical internal or a bulge loop by “|”. In the fifth line a base pair is marked by “M”, a non-pair by “N”, a deletion in the top strand by “D”, and an insertion in the top strand by “I”. (c) Positions predicted as miRNA/miRNA* complexes are marked by “x” and their relative positions in the sequence.
Efficiency of filtering steps and paired HMM. Receiver operating characteristic (ROC) curves for filters on (a) minimum length of stem-loop region, (b) fraction of base pairs in a sliding window of 25 states, (c) number of consecutive base pairs, and (d) ratio of sequence and stem-loop length. The area under the curve (AUC) is (a) 0.94, (b) 0.97, (c) 0.93, and (d) 0.87. The dots (at max(sensitivity+specificity-1)) denote the value pair of sensitivity and false positive rate that optimally discriminates between miRNA and non-miRNA sequences. The data set consisted of all plant miRNA sequences from miRBase 10 and 455 non-miRNA sequences from Rfam 7.
4. Results and Discussion
Our programnovoMIR uses a set of heuristic filters and a statistical model to discriminate a miRNA precursor from all other RNAs (see Figure S5). The data for this model are collected based on a set of true positive sequences (miRNA precursors from A. thaliana as in miRBase 10) and a set of true non-miRNA sequences (for details, see Section 2.3).
4.1. Test on miRBase Version 14
All thresholds for the filter steps and the probabilities for the Hidden-Markov model were selected on the basis of “receiver operating characteristic” (ROC) curves like those shown in Figure 2. For these, the set of true positive A. thaliana pre-miRNA sequences was taken from miRBase version 10. Sensitivity values for the enlarged set of pre-miRNAs from miRBase version 14 (190 A. thaliana and 1,840 sequences from other plants) are compared to those obtained from miRBase version 10 (184 A. thaliana and 1,063 sequences from other plants) in Table 1. The sensitivity values of novoMIR for A. thaliana pre-miRNA sequences of both miRBase versions are very close to each other (0.837 and 0.832, resp.). The values for all plant pre-miRNA sequences are slightly lower (0.791 and 0.792, resp.), but show no clear trend that sequences of miRBase 14 (not present in miRBase 10) are different from those of miRBase 10 or that sequences from a certain taxonomic group might be different from those of others (see supplemental Table S1).
Sensitivity of novoMIR, triplet-SVM [14], and HHMMiR [13] in pre-miRNA prediction for different versions of miRBase. The row “14–10” shows values for sequences from miRBase 14 which are not present in miRBase 10.
Sensitivity1
miRBase
novoMIR2
HHMMiR3
triplet-SVM3
version
# sequences
pre-miRNA
miRNA/miRNA*
pre-miRNA
pre-miRNA
10
184
A. th.
0.84
0.73
0.15
0.75
0.45
0.60
14
190
A. th.
0.83
0.75
0.10
0.79
0.44
0.59
10
1247
plant
0.79
0.82
0.04
0.58
0.39
0.51
14
2030
plant
0.79
0.83
0.04
0.64
0.38
0.50
14–10
788
plant
0.80
—
0.04
0.73
0.38
0.48
1 Sensitivity is calculated as TP/(TP+FN).
2 Note that novoMIR’s thresholds and probabilities were learned only from A. thaliana sequences in miRBase version 10.
3 The left column gives sensitivity for all sequences; the right column gives sensitivity for those sequences left after the preprocessing step(s) of HHMMiR and triplet-SVM, respectively.
The sensitivity of novoMIR in predicting the position of the miRNA/miRNA* complex is also high (0.73 for A. thaliana and 0.82 for all plants; see Table 1). For this, a position is counted as correctly predicted if it matches exactly the annotated mature miRNA or overlaps by five or fewer nucleotides.
4.2. Comparison with Other Tools
We tested HHMMiR [13] and triplet-SVM [14] for sensitivity with the sequences from miRBase 10 and 14 (see Table 1). Their sensitivity is at maximum 0.15 and 0.45, respectively. The filtering steps of both tools reject already many sequences (HHMMiR more than 80% and triplet-SVM more than 22%). For the sequences remaining after the filtering steps, the sensitivity of the HHMM and SVM is at maximum 0.79 and 0.60, respectively, which is also lower than that of novoMIR with a sensitivity of at least 0.80 (using all filter steps).
4.3. Tests on Specificity
We assembled different data sets to test the specificity of novoMIR. These data sets should not contain any true (pre-)miRNA. For example, we used well-annotated RNAs (mRNA, noncoding RNA) and sets of “pseudohairpins” from H. sapiens and A. thaliana. Similarly, the chance is negligible that the data set of 10×5,000 sequences randomly selected from the A. thaliana genome contains a true miRNA. The most difficult data set consisted of A. thaliana mRNAs; with these novoMIR reached a specificity of 0.975 (see Table 2). With all other data sets specificity was from 0.98 up to 1.00.
Specificity of novoMIR.
Data set
# sequences
Specificity5
A. thaliana mRNAs
710
0.975
noncoding RNAs1
1,296
1.000
noncoding RNAs2
455
0.982
shuffled A. thaliana pre-miRNAs
2,760
0.998
A. thaliana repetitive elements3
56
0.983
H. sapiens pseudohairpins
8,000
0.990
A. thaliana pseudohairpins
10,000
0.991
A. thaliana4
50,000
1.000
1 631 A. thaliana tRNAs, 63 5.8 S rRNAs, 602 5 S rRNAs
2 noncoding RNAs from Rfam
3 in total 134,000 nt
410×5,000 sequences of a length between 80 and 800 nt s randomly selected from the five chromosomes of A. thaliana
5 Specificity is calculated as TN/(TN+FP).
4.4. A Search for Pre-miRNAs in the Genome of Arabidopsis Thaliana
We wanted to test the program with a more realistic scenario, given the satisfying sensitivity and specificity values of novoMIR with our test data (see Tables 1 and 2). We selected all intergenic and intronic regions of the A. thaliana genome from “The Arabidopsis Information Resource” (TAIR), removed all pre-miRNA sequences, and searched within the remaining sequences for potential pre-miRNAs via novoMIR. novoMIR classified 828 sequences from the 30,413 intergenic sequences and 649 sequences from the 148,558 intronic sequences, respectively, as potential pre-miRNAs.
Despite this pleasingly low numbers of hits, however, an interpretation of this outcome is not easy. To get an impression on the hits, we searched with these potential pre-miRNA sequences with BLAST for any annotation and for the miRNA-typical expression pattern in the “Arabidopsis Small RNA Project Database” (ASRP) [28, 29]; such a typical expression pattern of a pre-miRNA includes sequences for the miRNA as well as for the miRNA* (for an example see supplemental Figure S7). To our surprise, we detected that some of the predicted candidates are already described as true pre-miRNAs. An example of such a sequence, predicted by novoMIR as a potential pre-miRNA, is located on A. thaliana chromosome 3 in the region between genes At3G09280 and At3G09290. Its secondary structure and its support by expressed small RNAs are shown in Figure 3 and Figure S8, respectively. It is already known as pre-miR2111a [30, 31], but not present in miRBase 14. The sequences of the mature miR2111a and of miR2111a* predicted by novoMIR also coincide with the sequences given in [30].
Secondary structure and features of sequences classified by novoMIR as pre-miRNAs. Positions in the A. thaliana chromosomes is given above the structures. The predicted miRNA/miRNA* complexes are shown in larger italic characters. The table contains features of the sequences and their structures; the filter values are fraction of base pairs in five windows (with default threshold tWD=0.65), stem-loop length (tHP=30), maximal helix length (tBP=8), ratio of sequence and stem length (tR=6). For expression pattern of small RNAs at the genomic location of these sequences, see supplemental Figures S8–S11.
In the following, we mention shortly three further candidate hits, for which we found some support by small-RNA expression in the ASRP but no explicit annotation. One novoMIR hit is located on chromosome 4 between At4G22760 and At4G22770 close to the 3′ terminus of the latter, but on the opposite strand; for further details, see Figure 3 and Figure S9. The next hit (see Figure 3 and Figure S10) is located in between At5G52689 and At5G52690. The last mentioned hit is located in an intron of AT1G01650, which encodes for an aspartic-type endopeptidase/peptidase; the structure of this sequence is shown in Figure 3 and the expression pattern of the genomic region in Figure S11.
Several candidate hits have no support by small RNAs in the ASRP. It is known that many miRNAs are induced by biotic and abiotic stress [36–38]. Thus, a lack of small RNAs might either point to a false-positive prediction or to a stress condition not analyzed for expression of small RNAs. Further candidate hits are located in regions showing expression patterns similar to those of repetitive elements. A recently published review [39] discussed the possibility that some miRNAs could be evolved from repetitive genomic elements and/or duplication of genomic regions.
4.5. Viroids as Pre-miRNAs?
Viroids are plant-infectious, noncoding, unencapsidated, circular RNAs that are transcribed in a rolling-circle mechanism either in nuclei (Pospiviroidae) or in chloroplasts (Avsunviroidae) of infected plants. Viroids cause the production of viroid-specific small RNAs (vsRNA) similar in size to small interfering (siRNA) and miRNAs, but they do escape the cytoplasmic silencing mechanism. A positive (or negative) novoMIR prediction of viroids as potential pre-miRNAs would point to the genesis of vsRNAs. For further details, see recent reviews [40–43].
Potato spindle tuber viroid (PSTVd) is the type strain of Pospiviroidae. Because of its high self-complementarity the circular PSTVd RNA folds into a rod-like secondary structure of high thermodynamic stability (see Figure 4). This structure can be divided into five structural domains on the basis of homology between different pospiviroids [34]. Most sequence variants or strains of PSTVd differ by mutations in the pathogenicity-modulating (P) domain and/or variable (V) domain. Only a few nucleotide changes in the P domain are sufficient to exhibit remarkably different symptoms in infected tomato plants Solanum lycopersicon cv Rutgers. If this P domain would be the source of miRNA-like vsRNAs, these could interfere somehow with the host's metabolism leading to symptom production.
Secondary structure of PSTVd and location of miRNA/miRNA* complexes as predicted by novoMIR. The structure scheme is based on a consensus structure of 45 (+)-stranded circular PSTVd sequence variants [32]; the sequence is given for the PSTVd variant Intermediate [33]. The five homology domains of pospiviroids are marked as proposed by Keese and Symons [34]: terminal left and right (TL,TR), pathogenicity-modulating (P), central conserved (C), and variable (V) domain. The transcription start site for (-)-strand synthesis is marked by “polII” [35]. The predicted miRNA/miRNA* complexes are shown as larger italic characters. The complex in the P domain is predicted using default parameters; the complex in the TR domain is additionally predicted with a normalized energy threshold tΔG/L/fGC=0.69 (instead of 0.75).
For an RNA with PSTVd sequence from positions 263–359/1–96, which is one of the structural elements present during processing of (+)-strand replication intermediates to circles [44], novoMIR predicted miRNA/miRNA* complexes in the P domain of PSTVd; for an RNA from positions 103–255, which is also a structural elements during processing, novoMIR predicted a further miRNA/miRNA* complex in the TR domain, but only after lowering the normalized energy threshold from the default value tΔG/L/fGC=0.75 to 0.69. Both regions are marked by italic characters in Figure 4. novoMIR predicted identical positions for complexes in a full-length, linear PSTVd (1–359). Especially the prediction of vsRNAs derived from the P domain supports an involvement of vsRNAs in symptom production via vsRNA-induced (mis)regulation of plant-endogenous RNAs like mRNAs coding for transcription factors. This hypothesis is supported by deep-sequencing of PSTVd-derived vsRNAs in PSTVd-infected tomato plants (Diermann, Matoušek, Teune, Riesner and Steger, submitted) and sequencing of vsRNAs produced in vitro by DCL processing of PSTVd [45] which showed clusters of vsRNAs derived from the P domain. In contrast, [45, 46] found only vsRNAs in PSTVd-infected tomato plants that clustered in regions outside of the P domain. This discrepancy is unresolved but might be based for example on different purification procedures of the vsRNAs.
5. Conclusion
Plant pre-miRNAs are more heterogeneous in size and structure than animal pre-miRNAs but still show sufficient characteristic features–-such as relative thermodynamic stability of their structure, length of helices, and number and size of loops–-to be differentiated from other RNAs. Based on several of these features, we developed a series of filter steps and a statistical model that together are able to detect pre-miRNAs with a sensitivity of about 0.8 and a specificity of about 0.99. Thus, the program, which we call novoMIR, is well suited to search on a genomic scale for new pre-miRNAs that are not necessarily evolutionarily conserved. As an example, we searched with novoMIR for pre-miRNAs in nontranslated regions of the A. thaliana genome and detected among the high-scoring sequences experimentally verified pre-miRNAs, which were not annotated in the recent version of miRBase. Additionally, novoMIR recognizes viroids as pre-miRNAs, which supports the hypothesis that viroid-specific small RNAs are generated in a miRNA-like pathway.
Funding
The project was supported by a grant from the German Science Foundation to Detlev Riesner and G. Steger.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
Acknowledgment is given to Dr. M. Schmitz and Dr. L. Nagel for critical reading of the paper. The package and supplementary information can be downloaded at http://www.biophys.uni-duesseldorf.de/novomir/.
RamachandranV.ChenX.Small RNA metabolism in Arabidopsis20081373683742-s2.0-4614908562610.1016/j.tplants.2008.03.008VoinnetO.Origin, biogenesis, and activity of plant microRNAs200913646696872-s2.0-6014908635110.1016/j.cell.2009.01.046AmbrosV.A hierarchy of regulatory genes controls a larva-to-adult developmental switch in C. elegans198957149572-s2.0-0024963695ReinhartB. J.SlackF. J.BassonM.PasquienelllA. E.BettlngerJ. C.RougvleA. E.HorvitzH. R.RuvkunG.The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans200040367729019062-s2.0-002176009210.1038/35002607BrownJ. R.SanseauP.A computational view of microRNAs and their targets20051085956012-s2.0-1744436509110.1016/S1359-6446(05)03399-4MendesN. D.FreitasA. T.SagotM.-F.Current tools for the identification of miRNA genes and their targets2009378241924332-s2.0-6584952067610.1093/nar/gkp145YousefM.ShoweL.ShoweM.A study of microRNAs in silico and in vivo: bioinformatics approaches to microRNA discovery and target identification20092768215021562-s2.0-6304913693210.1111/j.1742-4658.2009.06933.xRitchieW.LegendreM.GautheretD.RNA stem-loops: to be or not to be cleaved by RNAse III20071344574622-s2.0-3394772839410.1261/rna.366507DezulianT.RemmertM.PalatnikJ. F.WeigelD.HusonD. H.Identification of plant microRNA homologs20062233593602-s2.0-3214443883310.1093/bioinformatics/bti802PfefferS.SewerA.Lagos-QuintanaM.SheridanR.SanderC.GrässerF. A.van DykL. F.Kiong HoC.ShumanS.ChienM.RussoJ. J.JuJ.RandallG.LindenbachB. D.RiceC. M.SimonV.HoD. D.ZavolanM.TuschlT.Identification of microRNAs of the herpesvirus family2005242692762-s2.0-2084446341410.1038/nmeth746WangX. J.ReyesJ. L.ChuaN. H.GaasterlandT.Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets200459R652-s2.0-12744262431Jones-RhoadesM. W.BartelD. P.Computational identification of plant microRNAs and their targets, including a stress-induced miRNA20041467877992-s2.0-294267258010.1016/j.molcel.2004.05.027KadriS.HinmanV.BenosP. V.HHMMiR: efficient de novo prediction of microRNAs using hierarchical hidden Markov models2009101, article S352-s2.0-6084908543610.1186/1471-2105-10-S1-S35XueC.LiF.HeT.LiuG.-P.LiY.ZhangX.Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine20056, article 3102-s2.0-3034444726410.1186/1471-2105-6-310Griffiths-JonesS.GrocockR. J.van DongenS.BatemanA.EnrightA. J.miRBase: microRNA sequences, targets and gene nomenclature200634D140D1442-s2.0-33644750115Griffiths-JonesS.SainiH. K.Van DongenS.EnrightA. J.miRBase: tools for microRNA genomics2008361D154D1582-s2.0-3854915027510.1093/nar/gkm952ZhangB. H.PanX. P.CoxS. B.CobbG. P.AndersonT. A.Evidence that miRNAs are different from other RNAs20066322462542-s2.0-3074446855710.1007/s00018-005-5467-7HofackerI. L.Vienna RNA secondary structure server20033113342934312-s2.0-004312315310.1093/nar/gkg599Griffiths-JonesS.MoxonS.MarshallM.KhannaA.EddyS. R.BatemanA.Rfam: annotating non-coding RNAs in complete genomes200533D121D1242-s2.0-1344425284710.1093/nar/gki081EddyS. R.SQUID—C function library for sequence analysis2008, http://selab.janelia.org/software.html#squidSmitA. F. A.HubleyR.GreenP.RepeatMasker Open-3.02004, http://www.repeatmasker.org/NgK. L. S.MishraS. K.De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures20072311132113302-s2.0-3444730905810.1093/bioinformatics/btm026Garcia-HernandezM.BerardiniT. Z.ChenG.CristD.DoyleA.HualaE.KneeE.LambrechtM.MillerN.MuellerL. A.MundodiS.ReiserL.RheeS. Y.SchollR.TacklindJ.WeemsD. C.WuY.XuI.YooD.YoonJ.ZhangP.TAIR: a resource for integrated Arabidopsis data2002262392532-s2.0-1904439306810.1007/s10142-002-0077-zGiegerichR.VoßB.RehmsmeierM.Abstract shapes of RNA20043216484348512-s2.0-454433951210.1093/nar/gkh779SteffenP.VoßB.RehmsmeierM.ReederJ.GiegerichR.RNAshapes: an integrated RNA analysis package based on abstract shapes20062245005032-s2.0-3254444079710.1093/bioinformatics/btk010HofackerI. L.PriwitzerB.StadlerP. F.Prediction of locally stable RNA secondary structures for genome-wide surveys20042021861902-s2.0-104230420610.1093/bioinformatics/btg388NamJ.-W.ShinK.-R.HanJ.LeeY.KimV. N.ZhangB.-T.Human microRNA prediction through a probabilistic co-learning model of sequence and structure20053311357035812-s2.0-2184447702710.1093/nar/gki668GustafsonA. M.AllenE.GivanS.SmithD.CarringtonJ. C.KasschauK. D.ASRP: the Arabidopsis Small RNA Project database200533D637D6402-s2.0-1344425646510.1093/nar/gki127BackmanT. W. H.SullivanC. M.CumbieJ. S.MillerZ. A.ChapmanE. J.FahlgrenN.GivanS. A.CarringtonJ. C.KasschauK. D.Update of ASRP: the Arabidopsis Small RNA Project database2008361D982D9852-s2.0-3854911066410.1093/nar/gkm997PantB. D.Musialak-LangeM.NucP.MayP.BuhtzA.KehrJ.WaltherD.ScheibleW.-R.Identification of nutrient-responsive Arabidopsis and rapeseed microRNAs by comprehensive real-time polymerase chain reaction profiling and small RNA sequencing20091503154115552-s2.0-6765015680410.1104/pp.109.139139FahlgrenN.SullivanC. M.KasschauK. D.ChapmanE. J.CumbieJ. S.MontgomeryT. A.GilbertS. D.DasenkoM.BackmanT. W. H.GivanS. A.CarringtonJ. C.Computational and analytical framework for small RNA profiling by high-throughput sequencing200915599210022-s2.0-6534909475910.1261/rna.1473809StegerG.RiesnerD.HadidiA.FloresR.RandlesJ. W.SemancikJ. S.Properties of viroids: molecular characteristics2003Melbourne, AustraliaCSIRO Publishing1529GrossH. J.DomdeyH.LossowC.Nucleotide sequence and secondary structure of potato spindle tuber viroid197827356592032082-s2.0-0017821271KeeseP.SymonsR. H.Domains in viroids: evidence of intermolecular RNA rearrangements and their contribution to viroid evolution19858214458245862-s2.0-0022344804KolonkoN.BannachO.AschermannK.HuK.-H.MoorsM.SchmitzM.StegerG.RiesnerD.Transcription of potato spindle tuber viroid by RNA polymerase II starts in the left terminal loop200634723924042-s2.0-3374481658910.1016/j.virol.2005.11.039LiB.YinW.XiaX.Identification of microRNAs and their targets from Populus euphratica200938822722772-s2.0-6924920963810.1016/j.bbrc.2009.07.161ShuklaL. I.ChinnusamyV.SunkarR.The role of microRNAs and other endogenous small RNAs in plant stress responses20081779117437482-s2.0-5504913997410.1016/j.bbagrm.2008.04.004JagadeeswaranG.SainiA.SunkarR.Biotic and abiotic stress down-regulate miR398 expression in Arabidopsis20092294100910142-s2.0-6134917123710.1007/s00425-009-0889-3AxtellM. J.BowmanJ. L.Evolution of plant microRNAs and their targets20081373433492-s2.0-4344911803810.1016/j.tplants.2008.03.009TsagrisE. M.de AlbaÁ. E. M.GozmanovaM.KalantidisK.Viroids20081011216821792-s2.0-5404909879710.1111/j.1462-5822.2008.01231.xDingB.ItayaA.Viroid: a useful model for studying the basic principles of infection and RNA biology20072017202-s2.0-3384572317010.1094/MPMI-20-0007SchmitzM.StegerG.Potato spindle tuber viroid (PSTVd)20071106115DaròsJ.-A.ElenaS. F.FloresR.Viroids: an Ariadne's thread into the RNA labyrinth2006765935982-s2.0-3374481920510.1038/sj.embor.7400706BaumstarkT.SchröderA. R. W.RiesnerD.Viroid processing: switch from cleavage to ligation is driven by a change from a tetraloop to a loop E conformation19971635996102-s2.0-1924437561310.1093/emboj/16.3.599ItayaA.ZhongX.BundschuhR.QiY.WangY.TakedaR.HarrisA. R.MolinaC.NelsonR. S.DingB.A structured viroid RNA serves as a substrate for dicer-like cleavage to produce biologically active small RNAs but is resistant to RNA-induced silencing complex-mediated degradation2007816298029942-s2.0-3394737298210.1128/JVI.02339-06Di SerioF.De AlbaA.-E. M.NavarroB.GiselA.FloresR.rflores@ibmcp.upv.esRNA-dependent RNA polymerase 6 delays accumulation and precludes meristem invasion of a viroid that replicates in the nucleus2010845247724892-s2.0-3374583746510.1128/JVI.02336-09