MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require “read count” to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA
MicroRNAs (miRNAs) are nonprotein coding RNAs of between 20 and 22 nucleotides that attenuate protein production by cleavage, translational inhibition, or sequestering of mRNA in P bodies [
Understanding miRNA biogenesis is important when developing predictive models. The mature miRNA originates from an expressed RNA precursor. The precursor folds back to base pair with itself to form a characteristic stem-loop structure. However, not all stem-loop structures are miRNA precursors. The dicer protein cuts a short, double-stranded RNA (miRNA:miRNA* duplex) from the precursor. This double-stranded RNA associates with the RISC complex, where the mature miRNA is retained while the miRNA* is assumed to degrade [
Many methods for predicting miRNAs from sequence data have been reported [
Xue et al. [
MiPred employs the random forest machine learning method [
Approaches designed for deep sequencing data, such as MirTools [
Of the three
One can ask, what data are necessary to correctly predict miRNAs from sequences? And, specifically, for deep sequencing data, can an accurate predictor be developed that does not require read count or rely on the presence of the miRNA*? The predictive model described here is one of the most specific predictors (98.53% specificity), yet has the least requirements as it only requires that the candidate sequence occurs in a sequence region large enough to encompass the putative precursor. Fulfilling this single requirement allows the set of quantitative sequence properties (attributes) for the candidate sequence to be calculated and passed to the predictor.
Here, positive and negative controls from 18 plant species were used for training and model evaluation. Positive controls were taken from miRBase [
Wet-lab validation is costly and time consuming; therefore, accurately predicting that the candidate RNA is truly a miRNA is important. Our aim was to accurately predict when a sequence is not a miRNA at the expense of missing a few true miRNAs, which limits the number of false positives. To demonstrate the quality of any predictor, rigorous statistical testing is required.
Training of the predictive model requires controls of known outcome. The measured characteristics of the controls used in machine learning are called attributes. The positive controls are known miRNAs and non-miRNA sequences form the negative controls. Known miRNAs from 18 plant species were taken from miRBase Release 18. The negative controls were picked from ESTs of the respective species downloaded from TIGR Plant Transcript Assemblies [
Plant species from which miRBase miRNAs were used.
Taxonomic group | Species |
---|---|
Brassicaceae |
|
Caricaceae |
|
Embryophyta |
|
Lycopodiophyta |
|
Euphorbiaceae |
|
Fabaceae |
|
Rutaceae |
|
Salicaceae |
|
Solanaceae |
|
Vitaceae |
|
Poaceae |
|
Panicoideae |
|
Pooideae |
|
Data processing flow chart for collection of controls, training, and statistical validation. Known miRNAs from miRBase are used as positive controls. Negative controls are randomly picked short segments from ESTs based on the quantity and length distribution of known miRNAs from each plant species in the test set. All controls are aligned to their respective genome. Alignment location allows collection of attributes. For positive controls, the alignment position also allows the location of the miRNA in relation to upstream or downstream of miRNA* to be determined within the precursor. Only attributes from the correct location are valid for training positive controls. Upstream or downstream attributes are equally valid for negative controls, therefore only one is selected at random. Positive controls were aligned using needleall to determine the similarity between all miRNAs. Negative controls were also aligned. The similarity values from the alignment were used to determine if other highly similar sequences should also be excluded when the one in the leave-one-out set was excluded from training. Each sequence in a leave-one-out set was tested for correct classification by the model just trained using controls in the inclusion set. Counts of correct and incorrect classification were used to calculate sensitivity and specificity.
The known precursors from miRBase are only used to confirm two things about the miRNAs used as positive controls: (i) the miRNA location on the chromosome is inside a known precursor and not just a random match and (ii) the sequence is upstream or downstream of the miRNA* (in the 5′ half or the 3′ half of the precursor illustrated in Figures
(a) Example of an upstream miRNA with downstream miRNA*. (b) Example of a downstream miRNA with upstream miRNA*. The mature miRNA (in red) may exist within the precursor upstream of the miRNA* (a) or downstream (b). Determining this location is critical for collecting the appropriate attribute set.
For any short sequence from the above three types, the precursor candidate was defined by first locating the miRNA* candidate with the strongest duplex binding. Next, the region between and including the sequence and the miRNA* candidate, along with 15 nt on both sides, was extracted. This region defines the operative precursor region (OPR), which in real miRNAs should form a stem-loop structure. The search for the miRNA:miRNA* candidate duplex with the strongest binding was limited to a 300 nt window. The 300 nt window was used because 95.80% of known plant miRNA precursors are less than 300 nt (Table
Range of stem-loop lengths for plant miRNAs in miRBase.
3284 | Total count of stem loops |
938 | Max stem loop length |
53 | Min stem loop length |
153 | Average stem loop length |
132 | Median stem loop length |
Count > 300 nt | Count < 300 nt | Count < 350 nt | |
---|---|---|---|
Counts | 135 | 3146 | 3194 |
% | 4.11% | 95.80% | 97.26% |
We use the OPR instead of the precursor region as reported by miRBase to ensure equal treatment of all controls, and later for sequences of unknown class. Simply, as unknowns and negative controls do not come with precursors, we must define the OPR equally for all, including the positive controls for training. Figure
A comparison between precursors from miRBase and the corresponding OPR predicted from the EST data. miRNA lja-miR167 MIMAT0010087 (in red) is found in precursor MI0010580. No genome is listed in miRBase for this miRNA, and it does not align to any chromosome for that species. It does, however, align to EST [GenBank: BW598483]. The EST lja-miR167 is correctly classified as a miRNA with a predicted precursor highly similar to the one in miRBase.
The computationally estimated MFE (in kcal/mol) for both the miRNA:miRNA* duplex and the OPR structure are required attributes for training. The RNAduplex function from the Vienna RNA package [
Base set of attributes.
Attribute | Description of the attribute in relation to the control or candidate sequence |
---|---|
chromLen/position | The ratio of the length of the chromosome over the position on that chromosome |
ShannonEntropyNorm | Shannon entropy normalized to the sequence length |
G% | Percentage of G base composition |
C% | Percentage of C base composition |
T% | Percentage of T base composition |
A% | Percentage of A base composition |
DuplexEnergy | The duplex energy between the miRNAs:miRNAs* |
DuplexEnergyNorm | The duplex energy normalized to the length of the duplex structure |
MaxMismatch | Maximum number of mismatches in the duplex structure based on both sides of the structure |
minMatchPercent | Minimum % match based on length of the duplex structure both sides of the structure |
DeltaG | Minimum free energy for the stem loop |
DeltaGnorm | Minimum free energy normalized to the length of the stem loop |
longestDotSet | Longest run of mismatches in the stem loop |
longestBracketSet | Longest run of matches in the stem loop |
loopCountNorm | Number of loop heads normalized to the length of the stem loop |
The attribute
Extended attribute set from combinations of base attributes.
Attribute | Description of the attribute in relation to the control or candidate sequence |
---|---|
G + T:= G% + T% | Sum of G% and T% |
G/T:= G%/T% | Ratio from G% to T% |
G + C:= G% + C% | Sum of G% and C% |
G/C:= G%/C% | Ratio from G% to C% |
A + C:= A% + C% | Sum of A% and C% |
A/C:= A%/C% | Ratio from A% to C% |
T + A:= T% + A% | Sum of T% and A% |
T/A:= T%/A% | Ratio from T% to A% |
G%/ShannonEntropyNorm := G%/ShannonEntropyNorm | Ratio of G% over normalized Shannon entropy |
C%/ShannonEntropyNorm := C%/ShannonEntropyNorm | Ratio of C% over normalized Shannon entropy |
T%/ShannonEntropyNorm := T%/ShannonEntropyNorm | Ratio of T% over normalized Shannon entropy |
A%/ShannonEntropyNorm := A%/ ShannonEntropyNorm | Ratio of A% over normalized Shannon entropy |
NormEnergyRatio := DeltaGnorm/DuplexEnergyNorm | Ratio of the normalized DeltaG from the stem loop and normalized miRNAs:miRNAs* duplex energy |
longestBracket/longestDot := longestBracketSet / longestDotSet | Ratio of longest match over the longest mismatch normalized counts |
Negative controls are short sequences, randomly picked from the central regions of ESTs. Attributes for negative controls were collected in a similar way as those for positive controls. Sequence lengths were chosen to resemble the length distribution of the positive controls. The randomly picked negative controls only qualified if they contained no ambiguity codes and had a length-normalized Shannon entropy consistent with that of the positive controls. The latter is used to avoid low complexity regions and to ensure that strong negative controls are collected. The attribute
Once attributes have been collected for both positive and negative controls, validation sets can be produced. The model was validated by calculating sensitivity and specificity based on leave-one-out cross-validation [
MiRNAs found in different locations may be similar or even identical to others. It is important that sequences similar to the ones left out are also excluded to ensure the rigour of the leave-one-out approach. A study on precursor prediction used the BLASTclust program to identify sequences of high similarity for exclusion from the training set [
The C5.0 program from RuleQuest incorporates a decision-tree machine learning method that we have used to train a miRNA predictor using controls of known outcome. C5.0 and the Windows version See5 are improved versions of C4.5 that in turn descended from a program called ID3. The training data used by C5.0 can be any combination of nominal attributes (e.g., the letter of the first nucleotide in the sequence) and numeric attributes (e.g., MFE). The data is evaluated for patterns that discriminate between the training classes. The output is a model in the form of if-then rules or decision trees for classifying cases of unknown outcome. The emphasis is on producing models that are accurate and easy to understand. Producing models that are easy to understand can be useful when the goal is to discover the biological relationships between the attributes and class, rather than to classify unknown cases. In some investigations the goal is to determine the attributes and cutoff values critical for separating the classes. C5.0 can be applied to training data containing many thousands of cases with hundreds of attributes each. The typical size of our training set is ~5294 cases, each with only 29 attributes.
The misclassification cost can also be adjusted when training. It is relatively expensive to validate a miRNA in the wet lab. Our aim is to train a classifier with a low false positive rate. If, after training, the model had a higher than acceptable false positive rate, a misclassification cost would be applied when retraining. If the opposite was true (i.e., validation cheap and miRNA rare), the misclassification cost could be adjusted accordingly. For example, a misclassification cost would be applied to avoid missing true miRNAs during training.
C5.0 supports a technique called boosting [
After training, the attribute usage information demonstrates the discriminative importance of each attribute. Table
Example of attribute usage from one representative training run.
Attribute usage | |
---|---|
100% | G% |
100% | C% |
100% | T% |
100% | DuplexEnergy |
100% | minMatchPercent |
100% | DeltaGnorm |
100% | G + T |
100% | G + C |
98% | duplexEnergyNorm |
86% | NormEnergyRatio |
85% | MaxMismatch |
82% | ShannonEntropyNorm |
74% | G/T |
51% | A% |
51% | A + C |
28% | chromLen/position |
Excluding an attribute from training reveals the discriminative importance to the classifier. Although this is an imperfect method, collecting data from all leave-one-out training runs can provide an overall view of an attribute’s discriminative power. C5.0 has a built-in function called winnowing that, when applied during training, returns the percentage of increase in error if an attribute is removed from training. Table
Attributes with an average decline of 1% or greater when excluded.
Attribute | Average percentage decline in training accuracy when the attribute is removed |
---|---|
DuplexEnergy | 53% |
T% + A% | 20% |
DeltaGnorm | 14% |
longestBracketSet | 10% |
minMatchPercent | 7% |
G% + C% | 5% |
loopCountNorm | 4% |
MaxMismatch | 4% |
DuplexEnergyNorm | 1% |
DeltaG | 1% |
longestBracket/longestDot | 1% |
Sensitivity and specificity are critical values for assessing classifier accuracy; values as high as 84.08% for sensitivity and 98.53% for specificity have been obtained. The question remains whether this high specificity and sensitivity can also be achieved when the predictor is trained with different species than are used for prediction.
If all miRNAs in each taxonomic category listed in Table
Results from exclusion of each of the 13 taxonomic groups.
Taxonomic grouping | Error count | Total count of miRNAs | % correctly classified | % of full set excluded | Notes |
---|---|---|---|---|---|
Embryophyta | 12 | 190 | 94 | 9.16 |
|
Lycopodiophyta | 0 | 55 | 100 | 2.65 |
|
Brassicaceae | 0 | 440 | 100 | 20.22 | Dicot, two |
Caricaceae | 0 | 1 | 100 | 0.05 | Dicot, papaya |
Euphorbiaceae | 0 | 7 | 100 | 0.34 | Dicot, castor oil plant |
Fabaceae | 0 | 560 | 100 | 27.00 | Dicot, three legumes |
Salicaceae | 16 | 73 | 78 | 3.52 | Dicot, poplar tree |
Solanaceae | 1 | 14 | 93 | 0.68 | Dicot, tomato plant |
Vitaceae | 9 | 89 | 94 | 4.29 | Dicot, common grape |
Rutaceae | 0 | 9 | 100 | 0.43 | Dicot, two citrus trees |
Panicoideae | 8 | 166 | 95 | 8.00 | Monocot, |
Poaceae | 0 | 404 | 100 | 19.48 | Monocot, rice and sorghum |
Pooideae | 13 | 66 | 80 | 3.18 | Monocot, |
Results from exclusion of each of the four taxonomic groups.
Taxonomic grouping | Error count | Total count of miRNAs | % correctly classified | % of full set excluded | Notes |
---|---|---|---|---|---|
Embryophyta | 13 | 190 | 93 | 9.16 |
|
Lycopodiophyta | 0 | 55 | 100 | 2.65 |
|
Monocotyledons | 0 | 1193 | 100 | 57.52 | Four species |
Dicotyledons | 0 | 636 | 100 | 30.67 | Twelve species |
Results from exclusion of each of the 3 taxonomic groups.
Taxonomic grouping | Error count | Total count of miRNAs | % correctly classified | % of full set excluded | Notes |
---|---|---|---|---|---|
Primitive | 0 | 389 | 100 | 11.81 |
|
Monocotyledons | 0 | 1775 | 100 | 30.67 | Four species |
Dicotyledons | 0 | 946 | 100 | 57.52 | Twelve species |
Identical miRNA sequences are found in multiple plant species, often on multiple chromosomes where they are part of identical or different precursors. Although the miRNA sequences can be identical, some of their attributes are not. At a minimum, the location within the chromosome is different. Leave-one-out sets are required for statistical validation by modelling the predictive accuracy of a control sequence that is unseen during training. For validation, excluding identical or near-identical miRNAs is critical. When creating a predictor for unknown sequences, the classifier can be trained using all plant miRNAs to ensure the best predictor possible. The taxonomic exclusion tests demonstrate that training with the maximum number of known miRNAs produces a classifier with the best cross-species recognition of miRNAs. In one training example, the set included 2278 positive controls, some of which are the same miRNAs sequence but in different precursors and species. The contrasting negative controls have 5680 short sequences randomly picked from ESTs. For this training set, from 3 to 300 boosting trials were run. At boosting trials near 300, only five negative controls were misclassified as miRNAs during training. Although only wet-lab validation can confirm that a sequence is a miRNA, two of the five are found in what look like very obvious miRNA precursors. The possible miRNA precursors for these two EST segments are shown in Figures S3 and S4 of the supplementary data. They were classified as miRNAs despite being used as negative controls during training. The folding patterns were produced using QuikFold from the UNAFold Web Server [
One method of demonstrating that the predictor can be used for
In some cases, the known miRNA does not match a location on the genome. Of the 18 plant species used for training, only 17 known miRNAs from seven species did not align to the respective genome. When these miRNAs can be found in an EST, the EST sequence can be used in place of a genomic sequence to collect attributes for prediction. An example of this is lja-miR167 in
During development, short sequences were randomly picked from ESTs in several collections. These negative controls were later pooled and the predictor applied to determine the number of potential miRNAs in random samples of ESTs. Out of 58,443 sequences, 633 (1.08%) were classified as miRNA. Nine (0.02%) of these had the maximum confidence value of 1.00. As described in the methods section, only one miRNA location (either upstream or downstream of the miRNA*) was kept. It is conceivable that if both were kept, the rate of miRNA detection in randomly picked ESTs could double. If some of these negative controls are true miRNA, the specificity would be higher if they were removed from training. While specificity is already high, this suggests that the predictor is more accurate than the value reported here.
The true sensitivity may also be higher than the reported value. Despite using reported miRNAs from miRBase for training, the model refuses to classify all positive controls as miRNA. Fifty miRNAs across 10 species that were not classified as miRNA were collected in a set. These were treated as unknown, and attributes were collected for reclassification. Twelve miRNAs across four species out of the original 50 were classified as miRNA when attributes from both upstream and downstream were tested. Closer inspection of the differences between attributes for the same miRNA between the two sets revealed that in some cases the relative position of miRNA and miRNA* was different. In these cases the precursor in miRBase was asymmetrical, and the automated location picking script returned the wrong location within the precursor. When the correct location and attributes were tested by the predictor, it was classified as a miRNA. An example is gma-MIR5034 MI0017906, where the miRNA is clearly located in the 3′ end. However, it has 69 bases on its downstream side and 60 bases on the upstream side, putting the miRNA predominantly in the upstream half of the precursor. Although some small set of erroneous attributes based on the wrong locations were included in the training dataset, the classifier correctly classified these as miRNA when given both upstream and downstream attributes. Including attributes from incorrect locations for training produced a lower sensitivity value than if correct locations were used but did not diminish the predictor’s ability to correctly classify true miRNA. This demonstrates that the predictor is more accurate than the sensitivity reported.
We have shown that a highly accurate universal plant miRNA predictor can be produced by machine learning using C5.0. This predictor can be applied to any short sequence that aligns to a precursor candidate in a genome or transcriptome. The source of the sequence for testing can be short reads from deep sequencing, or short segments taken from chromosomes or EST sliding windows. Along with miRNA prediction and prediction confidence level, the putative precursor is also produced ready for folding and visualization. If used to scan a chromosome region, this predictor will reveal areas of high or low miRNA density. If applied to deep sequencing data the predictor reveals how often in a genome the short read exists, how many are predicted miRNAs, and how many unique precursors contain that short-read miRNA.
P. H. Williams provided the initial conception, the design of the machine learning approach, data acquisition, and statistical validation. R. Eyles was involved with analysis and interpretation of data, including critical review of results. G. F. Weiller contributed to methodological tool selection for data analysis and critical manuscript revisions.
The authors thank Dr C. Weiller for critical reading of the paper. They acknowledge the Australian Research Council for funding the work, specifically grants CE0348212 (R. Eyles and G. F. Weiller) and DP0879308 (P. H. Williams and G. F. Weiller). Scripts and links to binaries can be found at