Eukaryotic regulatory small RNAs (sRNAs) play significant roles in many fundamental cellular processes. As such, they have emerged as useful biomarkers for diseases and cell differentiation states. sRNA-based biomarkers outperform traditional messenger RNA-based biomarkers by testing fewer targets with greater accuracy and providing earlier detection for disease states. Therefore, expression profiling of sRNAs is fundamentally important to further advance the understanding of biological processes, as well as diagnosis and treatment of diseases. High-throughput sequencing (HTS) is a powerful approach for both sRNA discovery and expression profiling. Here, we discuss the general considerations for sRNA-based HTS profiling methods from RNA preparation to sequencing library construction, with a focus on the causes of systematic error. By examining the enzymatic manipulation steps of sRNA expression profiling, this paper aims to demystify current HTS-based sRNA profiling approaches and to aid researchers in the informed design and interpretation of profiling experiments.
RNA in eukaryotic cells can be classified into five categories: ribosomal RNAs (rRNA), transfer RNAs (tRNA), messenger RNAs (mRNAs), long noncoding RNAs (lncRNAs), and small RNAs (sRNAs). Over 90% of the total RNA molecules present in a cell are rRNA and tRNA, while sRNAs account for ~1% or less. Eukaryotic regulatory sRNAs are a subset of sRNAs ranging in size from ~20 to 30 nt and include microRNAs (miRNAs), small interfering RNAs (siRNAs), and piwi-interacting RNAs (piRNAs). The functions of these regulatory sRNAs are conserved from plants to animals, which imply their involvement in fundamental cellular processes [
High-throughput sequencing (HTS) has revolutionized the study of sRNAs by simultaneously accelerating their discovery and revealing their expression patterns. As we have learned from microarray-based sRNA expression profiling [
In this paper, we discuss preparation of sRNAs for profiling by HTS and enzymatic manipulation upstream of sequencing library preparation. The purpose of enzymatic manipulation is either to improve representation and reduce bias or to specifically focus on subsets of sRNAs based on end modifications. Furthermore, we review the activities of the enzymes directly involved in common HTS library preparation methods and discuss their relative strengths and weaknesses to facilitate choosing suitable protocols and interpretation of the results.
Although small in size, eukaryotic regulatory sRNAs are diverse in their sequences, modifications, biogenesis, expression patterns, and functions [
miRNAs are a class of 21 to 24 nt sRNAs in most eukaryotes that regulate gene expression at the transcriptional or posttranscriptional level [
Classes of small RNAs and their 5′- and 3′-end modifications.
Class | Organism | 5′-end modification | 3′-end modification |
---|---|---|---|
miRNA | Mammals | Monophosphate | 2′OH |
Nematodes | Monophosphate | 2′OH | |
Insects | Monophosphate | 2′OH | |
Plants | Monophosphate | 2′-O-methyl | |
siRNA | Mammals | Monophosphate | 2′OH |
Nematodes | Monophosphate | 2′OH | |
Insects | Monophosphate | 2′-O-methyl | |
Plants | Monophosphate | 2′-O-methyl | |
Secondary siRNA | Nematodes | Polyphosphate | 2′OH |
Plants | Monophosphate | 2′-O-methyl | |
piRNA | Mammals | Monophosphate | 2′-O-methyl |
Nematodes | Monophosphate | 2′-O-methyl | |
Insects | Monophosphate | 2′-O-methyl |
For references, see text and [
Mature miRNAs are present in the cytoplasm, mainly within cytosolic P bodies, stress granules, and in association with polyribosomes [
miRNAs are now considered to be key regulators of gene expression in higher eukaryotes with estimates that at least 20–30% human protein-coding genes are regulated by miRNAs [
A second major class of sRNAs is endogenous small interfering RNAs (siRNAs). They are 21 to 23 nt in length and originate from endogenous double-stranded RNAs (dsRNAs) that are either synthesized by RNA-dependent RNA polymerase or that originate from annealed regions within or between endogenous transcripts [
Piwi-interacting RNAs (piRNAs) are a class of sRNAs that are 26 to 30 nt long and are speculated to be generated from long single-stranded RNA precursors [
Besides miRNAs, siRNAs, and piRNAs, many other classes of sRNAs exist and novel classes continue to be discovered in many organisms. For example, tasiRNA, natsiRNA, tncRNA, hcRNA, rasiRNA, scnRNA, and 21U-RNA have been identified recently [
With the employment of new technologies, such as HTS, discovering new classes of sRNAs is more feasible. Understanding their biological roles in various aspects of cellular processes and disease states is an important and exciting scientific frontier.
Due to their conserved functions in gene regulation, miRNAs have become valuable biomarkers for many diseases and cell differentiation states [
During their biogenesis, miRNAs can be subjected to various editing events, such as 3′ to 5′ exonucleolytic processing [
Though focused on HTS-based expression profiling, the methods and principles for preparing samples upstream of sequencing library construction discussed here are also applicable to sample preparation for other RNA expression profiling methods. To profile sRNA expression, it is desirable to avoid introducing systematic error from the sample acquisition, RNA extraction, and preparation. It is also critical that these procedures are thoughtfully considered to ensure reproducibility, valid interpretation, and comparative analysis of profiling results.
Clinical research-related sRNA profiling commonly deals with human samples. Age, sex, race, background comorbidity, anesthesia processes, state of consciousness, and circadian rhythms are potentially relevant to miRNA expression profiling [
When studying sRNAs from tissues, care must be taken in the tissue processing, which includes tissue procurement, fixation, and embedding. miRNAs appear to be more stable in FFPE tissue than mRNAs, probably due to their small size and reduced likelihood of remaining cross-linked with proteins after proteinase K digestion [
sRNAs are often isolated or enriched from extracted total RNA in profiling workflows. Although larger RNAs will eventually be excluded from sRNAs during library preparation, it is critical to maintain the integrity of total RNA to avoid the contamination by degraded large RNAs, especially rRNA. To extract total RNA, routine methods are composed of two steps: deproteinizing RNA in biological samples and precipitation of RNA. Deproteinizing RNA can be achieved by SDS solubilization followed by phenol extraction or TRIzol extraction [
Ethanol precipitation of sRNAs is commonly used to recover RNAs from ~20 nt to several kilobases in length. When possible, adding a nucleic acid carrier, such as glycogen, linear polyacrylamide, or tRNA, to the sample or prior to the extraction will increase the yield of extraction and precipitation [
Many column-based RNA isolation kits are commercially available. A key consideration for choosing whether the kit is suitable for sRNA profiling experiments is the retention of sRNA during extraction. Therefore, attention needs to be paid to select appropriate kits to ensure sRNAs retained with high yield during purification. Many of these kits are designed to isolate RNA based on the nucleic acid affinity to silica-based materials in the presence of chaotropic salts, such as guanidinium isothiocyanate, while proteins and other cellular components pass through. Residual contamination of chaotropic salts through purification is possible and can impair downstream enzymatic reactions. Therefore, thorough column washing is advised.
After RNA extraction, removal of residual genomic DNA using DNase I is necessary to ensure the purity of total RNAs. It is also highly recommended to check the integrity of total RNA before isolating sRNAs. Total RNA quality and quantity can be determined by gel electrophoresis or on a microfluidics-based technology, such as the Agilent 2100 Bioanalyzer (Agilent Technology Inc., Santa Clara, CA, USA) [
Though it adds hands-on labor and time, enrichment of sRNAs may be desirable for sRNA library construction because the high abundance of rRNA, tRNA, and mRNA may overwhelm the representation of sRNAs in HTS. sRNAs can be separated from other RNAs using polyacrylamide gel electrophoresis (PAGE). After excising gel pieces in the desired size range, sRNAs can be eluted by crushing and soaking in solution with constant rotation (passive diffusion) or can be more efficiently eluted using an electroelution approach with tubes, such as Mini GeBAflex-tubes (Gene Bio-Application Ltd, Yavne, Israel). Gel extraction allows for the tightest control of RNA size range to be analyzed in downstream procedures. A variation of PAGE fractionation is the FlashPAGE Fractionator (Life Technologies) which is a minielectrophoresis device that runs small scale polyacrylamide tube gels for isolating RNAs below a threshold length [
Due to their different origins and biogenesis pathways, sRNAs differ from each other in their modifications at the 5′- and 3′-termini (Table
Mature miRNAs and siRNAs from mammals have a monophosphate at their 5′-ends and 2′-, 3′-hydroxyl groups at their 3′-ends [
Some sRNA 5′- or 3′-end modifications are not reactive or have reduced reactivity for enzymatic manipulation in expression profiling protocols. For example, the commonly used T4 RNA ligases can efficiently catalyze the formation of a 3′- to 5′-phophodiester bond between a 3′-hydroxyl group and a 5′-phosphate group [
Enzymatic manipulation of RNAs with modifications at their 5′- or 3′-ends. Black lines represent RNA with the left and right ends representing the 5′- and 3′-ends, respectively. One, two, or three grey circles represent mono-, di-, or triphosphate at the 5′-end. “A” and “mG” represent a 3′ to 5′ AMP and cap structure at RNA 5′-end. “2′-OH” or “2′, 3′-OH” represents RNAs with no modification at the 3′-end. “2′-OCH3” and “2′, 3′-CP” represent 2′-O-methylation and 2′, 3′-cyclic phosphate at the 3′-end, respectively. Dashed lines represent degraded RNA. The nucleotide “N” in grey color represents the nucleotide removed during the
sRNA 5′-ends can have a 5′-hydroxyl group or contain a mono-, di-, or triphosphate group, or a cap structure. In order to convert sRNAs to have ligatable 5′-monophosphates, a number of enzymes can be utilized, and the choice of enzyme depends on the starting modification and desired enrichment or depletion of different substrates. To capture sRNAs with a 5′-triphosphate, such as secondary siRNAs, the 5′-triphosphate can be removed by alkaline phosphatase to yield a 5′-hydroxyl group. The removal of 5′-phosphate groups to yield a 5′-hydroxyl group has the advantage of preventing RNA self-ligation to form circles and concatemers. This has the net result of improving the yield of properly ligated products when ligating an adapter to the RNA 3′-end [
Instead of 5′-phosphorylated DNA adapters, adenylated DNA adapters are widely ligated to RNA 3′-hydroxyl ends since preadenylation allows for the exclusion of ATP in ligation reactions when using T4 RNA ligases. This leads to decreased formation of self-ligated adapter or adapter concatermers [
It remains to be determined whether there are significant amounts of sRNA species that contain 5′-adenlyated ends
TAP hydrolyzes the phosphoric acid anhydride bonds in the triphosphate bridge of the cap structure, releasing the cap nucleoside and generating a 5′-monophosphate terminus on the RNA molecule [
Due to the presence of 5′-monophosphate groups in sRNAs, such as miRNAs and siRNAs, one can selectively degrade these sRNAs using XRN1, a 5′ to 3′exoribonuclease [
3′-ends of sRNAs can also be differentially modified during biogenesis. piRNAs, for instance, are methylated at the 2′-position of the 3′-terminal ribose. RNAs with a 3′-end 2′-O-methyl group are ligatable by T4 RNA ligases but with significantly decreased efficiency under standard conditions. Ligation reactions using a mutant variant of T4 RNA ligase 2 (T4 Rnl2), T4 RNA ligase 2 truncated (T4 Rnl2tr), at an optimal PEG concentration can significantly improve 3′-adapter ligation efficiency of RNAs with a 2′-O-methyl 3′-end to a level equivalent to that of unmodified RNAs. As a result, their representation in sRNA quantification experiments will be increased [
To selectively capture sRNAs with a 2′-O-methyl at the 3′-end in HTS libraries, such as piRNAs, RNAs can be treated with oxidation followed by
2′-, 3′-cyclic phosphate at RNA 3′-ends can also arise from enzymatic or chemical processing of RNA. In contrast to DNA, the reactive 2′-hydroxyl group on the ribose ring in RNA can promote a hydrophilic attack and breakage of the 5′-, 3′-phosphodiester bond, forming 2′-, 3′-cyclic phosphate ends. RNAs fragmented by treatment with divalent cations or ribozyme-mediated cleavage have a 2′-, 3′-cyclic phosphate at the 3′-end that arise by this mechanism [
Converting the RNA 3′-ends from 2′-, 3′-cyclic phosphate, or 2′-hydroxyl, 3′-phosphate to 2′-, 3′-hydroxyl groups is necessary prior to ligation reactions. This can be achieved by treatment with wild-type T4 PNK with 3′-phosphatase activity, though the pH optimum for the resolution and repair reaction of 2′-, 3′-cyclic phosphate ends is more acidic than for the traditional kinase reaction [
In sRNA expression profiling workflows, RNA extraction, enrichment, and enzymatic treatment are potential sources of systematic error upstream of HTS library construction. To ensure representation and accurate quantification of sRNAs, these early steps should be thoughtfully considered and explicitly documented. The full extent of RNA-end modifications is not yet established, and, as novel modifications are discovered, new approaches to prepare RNAs containing these modifications will need to be developed. This will enable realistic interpretation of sRNA profiling data and allow for potential future comparisons.
HTS approaches have been rapidly adopted for use in sRNA expression profiling. Quantification based on counting-sequenced sRNA species provides a dynamic range that is orders of magnitude greater than traditional microarray approaches, and HTS analyzes orders of magnitude more targets than qPCR. In addition, HTS allows for the identification of new sRNAs with yet-undescribed functions.
HTS sRNA profiling methods generally consist of adding adapters to both ends of sRNAs through various enzymatic reactions and sequencing the resulting sRNA libraries on next-generation sequencers. The idea that HTS can be used for sRNA expression profiling is based on the concept that the relative frequency of sRNAs sequenced correlates to their relative abundance in the sample. However, correlation may be imperfect due to systematic errors in the sRNA preparation protocols. Multiple sources of such bias can be introduced during library preparations including adapter ligation bias from T4 RNA ligases and RNA secondary structures, PCR amplification bias, and bias from sequencing platforms. Building upon the review of RNA modifications and activities of important enzymes used for sRNA profiling in the previous section, we will now examine widely used library construction methods and discuss the potential sources of bias and possible solutions to minimize such bias.
Attaching adapters at sRNA 5′- and 3′-ends is required for downstream cDNA synthesis, amplification, and sequencing in HTS. Figure
Small RNA high-throughput sequencing library construction methods. The 5′- and 3′-adapters are shown as blue and red lines, respectively. sRNAs are depicted as grey lines. After sRNAs are converted into DNA, the sequences are shown as black lines. The asterisks represent steps suitable for introducing barcodes in each method. The dashed lines with arrows illustrate cDNA synthesis. At the bottom of each schematic diagram, RNA 5′-end requirement and sensitivity to 2′-O-methyl modification at the 3′-end for each method are noted.
The hybridization-based ligation method (SREK kit for the SOLiD sequencing platform developed by Life Technologies) uses two double stranded adapters that contain degenerate 5′- or 3′-end overhangs. These degenerate overhanging sequences allowing the region to anneal to the unknown sRNA ends. After annealing, the nicks between sRNAs and adapters are sealed using T4 RNA ligases. After ligation, the reaction products are reversed transcribed into cDNA by extending the bottom strand of the 3′-adapter and further amplified using primers annealing to both adapter sequences [
A second method utilizes polyadenylation. Multiple A residues are added to the 3′-end using Poly(A) polymerase, creating a 3′ polyA tail [
A third method uses sequential adapter ligations and is widely used for sequencing on the Illumina platform. The method sequentially ligates 3′- and 5′-adapter oligonucleotides directly to the unknown sRNA pools [
Under standard library construction protocols, sRNAs with 2′-O-methyl modifications at their 3′-ends tend to be underrepresented in HTS-based expression profiling experiments due to the effect of the modification on enzymatic reactions. Both polyadenylation and ligation efficiency of RNAs with a 2′-O-methyl group at the 3′-end are less than that of unmodified 3′-ends [
The reverse transcriptase used for cDNA synthesis is also known to be sensitive to 2′-O-methyl residues in RNA templates [
In order to capture sRNAs with a 5′-triphosphate, such as secondary siRNAs, in HTS libraries, the sRNAs can either be enzymatically treated to convert the 5′-triphosphate to 5′-monophosphate as shown in Figure
As discussed above, sRNA HTS library construction is achieved through series of enzymatic reactions. T4 RNA ligases are the key enzymes commonly used in all current library construction protocols. Here, we focus on the enzymatic properties of T4 RNA ligases, including T4 Rnl1 and T4 Rnl2, in the context of each library construction protocol.
In the hybridization-based ligation method, ligation is dependent on the annealing of the degenerate region of the adapter to RNAs in the sample. The annealing step itself could potentially introduce bias. We detected significant sequence bias in experiments that used degenerate stem-loop RT primers to sequence random oligonucleotide pools. In the experiments, partially double-stranded stem-loop oligos with 3′-overhanging degenerate regions were designed to hybridize with the 3′-end of sRNAs. HTS data from libraries prepared with this approach showed significant bias toward GC sequences in the hybridizing region. Although the study did not involve use of T4 RNA ligase, it is illustrative of the potential bias in sequence composition when degenerate oligos are used for hybridization [
Another concern in the hybridization-based ligation method arises from the junction substrate specificity of T4 Rnl2. During the annealing process, various types of junctions between RNA and adapter termini can form depending on positioning of the RNA and adapter. Annealing may result in a nick, one or more extra nucleotides flaps at the RNA 5′- or 3′-termini, or gaps. It is known that T4 Rnl2 only promotes the formation of phosphodiester bonds between 3′-hydroxyl ends and 5′-phosphate ends in the nicked arrangement [
In the sequential adapter ligation protocol, T4 Rnl1 and T4 Rnl2 are used to attach sequence-specific adapters to the 5′- and 3′-end of sRNAs. Although the preferred substrates of both enzymes are RNAs, T4 Rnl1 and T4 Rnl2 are capable of using 5′-phosphate DNA ends as donors, while T4 Rnl1 is also capable of using a DNA acceptor (3′-hydroxyl) [
T4 Rnl2tr, a C-terminal truncated T4 Rnl2, has desirable features for use in sRNA library construction. The C-terminal domain of T4 Rnl2 is implicated in transferring AMP from ligase to 5′-PO4 to form an adenylated RNA intermediate and thus T4 Rnl2tr requires preadenylated donor molecules for ligation [
A recent study using a pool of synthetic miRNAs showed that the inconsistencies in miRNA quantitation in HTS are mainly derived from the adapter ligation steps [
cDNA synthesis by reverse transcriptase (RT) is a common step in all RNA HTS library construction methods. Both fidelity and the ability of RT to synthesize full-length cDNA can potentially impact sRNA profiling by HTS.
In terms of base misincorporation rates, the fidelity of RTs is lower than that of modern proofreading DNA-dependent DNA polymerases used in HTS library construction. While potentially problematic for sRNA variant discovery, we would argue that the contribution of base misincorporation to systematic error in sRNA profiling by HTS is insignificant. The base misincorporation rate of AMV and MLV RTs are ~1/17,000 and ~1/30,000, respectively, as reviewed in [
Insertion and deletion errors by RT are less well characterized than misincorporations, and their impact on sRNA profiling by HTS remains to be determined. Similarly, further elucidation of the causes of RT mutational hotspots [
Untemplated 3′-end nucleotide addition by RT is disadvantageous for protocols synthesizing cDNAs prior to ligation steps. This activity might be problematic for experiments where the precise determination of the sRNA 5′-end is required as discussed in Section
RT primer extension of a panel of 5′- and 3′-ligated synthetic miRNAs showed no differences in the yields of cDNA synthesized [
The relatively low number of miRNAs per genome (1520 human miRNAs in miRBase version 18) [
Considering the observed bias originating from T4 RNA ligases, introducing barcodes during or prior to any of the ligation steps seems potentially problematic from the viewpoint of sRNA profiling. The different barcode sequences can influence the RNA and adapter cofold structures, likely resulting in barcode-dependent changes in ligation efficiency of sRNAs in the sample. Similarly, changing the adapter sequences can also be expected to change ligation efficiency for specific sRNAs. The net effect of these changes could confound the interpretation of expression profiling results.
Introducing barcodes in the reverse transcription or PCR steps seems less likely to cause biases in estimation of sRNA levels. However, this approach is not without caveats. It is known, for instance, that multiple-template PCR amplification can result in a sequence-dependent amplification bias due to sequence differences [
From the perspective of RNA ligase-dependent sRNA HTS library construction for expression profiling, it seems clear that library construction protocols, including ligation enzyme reaction conditions and adapter sequences, warrant careful consideration for the interpretation of results. When the libraries are prepared with the same protocol, comparisons of individual miRNA levels between libraries are likely valid and reproducible. Quantification of different sRNAs within a library or quantitation of specific sRNAs between samples prepared with different protocols may be influenced by the protocols themselves. The inclusion of spike-in external standards and careful secondary validation are critical to accurate interpretation of profiling results.
Validating expression profiles using alternative methods is essential due to the limitations and systematic error that may exist in any profiling method. Quantitative PCR, Northern Blot hybridizations, and microarrays are widely used methods for sRNA expression profiling [
Other emerging alternative sRNA profiling methods based on electrochemical, bioluminescence, raman signals, and surface plasmon resonance are well discussed in a recent review [
Realizing the biases discussed here for current HTS-based profiling, the successful development of an amplification free, direct RNA sequencing platform is particularly attractive to obtain a comprehensive and bias-free profiles of the transcriptome [
Profiling techniques that provide sensitive detection with large dynamic range but do not require the modification of sRNAs seem ideal. As we look toward the future, new technologies such as nanopores may be able to satisfy some of these criteria. For example, a recent report demonstrated proof of principle for nanopore detection of sRNAs using specific hybridization probes and the viral siRNA binding protein p19 to analyze specific miRNA in a total RNA sample [
This paper has discussed many of the sources of inaccuracy and bias that can arise in expression profiling of sRNAs. We focused on the implications of enzymatic manipulation of sRNAs using HTS library construction as an example. Although there are numerous steps where systematic error can be introduced in these workflows, HTS remains the most powerful current method for expression profiling and sRNA discovery. Enzymatic manipulation of nucleic acids in expression profiling will continue to be important, even as hardware platforms change. Thus, it is important that experimental design and interpretation of expression profiling experiments thoughtfully consider the capabilities of enzymes used as tools in order to produce high-quality data sets and to generate valid comparisons between them.
Small RNAs
High-throughput sequencing
Ribosomal RNA
Transfer RNA
Messenger RNA
Long noncoding RNA
MicroRNA
Small interfering RNA
Piwi-interacting RNA
Primary miRNA
Precursor miRNA
RNA-induced silencing complex
Formalin-fixed paraffin-embedded tissue
Small nuclear RNA
Heterogeneous nuclear RNA
Tobacco acid pyrophosphatase
T4 polynucleotide kinase
A 5′ monophosphate dependent 5′ to 3′exoribonuclease
Methanobacterium thermoautotrophicum RNA ligase
T4 RNA ligase 1
T4 RNA ligase 2
T4 RNA ligase 2 truncated
T4 RNA ligase 2 truncated with mutation K227Q
Avian myeloblastosis virus reverse transcriptase
Murine leukemia virus reverse transcriptase.
The authors thank Bill Jack, Sriharsa Pradhan, and Larry McReynolds for their critical reading and discussion of the paper, and Brad Langhorst and Jennifer Ong for helpful discussions.