Technical Considerations for Reduced Representation Bisulfite Sequencing with Multiplexed Libraries

Reduced representation bisulfite sequencing (RRBS), which couples bisulfite conversion and next generation sequencing, is an innovative method that specifically enriches genomic regions with a high density of potential methylation sites and enables investigation of DNA methylation at single-nucleotide resolution. Recent advances in the Illumina DNA sample preparation protocol and sequencing technology have vastly improved sequencing throughput capacity. Although the new Illumina technology is now widely used, the unique challenges associated with multiplexed RRBS libraries on this platform have not been previously described. We have made modifications to the RRBS library preparation protocol to sequence multiplexed libraries on a single flow cell lane of the Illumina HiSeq 2000. Furthermore, our analysis incorporates a bioinformatics pipeline specifically designed to process bisulfite-converted sequencing reads and evaluate the output and quality of the sequencing data generated from the multiplexed libraries. We obtained an average of 42 million paired-end reads per sample for each flow-cell lane, with a high unique mapping efficiency to the reference human genome. Here we provide a roadmap of modifications, strategies, and trouble shooting approaches we implemented to optimize sequencing of multiplexed libraries on an a RRBS background.


Introduction
The emergence of next generation sequencing (NGS) in recent years has revolutionized genetic researchers' ability to interrogate genomes on an unprecedented scale. The sequencing-by-synthesis NGS technologies are evolving rapidly and throughput capacity is increasing exponentially [1,2]. The Illumina sequencing platform, on which we focus here, involves clonal amplification of adaptor-ligated fragments of genomic DNA that are sequenced using reversible terminator-based chemistry. NGS has a broad scope of applications including de novo genome assembly, transcriptomics, SNP discovery and clinical diagnostics [3][4][5]. Of particular interest is the potential that NGS has for the field of epigenetics, specifically genome-wide DNA methylation analysis. Catalyzed by DNA methyltransferases (DNMTs), DNA methylation involves the covalent addition of a methyl group (CH 3 ) to the cytosine (C) to make 5-methylcytosine (5mC) [6]. In mammals, this stable modification almost exclusively occurs in the context of CpG dinucleotides [7,8]. DNA methylation has been associated with gene silencing, tissue differentiation, genomic imprinting, X chromosome inactivation, phenotypic variation, and possibly disease susceptibility [9][10][11][12][13]. Aberrant DNA methylation is implicated in several diseases and has a well-established role in tumorigenesis [14].
Recently, several NGS-based methods have been developed to profile the DNA methylation status of a genome. Among them, reduced representation bisulfite sequencing (RRBS) can be used to generate comprehensive methylomes [15]. Compared to whole genome methylation sequencing, RRBS provides an alternative sequencing technology at a reduced cost [16] and it is used by many groups worldwide [17][18][19][20][21][22]. By representing a small fraction of the genome (∼2.5% of the human genome), it reduces the amount of sequencing required, while enriching for promoter associated CG-rich regions. The method involves bisulfite conversion, which converts unmethylated cytosines to uracil, while leaving methylated cytosines unchanged, thereby generating base-pair resolution DNA methylation profiles.
Alignment of bisulfite-converted sequence reads to a large reference genome brings computational challenges. A thymine (T) in the sequenced read could reflect either an unmethylated C or a T, but not vice versa. Therefore, the asymmetric mapping increases the chances of falsepositive matches and the search space for mapping [23]. Nevertheless, several tools and pipelines have been developed that enable molecular biologists to analyze the large volume of methylation data with increasing ease [16,24,25].
Recent advances in the Illumina sequencing platform (see Table S1 in Supplementary Material available online at doi: 10.1155/2012/741542) have increased the capacity from 20-30 million reads per lane (Genome Analyzer/GAII) to ∼200 million reads per lane (HiSeq 2000). In the case of smaller or reduced representation genomes, such a high number of reads provides ample coverage. Indeed, multiplexed sequencing of these smaller or reduced representation genomes permits considerable cost savings. DNA libraries for multiplexed sequencing are prepared by ligating adaptors containing different 6 bp index sequences to each fragmented DNA sample (Supplementary Figure 1). This allows multiple samples to be combined into a single sequencing reaction and then individually identified and demultiplexed during base-calling analysis.
Meissner's group has previously described the RRBS protocol in several articles [15,20,26,27]. In each of these articles, the protocol was described for the Illumina Genome Analyzer sequencer, where one RRBS library was sequenced per lane. However, due to the increased capacity of the new HiSeq 2000 sequencer (almost 8-fold increase in terms of read numbers), it became inevitable to sequence multiple RRBS libraries in one single flow-cell lane. The library preparation method for multiplexed runs and the sequencing pipeline significantly differ from the described Genome Analyzer workflow. Here we describe the required modifications of the original protocol and the strategies we employed for successful sequencing. Specifically, we demonstrate improved strategies for bisulfite conversion, library purification, and PCR amplification.
Further, the previous protocols focused only on library preparation methods, but the downstream data processing and bioinformatics were not described. We show the steps and challenges involved in base calling of multiplexed RRBS libraries due to its unique base composition. We describe a pipeline to evaluate and improve the quality of the data obtained and increase the output from indexed sequencing runs. Additionally, we comment on the different data formats generated by different versions of the sequencing chemistry and the sequence alignments with high mapping efficiency, which demonstrates the effectiveness of the method described. This paper provides a complete workflow starting from library preparation to base calling and to successful mapping of the obtained library.

Methods
Previous papers have described in detail the process of generating reduced representation libraries for methylation analysis [15,26,27]. Here, we briefly describe the protocol with emphasis on the modifications required to make successful multiplexed libraries for sequencing. Key differences between the previously used paired-end library preparation protocol and the TruSeq (version 2) multiplexing protocol are outlined in Supplementary Table S2.

Library Preparation.
Although the TruSeq protocol recommends 1 µg of input DNA for normal genomic DNA library preparation, we found that for RRBS this amount was not sufficient. For RRBS, 2.5 µg of genomic DNA was digested overnight with MspI (New England Biolabs, Ipswich, MA) using 20 units of enzyme per µg of DNA to ensure complete digestion. Digested DNA was purified on a QIAquick spin column (Qiagen, Hilden, Germany). Then, end repair and addition of 3 A overhangs were performed using the TruSeq DNA kit (Illumina, San Diego, CA). Indexed TruSeq adaptors were ligated according to the manufacturer's protocol and purified with AMPure beads (Agencourt Bioscience, Beverly, MA). For detailed methodology of multiplexed adaptor ligation, see Supplementary  Table S3. DNA fragments ranging from 160 to 340 bp (40-220 bp of DNA plus 120 bp of adaptors) were excised from a 3% (w/v) NuSieve GTG agarose gel (Lonza, Basel, Switzerland) and purified using the QIAquick gel extraction protocol.

Bisulfite Conversion.
Previously published methods recommend two rounds of bisulfite conversion with the Qiagen EpiTect kit for complete bisulfite conversion of human RRBS libraries [27]. However, we found that two rounds of conversion and purification resulted in significant loss of the template in the bisulfite-converted library leading to a requirement for high cycle number for library amplification. Also, 5-methylcytosine can undergo deamination, at a slow rate, during prolonged bisulfite treatment [28]. Based on a small number of samples, we achieved more consistent bisulfite conversion of size-selected libraries using the EZ DNA methylation kit (Zymo Research, Irvine, CA). Although the manufacturer's instructions are to incubate DNA samples with bisulfite reagent for 12-16 hours at 50 • C, we incubated them for 18-20 hours. This longer incubation period was associated with consistent conversion of human genomic DNA and minimal loss during the process.

Amplification.
The previously described paired-end protocol used the forward and reverse primers P.E. 1.0 and P.E. 2.0 to amplify libraries [27]. The TruSeq protocol has streamlined the library amplification workflow by providing two master-mixed reagents, a PCR master mix and a PCR primer cocktail. However, we found that in contrast to a normal genomic DNA library used as a positive control, this protocol did not amplify bisulfite-converted libraries within 20 or 30 cycles of PCR. We optimized the library amplification step by incorporating the TruSeq primer cocktail into the PCR protocol used by Smith et al. [26]. For amplification of the bisulfite-converted DNA, we used PfuTurbo Cx DNA polymerase (Stratagene, La Jolla, CA). It has been proposed that a high frequency of uracils in bisulfite-converted DNA results in uracil stalling by the DNA polymerase [26]. PfuTurbo efficiently reads through uracils in the template strand, but its enhanced proofreading activity prevents PCR-induced point mutations. The final libraries (1.44 µL of bisulfite-converted DNA per 12 µL PCR reaction) were amplified using 1.45 U PfuTurbo Cx DNA polymerase, 0.3 mM dNTP stock, and 1.44 µL TruSeq PCR primer cocktail (see Supplementary Table S4) with the following thermocycler conditions: 95 • C for 2 min, n × (95 • C for 30s, 65 • C for 30 s, and 72 • C for 45 s), 72 • C for 7 min. Initially, analytical PCR reactions were performed to determine the optimal number of cycles for amplification of libraries (a higher number of cycles can induce amplification bias). Libraries were amplified with 15-18 cycles of amplification and then visualized on a 4-20% Criterion gradient polyacrylamide TBE gel (BioRad Laboratories, Hercules, CA) stained with SYBR green nucleic acid gel stain (Life Technologies, Grand Island, NY).

Assessment of of Library Quality.
To verify fragment size and quality of the amplified libraries, individual aliquots were run on a 2100 Bioanalyzer (Supplementary Figure S2) using a high-sensitivity DNA chip (Agilent Technologies, Santa Clara, CA). Libraries were quantified using a Qubit 2.0 fluorometer (Life Technologies, Grand Island, NY).

Sequencing.
Libraries were pooled at equimolar concentrations based on Qubit measurements and sequenced on a single flow cell lane of an Illumina HiSeq 2000 sequencer with a paired end, 100 bp run. The initial forward read is termed Read1 and the reverse read is termed Read2. 5% phiX genomic DNA (control) was spiked into the lane. In between these 2 reads, a 6-cycle read is made of the adaptor index sequence to allow for demultiplexing of the pooled samples.

Base Calling.
Illumina base calling is usually performed by the real-time analyzer (RTA) running on the sequencer, but in some cases, as described in Section 3, it was necessary to repeat the base-calling step afterwards. The Illumina Off line Basecaller (OLB) v 1.9.4, running on a Redhat Linux server, was used for this. The application applies the same algorithm as the original RTA application, using the compressed image files generated by the machine, but allows a wider range of processing options to be employed [29].

2.7.
Generating FASTQ File. FASTQ sequence file generation from the raw base-call data was performed using Illumina's CASAVA v1.8.2 package [30]. CASAVA scripts use the details of each sample, including its lane and index sequence, to identify the contents of each flow-cell lane and to partition multiplexed reads into index-specific FASTQ files. The amplification steps used in flow cell preparation generate a series of clusters of identical sequences at a specific pixel location on the flow-cell surface. Since the position is conserved through the reads (Read1, Read2, and the 6 bp index read), individual reads are identified by the pixel coordinates of the cluster on the flow-cell surface. CASAVA permits various processing options but, for the work described here, the output becomes a directory for each index containing a single gzip-compressed FASTQ file for each of the forward and reverse reads of a paired-end run. The logistics of running a flow cell require that the pairedend chemistry is applied to all samples if any on the cell needs that processing; CASAVA can be directed to ignore the reverse read or the extra data can be disregarded if not required. In the event that CASAVA finds reads for which the index sequence cannot be identified, through sequencing errors, the reads are saved in to a directory of "undetermined indices."

Data
Processing. Quality evaluation of the sequenced reads, filtering, processing and alignment of each dataset have been performed as previously published [24]. Briefly, the quality of the reads was checked using the FASTQC (version 0.10) program and then adaptor contamination was removed using cleanadaptors [24]. The alignment was performed using Bismark v0.6.4 [25] against the GRCh37.65 build of the human genome. The key aspects of the data processing are described in Section 3.

Results and Discussion
We have optimized the RRBS library preparation protocol and typically sequenced five multiplexed libraries on a single flow-cell lane of the Illumina HiSeq 2000. We obtain an average of 42 million paired-end reads per sample, with a high unique mapping efficiency to the reference human genome. Here we discuss the critical aspects of the DNA sample preparation protocol and data processing and comment on relevant troubleshooting approaches on an RRBS background.
The TruSeq protocol recommends the use of AMPure beads for purification of DNA fragments. However, this method is optimized for fragments >100 bp length. Consequently fragments shorter than 100 bp will be lost during the purification process, resulting in a much lower coverage of the genome (see Supplementary Figure S3). For the RRBS protocol, selection of 40-220 bp fragments (preligation) is necessary and therefore, prior to adaptor ligation, we recommend that DNA is purified using columns designed to retain DNA fragments within this size range. After the adaptors are ligated the fragment sizes are modified to 160-340 bp in length.
TruSeq adaptor ligation at a 2.5 : 1 (adaptor: DNA) ratio resulted in a distinct band at 125 bp in all our libraries following PCR amplification (Figure 1), which we had not observed in library preparations using the paired-end sample preparation kit that preceded the TruSeq protocol. Since the band is smaller than the DNA fragments excised from the gel, we concluded this band was a product of PCR amplification. Although we suspected the 125 bp band was due to adaptoradaptor dimerization, reducing the adaptor, DNA ratio did not eliminate these bands. However, we successfully bypassed this issue by performing a second round of gel size selection following PCR amplification, removing the dimer from the final library. In addition to this residual adaptor contamination, sequencing of size-selected RRBS libraries (160-340 bp) can result in adaptor sequence reads at the 3 end of the shorter fragments. These adaptor reads will interfere with alignment and possibly contribute to misalignment events and false methylation calls. We recommend removal of potential adaptor contamination by in silico tools (e.g., cleanadaptors) and assessment of the quality of the sequenced reads prior to mapping and further analysis [24].
RRBS libraries are generated after digestion with the methylation insensitive enzyme MspI. Due to the directionality of the Illumina platform and the protocol used, we obtained reads exclusively from MspI digested 5 CGG strands (the recognition motif of MspI is C CGG). The result of this protocol is that our reads will always start with CGG (when the first C is methylated) or TGG (when the first C is unmethylated and is converted to T after bisulfite conversion and PCR). This nonrandom base composition is a unique property of RRBS libraries, which significantly differs from normal genomic libraries. The Illumina RTA algorithm uses the first 4 sequencing cycles of Read1 to set internal standards for fluorescence throughout the run, a system based on the assumption that a relatively random distribution of bases will be found in at least two of those cycles. We were concerned that MspI fragmentation and the bisulfite chemistry together might confound this scheme as each of our fragment begins with CGG or TGG and therefore cause suboptimal standardization for the run, potentially leading to less accurate base calling.
The base composition features for our libraries are illustrated in Figure 2, which shows the intensities for each base channel over all machine cycles for two lanes: 2A (lane 4) contains regular multiplexed human genomic libraries and 2B (lane 8) contains our multiplexed bisulfite-converted libraries (note that the spikes present at 100 cycles in both lanes correspond to the 6 bp index sequence read). The biased frequencies of C and T in the RRBS samples are evident in Read1, as are the corresponding A and G biases in Read2. This skewed base composition is a specific feature of RRBS libraries as the other non-RRBS multiplexed genomic libraries have a normal distribution of all four bases (Figure 2(a)).
In order to counteract the possible bias one could include an internal standard (genomic DNA that can be expected to have random base composition in the first 4 cycles) at a sufficiently high level in the same lane or, as a subsequent operation, to set the standards from a lane containing clusters that are expected to be random. Although the latter operation is an option of the RTA system, we did not perform this step since it is not necessarily obvious in advance which other lanes might contain appropriately random samples or generate good-quality sequence data. For that reason, postrun standardization using a designated standard lane with the Illumina off-line base-calling application (OLB) was performed. OLB applies the same basecalling algorithm as RTA. We used the OLB program to repeat base calling using intensity data derived from genomic DNA (Figure 2(a)) as the standard. Detailed comparisons of base calling performed by RTA and OLB for each sample are illustrated in Table 1. For Read1, OLB showed an increase, ranging from 1.8% to 3.5%, in the proportion of sequence retained after quality trimming to a Phred score of 30 (fastq quality trimmer -t 30) compared to RTA. In contrast, Read2 showed more variable results with some libraries giving a decrease in retained bases: the difference ranged from −1.6% to 0.9%. Further, OLB showed an increase in the number of reads returned by CASAVA for both Read1 and Read2 for all RRBS libraries (Table 1). We then checked the quality of the sequencing reads generated by RTA and OLB with FASTQC (version 0.10) and found that the average Phred score value is marginally higher for the reads derived from OLB than RTA (Figure 3).
After base calling we obtained, for example, a total 55.3 Gb of sequence from one lane and 32.8, 49.4, 60.0, 27.5, and 41.7 million paired-end sequence reads for each of the five libraries, which ensures high coverage of CpG sites in the reduced representation genome ( Table 2). The variation in read numbers led us to investigate the number of undetermined indices during base calling. We found that a total of 13.2 million reads had undetermined indices (for both Read1 and Read2), being 10.9% of the total reads obtained. The genomic DNA control lane used by OLB had a total of 5.5% undetermined reads, which could suggest that the RRBS library contributes to more misreads during the index sequence read cycles, perhaps through more  mispriming. The extent to which index sequence misreading occurs with different libraries may vary and may contribute to variable numbers of reads generated for each. Moreover, accurate quantification of the libraries and pooling the different libraries in equimolar ratio could play an important role in multiplexed runs to achieve similar sequencing yields for each indexed sample.
Previously, we described a pipeline for the efficient processing of RRBS data for methylation analysis [24]. Following a similar strategy, we preprocessed the data and aligned the reads against the complete human reference genome (NCBI GRCh37 build) using the bisulfite aligner Bismark [25] and contrasted the mapping performance of both RTA and OLB derived data (Table 3). We did not observe major    differences in mapping efficiencies between the datasets, except decreased mapping for sample 3 on the OLB data. This analysis shows that it is worthwhile assessing the benefits of running OLB for each sequencing run for libraries that have significantly nonrandom base composition (e.g., RRBS) to ensure the quality of the data obtained. However, the choice of which dataset will be processed for further analysis will be based on the extent of improvement in the number of highquality reads retained and the mapping efficiency. Our results suggest that comparing the performance of each sample (on OLB and RTA datasets) and then choosing the samples with higher quality and better mapping in combination from both datasets will maximize the data returned from an individual library after demultiplexing. Progressive refinements in the instruments and sequencing chemistry have extended the lengths of reads and these changes have been accompanied by updates to the analytical software. Consequently changes have been made in the header lines of the FASTQ output files: the tile numbering has been changed and the quality codes have switched from an Illumina-defined scheme to that of Sanger [31]. Examples of FASTQ files from two different generations of machines, chemistry, and software are shown in Supplementary Figure S4. While these changes may seem minor, they can pose issues with specific tools for further analysis of the data, for example, graphical quality checks according to tiles (SolexaQA [32], or the methylation analysis of CpG sites. Differences in header lines between different files can confound programs used in further data processing. Hence, it becomes necessary that appropriate changes are made in scripts that interface with programs or that adequate flexibility is written into programs or scripts in order that they work with the data generated from the different versions of the chemistry and software.

Conclusions
Although the original RRBS methodology is well described, multiplexing such libraries is relatively new and the unique challenges associated with it have not been previously described. It will become increasingly desirable to do multiplexed (indexed) runs in the future for reduced representation and smaller genomes as the coverage and output from the sequencing run is increasing rapidly. However, methylation sequencing is more complex compared to other methods, so modification of the standard protocol is necessary. We have illustrated a successful strategy to generate high-quality methylation data from multiplexed runs. The challenge will however remain for molecular biologists to analyze and interpret the data as the volume of the data will also increase exponentially and computational tools will need to be updated in parallel with the advances in sequencing platforms and chemistry.