Distribution of Genes and Repetitive Elements in the Diabrotica virgifera virgifera Genome Estimated Using BAC Sequencing

Feeding damage caused by the western corn rootworm, Diabrotica virgifera virgifera, is destructive to corn plants in North America and Europe where control remains challenging due to evolution of resistance to chemical and transgenic toxins. A BAC library, DvvBAC1, containing 109,486 clones with 104 ± 34.5 kb inserts was created, which has an ~4.56X genome coverage based upon a 2.58 Gb (2.80 pg) flow cytometry-estimated haploid genome size. Paired end sequencing of 1037 BAC inserts produced 1.17 Mb of data (~0.05% genome coverage) and indicated ~9.4 and 16.0% of reads encode, respectively, endogenous genes and transposable elements (TEs). Sequencing genes within BAC full inserts demonstrated that TE densities are high within intergenic and intron regions and contribute to the increased gene size. Comparison of homologous genome regions cloned within different BAC clones indicated that TE movement may cause haplotype variation within the inbred strain. The data presented here indicate that the D. virgifera virgifera genome is large in size and contains a high proportion of repetitive sequence. These BAC sequencing methods that are applicable for characterization of genomes prior to sequencing may likely be valuable resources for genome annotation as well as scaffolding.


Introduction
Bacterial artificial chromosome (BAC) libraries are composed of physical constructs that contain large genomic DNA inserts and provide a tool for the molecular genetic research of organisms of interest. For instance, anonymous genetic markers linked to genes that control insecticide resistance traits have been identified on BAC clones, and, following subsequent sequencing of cloned inserts, allowed the characterization of gene(s) that influence the expression of these traits [1]. Furthermore, sequence data from BAC inserts provide a means to evaluate genome structure, including the estimation of repetitive element densities [2] and the relative gene content of a species [3]. BAC clones are also useful for the construction of physical maps that represent contiguous sequence from an entire genome or genomic regions, and these assemblies have proven useful for determination of minimum tiling paths prior to BACby-BAC sequencing of large or highly repetitive genomes [4]. Scaffolding takes advantage of paired BAC end sequence (BES) data which provide direct physical linkages between sequence tags [5] and may assist in the scaffolding of contigs assembled from mate paired reads from next generation sequencing technologies.
The western corn rootworm, Diabrotica virgifera virgifera (Coleoptera: Chrysomelidae), is a beetle native to North America, which is adapted to feeding on a limited number of grasses including corn [6]. The ancestral geographic range of D. virgifera virgifera extended from present day Mexico into the Southwest United States and Great Plains, but an eastward range expansion began in the 1940s that coincided with the widespread cultivation of continuously planted corn in the central United States [7,8]. D. virgifera virgifera was accidentally introduced into Central Europe in the early 1990s [9], and subsequent transatlantic and intra-European introductions have contributed to its contemporary geographic range in Europe [10,11]. D. virgifera virgifera has one generation per year. Individuals overwinter as diapausing eggs which hatch in the spring, and subterranean larvae feed on corn roots [12]. Root damage caused by D. virgifera virgifera reduces the plant's ability to absorb soil nutrients and compromises structural stability [13]. Upon pupation and emergence from the soil, adult corn rootworm beetles can persist in fields for up to 4 weeks, can reduce seed pollination rates through feeding damage to corn silks (stamen) and can vector maize chlorotic mottle virus [14] and stalk rot fungus [15].
Larval feeding damage can be suppressed by systemic seed treatments, soil-applied insecticides, or transgenic corn hybrids that express Bacillus-thuringiensis-(Bt) derived insecticidal proteins. However, resistance to both chemical and Bt toxins are documented [16][17][18][19]. D. virgifera virgifera populations have also been managed by an alternating cornsoybean crop rotation (grass-legume rotation) [20], which negates the need for insecticide applications. This control strategy is based on a strong female preference to oviposit in soil at the base of corn plants, and the cospecialization of larvae for feeding on grass roots. However, female D. virgifera virgifera phenotypes have evolved that are no longer specifically attracted to cornfields but will lay eggs near other plants [21][22][23]. In the subsequent crop production year, progeny of these variant D. virgifera virgifera females will emerge in and damage first-year corn crops. This adaptive loss of adult fidelity in oviposition behavior has defeated the use of corn-soybean rotation as an effective control practice in many corn growing regions of the United States [22].
The propensity for corn rootworm to adapt to control measures has raised concern among producers, scientists, and regulatory agencies, and the need to investigate the underlying genetic mechanisms for adaptation is critical to developing sustainable pest management approaches [7,24,25]. In anticipation of a recently initiated whole genome sequencing (WGS) effort for D. virgifera virgifera that aims to build a foundation for future genetic and genomics research [25], we have determined the haploid genome size and have estimated gene and repetitive fraction densities from BAC sequencing data. These data and resources will facilitate annotation and contig scaffolding efforts of the upcoming WGS project.

BAC Library Construction, End
Sequencing, and Annotation. Genomic DNA was extracted from ∼100 individuals of the D. virgifera virgifera nondiapausing strain, pooled, and fractionated by partial digestion with HindIII, fragments between ∼100 and 150 kb were excised, and these inserts were ligated into the pCC1 BAC vector (Epicentre, Madison, Wl, USA). Constructs were used to transform the Escherichia coli strain DH10B T1 by electroporation. Transformants were plated on LB agar (12.5 µg mL −1 chloramphenicol, 80 µg mL −1 X-Gal, and 100 0.5 mM IPTG) and a total of 110,592 BAC clones were arrayed on 288 individual 384-well plates to comprise the DvvBAC1 library. The mean insert size within DvvBAC1 was estimated by contour-clamped homogeneous electric field (CHEF) electrophoresis of NotI digested BAC DNA from 96 clones on a 0.9% agarose in 0.5X TAE buffer gel ramp run with a pulse time 5-15 s at 5 V/cm for 24 hrs and 4 • C. Insert size estimates were made by comparison to the MidRange II PFG Marker (New England Biolabs, Ipswich, MA, USA), and the fold genome coverage of DvvBAC1 was estimated according to Clark and Carbon [29]. BAC DNA from 1152 DvvBAC1 clones (plates 217, 218, and 227) were purified and sequenced and annotated as described by Coates et al. [2], and sequence data deposited into the GenBank genome survey sequence (GSS) database (accession numbers JM104642-JM106797).

BAC Screening and Full Insert
Sequencing. DNA from DvvBAC1 clones were pooled into matrix, row, and column pools according to Yim et al. [30] and used in PCR reactions as described by Coates et al. [31]. DNA from DvvBAC1 clones was purified using the Large Construct Purification Kit (Qiagen, Valencia, CA, USA) according to the manufacturers instructions, and DNA preparations run on 0.8% agarose gel electrophoresis. BAC DNA was used to create individual mid-tagged libraries (RL1 to RL10) and each was sequenced on Roche GS-FLX at the William H. Keck Center for Comparative and Functional Genomics at the University of Illinois.
Cross-match (http://www.phrap.org/), Roche-provided sff tools (http://454.com/products/analysis-software/index.asp) and custom Java scripts were used to identify trim sequencing adaptors within sff file data and remove sequences of <50 nucleotides or with homopolymer stretches ≥60% of the raw read length. Processed sequence data was assembled into contigs using the Roche GS De Novo Assembler v 1.1.03 using default parameters (seed step: 12, seed length: 16, min overlap length: 40, min overlap identity: 90%, alignment identity score: 2, and alignment difference score: 23).
Cloning vector sequence was identified using VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html) and masked using Maskseq [32]. Contigs assembled from contaminating Escherichia coli DNA were identified by querying against the K-12 reference genome (GenBank accession NC 000913) using the blastn algorithm, and contigs that produced E-values ≤10 −15 were removed manually. The remaining filtered BAC contigs were annotated using the MAKER 2 genome annotation pipeline [33] using coding sequence evidence from 17,778 D. virgifera virgifera ESTs (GenBank dbEST accessions EW761110.1-EW777358.1 [34] and CN497248.1-CN498776.1 [35]), protein homology by blastx searches of the UniProt/Swiss-Prot databases, and T. castaneum gene models using the AUGUSTUS web server [36]. Prior to any annotation, RepeatMasker and RepeatRunner were used to identify retroelement-like regions within the BAC full inserts by running against predefined RepBase and RepeatRunner te proteins provided by MAKER 2 [33]. MAKER 2 output was imported into the Apollo Genome Annotation and Curation Tool [37], where additional annotations were performed via blastx searches of the NCBI nr protein database (E-values ≤ 10 −15 ).
Contigs from clones 142B02 and 156M20 represent partially overlapping homologous sequence and were assembled into a single reference using CAP3 [38] (default parameters). Processed 454 read data from libraries RL003 and RL007 were mapped to this assembled reference using the program LASTZ [39] (default parameters). LASTZ output was made in Sequence Alignment/Map (SAM) format which was used to create an indexing sorted alignment file (.BAI file) with the command line index tool from SAMtools [40]. Mapped read data was visualized using BAMview in the Artemis Genome Viewer [41].  [42]. Splign tab-delimited output was used to estimate mean intron and exon size. Additionally, a de novo prediction of D. virgifera virgifera repetitive sequence was made by assembling BAC end sequence (BES) data using CAP3 [38] (default parameters), and subsequently used to query the D. virgifera virgifera cadherin genomic DNA sequence (EF541349.1) for putative repetitive sequence using the blastn algorithm. Blastn output was filtered for E-values ≥10 −40 . De novo prediction of D. virgifera virgifera repetitive sequences were also made within our assembled BAC full inserts (GenBank accessions JQ581035-JQ581043) by querying accession EF541349.1 using identical parameters. BAC insert regions with similarity to the EF541349.1 sequence were excised from BES contigs and BAC full insert sequences (using a custom PERL script), mapped to EF541349.1 using LASTZ [39] (default parameters) and output handled as described previously.

Comparative Genomics and Annotation of Repetitive
A computational prediction of short repetitive DNA elements known as miniature inverted repeat transposable elements (MITEs) was made for D. virgifera virgifera BES contigs and singletons as well as GenBank accession EF541349. 1

Genome Size Estimation.
The D. virgifera virgifera haploid genome size was estimated at 71, 144 ± 537 fluorescent units from propidium iodide (PI) stained nuclei, which compared to 69, 319 ± 491 and 35, 631 ± 687 units for the internal standards of known genome size, Zea mays (2.50 Gbp) and Glycine max (1.115 Gbp), respectively. Populations of nuclei from Z. mays and D. virgifera virgifera produced overlapping PI signals on a flow cytometer, but the size scatter component (SSC-A) indicative of nucleosome densities was used to separate the signals of independent PI readings ( Figure 1). Subsequent calculations of PI to genome size ratios indicate an estimated D. virgifera virgifera haploid genome size of ∼2.58 Gb or 2.80 pg.

BAC Screening and Full Insert Sequencing.
Screening of DvvBAC1 identified clones containing sequence from eight EST markers (5.29 ± 2.98 hits; range of 1 to 9 hits per marker; data not shown). Eight of the 9 BAC inserts (88.9%) were successfully sequenced on the Roche GS-FLX. After raw data filtering, a total of 240,586 reads were assembled into 39 contigs that contained 642.0 kb of sequence (16.5±18.9 kb per contig; Table 2). The annotation of BAC inserts using MAKER 2 predicted 37 putative genes and 48 retrotransposon-like protein coding intervals with 3 and 31 of these sequences supported by EST evidence, respectively.
Contigs from clones 142B02 and 156M20 represent homologous genomic regions within different clones and provide a measure of haplotype variation within the library. Sequences from these two clones shared 11 endogenous and 5 retroelement-like protein coding sequences, which represent homologous genome intervals from unique BAC inserts. Six contigs totaling 31.9 kb were aligned (Figure 3), and haplotype variation between inserts was shown via 3 SNPs within the 22.5 kb of CAP3 aligned sequence (SNP frequency ∼1.3 × 10 −4 ), protein coding sequences were 100% conserved, and no indels were present. Compared to the consensus, 2564 and 5467 bp regions were not represented within clones 142B02 and 156M20, respectively, and was verified by mapping reads to the CAP3 scaffolds ( Figure 3). HindIII restriction site

D.virgifera virgifera D.virgifera virgifera
Percentage of parent mapping showed that cut sites used in cloning may not have been the cause of sequence disparity. Additionally, the entire pCC1 cloning vector sequence was sequenced and masked from both clone 142B02 and 156M20 assemblies, indicating that insert boundaries did not give rise to the two gaps. Retroelement-like sequences were annotated within the two haplotype sequence gaps. These results also suggest that structural variation based on the integration/excision or random deletion of repetitive DNA elements may exist among D. virgifera virgifera haplotypes.     (Table S4), where 22, 18, and 11 of the DRs involved AT/TA, AA, and TT dinucleotides. Putative MITEs that occupy a total of 2.4 kb (mean = 278.9 ± 124.5 bp) are composed of 65.7 ± 0.1% A or T nucleotides and have predicted terminal inverted repeat (TIR) lengths of 11.9 ± 5.5 bp. Positions of MITE-like inserts were predicted to be within intron regions (Figure 4). Comparatively, the T. castaneum cadherin gene contained 12 putative MITE-like elements that were all predicted within intron regions (Table  S5).  [45]. Genome size heterogeneity among beetle species does not appear to be correlated with organism "complexity" (Cvalue paradox) [47], specialization [48], or increased gene content [49]. The relation between repetitive DNA content and genome size in Coleoptera is only available for the model species Tribolium castaneum (Coleoptera: Tenebrionidae), where the ∼0.200 Gb genome has an estimated 5110 repetitive elements [30] which comprise ∼13 of the 0.160 Gb assembled sequence [49]. In contrast, our data suggest that the proportion of the D. virgifera virgifera genome consisting of repetitive DNA is much higher.

BAC Library Construction, End
Sequencing, and Annotation. BAC libraries are genomic tools that are useful for the isolation of genes linked to a trait [50] as well as the generation of end sequences that provide estimates of genome structure and TE densities [2,51,52]. Despite their utility in genomics research, only one coleopteran BAC library has previously been reported, for T. castaneum [53]. The prediction of gene-coding regions from BAC end sequence from nonmodel species rely on functional annotation by homology-based identification with related genes in model organisms. This can result in vague or inaccurate gene definitions for nonmodel species [54], such as our D. virgifera virgifera dbGSS dataset. Despite the relative dearth in gene discovery by D. virgifera virgifera BES, 179 novel protein coding regions were annotated which will provide a resource for annotation of future WGS efforts. Studies with similarly low genome sequence coverage have been useful for initial descriptions of functional and repetitive elements [55].  Proteins encoded by DNA-based TEs and retrotransposons totaled ∼16.0% of BES reads and outnumbered endogenous genes by ∼1. 8-fold. Extrapolation suggests that retroelement-like TE genes might occupy ∼0. 41 Gb of the 2.58 Gb genome. Compared to T. castaneum which has ∼3.7% of the genome assembly occupied by LTR-and non-LTR-retrotransposons [30], the D. virgifera virgifera genome may have an ∼4.3-fold higher retroelement content. Our investigations also indicate that small nonautonomous miniature inverted repeat transposable elements (MITEs) are present within the D. virgifera virgifera genome.

BAC Screening and Full Insert
Sequencing. The Roche-454 GS-FLX provides a robust method for rapid sequence generation, from which single end read data were sufficient to assemble 8 of the 9 BAC plasmids we sequenced. Assembly of D. virgifera virgifera BAC inserts into an average of ∼5 contigs per clone and encompassing 80.3 kb of total sequence was greater than that obtained following assembly of BACs from barley [23]. Annotations indicated that the number of TE-derived genes in assembled contigs were 1.3-fold higher than endogenous protein coding genes. This result differs from our estimate from BES data but may be influenced by sample number or by the effect of large TE-derived gene sizes on the probability of sampling from BES data. Regardless, full BAC insert sequences indicate that the D. virgifera virgifera genome is comprised of a high proportion of TEderived sequence but also suggests that DNA-based and retroelement-like TEs are localized within intergenic space. This preliminary genome sequencing evidence suggests that genic regions of the D. virgifera virgifera genome can be assembled from short single-end NGS read data, but the use of longer read lengths and paired-end or mate-pair NGS strategies may result in increased contig size and/or scaffolding by the spanning of repetitive elements.
Comparison of the homologous regions within contigs from clones 142B02 and 156M20 provided a direct measure of haplotype variation within DvvBAC and also within the D. virgifera virgifera nondiapause strain. SNP variation between haplotypes was low, which may be the result of a genetic bottleneck and subsequent inbreeding within the colony. These results are consistent with a microsatellite markerbased estimate of 15-39% allele diversity reduction in the nondiapause colony compared to wild populations [56]. Comparison of D. virgifera virgifera haplotypes suggested that local genome variation based upon insertion/deletion of large DNA regions may occur. Evidence suggests that these variations are not likely due to differences in read depth or effects of cloning due to variation in HindIII restriction sites. Interestingly, retroelement-like sequences were annotated within regions of haplotype variation and may indicate that microsynteny is altered through TE integration. Analogous haplotype variation was caused by movement of Helitronlike TEs in maize and SINEs in canine genomes. Our results similarly suggest that retroelement movement may be a source of haplotype variation in the D. virgifera virgifera genome but will require further investigation to realize the extent to which movements affect genome structure and function.

Comparative Genomics and Annotation of Repetitive Elements.
Compared to T. castaneum, the orthologs of intronless histone encoding genes show no size increase within the D. virgifera virgifera genome, although intron-containing genes tend to show a dramatic increased size in D. virgifera virgifera. For example, the 94.6 kb D. virgifera virgifera cadherin gene is ∼13.3-fold larger than the T. castaneum ortholog despite the coding regions being approximately the same length. Mapping of BES reads and computational prediction of MITE-like elements within the D. virgifera virgifera cadherin gene indicated that TEs and other repetitive elements have inserted within intron regions and are the cause of the comparative increase in gene size. TE integrations within introns are known to affect splicing efficiencies [57], but this remains to be investigated in D. virgifera virgifera. As stated previously, the insertion of large retroelements within gene coding regions was not predicted. The insertion of a repetitive DNA in the D. virgifera virgifera cadherin 5 -UTR suggests that the movement of TEs within the genome could alter gene expression and regulation. TE integrations are also known to cause chromosomal changes that alter gene expression [58]. The accumulation of these changes across the genome can lead to differential selection among local environments [59] or even contribute to the evolution of new species [60]. Knowledge of TE composition within a genome is a fundamental step in the study of relationships between structure and function that may form a basis for future comparative studies. We defined 296 small repetitive DNA elements and 48 large retroelement-like coding sequences within the D. virgifera virgifera genome. Although these elements were defined from only 1.15 Mb of genomic sequence, these predictions represent an initial resource for understanding the proliferation and phenotypic effects of repetitive DNA. The DvvBAC1 library has proven useful for the description of gene and repetitive element densities in the D. virgifera virgifera genome and will be a tool for the investigation of the genetic basis of problematic insecticide resistance and behavioral traits expressed by this crop pest species.