The Draft Genome and Transcriptome of the Atlantic Horseshoe Crab, Limulus polyphemus

The horseshoe crab, Limulus polyphemus, exhibits robust circadian and circatidal rhythms, but little is known about the molecular mechanisms underlying those rhythms. In this study, horseshoe crabs were collected during the day and night as well as high and low tides, and their muscle and central nervous system tissues were processed for genome and transcriptome sequencing, respectively. The genome assembly resulted in 7.4 × 105 contigs with N50 of 4,736, while the transcriptome assembly resulted in 9.3 × 104 contigs and N50 of 3,497. Analysis of functional completeness by the identification of putative universal orthologs suggests that the transcriptome has three times more total expected orthologs than the genome. Interestingly, RNA-Seq analysis indicated no statistically significant changes in expression level for any circadian core or accessory gene, but there was significant cycling of several noncircadian transcripts. Overall, these assemblies provide a resource to investigate the Limulus clock systems and provide a large dataset for further exploration into the taxonomy and biology of the Atlantic horseshoe crab.


Introduction
The Atlantic horseshoe crab, Limulus polyphemus, is an important species for a variety of reasons. First, it is considered a keystone species within many estuaries and bays along the Atlantic coast of the United States. Moreover, in these habitats their foraging behavior releases trapped nutrients into their local environment [1,2]. Second, their eggs are a source of food for endemic and endangered long distance avian migrants. Third, they are used as bait for the eel and conch fisheries [3]. Fourth, their blood is the source of the most sensitive assay for gram-negative bacteria contamination of medical products [4][5][6]. Finally, Limulus have long been used as model systems in neurobiological research. For example, lateral inhibition, one of the underlying principles of visual physiology, was first demonstrated in Limulus by Haldan Keffer Hartline, and as a result he won the Nobel Prize in Physiology or Medicine in 1967 [7]. More recently, this species has been used as a model for the investigation biological rhythms, for instance, circadian rhythms of visual sensitivity [8][9][10].
Circadian rhythms are produced by a combination of endogenous clocks and cues from the environment [11,12]. The identity of several core circadian genes that are responsible for producing these rhythms has been revealed in a number of invertebrates and vertebrates [13] and includes period, clock, cryptochrome, cycle, timeless, and other accessory genes. Although these genes are labeled as circadian, due to their critical function within the circadian clock mechanism, they may also play a role in other types of biological rhythms including those that regulate seasonal activity [14]. It has also been proposed, but not demonstrated, that they might be involved in shorter (∼12.5 hr) circatidal rhythms [9,15,16]. One of the overall goals of this study was to test this hypothesis in horseshoe crabs, which express both a circadian rhythm of lateral eye sensitivity [17] and a circatidal rhythm of locomotion [10,18]. In this study we developed draft genomic and transcriptomic assemblies for Limulus polyphemus and then compared the genes expressed during high and low tides and during the day versus the night. Particular attention was paid to putative core and accessory circadian genes. We identified these and then compared their expression, using RPKM values, across the different conditions (tides and light : dark (L : D)). Because no clear differences in the expression of putative circadian genes were apparent, we further examined some of the transcripts that did exhibit significant day/night or high/low tide differences as a first step towards the identification of potential proteins involved in the temporal control of the behavior and physiology in this species.

Animals and Environmental
Conditions. For genomic sequencing, an individual horseshoe crab was wild-caught from Great Bay Estuary in Durham, NH (43 ∘ 05 30 N and 70 ∘ 51 55 W). Leg skeletal muscle tissue was removed and placed in liquid nitrogen for immediate DNA extraction (described in the following).
For transcriptome sequencing, four animals were captured from Great Bay Estuary in Durham, NH, and sacrificed at four different times: day high tide (DHT, 0800), night high tide (NHT, 2030), evening low tide (ELT 1800), and during the day at low tide (DLT, 1530). DLT was collected while still being active (during high tide), placed into a natural water flow-through tank located next to the bay with open exposure, and sacrificed, while being inactive (buried), at the time of low tide (1530). DHT and NHT were used to compare the expression of genes during the day versus the night, and DHT and DLT were used to compare expression during high and low tides. Tissues from ELT were sequenced and used to increase the overall depth of the combined transcriptome dataset. Animals were dissected and their entire central nervous system tissue (protocerebrum, subesophageal ganglia, ventral nerve cord, and ganglia) was snap frozen on dry ice.

Benchmarking Universal Single-Copy Orthologs (BUSCO)
Analysis. Genome and transcriptome completeness's were assessed using BUSCO v1.1 using the eukaryotic linage for both and default parameters for the genome and transcriptome analyses, respectively.
2.9. mtDNA Analysis. The previously published Limulus mitochondrial genome (NC 003057.1) was blasted against the genomic assembly and used to identify Limulus genomic contig 669. Contig 669 was then analyzed for coding regions and fully annotated using NC 003057.1 as a reference and visualized ( Figure 1) using Organellar Genome DRAW [19]. Limulus mtDNA was then blasted against the Limulus genome assembly to look for nuclear mitochondrial (NUMT) sequences. To validate potential NUMT sequences, genomic contigs that contained homologous regions of mtDNA were extracted and compared for similarity.

Transcriptome Analysis.
Individual read sets from the four samples were mapped to the overall transcriptome assembly, with each contig given a unique identifying number. Reads per kilobase per million mapped reads (RPKMs) were used to determine relative fold change between day/night and high/low tides. Blast2GO [20] was performed on each of the four RNA-Seq mappings to look for expression variation in molecular groupings. Transcripts between conditions were normalized by identifying the condition with the highest number of sequences across all defined gene ontologies (GO) and dividing that condition's GO sequence numbers by the corresponding GOs of another condition marked for comparison. The average difference between GOs of the two individuals (to be compared) was then used as a normalization value for the actual sequence number of the individual with less sequences per GO.

Limulus Genome Sequencing and Assembly.
Genomic sequences yielded a total of 620,743,244 one hundred base pair (paired-end) reads ( Table 1). The Limulus genome size is estimated to be 2.74 Gb based on biochemical analysis [21], and, based on that estimation, these sequences should result in an approximate 23x coverage of the genome (Table 1). De novo assembly of the genome resulted in N50 of 4.7 kb with the largest contig being 68.9 kb. The total length of all contigs ≥ 500 bp was 1.4 Gb. To evaluate the completeness of the genome assembly with respect to protein coding genes, BUSCO v1.1 genome assessment was used to look for evidence of 429 putative universal orthologs and yielded the following results: complete genes found (C), 12.9%; duplicated genes found (D), 1.6%; fragmented genes found (F), 10.7%; genes  [23]. The majority of extra mtDNA appears to be in the noncoding origin of the mtDNA genome. However, there are a number of polymorphisms, insertions, and deletions between the mitochondrial genome sequences, many of which occur in the coding regions (Table 2).

Limulus NUMT Sequences.
Two NUMT sequences were identified as genomic contigs 269235 and 5819, which were 9,955 bp and 12,620 bp in length, respectively. These two nuclear contigs each had significant stretches of sequence that were clearly homologous to the extant mitochondrial genome. Specifically, nuclear contig 269235 contained a contiguous region of 2,008 bp that includes sequences homologous to the mitochondrial genes for nad2, methionyl-tRNA, lysidine synthase, glutamate-tRNA ligase, valine-tRNA ligase, and 12s ribosomal RNA and contig 5819 includes 1,065 bp homologous to nad1. To confirm that the NUMT sequences were not simply errors in assembly, we compared the NUMTs and extant mitochondrial sequences. Neither of the NUMT sequences in genomic contigs 269235 or 5819 should have high levels of sequence identity expected if they were assembled from extant mitochondrial genome reads. Specifically NUMT sequences in genomic contig 269235 had an 84.4% average percent identity to mtDNA sequences and those in genomic contig 5819 had a 79.4% identity to the homologous mtDNA sequences ( Figure 2).

Limulus Transcriptome Sequencing and Assembly.
The combined dataset had a total of 484,378,406 reads at a size of 100 bp and was de novo assembled in CLC to generate a reference assembly for analysis. The resulting transcriptomic assembly contained 92,348 contigs with a maximum contig size of 23.85 kb. Transcripts mapped back to the genome with 88.3% of the total transcripts, equating to 109.76 Mb, suggesting that the transcriptome assembly represents most of the exons present in our genome assembly ( Figure 3). The completeness of the transcriptome was evaluated using Other genes Transfer RNAs Ribosomal RNAs Figure 1: Gene map of the Limulus polyphemus mitochondrial genome. Arrows indicate strand direction with the inner circle representing genes on the light strand while the outer circle represents genes on the heavy strand. ND1-6 represents nicotinamide adenine dinucleotide dehydrogenase (NADH) subunit variants. tRNA ("amino acid") represents transfer RNA variants. COX1-3 represents the cytochrome c oxidase subunits. CYTB represents cytochrome b oxidase. ATP6 and ATP8 represent adenosine triphosphate variants 6 and 8. Rrn12 and rrn16 represent ribosomal subunits 12 and 16. Innermost dark grey circle represents distribution of GC content. BUSCO v1.1 transcriptome assessment to look for evidence of 429 putative universal orthologs, which yielded the following results: complete genes found (C), 56.6%; duplicated genes found (D), 10.7%; fragmented genes found (F), 13.3%; genes missing (M), 19.3%. Overall the transcriptome was found to have 80.7% of the total number of genes looked for in the BUSCO analysis, more than three times that of the genome [22]. Sequencing and assembly statistics obtained for the genomic and transcriptomic libraries are summarized in Table 1.

Effects of Photoperiod and Tidal Phase on Limulus
Transcript Expression. The primary purpose of the RNA-Seq data was to generate a central nervous system transcriptome, while a secondary purpose was to generate a set of pilot data for potential differences with regard to photoperiod and tidal phase. Of the four conditions used for transcriptomic analysis (DHT, NHT, ELT, and DLT), three were used for the following comparisons: DHT/NHT and NHT/DLT. Total read counts for each condition were 1.5 × 10 8 for DHT, 1.3 × 10 8 for NHT, 5.8 × 10 7 for ELT, and 1.4 × 10 8 for DLT. To compare gene expression across the different treatments we mapped the reads from each dataset to the combined reference assembly using CLC Genomics Workbench's (v5.1.2) RNA-Seq analysis toolkit. Volcano plot analysis (CLC v5.1.2) was used to analyze variation between night and day samples as well as high and low tide samples (Figures 4 and 5). As expected, the majority of sequences for both conditions fell below the thresholds of statistical significance (below horizontal red line) and a minimum fold change of 2 (in between vertical red lines). However, when the number of contigs that varied significantly was compared between the two treatments ( Figure 6), there was a larger number of contigs with a minimum fold change of ≥2 that varied significantly in the high/low tide comparison than in the day/night comparison.

Expression of Gene Types with respect to Photoperiod and
Tidal Phase. The majority of transcripts that were different in the photoperiod and tidal phase comparisons were novel [24]. Therefore, only the top four identifiable genes with the lowest values and highest fold changes are shown for both comparisons in Table 3. Of the genes that could be identified when BLASTed against NCBI, twice as many were affected by tidal versus photoperiodic conditions (e-values <1.0 − 4;  Table 3). Additionally, genes with -values <1.0 − 22 were primarily developmental and regulatory in nature (Table 3). All transcript contigs from photoperiodic and tidal conditions were also annotated using the BLAST2GO pipeline, which uses gene ontology (GO) to define genes based on common functions and relationships. The GO terms "multiorganism process" (Figure 7), "molecular transducer activity," "receptor activity," "structural molecule activity" (Figure 8), and "translation regulator activity" (Figure 9) all varied greater than 40% when looking at the comparison between high and low tides. Only 1 gene ontology term, "structural molecule activity," appeared to change more than 40% when looking at night versus day (Figure 8).   Relative RPKM values were also calculated for a number of circadian accessory genes [25][26][27][28] and potential control genes [29][30][31][32], as shown in Table 4 Table 3. Notably, six of seven actin variations, a gene traditionally used for normalization in QPCR analysis, exhibit an absolute fold change ≥1.3-fold over tidal conditions and four of seven display a variation ≥1.1-fold over photoperiodic conditions. The neuropeptide, pigment-dispersing hormone (PDH), was also investigated as it has been described as a potential downstream regulator of circadian rhythms in arthropods [33,34]. A PDH receptor-like homolog was identified in the Limulus transcriptome dataset and found to have an absolute fold change of 3.0-fold for conditions of tidal phase, whereas the absolute fold change for conditions of photoperiod was only 1.5-fold (Table 4).

High tide versus low tide Day versus night
Absolute

Discussion
Based on the fossil record, the Atlantic horseshoe crab, Limulus polyphemus, is one of the most ancient living organisms [35] and therefore the genome and transcriptome presented here will likely both provide clues to evolutionary divergence and lead to the identification of novel genes [36]. The draft sequences for the Limulus genome, transcriptome, and the addition of a second mitochondrial sequence will also provide a dataset for mining biomarkers and other molecular tools with which diversity and population studies can be conducted [37]. Thus, these data may also help to provide genetic markers that can be used to identify local subpopulations and lead to improved management of this important species. Finally, the subphylum Chelicerata is underrepresented when it comes to annotated genomes. Currently there is only one other representative from the Chelicerata subphylum which has had its genome sequenced (Ixodes scapularis [38]) and so adding a draft genome and transcriptome at 23x coverage and 38.5 Gb, respectively, will greatly increase the genetic reference for all Chelicerata. Limulus has at least two distinct endogenous clocks [8,9]. There is a circadian clock, which controls day/night sensitivity to light [39] and a clock system that controls circatidal rhythms of locomotor activity [10]. While several core and accessory genes are known to be part of the molecular structure of the circadian clock in model organisms such as Drosophila [26][27][28], the molecular workings of the clock controlling circatidal rhythms are completely unknown [40,41]. While investigations into the molecular mechanics of this clock in the intertidal mangrove cricket, Apteronemobius asahinai, have shown that the circadian genes period and clock are likely not involved in circatidal rhythms [42,43], our working hypothesis is that circatidal rhythms evolved from 8 International Journal of Genomics Table 3: Top four contigs of day/night and high/low tide conditions from RNA-Seq volcano plot analysis. Contigs chosen were furthest away from the origin while still having an associated gene name in GenBank (Figures 4 and 5). Contig ID, unique identifier for the Limulus transcriptome dataset; top hit ID, accession ID for the top BLAST result for each contig; predicted gene/protein, BLAST result gene or protein description. Top hit -value, -value for top BLAST result; value, probability of statistical significance; absolute fold change, difference in expression between tested conditions; unique gene reads, total depth of reads found for a particular contig within a particular condition; RPKM, reads per kilobase per million mapped reads. existing circadian mechanisms. Indeed, the findings that circatidal rhythms appear to be controlled by two clocks ("circalunidian clocks" [41,44]), each of which has an endogenous period (∼24.8 h) very close to circadian clocks (∼24 h), support this idea. Homologs for all of the core circadian genes have been identified within Limulus (Table 4 [ 45]), two of which (cycle and timeless variants) seem to vary by time of day and tide (Table 4), supporting the idea that some of core circadian genes are somehow involved in modulating the circadian rhythm as well as the circatidal rhythm.
Overall however, RNA-Seq analysis of the majority of core circadian transcript expression across conditions of photoperiod and tidal phase indicated only slight variation in expression (<2-fold for nearly all) for those conditions. Additionally, QPCR data of the core circadian mRNA within the protocerebrum has shown no statistically significant variation of expression for clock, cycle 1, or timeless (Simpson, unpublished results). When combined, these data suggest a relatively constitutive expression profile when looking at daily and tidal differences of most core circadian clock gene transcripts within the central nervous system as a whole. However, both variants of the core clock gene cycle seem to vary moderately (>3-fold) with respect to tidal phase with only one variant (cycle 2) varying moderately (>3-fold) with respect to photoperiod. Additionally, investigations into downstream circadian regulators like the homolog of the neuropeptide PDH identified a transcript that is closely related to a PDH receptor but shares a higher percent identity with the vertebrate receptor for corticotropin-releasing hormone (CRH). Changes in RPKM values for the transcript exhibited a much greater change (3-fold) from conditions of tidal phase rather than from conditions of photoperiod (Table 4). Further BLAST investigations of PDH, CRH, and precursors involved with their synthesis yielded no reads [45]. With the majority of investigated circadian transcript expression exhibiting strong tendencies towards tidal phase rather than photoperiod, these findings may provide evidence of genes involved with the Limulus circatidal timing system.
When the top 20 most significant transcript contigs were pulled out of the experimental groupings (high tide versus low tide and night versus day) and blasted against the NCBI database, only 20% came back with an identified hit (Table 3). Of the genes that were expressed differentially during conditions of photoperiod, only two had -values < 1.0 − 3. The exact function of these genes is not known; however, "DMX-like protein 2" seems to be involved in synaptic processes and "protein trapped in endotherm 1" may play a critical role in germ cell migration. The genes that exhibited the largest changes over conditions of tidal phase all had -values < 1.0 − 31. "Thyroid peroxidaselike" and "3 beta-hydroxysteroid dehydrogenase" seem to be involved in hormone production; "ALX homeobox protein 1" is involved with neural development in mice and "ovostatinlike" is similar to a proteinase inhibitor found in chicken eggs ( Table 3). The lack of definitive gene identification and the small percentage of transcripts that returned with BLAST evidence of homology suggests that the majority of transcripts found to respond to photoperiod and tidal phase in the Limulus transcriptome are novel and these findings support similar studies that found genes involved in ecological adaptation tend to be novel [24,46].
Overall trends in transcript expression indicate a higher degree of difference in transcription occurring during high versus low tide periods than night versus day time periods. For example, there are nearly twice as many transcripts that are differentially expressed during high/low tide versus night/day ( Figure 6). When comparing gene ontology term values (GO;  there are five terms that vary with a minimum percent change ≥5%. When comparing these terms across conditions of night versus day we see a minimum percent change of 5% and maximum percent change of 52%. However, when looking at those same terms over high and low tidal conditions there is a minimum percent change of 36% and a maximum percent change of 198% (Figures 7-9). Taken together the results suggest a transcription profile that is dominated by tides more than by photoperiod. Deeper sequencing and statistical power will perhaps lead to the teasing apart of subtle and potentially significant changes in transcript expression.
The overall analysis of genome completeness using BUSCO was much less than that of the transcriptome (25.2% and 80.7% orthologs identified, resp.) suggesting recalcitrance by the genome and/or genome assembly to the BUSCO analysis [22]. If the genome assembly was causing the low completeness score we would expect a mapping of the transcriptome assembly to the genome assembly to be poor. However, the transcriptome maps to the genome with 88.3% coverage ( Figure 3). Additionally, when the genome assembly was compared to previously reported Limulus genome sizes, it was found to be about half of the expected size, indicating a large, highly repetitive genome [21,47]. Specifically, our assembly appeared to include highly frequent repetitive elements that represent a large factor of the genome. We hypothesize that it is those repeats in the genes (introns) that reduce the success of the findings universal orthologs. By contrast, the transcriptome cuts away these highly repetitive elements. Additionally, while not contributing to the low BUSCO score, other complexities with the genome structure, like the integration of mtDNA elements, may have facilitated increased collapse within the genome assembly.
Nuclear incorporation of mtDNA (NUMT) sequences can cause problems when conducting mtDNA-based studies because they can be confused with the extant version of the mitochondrial genome. To identify potential NUMTs, the Limulus genome assembly was searched for sequences highly similar to the extant version of the mitochondrial genome using BLASTN, which yielded two contigs. In both contigs these putative NUMT sequences were flanked by nonmitochondrial sequences including hypothetical Zinc Finger MYM-type 1 coding sequences, which are not present in any known animal mtDNA genome. The presence of these Zinc Finger sequences suggested either that the mitochondrial-like sequences were NUMTs or that there was an error which is the genome assembly causing the mtDNA sequences to be combined with the gDNA sequences. Results indicate that these sequences seem to be true NUMTs as the percent identity between the suspected NUMT sequences and the mtDNA sequences was less than 85% in either case (Figure 2).  For each of these two putative NUMTs the nuclear and extant mitochondrial sequences are collinear, suggesting that these could have been derived from single translocation events and although there are two distinct mtDNA loci incorporated into two separate genomic contigs their similar levels of divergence from the extant versions may suggest a single common origin for both putative NUMTs.
The draft genome and transcriptome will begin to provide a molecular guide for a host of environmental and experimental studies. This will allow for a deeper, more thorough understanding of the behavioral and environmental impacts and challenges these animals face. These datasets are an important addition to an evergrowing aggregation of genomic and transcriptomic databases.

Conclusions
Endogenous biological rhythms are the result of complex and highly dynamic interactions between many different molecular components. When more than one rhythmic system is present in an organism, such as Limulus, it can be much more difficult to elucidate the most important molecules and their interactions. Here we show that Limulus have many genes that are considered central to other model organism's circadian clocks, but their transcripts do not seem to cycle with respect to time of day. However, there does seem to be some degree of transcript regulation with respect to tidal phase. The draft Limulus genome and transcriptome reported should facilitate future molecular and genetic investigations concerning biological clocks and other processes in chelicerates and other related organisms. Table 4: Reads per kilobase per million mapped reads (RPKM) for common circadian and control genes. The absolute fold change is shown for two conditions, day/night and high/low tide. Gene name, either homologous gene title identified through BLAST or designated name given based on percent identity to similar gene variants; contig ID, contiguous sequence number given to a particular gene during assembly; RPKM values given for four individual animals (begin day high tide, day low tide, day between tides, and night high tide); absolute fold change day/night, the RPKM difference between begin day high tide and begin night high tide; absolute fold change high/low, difference between begin day high tide and begin day low tide.