High-throughput sequencing is a common approach to discover SNP variants, especially in plant species. However, methods to analyze predicted SNPs are often optimized for diploid plant species whereas many crop species are allopolyploids and combine related but divergent subgenomes (homoeologous chromosome sets). We created a software tool, SNiPloid, that exploits and interprets putative SNPs in the context of allopolyploidy by comparing SNPs from an allopolyploid with those obtained in its modern-day diploid progenitors. SNiPloid can compare SNPs obtained from a sample to estimate the subgenome contribution to the transcriptome or SNPs obtained from two polyploid accessions to search for SNP divergence.
The advent of high-throughput sequencing technologies is revolutionizing our ability to discover and exploit single-nucleotide polymorphisms (SNPs). Polyploidy occurs in many animals and plants but is particularly widespread in flowering plants, including many major crops. However, most methods used to discover and validate predicted SNPs are optimized for diploid species, so specific challenges related to polyploidy remain to be addressed.
Many polyploid plants including coffee (
The sequencing of transcripts using high-throughput sequencing methods (RNA-Seq) can provide fresh insights into polyploid biology [
Here we present a new tool, SNiPloid, that can tackle the many aspects involved in the analysis of SNPs in the context of allopolyploidy. Based on the coassembly of homoeologs, SNiPloid compares either putative SNPs detected from an allopolyploid to those obtained in its parental genomes, or putative SNPs derived from two allopolyploid accessions to search for polymorphism. SNiPloid web server and source code (downloadable under the CeCILL public license) are accessible at
Before interpreting the results of RNA-Seq data using SNiPloid, data preprocessing is required. Biologists can preprocess their data through the Galaxy public server (
Data preprocessing. Before launching SNiPloid, each individual sample needs to be preprocessed by successively running mapping alignments and SNP calling.
SNiPloid assumes that short reads datasets (i.e., samples) derived from unique single genotype or distinct accessions (diploid or polyploid) are separately aligned against a single diploid transcriptome reference corresponding to one of the parental diploids using dedicated mapping software such as BWA [
Mapping alignment is a key step in data preprocessing and mapping parameters need to be adjusted and optimized to best fit the single diploid genome used as reference. Actually, since the reference diploid transcriptome is more closely related to one of the two subgenomes in the tetraploid, it might have collateral effects on the mapping efficiency and indirectly cause biases in the interpretation of the SNP, notably when analyzing the relative homoeologous gene expression represented by the contribution of subgenomes to total gene expression.
The SNiPloid utility uses the power of the Variant Call Format (VCF) which lists SNP variations and assigns alleles for each sequenced sample, by comparison with a reference sequence [
Inputs to the SNiPloid software consist of two different GATK outputs for each sample: (i) a VCF file listing putative SNPs and (ii) a coverage depth file (Figure
SNiPloid comprises three main steps (Figure
(a) SNiPloid procedure. For each reference sequence or gene of a diploid genome G2, SNiPloid extracts intervals that meet a minimal coverage depth threshold for each sample (1a) and identify overlapping intervals between samples (1b). It then extracts putative SNPs in both samples within these defined common regions (2) and compares the differences observed between samples in order to interpret the situation (3). (b) Phylogenetic contexts within a polyploidy genome and assignment of SNP categories.
In the second step also for each sample, SNiPloid extracts alleles from the VCF file for SNP positions within the defined common regions. In the third step the differences observed between samples are compared and the situation is interpreted.
Using its main functionality (“ Patterns 1 and 2 correspond to interspecific SNPs and are assigned if an allele is specific to one of the parental genomes. The mutation occurred after the polyploidization event (e.g., diploid1 A/A, diploid2 G/G, and tetraploid G/G). Pattern 5 corresponds to putative homoeoSNPs because the same variation is observed in tetraploids and between parental genomes (e.g., diploid1 A/A, diploid2 G/G, and tetraploid A/G). With this pattern, SNiPloid identifies in which subgenome the homoeoallele resides by using diploid sequence alleles. In the second step, by retrieving and combining allelic depths for the reference and alternate alleles provided in the VCF format, it can estimate the subgenome contribution to the transcriptome for each homoeologous genes. Patterns 3 and 4 are attributed when the variation observed in the tetraploid is not identified between parental genomes (e.g., diploid1 A/A, diploid2 A/A, and tetraploid A/G). The mutation may have occurred in one of the subgenomes of the allotetraploid after the polyploidization event. With a mixture of reads originating from two subgenomes in the mapping of an allotetraploid, pattern 3 or 4 cannot be attributed without haplotype information, and a pattern “3 or 4” is assigned. In addition, SNiPloid can benefit from the phasing information included in the VCF file derived from the allotetraploid to infer the origin of an allele and distinguish between a hypothetical evolution pattern 3 or 4. Indeed, the VCF format anticipates the coding of allele phasing information (allele pairs specified by 0∣1 instead of 0/1 if phased with the previous polymorphism) in order to define haplotype blocks. Thus, if provided in the VCF, the phasing information can specify potential associations with SNP pattern 5 whose subgenome origin is known and thus distinguish between patterns 3 and 4. Basically, this process based on the haplotype makes it possible to identify putative subgenome specific SNPs.
SNiPloid is a component of the South Green Bioinformatics Platform (
Alternatively, SNiPloid can be downloaded as a component of the Galaxy project [
The Web application allows the export of the detailed list of classified SNPs in a tabulated format. At the end of the process, the program summarizes the analysis by counting the different SNP classes for each gene/contig of the reference dataset and by reporting the results in a dynamic sortable table (Figure
SNiPloid outputs. (a) SNiPloid produces HTML outputs showing the number of predefined SNP categories and an approximate ratio of subgenome contribution to the transcriptome for each reference sequence. (b) SNiPloid is also able to generate a graphic image that shows the overall distribution of SNP categories and of subgenome contributions along the chromosomes.
In addition, when the objective is to calculate general statistics or SNP frequencies along the transcriptome, the counting of SNP categories can be reported to the number of positions taken into account for the analysis, that is, positions that had met the minimum coverage depth threshold defined by the user.
Basically, the second option “
Finally, SNiPloid includes a viewer that allows a graphical overview of the distribution of SNP categories and of subgenome contributions along the chromosomes (Figure
This functionality can only be applied on species for which a complete and fully annotated reference genome sequence is available and requires a structural genome annotation in General Feature Format (GFF) format as additional input, supplying the viewer program with the coordinates of gene models used as reference on the genome. The aim is to rapidly localize potential highly bias-expressed regions, introgressed genes, or homogenized regions within the genome.
A whole transcriptome analysis was conducted on the allotetraploid
Sampled from this study, an example of datasets is provided by the SNiPloid Web server to familiarize users with the correct input and expected results.
The main functionality of SNiPloid is dedicated to RNA-Seq data and to polyploid species for which a diploid transcriptome reference is available for at least one of the parents.
One limitation of the use of RNA-Seq for SNP detection and subsequent interpretation is that the transcript sequences represent only the expressed part of the genome and that the sequencing depth varies considerably across the genome due to the different gene expression levels. Thus, only SNPs in well-expressed genes can be detected and allele or homoeolog expression bias could make the detection of certain SNP difficult due to their low frequency in the transcriptome. However, NGS technologies and the use of appropriate read cutoffs allow to detect and interpret SNPs for a large number of genes distributed across the genome.
Theoretically, even though the allele expression quantification would not be performed, a genome wide analysis would be also possible on genomic data. However from a technical point of view, whole genome analysis would be difficult to perform through our Web server, since it requires uploading VCF and depths file inputs that would be sizeable and should be computed by command line after having downloaded the SNiPloid package or through Galaxy.
In terms of performance, in our practical experience two RNA-Seq samples derived from a polyploid and a diploid species first mapped against a complete reference transcriptome and then generating 600 000 putative SNPs each can be successfully compared by SNiPloid Web server in less than five minutes.
Even though numerous SNP bioinformatics tools or pipelines exist for SNP calling (GATK [
An example of pipeline reported by Hand et al. [
This approach is relevant and more advanced but can appear slightly more fastidious to operate. The main advantage of SNiPloid is its ease to be applied since it does not require preliminary work to establish homoeoSNPs database that can be time-consuming, and offers to non-bioinformaticians a ready-to-use Web server allowing to rapidly obtain subgenome attribution thanks to a “one click” analysis.
In addition, our approach seems to be more appropriate for allopolyploid species for which the polyploidization event is relatively recent in the evolution such as Coffea or Spartina.
To our knowledge, SNiPloid is the first Web tool dedicated and optimized for the SNP analysis of RNA-Seq data obtained from an allopolyploid species. By exploiting the well-organized information stored in the standard VCF format, SNiPloid helps to interpret putative SNPs detected in a whole transcriptome by a comprehensive SNP categorization. SNiPloid is appropriate for allotetraploids and opens new prospects for investigating allopolyploid genome structure or expression.