Next-generation sequencing (NGS) technology has provided researchers with opportunities to study the genome in unprecedented detail. In particular, NGS is applied to disease association studies. Unlike genotyping chips, NGS is not limited to a fixed set of SNPs. Prices for NGS are now comparable to the SNP chip, although for large studies the cost can be substantial. Pooling techniques are often used to reduce the overall cost of large-scale studies. In this study, we designed a rigorous simulation model to test the practicability of estimating allele frequency from pooled sequencing data. We took crucial factors into consideration, including pool size, overall depth, average depth per sample, pooling variation, and sampling variation. We used real data to demonstrate and measure reference allele preference in DNAseq data and implemented this bias in our simulation model. We found that pooled sequencing data can introduce high levels of relative error rate (defined as error rate divided by targeted allele frequency) and that the error rate is more severe for low minor allele frequency SNPs than for high minor allele frequency SNPs. In order to overcome the error introduced by pooling, we recommend a large pool size and high average depth per sample.
Over the last decade, large-scale genome-wide association studies (GWAS) based on genotyping arrays have helped researchers to identify hundreds of loci harboring common variants that are associated with complex traits. However, multiple disadvantages have limited genotyping arrays’ ability for disease association detection. A major disadvantage of genotyping arrays is the limited power for detecting rare disease variance. Rare variants with minor allele frequency (MAF) less than 1% are not sufficiently captured by GWAS [
Most of the above limitations can be overcome by using high throughput NGS technology [
The concept of pooling in genetic studies began in 1985 with the first genetic study to apply a pooling strategy [
Usually two different kinds of pooling paradigms are involved. The first is multiplexing (also known as barcoding). On an Illumina HiSeq 2000 sequencer, one lane can generate, on average, from 100 to 150 million reads per run. For exome sequencing, from 30 to 40 million reads per sample are needed to generate reliable coverage in the exome for variant detection. Thus, the common practice is to multiplex from 3 to 4 samples per lane to reduce cost. Using multiplexing with barcode technology, we are able to identify each read’s origination. The disadvantage of multiplexing with barcoding is the extra cost of barcoding and labor. The cheaper alternative to pooling with multiplexing is pooling without multiplexing, which prevents us from identifying the origin of each read.
In this study, we focused on pooling without multiplexing. By using comprehensive and thorough simulations, we tried to determine the effectiveness of estimating allele frequency from pooled sequencing data. In our simulation model we considered important factors of pooled sequencing, including overall depth, the average depth per sample, pooling variation, sampling variation, and targeted minor allele frequency (MAF). Another important issue we addressed in our simulation is the reference allele preferential bias, which is a phenomenon during alignment when there is preference toward the reference allele. We used real data to show the effect of reference allele bias and adjusted our simulation model accordingly. We describe our simulation model in detail and present the results from the simulation.
We designed a thorough simulation model to closely reflect the real-world pooled sequencing situation. Our simulation model includes notations which we have defined as follows: let
In our study we estimated the average depth for the exome regions as follows:
In general the read output for 1 lane on an Illumina HiSeq 2000 sequencer is around 120 million reads. The most popular exome capture kits including Illumina TruSeq, Agilent SureSelect, and NimbleGen SeqCap EZ capture almost 100 percent of all known exons (about 30 million base pairs). Most capture kits claim that they have capture efficiency of at least 70 percent, but, in practice, it has been shown that the capture efficiency of all these capture kits are only around 50 percent [
To measure the accuracy of the allele frequency
Reference allele preferential bias is a phenomenon during alignment when there is preference toward the reference allele. Degner et al. described such bias in RNA-seq data [
Allele balance for 3 independent datasets.
Dataset | Sample | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | Mean 95% conf. lo. | Mean 95% conf. hi. |
---|---|---|---|---|---|---|---|---|---|
1055QC0003 | 0.091 | 0.423 | 0.48 | 0.48 | 0.536 | 0.862 | 0.476 | 0.483 | |
1055QC0004 | 0.1 | 0.427 | 0.477 | 0.48 | 0.53 | 0.826 | 0.477 | 0.483 | |
1055QC0005 | 0.046 | 0.429 | 0.481 | 0.482 | 0.536 | 0.939 | 0.479 | 0.486 | |
1055QC0006 | 0.1 | 0.418 | 0.478 | 0.481 | 0.542 | 0.909 | 0.477 | 0.485 | |
1055QC0007 | 0.156 | 0.417 | 0.476 | 0.475 | 0.536 | 0.879 | 0.472 | 0.479 | |
1055QC0008 | 0.148 | 0.421 | 0.481 | 0.482 | 0.542 | 0.905 | 0.479 | 0.486 | |
1055QC0009 | 0.148 | 0.422 | 0.478 | 0.48 | 0.536 | 0.963 | 0.476 | 0.483 | |
1055QC0011 | 0.1 | 0.421 | 0.481 | 0.48 | 0.538 | 0.952 | 0.477 | 0.484 | |
1055QC0012 | 0.095 | 0.429 | 0.478 | 0.48 | 0.531 | 1 | 0.477 | 0.483 | |
1055QC0013 | 0.165 | 0.424 | 0.482 | 0.482 | 0.541 | 0.9 | 0.479 | 0.486 | |
SureSelect | 1055QC0014 | 0.103 | 0.429 | 0.481 | 0.483 | 0.538 | 0.818 | 0.48 | 0.487 |
1055QC0016 | 0.13 | 0.425 | 0.48 | 0.482 | 0.54 | 0.909 | 0.478 | 0.485 | |
1055QC0017 | 0.136 | 0.422 | 0.481 | 0.48 | 0.536 | 0.9 | 0.477 | 0.483 | |
1055QC0018 | 0.182 | 0.424 | 0.48 | 0.48 | 0.537 | 0.987 | 0.477 | 0.483 | |
1055QC0020 | 0.2 | 0.432 | 0.483 | 0.485 | 0.536 | 0.815 | 0.482 | 0.488 | |
1055QC0021 | 0.12 | 0.429 | 0.481 | 0.484 | 0.538 | 1 | 0.48 | 0.487 | |
1055QC0022 | 0.091 | 0.424 | 0.478 | 0.479 | 0.533 | 0.905 | 0.476 | 0.482 | |
1055QC0024 | 0.077 | 0.422 | 0.478 | 0.478 | 0.535 | 0.857 | 0.474 | 0.481 | |
1055QC0025 | 0.13 | 0.429 | 0.481 | 0.484 | 0.54 | 0.897 | 0.481 | 0.488 | |
1055QC0026 | 0.13 | 0.42 | 0.478 | 0.479 | 0.539 | 0.793 | 0.476 | 0.482 | |
1055QC0028 | 0.039 | 0.419 | 0.477 | 0.476 | 0.531 | 0.938 | 0.472 | 0.479 | |
| |||||||||
10009 | 0.044 | 0.447 | 0.5 | 0.499 | 0.55 | 1 | 0.496 | 0.501 | |
10244 | 0.091 | 0.444 | 0.5 | 0.497 | 0.55 | 0.909 | 0.495 | 0.499 | |
TruSeq | 10290 | 0.065 | 0.444 | 0.5 | 0.497 | 0.55 | 0.917 | 0.495 | 0.499 |
20007 | 0.077 | 0.447 | 0.5 | 0.498 | 0.55 | 0.923 | 0.496 | 0.5 | |
20017 | 0.044 | 0.447 | 0.5 | 0.498 | 0.55 | 0.921 | 0.496 | 0.5 | |
20301 | 0.077 | 0.449 | 0.5 | 0.499 | 0.55 | 0.967 | 0.497 | 0.501 | |
| |||||||||
ERR004043 | 0.04 | 0.376 | 0.44 | 0.447 | 0.511 | 0.986 | 0.44 | 0.453 | |
ERR004047 | 0.125 | 0.391 | 0.447 | 0.451 | 0.503 | 1 | 0.446 | 0.457 | |
Array based | SRR013908 | 0.081 | 0.37 | 0.475 | 0.481 | 0.584 | 0.977 | 0.472 | 0.489 |
SRR013909 | 0.071 | 0.372 | 0.476 | 0.484 | 0.591 | 0.95 | 0.476 | 0.492 | |
SRR015428 | 0.093 | 0.389 | 0.488 | 0.49 | 0.586 | 0.909 | 0.483 | 0.498 | |
SRR015429 | 0.1 | 0.426 | 0.496 | 0.497 | 0.564 | 0.913 | 0.491 | 0.503 | |
| |||||||||
All | Mean | 0.103 | 0.421 | 0.482 | 0.483 | 0.543 | 0.919 | 0.479 | 0.487 |
Three simulations were conducted to evaluate the accuracy of allele frequency estimation from pooled sequencing data. The detailed descriptions of the three simulations are as follows.
The goal of Simulation
The goal of Simulation
The goal of Simulation
We assume that each sample’s DNA contribution to the pool follows a normal distribution
Relative RMSE for different pool sizes and MAFs under different standard deviations.
In this simulation, the goal was to examine the relationship between average depth per sample
In our study, we performed simulations at each MAF 10,000 times. However, in practice, we do not have the resources to measure a SNP 10,000 times and then take the average. In real exome sequencing, each SNP is only measured one time. Table
Statistics for doing 10,000 simulations at different MAFs.
MAF | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | Var. | Relative RMSE |
---|---|---|---|---|---|---|---|---|
0.5 | 0.0000 | 0.0036 | 0.0049 | 0.0050 | 0.0064 | 0.0162 | 0.0000 | 0.5037 |
1 | 0.0000 | 0.0075 | 0.0098 | 0.0100 | 0.0124 | 0.0264 | 0.0000 | 0.3552 |
5 | 0.0256 | 0.0448 | 0.0500 | 0.0500 | 0.0551 | 0.0795 | 0.0001 | 0.1540 |
10 | 0.0615 | 0.0928 | 0.0999 | 0.1000 | 0.1071 | 0.1401 | 0.0001 | 0.1070 |
20 | 0.1444 | 0.1904 | 0.2000 | 0.2001 | 0.2098 | 0.2558 | 0.0002 | 0.0716 |
30 | 0.2449 | 0.2889 | 0.2997 | 0.2998 | 0.3106 | 0.3619 | 0.0003 | 0.0537 |
40 | 0.3397 | 0.3879 | 0.3998 | 0.4000 | 0.4118 | 0.4707 | 0.0003 | 0.0442 |
50 | 0.4348 | 0.4877 | 0.5000 | 0.4998 | 0.5116 | 0.5675 | 0.0003 | 0.0359 |
Relative RMSE for different pool sizes and MAFs under different average per sample depths.
In this simulation, we simulated the scenario of pooled exome sequencing. Using data from the 1000 Genomes Project as prior information that contains genotyping data from 1092 individuals, we built an empirical distribution of MAF (Figure
1000 Genome MAF distributions.
Median error rates for simulating 1000 exome sequences using different numbers of lanes. Simulation on 2 lanes shows nearly 30% error, and only around 5% error rate is observed for 16 lanes simulation.
Pooled and individual sequencing pricing.
Sequencing per pool | 200 | 400 | 600 | 800 | 1000 |
---|---|---|---|---|---|
2 lanes | $3,650 | $4,050 | $4,450 | $4,850 | $5,250 |
4 lanes | $6,650 | $7,050 | $7,450 | $7,850 | $8,250 |
6 lanes | $9,650 | $10,050 | $10,450 | $10,850 | $11,250 |
8 lanes | $12,650 | $13,050 | $13,450 | $13,850 | $14,250 |
10 lanes | $15,650 | $16,050 | $16,450 | $16,850 | $17,250 |
12 lanes | $18,650 | $19,050 | $19,450 | $19,850 | $20,250 |
16 lanes | $24,650 | $25,050 | $25,450 | $25,850 | $26,250 |
Individual prep. | $125,000 | $250,000 | $375,000 | $500,000 | $625,000 |
Our simulation showed that there are several important factors to consider when designing a pooling study. Those factors include sample size, targeted MAF, and, most importantly, the depth. The sample size directly affects the ability to detect rare SNPs. Larger pool size will increase the accuracy of MAF estimation with the same per sample depth but will not have much effect with the same overall depth. Similarly, with the same pool size, increasing depth will decrease relative RMSE. Our simulation also showed that pooled sequencing is not ideal for estimating the MAF of rare SNPs. The relative RMSE is much higher for SNPs with MAF < 1% compared to SNPs with MAF > 5% (Figure
Sequencing pooled DNA will ease financial burdens and make large association possible. At the same time, however, pooling introduces additional errors. A majority of the errors are caused by the unequal representation of each sample’s DNA in the pool. This unequal representation could be due to human or machine error, which we have considered in our simulation. There are other factors which can also cause the unequal representation, such as a sample’s DNA quality and variation introduced in the PCR/amplification stage. Unfortunately, we can only minimize such errors and variation using more sophisticated lab techniques. Even if every sample is equally represented in the pool, the sequenced data still do not truly reflect the equality due to sampling variance. Based on our simulation results, when designing a pooling study, we recommend the following: larger pool size is better, and higher depth is better. More elaborately, it is better to keep balance between pool size and depth. We recommend keeping the average depth per sample at 10 minimum if rare SNPs are not of interest; otherwise, average depth per sample at 20 minimum is highly recommended.
The authors would like to thank Peggy Schuyler and Margot Bjoring for their editorial support.