Evaluation of Allele Frequency Estimation Using Pooled Sequencing Data Simulation

Next-generation sequencing (NGS) technology has provided researchers with opportunities to study the genome in unprecedented detail. In particular, NGS is applied to disease association studies. Unlike genotyping chips, NGS is not limited to a fixed set of SNPs. Prices for NGS are now comparable to the SNP chip, although for large studies the cost can be substantial. Pooling techniques are often used to reduce the overall cost of large-scale studies. In this study, we designed a rigorous simulation model to test the practicability of estimating allele frequency from pooled sequencing data. We took crucial factors into consideration, including pool size, overall depth, average depth per sample, pooling variation, and sampling variation. We used real data to demonstrate and measure reference allele preference in DNAseq data and implemented this bias in our simulation model. We found that pooled sequencing data can introduce high levels of relative error rate (defined as error rate divided by targeted allele frequency) and that the error rate is more severe for low minor allele frequency SNPs than for high minor allele frequency SNPs. In order to overcome the error introduced by pooling, we recommend a large pool size and high average depth per sample.


Introduction
Over the last decade, large-scale genome-wide association studies (GWAS) based on genotyping arrays have helped researchers to identify hundreds of loci harboring common variants that are associated with complex traits. However, multiple disadvantages have limited genotyping arrays' ability for disease association detection. A major disadvantage of genotyping arrays is the limited power for detecting rare disease variance. Rare variants with minor allele frequency (MAF) less than 1% are not sufficiently captured by GWAS [1]. Such low MAF variants may have substantial effect sizes without showing Mendelian segregation. The lack of a functional link between the majority of the putative risk variants and the disease phenotypes is another major drawback for genotyping array-based GWAS [2]. The most popular genotyping chip, the Affymetrix 6.0 array, contains nearly 1 million SNPs, yet only one-third of these SNPs resides in the coding regions. Even though many GWAS-identified statistically significant SNPs lie in the intron or intergenic regions, [3][4][5] their biological function remains difficult to explain. Another limitation of genotyping arrays is that, because the SNPs are predetermined on the array, no finding of novel SNPs is possible.
Most of the above limitations can be overcome by using high throughput NGS technology [6]. NGS can target a specific region of interest, such as the exome or the mitochondria. Often, the functions of variants identified in coding regions of interest are much easier to explain than those of variants identified in the intron or intergenic regions. Also, by targeting the exome, we can effectively examine nearly 30 million base pairs in the coding region rather than just 0.3 million SNPs on the Affymetrix 6.0 array. Sequencing technology has been used to detect rare variants in many studies [7][8][9][10], with rare variants defined as 1%-5% frequency. Due to the large sample size needed to detect such low frequency variants, detection of rare variants less than 1% can still pose a significant challenge for NGS technology. One way to overcome this limitation is by doing a massive genotyping 2 The Scientific World Journal catalogue such as the 1000 Genomes Project [11]. Researchers are often too limited financially to conduct a genotyping study on such a large scale. DNA pooling is a strategy often used to reduce the financial burden in such cases.
The concept of pooling in genetic studies began in 1985 with the first genetic study to apply a pooling strategy [12]. Since then, pooling has been extensively applied in linkage studies in plants [13], allele frequency measurements of microsatellite markers and single nucleotide polymorphisms (SNPs) [10,[14][15][16][17][18], homozygosity mapping of recessive diseases in inbred populations [19][20][21][22], and mutation detection [23]. Even though pooling has also been used widely with NGS technology [24][25][26], the effectiveness of the pooling strategy has long been debated. On the one hand, several studies have claimed that data generated from pooling studies are accurate and reliable. For example, Huang et al. claimed that the minor allele odds ratio estimated from pooled DNA agreed fairly well with the minor allele odds ratio estimated from individual genotyping [27]. Docherty et al. demonstrated that pooling can be effectively applied to the genome-wide Affymetrix GeneChip Mapping 500 K Array [28]. Some studies have even found that pooling designs have an advantage in the detection of rare alleles and mutations, such as the study by Amos et al., which suggested that mutations in individuals could be more efficiently detected using pools [23]. On the other hand, several studies have argued that, when compared with individual sequencing, pooled sequencing can generate variant calls with high falsepositive rates [29]. Other studies also found that the ability to accurately estimate the allele frequency from pooled sequencing is limited [30,31].
Usually two different kinds of pooling paradigms are involved. The first is multiplexing (also known as barcoding). On an Illumina HiSeq 2000 sequencer, one lane can generate, on average, from 100 to 150 million reads per run. For exome sequencing, from 30 to 40 million reads per sample are needed to generate reliable coverage in the exome for variant detection. Thus, the common practice is to multiplex from 3 to 4 samples per lane to reduce cost. Using multiplexing with barcode technology, we are able to identify each read's origination. The disadvantage of multiplexing with barcoding is the extra cost of barcoding and labor. The cheaper alternative to pooling with multiplexing is pooling without multiplexing, which prevents us from identifying the origin of each read.
In this study, we focused on pooling without multiplexing. By using comprehensive and thorough simulations, we tried to determine the effectiveness of estimating allele frequency from pooled sequencing data. In our simulation model we considered important factors of pooled sequencing, including overall depth, the average depth per sample, pooling variation, sampling variation, and targeted minor allele frequency (MAF). Another important issue we addressed in our simulation is the reference allele preferential bias, which is a phenomenon during alignment when there is preference toward the reference allele. We used real data to show the effect of reference allele bias and adjusted our simulation model accordingly. We describe our simulation model in detail and present the results from the simulation.

Materials and Methods
We designed a thorough simulation model to closely reflect the real-world pooled sequencing situation. Our simulation model includes notations which we have defined as follows: let̂be the allele frequency estimated from pooled sequencing data, and let be the true allele frequency in the pool. Under the ideal assumption, all samples' contributions to the pool are equal. However, in practice, factors such as human error and library preparation variation can affect a sample's contribution to the pool. Very likely, each time a sample is added to the pool, an error is introduced. We let denote the error of sample during the pooling process, and should follow a normal distribution ( , 2 ), where = 0, and 2 denotes the variance of error in the pool. We assume that the amount of DNA added to the pool for each sample + follows a normal distribution ( , 2 ), where is a constant and denotes the ideal constant contribution to the pool by each sample. The probability that a read is contributed by sample can be represented as denotes the number of samples in the pool. The contribution of each sample in the pool to a SNP can be modeled as a multinomial distribution , where equals the depth at this SNP and represents the reads contributed by sample for this SNP. The depth follows a possion distribution ( ), where equals the average depth for the exome regions. For sequencing data, the reads at heterozygous SNPs should have an allele balance of 50%, meaning 50% of the read should support the reference allele while the other 50% of the read should support the alternative allele. Thus the reads that support the alternative allele should follow a binomial distribution ( , 0.5). In our study we estimated the average depth for the exome regions as follows: Lane × Reads per lane × Capture efficiency number of exons . (1) In general the read output for 1 lane on an Illumina HiSeq 2000 sequencer is around 120 million reads. The most popular exome capture kits including Illumina TruSeq, Agilent SureSelect, and NimbleGen SeqCap EZ capture almost 100 percent of all known exons (about 30 million base pairs). Most capture kits claim that they have capture efficiency of at least 70 percent, but, in practice, it has been shown that the capture efficiency of all these capture kits are only around 50 percent [32], which implies that if a sample is sequenced for 120 million reads, only around 60 million reads will be aligned to exome regions. After filtering for mapping quality, the number of reads aligned to exome regions will be even smaller. However, to simplify, we ignored the reads that failed the mapping quality filter. There are about 180,000 exons [33]. Based on (1), for exome sequencing on 1 Illumina HiSeq lane, the average depth is expected to be roughly 400.
The Scientific World Journal 3 To measure the accuracy of the allele frequencyê stimated from pooled sequencing data, we computed the relative root mean square error (RMSE) as follows: where is the number of simulations we performed to estimate the target allele frequency . In our simulations, we set = 10, 000. Unlike the traditional RMSE, we divided it by the target allele frequency to make the result relative to the allele frequency we were simulating, so we could compare RMSE for allele frequencies as small as 0.5% and as large as 50%.
Reference allele preferential bias is a phenomenon during alignment when there is preference toward the reference allele. Degner et al. described such bias in RNA-seq data [34]. To examine whether this bias also exists in DNAseq data, we measured allele balance (defined as reads that support the alternative allele divided by total reads) of three independent DNA sequencing datasets. The three datasets were sequenced at different facilities (Broad Institute, HudsonAlpha, Illumina), at different time points, and using different capture methods (Agilent SureSelect, Illumina TruSeq, and Array Based Capture). The theoretical allele balance for heterozygous SNPs should be around 50%. In real data, we observed that the mean allele balance for all heterozygous SNPs for all samples is 0.483 (range: 0.447-0.499) ( Table 1). Thus, we modified our previously defined read distribution at heterozygous site ( , 0.5) to ( , ) where follows a normal distribution ( , 2 ), where and 2 are estimated by the empirical mean allele balance we observed in real data.
Three simulations were conducted to evaluate the accuracy of allele frequency estimation from pooled sequencing data. The detailed descriptions of the three simulations are as follows.
Simulation 1. The goal of Simulation 1 was to study the relationship between different levels of and relative RMSE under different pool sizes ( = 200, 400, 800, and 1600) and different MAF (MAF = 0.5%, 1%, %5, 10%, 20%, 30%, 40%, and 50%). Each sample's DNA contribution + to the pool follows a normal distribution ( , 2 ). For simulation purpose, we set an arbitrary value = 10 units; the actual value of does not affect the outcome of the simulations, because the simulation merely scales around it. To best represent the scenario in practice, we used several different standard deviations values for the distribution of sample contribution to the pool. For the ideal situation, we set 2 to a very small number (10 −5 ); then, we increase 2 to 1, 2, and 4 (10%, 20%, and 40% of ) to see the effect of larger error variance on the accuracy of allele frequency estimation using pooled sequencing data. Each allele frequency was simulated 10,000 times. = 200, 400, 800, and 1600, and MAF = 0.5%, 1%, %5, 10%, 20%, 30%, 40%, and 50%. Each allele frequency was simulated 10,000 times. Simulation 3. The goal of Simulation 3 was to determine the overall performance of a pooled exome sequencing study. In practice, we cannot measure a SNP 10,000 times and then compute the average allele frequency as we did in Simulations 1 and 2. We are limited with one measurement only at a given SNP. It is important that we look at the overall performance too rather than just at a single SNP. Based on the released data of the 1000 Genomes Project, we built an empirical MAF distribution. This distribution should represent an overall picture of MAF distribution in the population. A typical exome study will yield 10,000-100,000 SNPs after filtering, with the number of SNPs heavily dependent on the number of samples sequenced in the study. Following the empirical distribution of the MAF, we randomly drew 10,000 SNPs from this distribution to simulate an exome sequencing dataset and computed an overall error rate. The error rate is defined as |̂− |/((̂+ )/2). We further repeated this simulation 1000 times and computed the median error rates.

Simulation 2.
In this simulation, the goal was to examine the relationship between average depth per sample and pool size . We found that, with the same average depth per sample , higher pool sizes will generate lower relative RMSEs. Also, as the MAF increases, the relative RMSE decreases. For example, for MAF = 50% and average depth per sample = 1, the relative RMSEs for = 200, 400, 800, and 1600 are 0.071, 0.050, 0.036, and 0.025, respectively. If we can infinitely increase pool size or average depth per sample while fixing the other, the RMSE will reach zero. The result of Simulation 2 can be viewed in Figure 2.
In our study, we performed simulations at each MAF 10,000 times. However, in practice, we do not have the resources to measure a SNP 10,000 times and then take the average. In real exome sequencing, each SNP is only measured one time. Table 2 shows the quantile information for simulating MAF = 0.5%, 1%, 5%, 10%, 20%, 30%, 40%, and 50% 10,000 times. The mean and median of the estimated MAF are very close to the targeted MAF value. When MAF increases, the variance also increases. If we account for relative RMSE, the simulations still produced more accurate results for larger MAFs.
Simulation 3. In this simulation, we simulated the scenario of pooled exome sequencing. Using data from the 1000 Genomes Project as prior information that contains genotyping data from 1092 individuals, we built an empirical distribution of MAF (Figure 3). Based on this empirical distribution, we simulated the pooled exome sequencing with pool size = 1092 1000 times and computed the median error rate for each simulation (Figure 4). The results clearly indicate that higher depth is required to produce an acceptable error rate (>5%). For standard exome sequencing, pooled DNA from 1000 subjects will require roughly 16 Illumina HiSeq lanes to produce results with an acceptable error rate.
Financial Implication. The ultimate goal of pooling is to ease the financial burden on large association studies. Based on the most up-to-date pricing information on NGS, we compared the total cost of conducting association studies using pooling at different pool sizes with individual sequencing using Illumina HighSeq 2000 sequencer, which contains 2 flow cells, and each flow cell contains 16 lanes. Table 3 shows the price difference between pooling and individual sequencing. The savings using pooling is more substantial when pool sample size is large. When using all 16 lanes, the savings for 200 samples is roughly 500% over individual sequencing and, for 1000 samples, a 2300% saving.

Discussion
Our simulation showed that there are several important factors to consider when designing a pooling study. Those factors include sample size, targeted MAF, and, most importantly, the depth. The sample size directly affects the ability to detect rare SNPs. Larger pool size will increase the accuracy of MAF estimation with the same per sample depth but will not have much effect with the same overall depth. Similarly, with the same pool size, increasing depth will decrease relative RMSE. Our simulation also showed that pooled sequencing is not ideal for estimating the MAF of rare SNPs. The relative RMSE is much higher for SNPs with MAF < 1% compared to SNPs with MAF > 5% (Figure 1).
Sequencing pooled DNA will ease financial burdens and make large association possible. At the same time, however, pooling introduces additional errors. A majority of the errors are caused by the unequal representation of each sample's DNA in the pool. This unequal representation could be due to human or machine error, which we have considered in our simulation. There are other factors which can also cause the unequal representation, such as a sample's DNA quality and variation introduced in the PCR/amplification stage. Unfortunately, we can only minimize such errors and variation using more sophisticated lab techniques. Even if every sample is equally represented in the pool, the sequenced data still do not truly reflect the equality due to sampling variance. Based on our simulation results, when designing a pooling study, we recommend the following: larger pool 8 The Scientific World Journal size is better, and higher depth is better. More elaborately, it is better to keep balance between pool size and depth. We recommend keeping the average depth per sample at 10 minimum if rare SNPs are not of interest; otherwise, average depth per sample at 20 minimum is highly recommended.