Design and Statistical Analysis of Pooled Next Generation Sequencing for Rare Variants

Next generation sequencing NGS is a revolutionary technology for biomedical research. One highly cost-efficient application of NGS is to detect disease association based on pooled DNA samples. However, several key issues need to be addressed for pooled NGS. One of them is the high sequencing error rate and its high variability across genomic positions and experiment runs, which, if not well considered in the experimental design and analysis, could lead to either inflated false positive rates or loss in statistical power. Another important issue is how to test association of a group of rare variants. To address the first issue, we proposed a new blocked pooling design in which multiple pools of DNA samples from cases and controls are sequenced together on same NGS functional units. To address the second issue, we proposed a testing procedure that does not require individual genotypes but by taking advantage of multiple DNA pools. Through a simulation study, we demonstrated that our approach provides a good control of the type I error rate, and yields satisfactory power compared to the test-based on individual genotypes. Our results also provide guidelines for designing an efficient pooled.


Introduction
An understanding of the role of genetic variants in human diseases provides valuable insights into the etiology of diseases.Next generation sequencing NGS , also known as massively parallel sequencing, is a revolutionary technology for biomedical research 1 .The production of large numbers of low-cost reads makes NGS useful for many applications.incorporated into statistical models of testing disease association.It was shown by us and others that disease association can be validly and efficiently examined when the sequencing error parameters can be correctly specified 16, 17 .However, current statistical approaches often assume that the sequencing error parameters in the statistical models are known 16 or able to be estimated by using an internal control, such as a segment of plasmid DNA, in the pool 13, 17 .Because of the high variation in the error rates between genomic positions as well as different runs/lanes of NGS instruments, it is not adequate to use the average error rate for adjusting for the bias, potentially causing either inflated type I error rates or loss in statistical power 16, 17 .How to accurately estimate position-and lane-specific error rates relies on efficient pooled sequencing designs.
One major advantage of the sequencing technology over the SNP array is that it can ascertain novel rare variants that are not present on the array panels.However, it is well known that the power of testing association of rare variants individually is very limited due to the low occurrence of the rare alleles.To improve the statistical power, many statistical approaches have been proposed to simultaneously test a group of rare variants over recent years.Among them, the "collapsing" approach defines a score for each individual by the unweighted or weighted sum of the rare variant alleles of multiple positions in the targeted region.This approach essentially increases the "allele frequency" by pooling multiple variants, and hence improves the power, but the power of this approach relies on the assumption that all rare variant alleles have effects in the same direction 18-23 .To avoid such an assumption that is often not realistic, other approaches such as the test statistic based on the genomic distance and C-alpha test were also proposed 24, 25 .Nevertheless, these approaches all require individual genotypes for accounting for the linkage disequilibrium LD among multiple variants.Because LD information is largely lost in pooled sequencing, how to test disease association of a group of rare variants is still an open question.
In this paper, we proposed blocked pooling design combining bar-coding and pooling sequencing, along with a new multivariant testing procedure, for testing disease association of rare variants.We conducted a simulation study to examine the performance of the new approach under various situations.

Blocked Pooling Design
Sequencing error is the major concern of pooled sequencing because it has a significant impact on the validity and efficiency of testing disease association 17 .Because sequencing errors and bases of true variant alleles confound each other in a single DNA pool, it is often too difficult to differentiate them to obtain an accurate estimate of the sequencing error rate and the allele frequency.Nevertheless, if the sequencing error rates across multiple pools are consistent, a more accurate estimate may be obtained by combining data of multiple DNA pools.To understand sequencing error rates across multiple pools, we have conducted a study that used the GA II system to sequence the pooled mitochondria DNA mtDNA from 20 subjects, whose mtDNA had been sequenced previously using Sanger dideoxy sequencing on an ABI3730XL 15 .The pooled mtDNA samples were multiplexed by bar-coding at 2 pools per lane and replicated in another lane on a different flow cells.Using the results of Sanger sequencing as the reference, the data suggested that locus-specific base-calling error rates are quite consistent between two pools multiplexed in one lane, but vary between two lanes in different flow cells.In addition, sequencing error rates across genomic positions have a significant variation.Although the majority of positions have an error rate lower than 1%, it can be as high as 20%, suggesting that the use of the average error rate of all genomic positions to account for sequencing error is not adequate in testing disease association, even when such an error rate is estimated from a segment of plasmid DNA as the internal control 13, 16, 17 .Based on data from the pooled mtDNA sequencing study, we propose to combine pooling and bar-coding approaches to sequence multiple DNA pools of cases and controls in one lane with each pool indexed.This experiment design can be looked upon as blocked design, which is known to improve the statistical validity as well as power, in particular, when a large variability between blocks here lanes is present 26 .With multiple pools indexed in one lane, the sequencing errors are largely consistent across multiple indexed pools, while bases of the true variant alleles may vary because different numbers of alleles are likely sampled in different pools.The idea of blocked pooling design is that each pool can serve as the control of other pools in the same block to eliminate effects of sequencing errors, and eventually improve the validity and efficiency of testing disease association.Furthermore, an unbalanced pooling design different sizes of pools could be considered to obtain an even more accurate estimation of the sequencing error rate.For example, a pool with a single individual and another pool with a large number of individuals can be multiplexed in one lane.In this design, the pool with one individual serves as the control for accurately estimating the sequencing error rate, while the pool with a large number of individuals provides the data for accurately estimating the allele frequency.The pool with a small number of individuals could provide a more accurate estimate of the sequencing error rate because of the large difference between the allele frequency e.g., 0, 0.5, or 1 for one individual and the sequencing error rate.In the ideal situation in which there are no sequencing errors, the balanced pooling design provides the most efficient estimate of the allele frequency because of the consistent depth of coverage for each individual.However, in the presence of sequencing errors it is necessary to balance between estimating the allele frequency and estimating the sequencing error rate to obtain an optimal association result.We empirically evaluated the importance of parameters of blocked designs in terms of the bias and standard error SE of the estimate of the sequencing error rate and the allele frequency.

Estimating the Sequencing Error Rate and Testing Association of Single Variants
For a case-control study, let the phenotype of a subject be denoted by i 1, 0 for cases or controls, respectively.We are interested in the question of whether the variant allele is associated with disease.Let θ i be the allele frequency of the group i.The statistical hypothesis of association can be tested by examining if cases have a different frequency of the variant allele from controls, which could be written as H 0 : θ 1 θ 0 versus H 1 : θ 1 / θ 0 .
Let n ij be the total number of chromosomes and let v ij be the number of the variant alleles at a locus of interest for the jth pool of ith group.For a pooled sequencing, v ij is unknown and has to be estimated from sequencing reads.We assume that cases and controls are assigned in L 1 and L 0 pools, respectively, indexed in a single sequencing lane.After sequencing, m ij sequencing bases at the locus are observed, and x ij out of m ij bases report the variant allele for the jth pool.To estimate the sequencing error rate e and the allele frequency θ , we consider a simple EM algorithm, given by 0 Initial θ 0 and e 0 , 1 E step where 3 Iteratively update θ and e until converge.
For testing disease association of a rare variant, we have proposed a simple testing procedure based on a parametric bootstrap PB , which is defined by the following steps:

Testing Association of Multiple Rare Variants
Because the statistical power to detect disease association of rare variants individually is often limited, it is useful to jointly test association of a group of rare variants, for example, rare variants in an exon or a gene.Our test statistic is based on P values of individual variants.Let p r r 1, . . ., R be the P value for variant r.The test statistic is defined by When multiple rare variants are in linkage equilibrium no correlation , the statistic follows a standard normal distribution.The question is that, when multiple rare variants are in LD, the P value cannot be obtained based on a standard distribution.The permutation procedure randomly shuffling the disease status is often used to account for the correlation among genetic variants.However, such a procedure requires individual genotypes that are not available in pooled sequencing.Instead, we can take a Monte Carlo approach by simulating the test statistics of individual variants under the null hypothesis from multivariate normal distribution to evaluate the P value.In this approach, we simulate the multivariate normally distributed vector with mean 0 and covariance Σ, the R × R matrix of pair-wise correlations.To do this, we use the Cholesky decomposition: a vector of R independent, standard normally distributed random variables is first generated; then it is multiplied by the Cholesky decomposition matrix of Σ.The simulated test statistic is calculated based on the multivariate normally distributed vector.A large number of multivariate normal vectors are simulated, and the empirical P value is defined by the proportion of simulated test statistics that exceed the observed test statistic .
The statistical challenge is how to estimate the covariance matrix Σ without individual genotypes available.By treating single pools as the sample unit, we estimate the covariance matrix based on the number of variant alleles of pools, instead of the number of alleles of individuals.One option is the standard unbiased empirical covariance matrix S with entries defined as s rr , which v jr is the estimated number of variant alleles of the rth variant in the jth pool.However, this unbiased estimate is known to be inefficient, particularly because the number of the pools is often relatively small.Because rare mutations usually occur on different haplotypes within a target region 32 , and therefore their correlations are often low.This motivated us to use an empirical Bayesian shrinkage estimate of the covariance, which may provide better balance between efficiency and bias 33 .The proposed shrinkage estimate is in the following form: where λ r / r v ar s rr r v ar s rr / r / r s 2 rr r s rr − 1 2 is the shrinkage intensity.The idea of this empirical Bayesian estimate is that, when the data do not provide evidence of correlation of variants, the estimate is shrunk toward an identity matrix, the possibly efficient estimator under the assumption of independency of variants.Of note, this estimate is essentially equivalent to that proposed by Schafer 34 .

Simulations
We conducted a simulation study to examine the impact of varied parameters of pooled designs on the estimation of the sequencing error rate and the allele frequency as well as the test of disease association in terms of validity and efficiency.For each replicate, the pooled sequencing reads of each pool were simulated in the following two-steps: the individual genotypes were first generated under Hardy-Weinberg equilibrium; the sequencing reads of each pool were then generated independently.The sample size was set at 500 cases and 500 controls; individuals were included in different numbers of DNA pools under either the balanced or the unbalanced designs.For the unbalanced designs, one half of the pools included single subjects, and the remaining individuals were evenly assigned to the other half of pools.We set the numbers of reads were consistent cross pools.The type I error rate and power will be evaluated by the proportion of replicates having a P value that is less than a significant level of 0.05.For each simulated situation, the process was repeated for 1,000 replicates.
The performance of the PB test was examined for testing single variants for different types of designs under different allele frequencies 1% and 5% , sequencing error rates 0.5% and 1% , depths of coverage per chromosome 5, 10, and 20× , and numbers of pools 2, 10 and 40 .To evaluate the type I error rate, we simulated the sequencing reads under the null hypothesis of no association, in which circumstance the cases and controls have the same allele frequency.For comparison, we also considered a Naïve Fisher's FN exact test that is based on the estimated allele frequency without taking sequencing errors into account, Fisher's exact test based on the estimated allele frequency with taking sequencing error into account FE , and Fisher's exact test based on the true individual genotypes FT .For the FN test, the number of variant alleles is directly estimated by the proportion of reads that report the variant allele.For the FE test, the number of the variant alleles is based on the allele frequency estimated by the EM algorithm; and the FT test assumes that the genotype of each individual is known and hence the number of the variant alleles can be simply counted.To evaluate the power, we fixed the allele frequency in controls, but allowed the allele frequency in cases to vary in order to yield different effect sizes.
The performance of the PB test was then examined for testing multiple variants.Different numbers of variants and varied correlations were considered.To simulate correlated variants, a set of variables was sampled from multivariate normal distribution with mean 0 and covariance Σ, which had equal pair-wise correlations ρ 0 or 0.5 .The haplotype was generated by dichotomizing the normal variables based on the allele frequencies of cases and controls.The genotypes in each DNA pool were randomly sampled from a large number of haplotypes, and reads for each variant were then sampled independently.We examined multivariant tests based on three different estimates of covariance matrix the unbiased empirical covariance estimate E , the independent matrix I , and the shrinkage estimate S and compared them to the single-variant test with Bonferroni correction minP .

Estimating the Sequencing Error Rate and the Allele Frequency
Table 1 presents results for estimating the sequencing error rate and the allele frequency for balanced and unbalanced pooled sequencing designs under different sequencing depths of coverage, numbers of pools, allele frequencies, and sequencing error rates.As expected, the unbalanced pooling design had smaller bias of the estimate of the sequencing error rate than the balanced design.For example, when the allele frequency and the sequencing error rate were both 1% and the number of pools was 10, the bias of the unbalanced design was <0.0001, while the bias of the balanced design was −0.0074.In addition, the SE of the sequencing error rate of the unbalanced design was comparable to that of the balanced design.Interestingly, the bias of the allele frequency of the unbalanced design was also smaller than the balanced design and their SEs were comparable.Surprisingly, the bias and SE of both the sequencing error rate and the allele frequency were not significantly improved by increasing the number of pools from 2 to 40.As expected, the bias and SE of the sequencing error rate and the allele frequency tended to decrease with an increasing sequencing coverage.

Type I Error Rate
The empirical type I error rate at a significance level of 0.05 is shown in Table 2.In general, the FT test tended to be overconservative when the allele frequency was low.When the depth of coverage is relatively low 5× , the FE test often had a very poor control of type I error rate for both the unbalanced and balanced designs, partially because the variance of the estimate of the number of variant alleles is not negligible due to the low depth of coverage.The FN test was either overliberal or overconservative because it ignores both the sequencing error and the variation of the estimate of the number of variant alleles.Table 2 indicates that the type I error rate of the PB test was consistently close to the nominal level of 0.05 for the unbalanced design, while it could be either liberal or conservative for the balanced design, which is likely due to that the balanced design could not provide an accurate estimate of the sequencing error rate and the allele frequency under low depths of coverage.With an increased depth of sequencing coverage 10× and 20× , the FE test had an improved control of the Type I error rate for the unbalanced design, while it was still a little conservative for the balanced design.
The FN test tended to be more conservative for both the balanced and unbalanced designs with an increasing depth of coverage.The PB test consistently kept a good control of the type I error for the unbalanced design.

Power
We only evaluated the power of the PB test for the unbalanced design because the balanced design did not provide a good control of the type I error rate.As the reference, the FT test that assumes individual genotypes are observed was compared.
Figure 1 shows the empirical power of the PB test for testing association of single variants.Because of the confounding effect of the sequencing error as well as the uncertainty of the estimate of the number of variants allele in a pool, the PB test was generally less powerful than the FT test.However, the loss in power was reduced with a decreasing sequencing error rate or an increasing sequencing depth of coverage.The power of the PB test was not significantly different between various numbers of pools, in particular for the numbers of pools were 10 and 40.The difference in power between the PB test and the FT test seemed more obvious for a more common variant, which could be due to the conservativeness of the FT test for testing relatively rare variants.The results of two versions of Fisher's exact test based on the estimated number of the variant alleles were not presented here, because they generally have a poor control of the type I error rate.Nevertheless, after adjusting for the inflated type I rate they tend to be less powerful than the proposed PB test, Figure 1: Empirical power at the a level of 5% for the parametric bootstrap PB test as a function of the difference of the allele frequency between cases and controls under various sequencing error rates, numbers of pools, and depths of sequencing coverage for testing association.Sample size was set at 500 cases and 500 controls.The minor allele frequencies MAF of controls were set at 0.01 and 0.5, sequencing error rates e were set at 0.005 and 0.01; and numbers of pools were set at 2, 10, and 40.Lines with different colors indicate the power of the PB test under different depths of coverage, which are compared to that of the Fisher's exact test read line based on the true individual genotypes FT .
in particular for rare variants, because of the tendency of conservativeness of Fisher's exact test itself in particular for rare variants data not shown .

Type I Error Rate
The empirical type I error rate at a significant level of 0.05 for testing association of multiple rare variants is shown in Table 3.The multi-variant PB test based on the empirical unbiased estimate of the covariance had the worst performance, it was too liberal when multiple rare variants were in linkage equilibrium, while it was overconservative when variants were in LD.This was more obvious when the sequencing error rate was high 1% .As expected, the test based on an identity covariance matrix had a good control of the type I error rate when multiple variants were uncorrelated, but it tended to be liberal when variants were in LD.The single-variant test based on Bonferroni correction was consistently conservative when variants were in either LD or linkage equilibrium.Compared to other tests, the multivariants PB test based on a shrinkage estimate had the best performance.The results were similar for different numbers of pools for an unbalanced design.As expected, the type I error rate was improved for the test based on the empirical estimate of the covariance with an increasing number of pools.The PB test based on the shrinkage estimate kept a good control of the type I error rate.

Power
Figure 2 shows the empirical power of different tests for testing association of multiple variants under various numbers of pools, numbers of variants, sequencing error rates, depths of sequencing coverage, and correlation structures.In general, the single-variant test with Bonferroni correction had the worst performance in terms of power, which may be due to two reasons: first, it does not make use of the accumulated effects from all variants; second, it has a conservative type I error rate.Among different multi-variant tests, the test based on the unbiased estimate of the covariance was consistently less powerful than the other two tests, even though it had a liberal type I error rate when variants were in LD data not shown .The power of the tests based on a shrinkage estimate and an identity covariance matrix was comparable when variants are in linkage equilibrium Figure 2 a , but the identity covariance matrix seemed slightly more powerful than the shrinkage estimate in particular when the variants were in LD, which may be due to the fact that the test based on an identity covariance matrix had a liberal type I error rate in this case data not shown .

Discussion
In this paper, we addressed two important questions of testing disease association of rare variants by pooled sequencing.One critical issue is that the sequencing error rate is high and has a significant variability across genomic positions.Ignoring the position-specific sequencing error could lead to a biased estimate of the allele frequency, and eventually a biased association result that can be either conservative or liberal, which was shown in our simulations.Another important issue is that the pooling procedure introduces an extra variance of the estimated number of variant alleles in a pool.Ignoring the uncertainty of the number of variant alleles could result in an inflated type I error rate, in particular in the case that the sequencing depth of coverage is low.This problem was indicated by the simulation results of the FE test which is directly based on the estimated number of variant alleles.To tackle these two questions, we proposed to use blocked pooling design to efficiently estimate the position-specific sequencing error rate and the allele frequency, along with a parameter bootstrap testing procedure to account for the extra variance of the estimate of the number of variant alleles in a pool.
We have proposed blocked pooling design to address the above two questions.Although blocked design in this paper was discussed based on lanes of flow cell, the similar idea could be extended to flow cells to take into account two sources of variation: variation between lanes within a flow cell and variation between flowcells.Based on blocked pooling design, an EM algorithm was used for estimating the position-specific sequencing error rate by making use of data from multiple pools.We examined the bias and standard error of the estimate of the sequencing error rates of different pooling designs under various situations through simulations.Intuitively, the EM algorithm should have a better performance when the number of pools is large and the number of individuals in a single pool is small because of the large difference between the minimal allele frequency of a pool and the sequencing error rate.As the result, we found the unbalanced design in which one half of pools included single individuals could provide a much more accurate estimate of the sequencing error rate as well as the allele frequency, while it does not sacrifice much on the variance of these estimates.Previously, we found that misspecification sequencing error has much more important impact on the statistical power than other parameters of pooled sequencing, for example, the depth of coverage and the number of pools 17 .Because the unbalanced design could provide more accurate estimates of the error rate and the allele frequency, the proposed PB test based on the unbalanced design not only consistently maintained a good control of the Type I error rate, but also provided higher power than the balanced design under various situations, even when the depth of coverage was low 5× .For balanced design, however, the proposed PB test tended to be anticonservative for low coverage data.As such, we suggest that the unbalanced blocked design, rather than the more commonly-used balance design, should be used in practice.Before a pooled sequencing study, it may be a good strategy to perform a simulation study to obtain the optimal unbalanced design based on the size of sequencing region and total depth of coverage.Under our simulated situations, for the given number of subjects, depth of coverage and type of design, the number of pools ranging from 10 to 40 did not significantly improve the estimate of the sequencing error rate and the allele frequency, and hence it was not a significant parameter for the statistical power.This result could be important, because it suggested that the pooled sequencing can be very cost-effective by including a small number of large pools with many individuals and small  Empirical power at a level of 5% as a function of the difference of the allele frequency between cases and controls for the proposed PB test based on various estimates of the covariance matrix for testing multiple rare variants under the unbalanced design.The allele frequency of controls was 0.01; the sample size was set at 500 cases and 500 controls; the error rates e were set at 0, 0.005, and 0.01; the depths of coverage were set at 10× and 20×, and the numbers of pools r were set at 5 and 10. Figure a shows the power of the PB test based on an identity covariance matrix I , the PB test based on the shrinkage estimate of the covariance matrix S and the single-variant test with Bonferroni correction the number of variants for independent variants minP .Figure b shows the power of the PB test based on the empirical estimate of the covariance matrix E , the PB test based on the shrinkage estimate of the covariance matrix S and the single-variant test with Bonferroni correction minP the number of variants for independent variants.pools with single individuals in an unbalanced design, which is able to achieve adequate power.
As a single rare variant is likely to have a low marginal effect on disease risk, particularly in the presence of genetic heterogeneity, it is beneficial to jointly test a group of rare variants in a functional unit, such as genes or pathways.We extended the PB method for multiple rare variants.As with other multivariant tests based on individual genotypes, the multivariant PB test is designed for situations in which many rare variants present in the target region.Because our multivariant test is defined by the sum of Z scores transformed from single P values, it does not rely on the assumption on the direction of effects.Even if the effects of rare allele are uniformly in one direction, such as increasing risk, the proposed test can easily incorporate such information by using one-sided single P values to define the test statistic.Its another advantage is that the power is not primarily driven by more common variants when variants with different allele frequencies present in the target region.Because individual genotypes are not available in pooled sequencing, permutation testing is not an option for accurate significance estimation in scenarios where LD is present.We proposed a Monte Carlo approach by simulating the null distribution of the test statistic based on the estimate covariance between variants.The validity and efficiency of this approach rely on how well the covariance can be estimated.Because of the limit number of pools, the test based on the empirical unbiased covariance estimate did not have a good control of the type I error rate and often led to loss in power.However, the test based on the shrinkage estimate could provide a more satisfactory control on the type I error rate.Yet, it maintained comparable power to the test based on the unknown true covariance.One concern of the proposed approach is that the simulation procedure may lead to significant computational time for large-scale sequencing-based studies.To reduce computational burden, more effective approaches could also be obtained based on the shrinkage estimate of the covariance matrix 35 .
The test procedure relies on several assumptions for the different steps of resequencing.The first step of resequencing is typically pulldown of the target genomic region and amplification.We assumed that the targeted genomic regions of subjects in a pool are amplified independently with an equal probability.One concern about this is the presence of heterogeneity in DNA amount in a pool.In this case, individuals are not evenly represented in the pool, and hence the assumption of the resampling approach that alleles of different subjects are drawn with the same probability is not valid.Indeed, the presence of heterogeneity in DNA amount was found to inflate the variance of the test statistic and hence lead to an inflated type I error rate data not shown .However, if multiple independent markers ≥30 are sequenced, it may be possible to use an approach similar to the genomic control to adjust for the inflated variance 36, 37 .
In summary, our results suggest that pooled next-generation sequencing with the unbalance blocked design and the appropriate analytic approach could be a valid and costeffective tool for screening the association of rare variants with diseases.Compared with individual sequencing, it is beneficial in terms of the reduction in cost and time but does not sacrifice much in statistical efficiency.

Figure 2 :
Figure2: Empirical power at a level of 5% as a function of the difference of the allele frequency between cases and controls for the proposed PB test based on various estimates of the covariance matrix for testing multiple rare variants under the unbalanced design.The allele frequency of controls was 0.01; the sample size was set at 500 cases and 500 controls; the error rates e were set at 0, 0.005, and 0.01; the depths of coverage were set at 10× and 20×, and the numbers of pools r were set at 5 and 10.Figureashows the power of the PB test based on an identity covariance matrix I , the PB test based on the shrinkage estimate of the covariance matrix S and the single-variant test with Bonferroni correction the number of variants for independent variants minP .Figurebshows the power of the PB test based on the empirical estimate of the covariance matrix E , the PB test based on the shrinkage estimate of the covariance matrix S and the single-variant test with Bonferroni correction minP the number of variants for independent variants.

Table 1 :
Estimated bias and standard error SE of the sequencing error rate and the allele frequency for the unbalanced and balanced designs under different sequencing depths of coverage, numbers of pools, allele frequencies, and sequencing error rates.Bias θ SE θ Bias e SE e Bias θ SE θ

Table 2 :
Type I error rates at a level of 5% for the PB test and Fisher's exact tests under various depths of coverage, numbers of pools, allele frequencies, and the error rates for testing association.Sample size was set at 500 cases and 500 controls.

Table 3 :
Type I error rates at a level of 5% for multivariant tests under various allele frequencies, error rates, depths of coverage, and numbers of variants for testing association.Sample size was set at 500 cases and 500 controls.Simulations were based on the unbalanced design with 10 and 20 pools.