Power Estimation for Gene-Longevity Association Analysis Using Concordant Twins

Statistical power is one of the major concerns in genetic association studies. Related individuals such as twins are valuable samples for genetic studies because of their genetic relatedness. Phenotype similarity in twin pairs provides evidence of genetic control over the phenotype variation in a population. The genetic association study on human longevity, a complex trait that is under control of both genetic and environmental factors, has been confronted by the small sample sizes of longevity subjects which limit statistical power. Twin pairs concordant for longevity have increased probability for carrying beneficial genes and thus are useful samples for gene-longevity association analysis. We conducted a computer simulation to estimate the power of association study using longevity concordant twin pairs. We observed remarkable power increases in using singletons from longevity concordant twin pairs as cases in comparison with cases of sporadic proband. A similar power would require doubled sample sizes for fraternal twins than for identical twins who are concordant for longevity suggesting that longevity concordant identical twins are more efficient samples than fraternal twins. We also observed an approximate of 2- to 3-fold increase in sample sizes needed for longevity cutoff at age 90 as compared with that at age 95. Overall, our results showed high value of twins in genetic association studies on human longevity.


Introduction
Complex phenotypes such as human longevity are associated with multiple genetic and environmental factors with perhaps the majority of them having low to modest effects [1]. As such, the power issue has been a crucial concern in genetic association studies. Although a desired statistical power can always be achieved by increased sample sizes, there can be many factors including the laboratory cost that easily limit the scale of a study. This is especially true for the currently still expensive genomic analysis, for example, the genome sequencing technologies. Twins are special samples that have made remarkable contribution to human genetic studies due to their genetic and environmental sharing. In genetic epidemiology, the popular classical twin design has been widely used in estimating the genetic and environmental components in the variation of disease phenotypes or traits [2]. For example, using Danish twins, the genetic contribution to human longevity has been estimated as about 25% [3,4]. The low heritability and the complex nature of human longevity make genetic association study on the trait low powered. In the literature, the search for genes associated with longevity has continued over many decades with only one gene, APOE, being conclusively confirmed.
Because of their genetic relatedness, twin pairs concordant for longevity are enriched for carrying beneficial genes and thus association studies using singletons from longevity concordant twin pairs should have increased power in comparison with using sporadic longevity individuals. This paper assesses and explores the power advantage for the use of longevity concordant twin pairs by computer simulation. The simulation is based on a proportional hazard assumption and makes use of the recent life table data of the Danish population. Lifespan data will be generated for identical or · · · · · · Figure 1: The concordant design and the ordinary case-control design for gene-longevity association study. The red frame represents concordant twin design in which singletons from longevity twin pairs (blue) are collected as cases. In contrast, the red frame is the ordinary case-control design in which sporadic longevity cases (blue) are sampled. Both designs use young subjects in the middle (orange) as controls.

Experiment Design.
The most popular experiment design for genetic association study on human longevity is the casecontrol design which samples longevity individuals (e.g., centenarians and nonagenarians) as cases and young or middle aged individuals as controls [5]. The power issue for the case-control design has been investigated by Tan et al. [6].
The current simulation study focuses on the power advantage of using singleton twins from twin pairs concordant for longevity as cases ( Figure 1). That is, from each concordant twin pair reaching certain threshold for longevity (e.g., 90 or 95 in this simulation), only one twin will be taken as case for genotyping. The controls will be collected as in ordinary casecontrol studies from unrelated individuals. With this design, the final data for analysis contain unrelated cases and controls with cases collected as singletons from longevity concordant twin pairs (one from each pair) and controls as unrelated individuals at age 40-50 years. The study design draws equal number of cases and controls in the final sample. years for females. The simulation takes the mean survival rate over the two sexes. The use of observed population survival rate avoids imposing any parametric form for the survival function in the simulation and thus ensures that our simulated survival data follows survival distribution in a real population (i.e., the Danish population).

The Proportional Hazard
Model. For a given genetic variant, for example, single nucleotide polymorphism (SNP), we assign a frequency parameter and a relative risk parameter for the minor allele of the SNP. Then the observed  population survival rate at any age is a weighted average of three subpopulations carrying 0, 1, and 2 copies of the minor allele, respectively [7], that is, In (1), ( ), 0 ( ), 1 ( ), 2 ( ) are survival rate for the total population and for the three subpopulations at age ; (1 − ) 2 , 2 (1 − ), 2 are frequencies for corresponding genotypes following binomial distribution of the minor allele frequency (MAF), . With the proportional hazard assumption, the hazards of death corresponding to 1 ( ) and 2 ( ) can be expressed as 1 ( ) = 0 ( ) and 2 ( ) = 2 0 ( ) so that and likewise, 2 ( ) = 0 ( ) 2 . With these relationships and for given MAF and relative risk , (1) can be solved numerically to obtain a nonparametric estimate of the baseline survival 0 ( ) and then 1 ( ) and 2 ( ) can be calculated and used for generating individual lifespan. In the simulation, we introduced a heterogeneity model for the baseline survival function in order to take into account of the unobserved factors that also affect individual survival [8].

Simulating Lifespan.
In order to simulate lifespan using genotype-specific survival, 0 ( ), 1 ( ), and 2 ( ), a genotype will be randomly assigned to each individual using the binomial probability of MAF. For MZ twin pairs, this was first done for one singleton and then the same genotype was copied to the cotwin. For DZ twin pairs, we started with independently simulating genotypes for each parent of a twin pair and assigned genotype for a singleton in a DZ pair by randomly taking one allele from each parent. This ensures that two twins within a pair have 50% chance of inheriting an allele identical by descent (IBD). The lifespan for unrelated controls was simulated by randomly assigning a genotype to each control subject using the binomial distribution of the minor allele. Subjects at age 40-50 years were selected as controls. We simulated power for cases from concordant MZ and DZ twin pairs separately aiming at comparing power difference between zygosities.

Statistical Testing and Power
where (⋅) is an indicator function for logical expression with 1 if true and 0 if false.

Results
In Table 1, we show the power estimates for additive effect of SNP alleles with different combinations of genetic parameters (MAF, relative risk) for different sample sizes from concordant MZ twins and for different cutoffs of longevity. With 800 cases aged 95+, the design is able to identify a common SNP (MAF = 0.5) with a small effect of only 5% reduction of rate of death ( = 0.95). For a small sample size of 200 cases aged over 95 years, the study design has good power (over 0.8) to capture a common SNP with MAF = 0.5 that reduces hazard of death by 10% ( = 0.9); a SNP with lower MAF = 0.1 and hazard reduction of 15% ( = 0.85); and a rare SNP with MAF = 0.05 and hazard reduction of 20% ( = 0.8). A small sample of 100 cases aged 95+ is able to detect a common SNP (MAF = 0.2) with risk reduction of 15% ( = 0.85). When the longevity cut-off is set to 90 years, a sample size of 500 to 800 cases is required to achieve comparable power, an increase of about 3 folds. The power for detecting dominant effect SNPs using MZ cases (Table 2) is almost comparable with that for the additive effect SNPs with low MAF in Table 1 although the difference increases with increasing MAF. Note that, for dominant effect SNPs, the power starts to decline when MAF approaches 0.5. The statistical power is largely reduced for recessive effect SNPs (Table 3). However, for high MAF SNPs ( = 0.5), the design has good power with 500 cases aged over 95 for = 0.9. Comparable power can be achieved with only 200 cases for = 0.85. Interestingly, when comparing power estimates between the two longevity cutoffs (90 and 95 years) for defining cases, we see that the cutoff of 90+ needs 2 to 3 times larger sample sizes to obtain comparable power as compared with that of 95+, a conclusion that applies to Tables  1 to 3.   Tables 4, 5, and 6 carry power estimates for similar parameter settings corresponding to Tables 1-3 except that we added a bigger sample size of 1500 cases considering the relative ease in sampling DZ than MZ concordant twin pairs. The power estimates for DZ twins exhibit similar pattern as for MZ twins but a 2-3-fold increase in sample size is required to obtain comparable power for corresponding settings as in Tables 1-3.

Discussions
Using computer simulation, we have estimated statistical power for a case-control design using singleton cases from twin pairs concordant for longevity. Different from the ordinary case-control studies that collect sporadic centenarians as cases, we limit our cutoffs for longevity to 90 and 95 considering rarity of twin pairs concordant for extreme longevity. It is interesting to compare our power estimates with those from our previous simulation study on ordinary case-control design with sporadic nontwin centenarians as cases [6]. Even with lowered threshold for longevity at age 95, the concordant MZ twin design is able to achieve equivalent power as in ordinary case-control design with centenarian cases for similar or even smaller sample sizes (comparing Tables 1-3 with Tables 1-3 in Tan et al. [6]). With an age cutoff at 90 years, our power estimates can be compared to those in our previous simulation which also simulated power for using nonagenarians as cases (Tables 4-6 in Tan et al. [6]). For comparable power estimates, the case-control design with cases as sporadic nonagenarians would need much larger sample sizes compared with using nonagenarian cases from concordant twin pairs (3-4 folds for MZ and about 2 folds for DZ twins). Overall, our results revealed remarkable power advantage in using longevity concordant twins over ordinary case-control design.
Comparing the power estimates in Tables 1-3 with those  in Tables 4-6, we observe that a power advantage in using cases of MZ twins over cases of DZ twins with the latter requiring almost doubled sample sizes to reach equivalent power. Although relatively lower powered, the DZ twin pairs are actually the same as sibling pairs in genetic sharing meaning that, in practice, concordant DZ twins can be replaced by concordant sibling pairs making sample collection easier and more feasible. On the other hand, when laboratory cost for genotyping is a major concern (such as genome-wide analysis or next-generation sequencing), MZ cases are the best choice as they help to maintain good power but with the lowest sample sizes.
Another interesting finding is the power difference between the two age cutoffs. For both MZ and DZ twins, approximately 2 to 3 times larger sample sizes are needed for cases of 90+ as compared with cases of 95+. For example, for an additive effect allele with MAF = 0.5 and = 0.9, the power for 300 cases of 95+ is equivalent to that for 800 cases of 90+ in both Tables 1 and 2. The large difference in power is understandable considering the very high selection pressure going on during this age interval. Survival data from the Danish 1905 birth cohort showed equal chances for surviving from birth to age 92 as from age 92 to 100 [11]. As a trade-off for power advantage, the extremely high survival selection also adds difficulty in collecting concordant twin pairs aged 95+. The study design should always balance sampling feasibility, power, and age cutoff.
The power advantage in using singleton cases from longevity concordant twin pairs is purely due to increased likelihood for carrying longevity-linked genetic variants. For the same sample size, this study design has the same genotyping cost as an ordinary case-control study but with much higher power. In other words, acceptable power can be achieved with lower cost. This is especially important as current techniques for genomic analysis, for example, the microarray and the next-generation sequencing techniques, are still expensive. Although this study focuses only on power advantage of using twins in longevity studies, our results should also reflect similar situation in human disease studies, that is, using disease concordant twins as cases. Moreover, although our current study focuses on advantage of concordant MZ twins in gene-longevity association studies, MZ twin pairs discordant for longevity or diseases are also useful samples for looking for environmental factors. Taking all together, the unique samples of twins will have a good potential in contributing to the molecular genetic studies of human complex diseases and traits.