The Transmission Disequilibrium / Heterogeneity Test with Parental-Genotype Reconstruction for Refined Genetic Mapping of Complex Diseases

In linkage analysis for mapping genetic diseases, the transmission/disequilibrium test TDT uses the linkage disequilibrium LD between some marker and trait loci for precise genetic mapping while avoiding confounding due to population stratification. The sib-TDT S-TDT and combinedTDT C-TDT proposed by Spielman and Ewens can combine data from families with and without parental marker genotypes PMGs . For some families with missing PMG, the reconstructioncombined TDT RC-TDT proposed by Knapp may be used to reconstruct missing parental genotypes from the genotypes of their offspring to increase power and to correct for potential bias. In this paper, we propose a further extension of the RC-TDT, called the reconstruction-combined transmission disequilibrium/heterogeneity RC-TDH test, to take into account the identical-bydescent IBD sharing information in addition to the LD information. It can effectively utilize families with missing or incomplete parental genetic marker information. An application of this proposed method to Genetic Analysis Workshop 14 GAW14 data sets and extensive simulation studies suggest that this approach may further increase statistical power which is particularly valuable when LD is unknown and/or when some or all PMGs are not available.


Introduction
Genetic linkage analysis is an important step in localizing and identifying genes in the chromosomes that underlie many human diseases and other traits of interest.A brief overview of commonly used statistical methods for linkage analysis including recently developed modelfree and model-based methods for mapping qualitative-and quantitative-trait loci, can be found in Shao 1 .For more extensive discussions on linkage analysis, readers can consult Ott 2 .In practice, parental marker genotypes are often incomplete for many genetic studies particularly for late onset diseases.Only using families with complete parental maker genotype information would lead to throwing away a large portion of the useful data and can also lead to biases.It is thus crucially important to make the TDH test applicable to families with missing or incomplete parental marker genotype information.In this paper, we develop a transmission disequilibrium/heterogeneity test with parental-genotype reconstruction, which utilizes both the LD information and the IBD-sharing information and can combine families with or without PMG information.
The transmission disequilibrium/heterogeneity test with parental-genotype reconstruction RC-TDH will be introduced in the next section.In Section 3, the RC-TDH test is applied to a data set from GAW14, and the results are compared with those of the RC-TDT.Finally, simulation studies that use common genetic models 5, 15 are carried out to compare the power and the true size of the RC-TDT and RC-TDH test.The numerical results suggest that RC-TDH test may greatly increase the statistical power which is particularly valuable whenever LD levels are unknown and/or whenever there is missing PMG information as in studying of a disease with late age of onset.
It should be pointed out that the main comparison made in this paper will be between RC-TDT and RC-TDH.We will not formally compare them with the classical IBD-based linkage tests such as those implemented in Genehunter and other softwares.The main rationale is as follows.We are mainly interested in fine mapping of genetic variants that underlie complex diseases, where the classical linkage tests are known to have low power because they do not utilize LD information effectively.With the rapid advancement of biotechnology, it is now feasible and affordable to use dense genetic markers, for example, the single nucleotide polymorphisms SNPs , for genomewide linkage scan.With a large number of dense genetic markers e.g., SNPs some of the markers can be expected to fall into the LD block of the causal genetic variants; thus LD would generally exist to some degree for many markers.Thus the TDT and TDH tests would have power advantage over classical linkage tests which only effectively utilize the IBD information.

Notation
It will be assumed that there are two alleles A and B at the marker locus, and allele A is of particular interest.Let n ai denote the number of affected children, let n ui denote the number of unaffected children, and let n ci n ai n ui denote the size of the sibship for family i.In each family, all children have been typed at the marker locus, but the PMG may or may not be available.Let N g ai N g ui be random variables, denoting the number of affected or unaffected children with genotype g in family i.Small letters i.e., n

The TDH Test with Complete PMG
For completeness, we first consider the case when PMG are observed along with children's marker genotypes.Let x i be the number of alleles A transmitted by the ith marker heterozygous parent to the affected children.When the exact number x i of marker alleles A transmitted to affected children cannot be determined as might happen in families with two heterozygous parents, then T i can be used to replace x i .Using T i in families with ambiguous transmissions, the TDT statistic can be written as T D T 2 d where The transmission heterogeneity test THT statistic is denoted as T H T 2 h where where the moments of T i under H 0 given the parental marker genotypes PMGs are summarized in Table 1.
The transmission disequilibrium/heterogeneity TDH test is based on the following test statistic 4 : In terms of statistical optimality, it can be shown that the TDH test is the efficient score test from the mixture likelihood function under transmission disequilibrium and heterogeneity 4 .In theory, the efficient score test is known to be locally most powerful.

The Reconstruction-Combined TDH (RC-TDH) Test
When at least one parent with missing PMG, Knapp 10 proposed a reconstructioncombined TDT RC-TDT to reconstruct PMG from the genotypes of their offspring and correct for the biases resulting from using reconstructed PMG.To improve the power to detect linkage, we propose the reconstruction-combined TDH test RC-TDH using the following test statistic: where T i denotes the number of marker alleles A in affected children, and denote the appropriate null expectation and variance of T i , respectively, as can be found in Tables 1 and 2 of Knapp 10 .In the RC-TDH statistic, the first term is the RC-TDT statistic of Knapp 10 and the second term is the RC-THT statistic with the restriction.
To get the appropriate null expectation Var , we need to derive the conditional distribution of T i given the constraint for reconstruction R.
When one parental genotype is missing and reconstructible, the conditional probabilities of T i are listed in Table 2.Note that the family index i has been dropped in the formula in Table 2.In the first column, the first parental genotype is typed and the second one is reconstructed.The second column presents a necessary and sufficient condition, for the observed marker genotypes in the offspring, to allow reconstruction of the parental genotypes.The details of the derivation are provided in Han 16 .
When both parental genotypes are missing, the reconstruction condition and the conditional probabilities of T i are the same as that of one parental genotype is missing and the known parental genotype is AB.
When at least one parental genotype is missing and cannot be reconstructed, but the condition for the S-TDT is satisfied i.e., there is at least one affected and at least one unaffected child in this family, not all of the children possess the same genotype , the distribution of T i can be calculated using the affected and unaffected children genotypes by the hypergeometric distribution.The details are provided in the Appendix section.
As in C-TDT and RC-TDT, families not belonging to the previous categories will be ignored.

Application to Genetic Analysis Workshop 14 Data
The proposed RC-TDH test was applied to a Genetic Analysis Workshop 14 GAW14 dataset to compare the power with that of RC-TDT.The GAW14 simulated data were generated by Dr. David Greenberg.A behavioral disorder has been simulated in multiple replicates of four different populations/groups.There are 100 families in the Aipotu, Karnagar, and Danacaa data sets.There are 100 replicates for each data set.The results of power comparison of RC-TDH with RC-TDT to analyze the linkage between the trait b disease allele and the marker B01T0561 are presented in Table 3.This trait has incomplete penetrance with f DD 30%.Application of the RC-TDH is illustrated in Table 3 with 50% and 100% missing parental genotypes.The power is based on type I error at 0.05 level.

Simulation Set-Up
Simulation studies are conducted to compare the powers of the proposed RC-TDH test with the RC-TDT.To attain the correct type I error rates, we directly simulated the critical values under the null hypothesis of no linkage, in which θ recombination frequency 0.5.In the simulations for the null distribution, 1,000,000 replicates of samples of nuclear families are generated and the empirical critical values are obtained.Based on 500 independent replicates and the empirical critical values, we estimate the power of the tests using the relative frequencies of the simulated test statistics which exceed the empirical critical values.
To generate the family-based data, as in earlier work 5 , we consider two biallelic loci: one disease locus with disease allele D and normal allele d and one marker locus with allele A and B .The frequency for disease allele D is p D and for marker allele A is p A .The linkage disequilibrium is the deviation of the frequency of DA haplotype from its equilibrium value expected by chance .Define the LD parameter as In our simulations, we assume A is the allele in LD with D. Thus, the range of the LD parameter Δ is in 0, 1 , in which 0 indicates linkage equilibrium.There are three penetrance parameters, f DD , f Dd , and f dd , corresponding to three possible disease genotypes.In the study of 100% PMG missing, we ignore all the parental marker genotypes.In the study of 50% PMG missing, we use 50% families with parental marker genotypes and 50% families without parental marker genotypes.
Simulation study 1 closely followed the approach used by Boehnke and Langefeld 15 .For each model, a disease prevalence K p of 5% was assumed.The disease allele frequency p that resulted from each of the disease models can be calculated by

Simulation Results
Table 6 presents estimates of the critical values for RC-TDH at significance levels of .05,.01,and .001.Table 7 presents the estimates of the true type I error rate, at nominal significance levels of .05,.01,and .001.The simulations support the validity of approximating the null distribution with a standard normal distribution for RC-TDT.
The results of simulation study 1 are shown in Table 8.The disease models are denoted by "D," "A," and "R" for the mode of inheritance i.e., dominant, additive, and recessive ; "1" and "2" for the value of f DD i.e., 1.0 and 0.5 .The presented results come from the simulations with 4 sibs in each family, which have the same trend as those with 2 or 6 sibs in each family.In instances for which there is no parental genotype information available, application of the RC-TDH instead of the RC-TDT results in a consistent gain of power, especially when linkage disequilibrium is weak.
We conducted simulation study 2 to compare the power of the proposed RC-TDH test with that of RC-TDT according to linkage disequilibrium in different scenarios based on Table 5, such as tight linkage versus weak linkage, full penetrance versus incomplete penetrance.Each simulated sample consists of families with an identical number of sibs n c in each family with n c 3 , which are ascertained on the basis of the presence of an affected child.Each sample consists of a total of 600 children.Half of the 200 families have complete PGM, and half of the families without PGM.To assess the power of the tests, 500 replicate samples are generated, under different simulation scenarios.For each replicate sample, the statistics obtained with the proposed RC-TDH and with the RC-TDT were calculated.
To compare power of the RC-TDH with that of the RC-TDT at different LD levels, we set the range of LD between 0 and 1, recombination fraction at 0.01, the frequency of allele D at 0.1, the frequency of allele A at 0.5, penetrance for genotype DD at full penetrance 1, penetrance for genotype dd at 0.01, and then the penetrance for genotype Dd can be determined by the modes of inheritance.The results in Table 9 and Figure 1 show that the power increases with LD, and the proposed RC-TDH is more powerful than RC-TDT, especially when LD is weak as in scenario 1 of Table 4. Penetrance is the conditional probability of observing a phenotype given a specified disease genotype.In scenario 1, we set f DD the penetrance for a subject whose marker genotype is DD at 1, which is an idealistic penetrance.To compare the power of the proposed RC-TDH with that of its competitor under different penetrance, f DD is varied from full penetrance to incomplete penetrance 0.5, which is more realistic.The results in Table 9 and Figure 2 show that the proposed RC-TDH has better power than RC-TDT with half penetrance for genotype DD individuals as in scenario 5 of Table 5.
In summary, our simulation results show that the proposed RC-TDH is generally more powerful than RC-TDT for a broad range of LD, the tightness of the linkage, and across disease models.

Discussion
For mapping complex diseases, it is common that the transmission probabilities of a marker allele of interest vary across heterozygous parents, due to locus heterogeneity, etiological heterogeneity, and many other complexities and/or combinations of them 3, 4 .Under such transmission heterogeneity, the transmission likelihood generally has the form of mixture models with many parameters, and the efficient score test has two parts in the form of a TDH test 4 .This paper studies a TDH test which allows the inclusion of reconstructed   D dominant , R recessive , A additive ; f DD : 1 1.0 , 2 0.5 ; with type-I error rate .05based on 500 independent replicates of 150 nuclear families.Δ is the measurement for linkage disequilibrium.When Δ 0, there is no linkage disequilibrium.In this simulation study, all the parental marker genotypes are missing.parental marker genotype data and extends the RC-TDT of Knapp 10,11 .The proposed new approach was validated by simulation studies and GAW14 data sets, and the results indicate that the new approach might improve the power of family-based linkage analysis for a broad range of LD.Moreover, the simulation studies also indicate that the systematic power advantage of the RC-TDH test over the RC-TDT holds regardless of the underlying genetic models e.g., recessive, dominant, additive, multiplicative .Similar to RC-TDT, the new approach can utilize the missing parental information that can be reconstructed from the child genotypes, especially including some families with genotype-concordant or phenotype-concordant sibs.In addition, the proposed test is a sibship-oriented method which does not require specification of the underlying genetic model; it naturally uses the multiple siblings by considering the sibship as a whole.The second part of the RC-TDH statistic, the THT part of the test statistic, is based on information from IBD.This is quite obvious in the situation of affected sib-pairs, where the THT is essentially equivalent to the so-called mean test 4, 13 .5 scenario 5 .This figure is based on scenario 5: θ 0.01, p D 0.1, p A 0.5, f DD 0.5 and f dd 0.01.The type I error rate is 0.001 based on 500 independent replicates of 200 nuclear families, 50% of which without parental information.Every family contains 3 sibs and at least one is affected.LD is the measurement for linkage disequilibrium as defined by Δ in Section 4.1.When LD 0, there is no linkage disequilibrium.
Many other linkage analysis tests such as the tests implemented by Genehunter have relatively low power with respect to TDT or TDH when LD is present.In reality, some degree of LD is often present particularly when we use dense genetic markers e.g., SNPs along the genome because they are available at increasingly cheaper cost, and these dense markers are already very affordable.With a large number of dense genetic markers, some markers may be expected to fall into the LD block of the causal variants.When using these affordable dense markers along the genome or candidate gene regions, we believe that RC-TDH will have better chance of success than the classical IBD-based linkage methods in detecting linkage signals along the genome.
As high density SNP arrays become increasingly affordable to researchers, genomewide linkage studies are becoming common.Our TDH test has simple closed form test statistics which is computationally easy in addition to good overall power across a broad range of LD.Thus the proposed method would be potentially useful for genomewide linkage analysis.In contrast, likelihood ratio test for mixture likelihood is generally computationally There are three cases for the calculation: Therefore the distribution of T conditioned on R is

A.2. At Least One Parental Genotype Is Missing and Cannot Be Reconstructed, but the Condition for the S-TDT Is Satisfied
In a sibship with a affected and u unaffected sibs, the total number of sibs is t a u.Suppose that in this sibship the number of sibs who are of genotype AA is r and the number of sibs who are of genotype AB is s.Let x be the number of AA sibs and let y be the number of AB sibs who are classified as affected.As discussed in Spielman and Ewens 9 , given the totals r, s, a, u, and t, the numbers x, y can be regarded as two entries in a 2 × 3 contingency table with marginal totals a, u, r, s, and t − r − s.Therefore, the distribution of T 2x y can be obtained by the generalized hypergeometric distribution 18, page 47 .More specifically, we have A.4 More formulas of parental marker genotype reconstruction probabilities under various missing genotypes types and constraints, as well as detailed derivations of these formulas, can be found in Han 16 .

1 − 1 /2 nc 2 3 /2 nc − 1 AA
× AB N AA > 0 and N AB > 0 na − 1/2 nc 1 − 2 1/2 nc variable and the observed number of children with genotype g in family i, respectively.T i denotes the number of A alleles in affected children i.e., T i 2N AA ai N AB ai .The notation introduced here is consistent withKnapp 10, 11 and Han 16 .

Figure 2 :
Figure 2: Power of RC-TDH solid and RC-TDT dashed in Table5scenario 5 .This figure is based on scenario 5: θ 0.01, p D 0.1, p A 0.5, f DD 0.5 and f dd 0.01.The type I error rate is 0.001 based on 500 independent replicates of 200 nuclear families, 50% of which without parental information.Every family contains 3 sibs and at least one is affected.LD is the measurement for linkage disequilibrium as defined by Δ in Section 4.1.When LD 0, there is no linkage disequilibrium.

case 1 :
c n a , P H 0 {T c} ∩ R n a c−n a 1/2 n a − 1/2 n c , case 2: n a < c < 2n a , P H 0 {T c} ∩ R n a c−n a 1/2 n a , case 3: c 2n a , P H 0 {T c} ∩ R n a c−n a 1/2 n a − 1/2 n c .

Table 1 :
Moments of T i under H 0 .

Table 2 :
Distribution of T i when one PMG is missing but reconstructible.

Table 3 :
Power comparison of the RC-TDH test with RC-TDT using GAW14 data.

Table 4 .
dd .Summary of the parameters used in this simulation study is in Summary of the parameters used in simulation study 2 is in Table 5.Four commonly used disease models are used here: dominant f Dd f DD , additive f Dd f DD f dd /2 , multiplicative f Dd f DD • f dd , and recessive f Dd f dd models.

Table 4 :
Parameters used in simulation study 1.

Table 5 :
Parameters used in simulation study 2.

Table 6 :
Simulated critical values for RC-TDH.: determined on the basis of the dominant model with f DD 0.2 Scenario 4 in Table4. Note

Table 7 :
Simulated true type I error rates of the RC-TDT and of RC-TDH.Determined on the basis of the dominant model with f DD 0.2 scenario 4 in Table4.

Table 8 :
Powers of RC-TDT and RC-TDH in simulation study 1.

Table 9 :
Powers of the RC-TDT and RC-TDH in simulation study 2. Power of RC-TDH solid and RC-TDT dashed in Table5scenario 1 .This figure is based on scenario 1: θ 0.01, p D 0.1, p A 0.5, f DD 1 and f dd 0.01.The type I error rate is 0.001 based on 500 independent replicates of 200 nuclear families, 50% of which without parental information.Every family contains 3 sibs and at least one is affected.LD is the measurement for linkage disequilibrium as defined by Δ in Section 4.1.When LD 0, there is no linkage disequilibrium.