Robust Association Tests for the Replication of Genome-Wide Association Studies

In genome-wide association study (GWAS), robust genetic association tests such as maximum of three CATTs (MAX3), each corresponding to recessive, additive, and dominant genetic models, the minimum p value of Pearson's Chi-square test with 2 degrees of freedom, and CATT based on additive genetic model (MIN2), genetic model selection (GMS), and genetic model exclusion (GME) methods have been shown to provide better power performance under wide range of underlying genetic models. In this paper, we demonstrate how these robust tests can be applied to the replication study of GWAS and how the overall statistical significance can be evaluated using the combined test formed by p values of the discovery and replication studies.


Introduction
With the advance of biotechnology and substantial reduction of genotyping costs, a genome-wide association study (GWAS) using hundred thousand markers in several thousand individuals is now increasingly utilized and has been successful in detecting genetic associations across the entire genome with complex human traits [1][2][3][4][5][6]. Among many challenges this application holds; development of more efficient and robust statistical methodologies with higher power to detect an association with a single marker has been one of the most important statistical issues, given that effects of individual markers are usually characterized as being small to moderate. One attempt to overcome this challenge is focused on developing efficient tests that are robust against underlying genetic model misspecification.
Two most frequently used association tests are the allelebased test (ABT) and the genotype-based test (GBT). ABT compares the allele frequencies between cases and controls, while GBT compares the genotype distributions of cases and controls. The Cochran-Armitage trend test (CATT) [7,8] is a popular GBT which takes into account the underlying genetic model. It is well known, however, that the ABT may inflate type I error when Hardy-Weinberg equilibrium (HWE) does not hold in the samples [9]. Even under HWE, when the genetic model is recessive or dominant, the ABT may suffer from serious power loss. On the other hand, the CATT does not depend on HWE, but to apply the CATT the choice of scores optimal for the underlying genetic model needs to be specified. For complex diseases, the genetic model is usually unknown and robust tests such as the maximum of three CATTs (MAX3) [10] and the maximum efficiency robust test (MERT) [11,12] are preferable. Alternatively, Zheng and Ng [13] and Joo et al. [14] proposed a two-phase analysis based on the genetic model selection (GMS) and genetic model exclusion (GME). Moreover, an alternative approach was proposed by the Wellcome trust case-control consortium (WTCCC) [5] which used a minimum value of Pearson's Chi-square test and additive CATT, and the asymptotic properties of this approach were studied in detail by Joo et al. [15]. These methods provide better or comparable power performance than some of the robust tests such as MAX3.

BioMed Research International
In this paper, we illustrate how these robust tests can be applied to a replication study of GWAS and how overall statistical significance can be evaluated using the combined test formed by values of the discovery and replication studies. The importance of replication or validation in GWAS has been well recognized [16,17], and joint analysis in a two-stage design of GWAS has been proved to be more powerful than replication-based analysis and has been widely conducted in GWAS with a variety of phenotypes of interest [18,19].
The paper is organized as follows. We first describe the data structures and notation and review existing robust association tests for a single data set. Then we describe how to obtain the value for the replication data set, given the significant result of the discovery stage, using robust tests. In the next section, a combined test of the values of the discovery and replication data sets is proposed, together with the way to evaluate the statistical significance for the combined test. Simulation studies are conducted to compare the type I error rates and powers of various analytical strategies. For illustration purposes, the summarized methods are applied to a non-small-cell lung cancer data set and at the end there is a discussion.

Review of Association Tests for a Single Data Set.
The association in case-control studies can be tested using various methods which have been extensively studied. The general association between the disease status and the SNP can be tested using Pearson's Chi-square test which has an asymptotic Chi-square distribution with 2 degrees of freedom under 0 . The test is given by where = + for = 0, 1, 2 and = + . Under Hardy-Weinberg equilibrium (HWE), an allele-based test (ABT) and CATT with scores (0, , 1) for ( 0 , 1 , 2 ), where 0 ≤ ≤ 1, are given by where ( 0 , 1 , 2 ) = (0, , 1) [9]. The optimal choices of for the recessive (REC), additive/multiplicative (ADD/MUL), and dominant (DOM) models are = 0, 1/2 and 1, respectively [9,20]. Both and ABT asymptotically follow a standard normal distribution under 0 .
can be used even when HWE does not hold. However, without the HWE assumption, ABT does not follow a standard normal distribution due to the correlation between two alleles.
A robust test, MAX3 proposed by Friedlin et al. [10], can be obtained by taking the maximum of three CATTs under the three genetic models as MAX3 = max(| 0 |, | 1/2 |, | 1 |). Parametric bootstrap or permutation methods can be used to find the value of MAX3 [4].
Let the values of Pearson's Chi-square test and CATT under the additive genetic model 1/2 be chi2 and 1/2 , respectively. WTCCC [5] proposed an alternative robust test MIN2 = min( chi2 , 1/2 ). Joo et al. [15] derived the asymptotic null distribution of MIN2 and using their result the value of MIN2 can be obtained as where 1 and 2 are the cumulative distributions of Chisquare distributions with 1 and 2 degrees of freedom.
While studying the properties of GMS, Joo et al. [14] noticed that the probability of selecting the true recessive or dominant models using is very low especially for low to moderate GRRs, but the unlikely genetic model can be successfully excluded. This led to genetic model exclusion method GME which is the same as the GMS described above except for = 0, 1/2, 1 is replaced by * where * = ( + 1/2 )/{2(1 +̂, 1/2 )} 1/2 . And the value of GME can be obtained as where = {2(1 +̂, 1/2 )} 1/2 − 1/2 for = GME .

Value of Replication Data Using the Robust Method.
In the discovery stage, the value of robust association tests, including MAX3, MIN2, GMS , and GME , can be obtained as described in Section 2.2. For the value of replication data using the robust method, we use the same analytic method that was used for discovery and the risk allele identified by it [16]. This means that when the best test statistic or genetic model is selected in the discovery stage, the replication stage will adopt the discovery stage selection and the direction of association. Suppose that, for simplicity of notation, our interest is in GWAS with two stages, one for discovery and the other for replication, although the methodology described below can be extended to multistages for replication. Let ( ) for = 0, 1/2, 1 be the CATT optimal for recessive, additive, and dominant models and let ( ) be corresponding value for th stage ( = 1 for discovery and = 2 for replication stages). Also, denote ( ) * = ( ( ) + ( ) 1/2 )/{2(1 +̂( ) ,1/2 )} 1/2 for = 0, 1/2, 1. Then, for CATT with a preselected genetic model, (2) value given the direction of association from the discovery stage, and (2) * = 1−Φ(sign( (1) * ⋅ (2) * )⋅| (2) * |). Moreover, denote the test statistics and values using Pearson's Chisquare test from the th stage as ( ) chi2 and ( ) chi2 . Further, let HWDTT from the th stage be ( ) . Then, the second stage values, using MAX3, MIN2, GMS , and GME , denoted as (2) MAX3 , (2) MIN2 , (2) GMS , and (2) GME , can be obtained as follows: It is important to note that even though the direction of the test statistics and the selected genetic models are used to obtain the second stage values, the values from the two stages are independent under the null hypothesis. This is because, under the null hypothesis, the probability of 1/2 being positive or negative is simply 1/2, and the probability of the selection of a certain genetic model is also a constant ( for the recessive and dominant models and 1 − 2 for the additive model).

Combined Test Using Values and Its Statistical Significance.
For a given robust test, we can consider the joint analysis by combining values from the discovery and replication stages of GWAS. We consider using values rather than the test statistics because test statistics can have Table 1: Type I error rates of three approaches-replication-based (REP) test, Fisher's combination ( FC ), and linear combination of test ( LC )-based on the CATT with an additive model ( 1/2 ), 2 , MAX3, MIN2, GMS, and GME. The disease prevalence = 0.1, = 10 markers, = 1, 500 cases, and = 1, 500 controls are considered based on 20,000 simulations. There are several methods for combining test statistics from two stages [22], and two most commonly used forms are based on Fisher's combination and a linear combination after inverse normal transformation [23]. Fisher's combination (FC) directly sums values after −2 log transformation; that is, FC = −2 1 log( (1) ) − 2 2 log( (2) ), where ( ) is value from = 0 for discovery and = 1 for replication stages using a given robust test. A specification of 1 = 2 = 1 gives the same weight for discovery and replication stages, and one can consider 1 = 2 and 2 = 2(1 − ) where = /( + ), and and are sample sizes of the discovery and replication data sets. A linear combination of two values after taking the inverse of the standard normal cumulative distribution is given by LC with a natural choice of 1 = √ and 2 = √1 − . Let the significance level of the discovery stage be , which means that markers with (1) < are selected and replicated in the replication stage. The value of combined test can then be obtained as where the observed value of FC is FC . The FC are calculated as − FC /2 (1+ FC /2+log ) for equal weights where FC > −2log and for unequal weights where FC > −2 1 log . Detailed derivations are described in the Appendix. Equivalently, for an overall type I error threshold for a single marker of , one may obtain the threshold FC of FC that satisfies Similarly, for the LC , the value is calculated where the observed value of LC = LC . Table 1 provides the type I errors under different scenarios. A disease prevalence of 10% is assumed, and a total of 1500 cases and 1500 controls were divided into two stages. The proportions of samples in the first stage ( ) of 0.5 and 0.6 were considered for the minor allele frequency (MAF) of 0.3 and 0.4. We considered = 10 markers to control the genome-wide false positive rate at = 0.05 with the Bonferroni correction. We did not consider a larger BioMed Research International 5 such as 300,000 or 500,000 because this will require more than 10 million simulations to show a stable estimate of the type I error rate. With = 10, we performed 20,000 simulations which result in less than 10% of a coefficient of variation for a significance level 0.05/ = 0.005 for each marker [24]. The test statistics considered are 1/2 , Pearson's Chi-square test, MIN2, MAX3, GMS, and GME. For the second stage analysis, we considered a replication-based analysis, FC , and LC as proposed above. The results are based on the situation under HWE (HWE coefficient = 0). As expected, all tests control the type I error reasonablly well, and similar results were obtained when a slight deviation from HWE is present with = 0.05 (results not shown).

Empirical Power.
We examined the empirical powers of different tests considered above. In Figure 1, we considered = 10 markers, a disease prevalence of 10%, the same genotype relative risk for two stages ( 1 = 1.4 and 2 = 1.4), and 1,000 cases and 1,000 controls. 2,000 simulations were performed under HWE ( = 0) to control the genomewide false positive rate at = 0.05. The recessive, additive, and dominant models were assumed for the first, second, and third rows. Both joint analyses showed better power performances compared to the replication-based analysis (up to 15.9% in scenarios considered in Figure 1), and LC and FC have comparable powers with less than 2% difference. The power gain of using the joint analysis is not as much as that observed in Skol et al. [18]. However, as reported by Skol et al. [18], when the between-stage heterogeneity exists and the risk allele has a larger effect in the first stage than that in the second stage, much improved power is observed by using the joint test. Figure 2 shows results under this scenario with 1 = 1.6 and 2 = 1.4, and the observed increase in power using the joint test is as high as 33.9%. Again, the difference between LC and FC is minor with less than 3% difference. As for comparison between different robust methods, MAX3, GMS, and GME perform well under the recessive model, while 1/2 , 2 , and MIN2 are less powerful. Under the additive model, 1/2 is most powerful, as expected, and 2 is least powerful. Other robust methods perform well with a slight decrease in power compared to 1/2 . Under the dominant model, MAX3, GMS, and GME perform the best even though all tests show good power performances, and the difference is minor. Similar patterns were observed when a slight deviation from the HWE is present (results not shown).

Real Data Application
The GWAS on non-small-cell lung cancer (NSCLC) by Yoon et al. [25] studied 621 NSCLC patients and 1541 control subjects in the discovery stage. After stringent quality control steps, a total of 246,758 SNPs were tested for the association with NSCLC based on 1/2 . In the replication stage, 168 SNPs with value less than 1 × 10 −4 in the first stage based on 1/2 were tested using 804 patients and 1470 control samples. We identified additional 234 SNPs using MIN2 in the first stage which could be studied in the replication stage if MIN2 was used instead of 1/2 since MIN2 produces stronger evidence for the additional SNPs than 1/2 does. The Manhattan plots of using MIN2 and 1/2 are presented in Figure 3. One example is 385272 located in chromosome 2, which had a value of 1.37 × 10 −7 which reached significance level at Bonferroni correction in discovery samples alone, whereas 1/2 yielded a value greater than 1×10 −4 . Even though there is possibility of false positive findings, these SNPs could have been selected for replication if robust methods were used.
Since we do not have replication data for these additional SNPs selected using MIN2 because the first stage selection was based on 1/2 in Yoon et al. [25], just for illustration purpose of the proposed methods, we present the results of three SNPs including 2131877 that was reported by Yoon et al. [25]. When the significance level in the discovery stage is set at = 5 × 10 −5 so that all these exemplary SNPs can be selected in the discovery stage; the value of combined test based on four robust methods (MAX3, MIN2, GMS, and GME) as well as 1/2 and Pearson's Chi-square test is presented in Table 2. Fisher's combination was used for the joint test in the second stage. Only 2131877 was found to be significant with Bonferroni correction ( value < 2.03 × 10 −7 ) by all except MAX3 method.

Discussion
In genetic association studies, efficiency robust tests whose performance does not depend on the underlying genetic model have been extensively studied, and their power benefit over a wide range of genetic models has been well recognized. In this paper, we described how the idea of these robust association tests can be applied to the replication studies and further how overall statistical significance can be evaluated using the combined test formed by values of the discovery and replication studies.
When the robust tests are used, the test statistic of each stage can have a complex form and thus dealing with the distribution of the joint test can be difficult, whereas calculating the value of each stage might be relatively simple. Because the asymptotic distribution of the value under the null hypothesis of no association is easy to handle, the combined test using values rather than the test statistics themselves can provide computational convenience.
There are several methods for combining test statistics from two stages and Won et al. [22] compared the performances of various choices. Two most commonly used forms are based on Fisher's combination and the linear combination after the inverse normal transformation [23], and we presented the test statistics and values of these two methods. In our limited experience, the linear combination and Fisher's combination are fairly comparable. Fisher's combination seems to perform slightly better than the linear combination when there exists some heterogeneity between stages in terms of the genotype relative risk, while the linear combination seems to perform slightly better in most of other situations. However, the difference is extremely minor. Further research is required for the thorough comparison of various methods of combining values in the application of efficiency robust tests to the replication of genetic association studies.

MAX3
Replication-based test

Replication-based test
Replication-based test

Replication-based test
Replication-based test  In a genetic study where the purpose of considering a replication stage is to validate or replicate the genetic findings from the discovery stage, which is the case considered in this paper, the analysis in the replication stage utilized the test statistic or genetic model that is selected as being the best in the discovery stage and also the direction of the risk allele, following guidelines for exact replication in genetic association studies. If the purpose is to simply combine the evidence from different data sources such as in meta-analysis, other strategies may be devised. Further research, again, is required to provide fully detailed properties of such methods.
Power gain of a joint analysis over the conventional replication-based analysis was thoroughly studied by Skol et al. [18,19]. In our simulation, the amount of power increase using a joint test compared to the replication-based analysis was much minor than what was observed by Skol et al. [18,19]. The exact reason is not known, but we suspect this might be due to the power advantages of robust methods and also due to the fact that the optimal choice from the first stage is used when calculating the second stage values.
However, even though it was minor in some situations, the joint anlysis presented better power performance than the replication-based analysis in our study. This type of joint analysis raised concerns about the exact meaning of replication [17]. However, McCarthy et al. [26] mentioned that joint analyses "blur the boundaries of where exactly replication starts, but whichever analytical approach is taken, confirmation in many independent samples is important and it is the overall strength of the evidence of association that matters. " Purpose of the current study was to present how the overall strength of the evidence of association can be evaluated when robust tests are used in GWAS replication studies.
We illustrated how the proposed methods can be applied in the real data that studied the association of SNPs with nonsmall-cell lung cancer (NSCLC) in discovery and replication stages. In the original study reported by Yoon et al. [25], SNPs were selected in the discovery data set not based on the robust tests but based on additive CATT. Therefore, we found that some SNPs could have been selected by one of the robust methods but they were not included in the replication data set. For these SNPs, we were not able to perform the joint analysis that we propose, and it was not possible to examine whether there are other SNPs that could have been found to be associated with NSCLC by proposed methods in the replication study. For this reason, we merely presented how many additional SNPs could have been further followed in the replication stage when robust methods were used. In many GWASs, it is a common practice to report the summary test statistics and values of the SNPs under a specific genetic model, usually an additive model, which were further genotyped in the replication stage and were finally defined to be significantly associated with a phenotype of interest. As emphasized in this paper, one may have a better chance of finding many missing SNPs by applying more powerful and robust methods that consider different genetic models simultaneously. Therefore, we urge the community to share test results under not only an additive model but also other genetic models, although they were not significant at a stringent significance level, so that future research may have enriched data resources, to which robust tests can be applied in association studies.
exp (− (2 2 ) ) , and the probability distribution function is (A.3) Using the previous notation, we have the following form of value: where FC / 1 > .