Robust Joint Analysis with Data Fusion in Two-Stage Quantitative Trait Genome-Wide Association Studies

Genome-wide association studies (GWASs) in identifying the disease-associated genetic variants have been proved to be a great pioneering work. Two-stage design and analysis are often adopted in GWASs. Considering the genetic model uncertainty, many robust procedures have been proposed and applied in GWASs. However, the existing approaches mostly focused on binary traits, and few work has been done on continuous (quantitative) traits, since the statistical significance of these robust tests is difficult to calculate. In this paper, we develop a powerful F-statistic-based robust joint analysis method for quantitative traits using the combined raw data from both stages in the framework of two-staged GWASs. Explicit expressions are obtained to calculate the statistical significance and power. We show using simulations that the proposed method is substantially more robust than the F-test based on the additive model when the underlying genetic model is unknown. An example for rheumatic arthritis (RA) is used for illustration.


Introduction
Genome-wide association studies (GWASs) have identified a large number of genomic regions (especially single-nucleotide polymorphisms (SNPs)) with a wide variety of complex traits/diseases. In a GWAS, two most common types of data, qualitative (or binary) and quantitative (or continuous) traits, are analyzed and two contentious points are often faced; one is how to construct the test statistic considering the genetic model uncertainty and the other is how to evaluate the statistical significance for controlling the false positive rates efficiently (e.g., [1,2]). Considering these issues, a lot of work has been done on the binary trait in the past 10 years (e.g., [3][4][5][6][7]). Computer algorithms have also been developed to calculated the significance level of robust tests in GWASs, taking into account the genetic model uncertainty [8]. However, few work has been done on continuous traits, only recently So and Sham [9] proposed a MAX3 based on score test statistics, and Li et al. [10] gave a MAX3 based on -test statistics. Note that these tests just focus on single-marker analysis in one-stage analysis.
Although the costs of whole-genome genotyping are decreasing with the high-throughput biological technology, the total costs for a GWAS are still very expensive due to the thousands of sampling units and huge amounts of singlenucleotide polymorphisms. In order to save the costs, the two-stage design and the corresponding statistical analysis where all the SNPs are genotyped in Stage 1 on a portion of the samples and the promising SNPs with small -values (e.g., <0.001) based on some efficient tests are further screened on the remaining subjects, are often adopted in practice (e.g., [11][12][13][14][15]).
In genetic association studies, especially GWASs, genetic markers are routinely tested under the assumption of additive effects. Although convenient to use, those tests are optimal only when the true underlying genetic model is additive so that they are not robust against the genetic model misspecification. To our best knowledge, few work has been done on the two-stage joint analysis for quantitative trait GWASs allowing for genetic model uncertainty. Here, we attempt to develop a joint analysis method with data fusion in the two-stage design using -statistic, since -test is commonly employed from the linear regression model for quantitative trait, and Li et al. [10] show that MAX3 based on -statistics is more powerful than So and Sham's method by extensively numerical simulation.
The content of this paper is organized as follows. In Section 2, we give some notations and the proposed robust joint test statistics. Further, we derive the asymptotic distribution of the test statistics under the null and the alternative hypotheses. In Section 3, we show that the proposed joint analysis method is substantially more robust than the additive-modelbased -test from the numerical results of power comparison when the real genetic model is unknown. After that, an illustrative example for rheumatic arthritis (RA) is presented. Finally, we give some discussion of this paper in Section 4.

Notations. Assume that
individuals are randomly selected to be genotyped in a two-staged GWAS for a certain quantitative trait and that is the sampling proportion in Stage 1. Let 1 = and 2 = (1 − ) be the sample sizes for Stages 1 and 2, respectively. Consider a biallelic marker with two alleles G and g. Without loss of generality, we assume that G is the minor or high-risk allele. We suppose that the total SNPs are genotyped on the samples of Stage 1, and SNPs with -values less than in Stage 1 will be further genotyped and tested in Stage 2. Let the significance level be , and then the genome-wide significance level per SNP is / with the Bonferroni adjustments. Let Y 1 = ( 1 , 2 , . . . , 1 ) and Y 2 = ( 1 +1 , 1 +2 , . . . , ) be the observed quantitative outcome vectors for Stage 1 and Stage 2, respectively. Without loss of generality, we assume that the first 10 individuals in Stage 1 have the genotype gg, the second 11 individuals in Stage 1 have the genotype Gg, and the last 12 subjects in Stage 1 possess the genotype GG. Similarly, the first 20 subjects in Stage 2 have the genotype gg, the second 21 individuals in Stage 2 have the genotype Gg, and the last 22 subjects in Stage 2 possess the genotype GG. Let 0 = (0, 0, . . . , 0) ×1 and 1 = (1, 1, . . . , 1) ×1 , and let O × be the × matrix with all its entries being zero and I be the × identity matrix.

-Statistic-Based Robust Joint Analysis.
We firstly briefly introduce -statistic-based MAX3 by Li et al. [10] just using the data from Stage 1. Consider the following linear regression model: where 0 is the nuisance parameter for the intercept, 1 is the parameter of interest for genetic effect, and is the genotype value, which takes 0, 1, or 2 corresponding to the count of G at a marker locus for the th subject, = 1, 2, . . . , 1 . The hypotheses of interest are The variable in the previously stated equation is coded differently for the three common genetic models. Let , and X 1 = (1 1 , G 1 ) be the design matrices under three commonly used genetic models, where G 1 = (0 10 + 11 , 1 12 ) corresponds to the recessive model, G 1 = ( 1 , 2 , . . . , 1 ) corresponds to the additive model, and G 1 = (0 10 , 1 11 + 12 ) is for the dominant model.
, where x 11 = (0 10 , 1 11 , 0 12 ) and x 12 = (0 10 , 0 11 , 1 12 ) . The modified -test statistics under the recessive, additive, and dominant models for Stage 1 are given by The robust test statistic in Stage 1 is We now give the proposed robust joint analysis. In the framework of two-stage design GWAS of quantitative traits, the SNPs with -values less than will be genotyped on the remaining 2 subjects in Stage 2. Following the previous notation for Stage 1, corresponding to the recessive, additive, and dominant models, the genotype data in Stage 2 are denoted by G 2 = (0 20 + 21 , 1 22 ) , G 2 = ( 1 +1 , 1 +2 , . . . , ) , and G 2 = (0 20 , 1 21 + 22 ) , respectively, and the design matrices are , where x 21 = (0 20 , 1 21 , 0 22 ) and x 22 = (0 20 , 0 21 , 1 22 ) . Then, we can obtain three modified -test statistics under the recessive, additive, and dominant models for Stage 2 similarly, and denote them by 2 , 2 , and 2 . Let Y = (Y 1 , Y 2 ) , G = (G 1 , G 2 ) , G = (G 1 , G 2 ) , and G = (G 1 , G 2 ) . Denote 0 = 10 + 20 , 1 = 11 + 21 , and 2 = 12 + 22 for the combined sample sizes from two stages, corresponding to three genotypes. Then the proposed -test statistics under three genetic models on the basis of the combined data are as follows: where X = (1 , G ), X = (1 , G ), X = (1 , G ), and W = ( ) , Furthermore, we propose the joint testing statistic as In order to calculate the power of the proposed joint analysis, we have to get the thresholds, which is determined by the significance level. Denote the threshold for choosing the promising SNPs in Stage 1 by 1 , which is the solution of Since the genome-wide significance level is / , in order to control the false positive rate, we have where is the cut-off point for the joint statistic. Once we have 1 and , the power is calculated by We now give the detail to calculate the cut-off point and power above. The left side of (10) can be further expressed as For controlling the type I error rate and calculating the power, we need to know the distribution or the asymptotic distribution of ( 1 , 1 , 1 , , , , RSS 1 , RRS ) under both 0 and 1 .

Power Comparison.
We conduct simulation studies to evaluate the performance of the proposed method under three commonly used genetic models (recessive, additive, and dominant models). We mainly compare the power of two approaches; one is the proposed method in this paper, and the other is the joint analysis based on the -test statistics 1 and . For convenience, we refer to the proposed method as MAXFJ and AFJ for the other one. We choose the sample size = 2000, and = 5 × 10 5 . The proportion of subjects genotyped in Stage 1 has three levels = 0.3, 0.4, 0.5. We set the genome-wide significance level as = 0.05 and that the significance level per SNP as / = 1 × 10 −7 . In Stage 1, the -value threshold for SNPs selected for followup is set to be 1 × 10 −4 and 2 × 10 −4 . We assume that the Hardy-Weinberg   Tables 1 and 2 for = 1 × 10 −4 and = 2 × 10 −4 , respectively. They indicate that MAXFJ is more efficiency robust than AFJ across various inheritance models. As expected, AFJ is more powerful than MAXFJ under the additive model. However, MAFJ performs much more powerful than AFJ when the true genetic model is recessive. For instance, in Table 2, with = 0.4 and MAF = 0.3, the powers of AFJ and MAXFJ are 0.101 and 0.529, respectively. In summary, MAXFJ is substantially more powerful than AFJ in two-staged GWAS of quantitative traits, when the model for AFJ is misspecified.

An Illustration Example: Rheumatoid Arthritis.
Rheumatoid arthritis (RA) is an autoimmune disease (resulting in a chronically systemic inflammatory disorder) which mainly attacks synovial joints. About 1% of the common adult population worldwide is affected by RA [16]. It has been pointed out that the genetic variants might play a major role in RA susceptibility [17]. Genetic Analysis Workshop 16 (GAW16) based on the North American Rheumatoid Arthritis Consortium (NARAC) is a GWAS testing association with RA using about 5 × 10 5 SNPs [18][19][20]. It included 868 individuals who were RA positive (cases) and also had continuous trait anticyclic citrullinated peptide (anti-CCP) measures and 1194 controls sampled from the New York Cancer Project (NYCP) without RA which had no anti-CCP measures. Huizinga et al. [21] pointed out that a greater anti-CCP would be linked to better prediction of increased risk developing RA. Chen et al. [22] showed that SNP rs2476601 located in PTPN22 had the most significant association with RA. Here, we only focus on SNP rs2476601 and apply two joint analysis methods (AFJ and MAXFJ) to evaluate its statistical significance. The minimum of anti-CCP among 868 cases was affected to each control, and a log transformation of anti-CCP was applied in the analysis. Then, we considered = 0.3, 0.4, 0.5 three simulation circumstances. For = 0.3, thirty percent of individuals were randomly sampled from all cases and controls and were used as the data from Stage 1, and the rest of individuals were treated as the data of Stage 2.
The -values of AFJ and MAXFJ were calculated, respectively. We repeated the above procedure 1,000 times and saved the corresponding -values. A base-10 logarithm transformation and an opposite transformation were successively applied to these -values, and the histogram and density of these transformed data were obtained (Figure 1). Similarly, we conducted the simulation and calculation for = 0.4 and 0.5, and the corresponding histogram and density were presented in Figures 2 and 3. Examination of Figures 1-3 showed that the -values of MAXFJ are more stable than those of AFJ and the estimated density curves of MAXFJ are more closer to the symmetrical normal distribution while the estimated density curves of AFJ are rather skewed, which indicated that MAXFJ possesses more robust performance when the real genetic models are unknown.

Discussion
We have developed a feasible two-stage design and the corresponding robust joint analysis approach for quantitative trait GWASs. The method is based on the -statistics over three different genetic models. The denominator of the usedstatistic, which is constructed without assuming any genetic model, is different from the commonly used one. This adoption can reduce the computation intensity. Taking advantage of an ingenious design matrix, we successfully construct the common denominator of three -test statistics for the joint analysis with combined raw data from both stages. The statistical significance ( -value) for the proposed joint analysis method can be calculated with the derived analytic expressions on the basis of the asymptotic distributions, which greatly reduce the complexity and computational intensity compared with the resampling-type permutation and bootstrap procedures. Our numerical results demonstrate  Computational and Mathematical Methods in Medicine that this novel approach has the greater efficiency robustness for genetic model uncertainty than the -statistic-based joint analysis which assumes the additive genetic model. In this work, we did not investigate the power of joint analysis based on other existing robust association methods for quantitative traits such as So and Sham's method. We find that it is very difficult to extend So and Sham's method (score test-based MAX3) to two-staged GWASs with quantitative outcomes, since it is almost impossible to derive the joint distribution of score tests from two stages.
For simplicity, here we do not take into account the effects of covariates in the considered two-stage design. However, in real application, the proposed method can be easily applied to the situation including one or more covariates as shown by the original MAXF by Li et al. [10]. It is important to stress that we combine the raw data from two stages to construct the joint statistic, unlike the joint analysis for binary traits using the weighted sum of two statistics in Stages 1 and 2 [12]. Furthermore, one basic assumption in this paper is that the effect sizes of genetic variants between two stages are identical (i.e., no heterogeneity exists), which is the natural and reasonable precondition for the data fusion strategy. In addition, the population-based genetic association studies may be affected by the population stratification, and this needs future research to examine it. Based on the design matrix for the expanded full model above, we can get the ordinary least square estimator of bŷ= (W W) − On the one hand, according to it follows that So, we can get that Denote 0 = √( 10 + 11 ) 12 /( 10 ( 11 + 4 12 ) + 11 12 ) and 1 = √ 10 ( 11 + 12 )/( 10 ( 11 + 4 12 ) + 11 12 ). For a given > 0, is the bivariate normal density function for ( 1 / , 1 / ) with mean 0 2 and variance-covariance matrix Σ 0 = ( 1 V 13 V 13 1 ). Taking advantage of the symmetry of bivariate normal distribution, the above twofold integration can be only calculated at the right half space of Ω 0 and then multiplied by 2, which is Based on the property of conditional distributions of the multivariate normal distribution, we have where ( 0 ) is the probability density function of (0, 1) and ( 1 | 0 ; V 13 ) is the density function of the conditional normal distribution (V 13 0 , 1 − V 2 13 ). That is, ) .