Mixed Modeling with Whole Genome Data

Objective. We consider the need for a modeling framework for related individuals and various sources of variations. The relationships could either be among relatives in families or among unrelated individuals in a general population with cryptic relatedness; both could be refined or derived with whole genome data. As with variations they can include oliogogenes, polygenes, single nucleotide polymorphism SNP , and covariates. Methods. We describe mixed models as a coherent theoretical framework to accommodate correlations for various types of outcomes in relation to many sources of variations. The framework also extends to consortium meta-analysis involving both population-based and family-based studies. Results. Through examples we show that the framework can be furnished with general statistical packages whose great advantage lies in simplicity and exibility to study both genetic and environmental effects. Areas which require further work are also indicated. Conclusion. Mixed models will play an important role in practical analysis of data on both families and unrelated individuals when whole genome information is available.


Introduction
Genomewide association studies GWASs have successfully identified many genetic variants consistently associated with human diseases or other traits.Both unrelated individuals in a population or related individuals in families have been involved in such studies.There is a variety of issues which merit further consideration.
Our concern here is on correlations among individuals, which are "the central piece of information" 1 in detection and characterization of gene-trait association.Consideration of these correlations has traditionally limited to family data whose critical role in genetic epidemiological study ranges from familial aggregation, segregation, linkage to association 2 , and special attention is required in the analysis compared to unrelated individuals from a population.Correlations arise naturally among relatives but can be relevant to

GLM
We start from the usual GLM disregarding familial correlations.Let the phenotypes of n individuals in a family be y 1 , . . ., y n , its distribution is exponential where b • and c • are known functions, ϕ a scale or dispersion parameter.Furthermore, let E y i μ i and let this be connected to a linear predictor using link function g • by η i g μ i X i β, where X i is a vector of covariates and β the regression coefficient s .For simplicity, only canonical link is used so that θ i μ i .It can be shown 27 that the expectation E y i μ i b θ i and variance V y i ϕb θ i .Some special cases as with their properties are well-recognized 28 , for which models involving continuous and binary outcomes are most common. Normal: and an identity link.Binomial: y i ∼ Binom n, μ i , θ μ i ln μ i / 1 − μ i , b θ i ln 1 exp θ i , ϕ 1/n, b θ i exp θ i / 1 exp θ i , ϕb θ i μ i 1 − μ i /n, and a logit link g μ i ln μ i / 1 − μ i .Analysis of censored survival data can be molded into the framework 29 .Let t i denote the event time, c i the censoring time and δ i I t i ≤ c i the event indicator for unit i, i 1, . . ., n; the basic Cox model with vector of explanatory variables X i is specified via a hazard function λ i t λ 0 t exp X i β , where λ 0 t is the baseline hazard function.The partial likelihood PL for the standard Cox model can be expressed as follows: where n failure times have been ordered such that t 1 < • • • < t n and R t i is the "risk set," the number of cases that are at risk of experiencing an event at time t i .
Although GLM lays the foundation in many applications of general statistics, it largely serves a motivating role for models that are capable to account for familial correlations.As shown below, this is achieved with introduction of correlated random effects as in GLMM, but it is also linked with other models.

GLMM
We now consider model involving individual i, i 1, . . ., N, where N is the total number of individuals in our sample.

Polygene
Let P denote the polygene representing independent genes of small effect, which follows a multivariate normal distribution with covariance matrix g μ i X i β P i .

2.3
The likelihood for all relatives is furnished with specification of the distribution of P P 1 , . . ., P N with covariance where Φ ≡ {φ ij } n×n and φ ij is the kinship coefficient, defined such that, given two individuals, one with genes g i , g j and the other with genes g k , g l , the quantity is 1/4 P g i ≡ g k P g i ≡ g l P g j ≡ g k P g j ≡ g l , where ≡ represents probability that two genes sampled at random from each individual are IBD.The kinship coefficients for MZ twins, DZ twins/fullsibs, parent-offspring, half-sibs, and unrelated individuals are 0.

Oligogene
Suppose that a major gene M is also involved, independently and normally distributed with mean 0 and variances σ 2 M , then the covariance matrix has the form where Π ≡ {π ij } N×N in which π ij is the proportion of alleles shared IBD at the major gene between relatives i and j which can be estimated from a multipoint data, so that when it acts additively with polygene P , the likelihood is furnished with an extended covariance Σ M,P Σ M Σ P .

2.7
For a test of a strictly positive variance associated with a polygene versus polygene and an oligogene, the log likelihood ratio test statistic is referred to 0.5χ 2 1 0.5χ 2 2 30 .

Multiple Random Effects
The framework in 2.3 includes the common distributions such as normal, gamma, binomial and Poisson as special cases.For simplicity, we consider a quantitative trait, whose probability density function is normal and a statistical model is as follows: and U ∼ N 0, Σ , ∼ N 0, σ 2 , Cov U, 0. The expression of Σ −1 relative to the precision 1/σ 2 of as a Cholesky factorization Δ Δ, that is, Δ Δ led to the term relative precision factor for Δ 31 .Note that the partition of effects as being fixed and random H A : genetic effect can be compared to a sporadic model H 0 : no genetic effect y X 1 β 1 X 2 β 2 e, where both β 1 and β 2 are fixed effects, the involvement of Σ or more specifically Σ −1 as a "ridge factor" creates shrinkage in the random effects solutions to the normal equations, that is, "regression towards the mean." We will see an example from the GAW17 data below that a quantitative trait Q1 is influenced by polygenic background and specific gene VEGFC as captured by kinship or relationship matrix and IBD matrix, respectively.This prompts the need to consider multiple random effects.We therefore pursue 2.8 further.As in 32 , write y Xβ Z 1 a 1 • • • Z k a k with the usual assumption that y is N × 1 vector of observations, X an N × p known matrix, not necessarily of full column rank, β a vector of fixed effects, Z i a known N × r i matrix of rank r i , a i random effects with E a i 0, cov a i σ 2 i I r i , cov a i , a j 0, i / j, cov a i , 0, i, j 1, . . ., k, an N × 1 vector of errors with E 0, cov σ 2 I N .Then E y Xβ and cov y Σ σ 2 I N k j 1 σ 2 j Z j Z j .This turns out to be critical to explore the covariance structure involving more k parameters σ 2  1 , . . ., σ 2 k in the form where Σ i σ 2 i has the form of σ 2 i H i , i 1, . . ., k with σ 2 i being the unknown parameter and H i a known coefficient matrix.It will also hold when different variance components such as multiple major genes of interest, gene-gene, gene-environment interactions, common shared environment are to be modeled.For significance test, Case 4 in 30 serves as a general guideline.
A closely related model is the so-called marginal or population-average model whereby familial relationship can be specified for e, namely, generalized estimating equations GEEs 12, 33 .Given for which only link function and variance need to be specified.Parameter estimates are consistent even when variance structure is misspecified, but the ability to use 2.9 is an apparent advantage.We now turn to the Cox model.First, the consideration of an unobserved family specific random effect is often termed as frailty model, such that families with a larger value of the frailty will experience the event at earlier times and most "frail" individuals will fail early 34 .Now we allow for correlated frailty and, in analogy to model 2.3 and 22 , the appropriate model with random effect U i becomes λ i t λ 0 t exp X i β U i .Assuming the parameters of interest are β and σ 2 we have

2.11
The so-called integrated log likelihood is derived as A more tractable solution is via a Laplace approximation for an approximate marginal log likelihood that can be maximized by a penalized partial likelihood PPL for parameters β, σ 2 , PPL β, U log PL β, U − U T Σ −1 U/2, followed by a profile likelihood function involving only σ 2 .
Furthermore, we can take advantage of the generic form of covariance in other types of models as well.A straightforward yet remarkably useful extension is the multivariate model.For instance, consider 2.8 with m phenotypes.Let y y 11 , . . ., y 1N , . . ., y mN T be a vector of m multivariate phenotypes for N individuals.Let β be a vector of dimension mp of the regression coefficients for the p covariates including a vector of 1's corresponding to the overall mean, X I m X N,p , an mN ×mp known matrix of covariate values.An analogy to 2.7 and 2.8 lead to the variance-covariance matrix of the m phenotypes with dimension mN × mN is where R is the N × N matrix of the coefficients of relationship, Π is an N × N matrix of estimated proportion of alleles IBD, and A, B, C are oligogenic, polygenic, and residual variance-covariance matrices each with dimension m × m.

Meta-Analysis
One indispensable element in current GWASs is meta-analysis, typically involving findings from both unrelated individuals in a population and those from family data.While we have seen that mixed models are appropriate for a variety of traits in family-based association studies, broadly models for meta-analysis also fall into the same framework as described above.One can imagine a meta-analysis involving individual participant data IPD .A good summary of approaches for IPD meta-analysis is available 35 .
In the two-step approach, the individual participant data are first analysed in each separate study independently by using a statistical method appropriate for the type of data being analysed; for example, a linear regression model might be fitted for continuous responses such as blood pressure, or Cox regression might be applied for time to event data.This step produces aggregate data for each study including effect estimate and its standard error .These data are then synthesised in the second step using a suitable model for meta-analysis of aggregate data, such as one that weights studies by the inverse of the variance while assuming fixed or random effects across studies.In the one-step approach, the individual participant data from all studies are modelled simultaneously while accounting for the clustering of participants within studies.This approach again requires a model specific to the type of data being synthesised, alongside appropriate specification of the assumptions of the meta-analysis e.g., of fixed or random effects across studies .
The two-step approach is the usual one used in various GWAS consortia while a one-step approach for all studies in our context could involve unrelated populationbased samples and family data in the meta-model as long as the correlation structure is appropriately specified.The practicality of both approaches has been illustrated in the literature 36, 37 but, in view of the complexity involving in such a framework, and the practical difficulty that a researcher may not have access to individual data from all studies, we refrain ourselves from such a consideration for now but remain focusing on family data as illustrated with both simulated and real data.

Related Results and Implementations
There have been concerns in the literature regarding large number of units each with bounded size 38 and a large number of random effects 39 .In our context large number of families, each with bounded members, consistent estimate of the random effect is difficult to obtain though fixed effects and variance components will be consistent.However, Type I error rate and power have been explored before 19,22,26,40 , so there will be more on specific examples.
Instead of using purposely written programs, we chose to use R, for its wide availability and many other features 41 , and in particular procedures to fit models described earlier are to a great extent available, including generic procedures from nlme, lme4, and gee, among others, but package designed for family data is pedigreemm with lmekin for linear mixed models available from coxme.We will also compare them to SAS, due to its ability to deal with large data, and great flexibility in model specification.

Examples
We consider two examples from GAWs 17 and 16, which involve simulated and real data widely available and allow for a lot of experiments to be done.

GAW17 Data
Data distributed by GAW17 were based on a collection of unrelated individuals and their genotypes were generated from the 1000 Genomes Project see http://www.1000genomes.org/, from which a sample of 697 individuals in 8 extended families and their genotypes and phenotypes was available.A total of 202 founders in the family data set were chosen at random from the set of unrelated individuals.Replicates of the trait were generated 200 times, but the simulated genotypes remain constant over replicates.The traits made available were Q1, Q2, Q4, and AFFECTED coded 0 no 1 yes with covariates AGE and SMOKE.The variables describing family structures were ID, FA, MO, SEX 1 men, 2 women .Fully informative IBD information was available for 3205 genes.
We chose to examine traits Q1, Q2, and AFFECTED as representatives of quantitative and qualitative traits.According to 42 , vascular endothelial growth factor VEGF pathway was enriched and here vascular endothelial growth factor C VEGFC see http://en.wikipedia.org/wiki/Vascularendothelial growth factor C was chosen as a causal variant associated with Q1 but not Q2.Q1 also increased with age, and the fact that AFFECTED is a function of Q1 offers the possibility to furnish a logistic regression model and explore age at onset via a Cox model.For illustration, we used age as surrogate for age onset.Being aware of the fact that this was only an approximation, whenever multipe affected individuals within a sibship are available, their average age was used.Causal variants and associate genes provide information on power of association testing statistics while the noncausal counterparts provide analogous results on Type I error rate.
The statistical significance was assessed according to log likelihood ratio tests between models using relationship only versus using both relationship and IBD information.The computation for this is relatively fast; results for all 200 replicates took 1 hour and 48 minutes on our 20-node Linux clusters each with 16 GB RAM and 4 CPUs using Sun grid engines.The nominal significance levels are shown in Table 1, which reveal that the tests are both close to the expected levels under H 0 and H A .
Gene-based analysis was also conducted for Q1 involving all 3205 genes and the results are shown with selected candidates highlighted in Figure 1, which agree with the simulated model in which the significant regions were in VEGFC/VEGFA.
As one would be keen to see various parameter estimates in a real analysis, we also provide results associated with replicate one.Q1 as based on restricted maximum likelihood REML is shown in Table 2.The models with relationship only and with both relationship and IBD information have −2 Res tricted log likelihood being 1789.5 and 1775.2,respectively while Akaike Information Criteria AIC being 1793.5 and 1781.2,respectively so that using IBD information improved fit for Q1 smaller AIC .For AFFECTED the results based on maximum pseudolikelihood are shown in Table 3 and those from Cox model in Table 4.Note that the improvement in terms of −2 log pseudolikelihood from 3434.4 to 3445.7 was also substantial.To explore the multivariate model 2.13 involving the polygenic effects for Q1, Q2, and Q4, the six parameters σ 11 , σ 21 , σ 22 , σ 31 , σ 32 , σ 33 in the variance-covariance matrix have been expressed according to 2.9 .The appropriate matrices associated with all parameters are constructed a priori.These are then subject to procedures such as PROC MIXED and lmekin.The joint model of Q1, Q2, Q4 is shown in Table 5.
The implementations are provided in Supplementary information available online at doi: 10.1155/2012/485174.While code blocks shown there are appropriate for one instance, it is preferable to use SAS's output delivery system ODS to save various results into databases.

The Framingham Heart Study
The Framingham Heart Study is under the direction of National Heart, Lung, and Blood Institute NHLBI which began in 1948 with the recruitment of adults from the town of Framingham, Massachusetts.Data available for GAW16 were 7130 individuals from the original cohort 373 , the first generation cohort 2760 , and the third generation cohort 3997 with sex, age, height, weight, blood pressure, lipids, smoking, and drinking.Data as outlined in 43 was used here, where 6848 had genotype data for at least one of the four specified SNPs rs1121980, rs9939609, rs17782313, and rs17700633 .Data for 96 individuals without any phenotype data but with genotype data and an additional 227 individuals without being assigned a family ID were excluded from analyses.Additionally, four individuals had no data on weight; 86 observations were measured at <18 years of age, and therefore were excluded.The 6,520 remaining individuals were part of 962 families, among which 2073 individuals had completed four visits.Meanwhile, there were also 365 cases of diabetes with their ages of onset.
Kinship information was obtained from family structure and used for genotype-trait association.Computer program PLINK 44 with the −genome option was also used to infer correlations π using whole genome data.A total of 8485 SNPs on Affymetrix 500 K chips were derived from a panel of 45620 informative autosomal SNPs used in our consortium analysis.This led to estimates for 6520 6520 − 1 /2 21251940 pairs of relationship.The genetic distance according to |π − π| 45 , that is, sum abs(EZ-PI HAT), na.rm=TRUE), is 3421.724.Approximately half 10478474 had π of 0.01 or more.Although there was a good agreement between kinship according to the specified family structures and π, 11207 pairs of individuals deemed to be unrelated had π between 0.1-0.3 and 12 of which were greater than 0.3.
Both types of relationship matrices were used for the Cox model via kinship and bdsmatrix.ibdfunctions in R. The frailty and polygenic models had log likelihoods of −1788.53,−1791.93 with variance estimates 0.10 2 and 0.02 2 , respectively.However, with inferred relationship the log likelihood turned out to be −1762.69and variance estimate 0.24 2 .Similar model for BMI at wave 1 was also fitted; a family specific random intercept model yielded log likelihood of −19273.26and variance 3.44, while a correlated random intercept model gave log likelihood −19379.3 and variance 0.01 2 with comparable results from inferred relationship though with a smaller residual error.The results on diabetes might have suggested a substantial genetic effect while for BMI the use of inferred relationship performed equally well with a model using explicit family structures.

Discussion
The models we have considered extend counterparts for unrelated sample by taking into account correlation within and heterogeneity between families.To a large extent, we have presented an appreciation of models and implementations for related individuals using mixed models.At the meantime, we have envisaged a whole range of analyses that can be put in the framework.However, compared to 13 and especially 19 , our development is more incremental and helps to gain insight into more complicated models.As a key feature of the model specification, oligogenes, polygenes, common environment, geneenvironment interaction, and multivariate data are accommodated in a coherent framework via appropriate covariance structure.The generic nature has enabled a range of genetic association studies.Our interpretation of the model also naturally extends the model for quantitative traits outlined by 19, 46 .It has been recognized that for longitudinal data some commonly used covariance structures, such as compound symmetry, can be expressed as "linear covariance of dimension k" 47, page 258 .Although it could be more involved, it may be possible in our context.Data as in consortium meta-analysis analysis is also perceived in broader framework consisting of both unrelated and related individuals.
We should be aware that mixed models are quite general and may well be linked to other models.For instance, we noticed that model 2.10 is reminiscent of an approach proposed for generalized method of moments 48 .An example as with its link with individual empirical Bayes estimates has been provided by 49,50 .A reviewer has brought to our attention recent work on nonparametric methods for longitudinal data 51 and the utility of mixed models in controlling for bias of population stratification e.g., 52 .This paper has limited coverage of literature on longitudinal analysis of family data, mainly owing to the fact that there is greater difficulty in implementation via general software package.However, this is expected to change.To our knowledge, little work has been done on joint analysis of individual data in the GWAS meta-analysis context.In view of the popularity of consortium data analysis, it will be appealing to have the appropriate mechanism to make it possible.
The models and their implementations are connected with whole genome data in several ways.First, the transition from the variance components models in earlier literature becomes more explicit.More specifically, the models described here are appropriate for GWAS where genetic variants coupled with a high resolution map are available.In general, the variance component associated with a major gene as in 2.7 is a function of the recombination rate r 12 , that is, σ 2 M f r, π ij , where π ij represents identity-by-descent sharing between a pair of individuals i, j for the marker locus; with dense marker, we can assume that r 0 which is also true with 2.9 .Second, as in the Framingham data there is a further benefit with dense genetic markers such that they can be used to infer family structure 53 or global IBD information 54 .The availability of the deep sequencing data and a long list of established genes are likely to give greater weight on use of family data 55 .It is also desirable that cryptic relatedness in population-based sample can be appropriately taken into account in association analysis.In our own EPIC-Norolk GWAS, samples with cryptic relatedness have been excluded at the quality control stage 56 .It is interesting to note that coxme was developed for handling large pedigrees involving sparse matrices; the availability of whole genome data will alter the scenario slightly but nevertheless remain in the same framework.Third, more work is required to shorten computing time.In the literature, it has been proposed to absorb the relationship in the model for quantitative trait by multiplying inverse of the kinship matrix followed by a linear regression, or using residuals from a phenotype-covariate only regression as outcome in a model including SNPs as in GenABEL.In principle one can extend the idea to multivariate or longitudinal models where the residuals are obtained only once for GWAS or incorporating regional information before turning to SNP-specific analysis.There are also alternative approaches such as retrospective methods found in Merlin.With its greater requirement in computation the "measured genotype" approach here remains intuitive especially for gene-environment characterization.To this point, associate projects such as BORDICEA see http://www.srl.cam.ac.uk/genepi/boadicea/boadicea home.htmland BayesMendel see http://bcb.dfci.harvard.edu/BayesMendel/have contributed to the success of work on R described here.
A reviewer has expressed interest regarding the Type I error linking to results shown in Table 1.We believe that data as distributed by GAW17 as they were 200 replicates are not ideal for assessing Type I error and possibly require a bootstrap procedure.In general, from our experience and personal communications with Profs.Douglas Bates and Terry Therneau , this is a difficult issue and possibly problem specific.In fact, in the recent implementation of GLMM in lme4, the associate p values for fixed effects are not shown which nevertheless may leave users with temptation to employ normal approximation.Although we have not conducted extensive numerical experiments, results from GAW17 and the Framingham Study have indicated good performance of these models, and that of the inferred relationship based on whole genome data is impressive.Since only directly  genotyped Affy500K SNPs were used, the addition of imputed genotypes, say based on the HapMap, should help to improve the inference.Its use in the usual genomewide association analysis should be considered.
Our attention lies on the implementation by taking advantages of the available implementation in general statistical computing environment.The clarification of the implementation in these should facilitate practical analysis of family data.Although these models are conceptually simple, availability of their implementation vary, notably the ability to allow for both oligogenes and polygenes in a GLMM framework.For R, these are at least possible with nlme, lme4, and additionally coxme.At the moment, applications of packages in R are often restricted with lmekin in coxme offering outcomes only on continuous outcome but for pedigreemm it is unable to handle complex covariance structure.It is desirable that a function called nlmekin can be developed as with pedigreemm expanded to incorporate   When the interest is on correlation between multiple traits, the use of nlme for multivariate longitudinal data in unrelated individuals has been described 57 .In general, this could be complicated with longitudinal familial data without 58 or with 59 consideration of relationship.In study of obesity-related traits, FTO has been shown to be strongly associated with BMI and supported by cross-sectional data as in 14 , longitudinal data as in 43 and data across life span as in 60 .Our previous attempt 43 , was based on a three-level model and it would be of interest to use kinship information as well.
While the framework we have outlined is comprehensive, we feel that our "proof of concepts" here awaits for extensive testing.It is also desirable that the current implementation can be optimized in computing time.A lot of work has been done for quantitative genetics in plants and animals.Our experience indicated that the running time with SAS was longer time than R.However, in an analysis of longitudinal lung function data in the EPIC-Norfolk study, we have shown that although an individual analysis could be slow, it is possible to perform an analysis for GWAS using SAS and Linux clusters so that ∼2.5 M SNPs would finish within 14 hours when running each chromosome on a separate node.It is likely that it benefited from SAS caching frequently-used instructions.Greater proportion of coding in C/C++ should also be helpful.Given the utility of the popular environments can be shown, their take-up in genomewide association studies will be quick and it is very much in line with efforts in other disciplines where large volume of data is involved.

Figure 1 :
Figure 1: Manhattan plot of Q1 and IBD information where the true loci are highlighted.
only involve with random effects, noting that it is assumed that, given random effects in the model, the phenotypic values among n relatives are independent and that the parameters of interest in 2.4 are the variances involving polygene σ 2 P .Regarding the statistical inference of random effects, since the parameter under the null hypothesis is on the boundary of the parameter space, the test for a specific σ 2 N i 1 f y i | P and L P 2π|Σ P | −1 exp −P Σ −1 P P/2

Table 2 :
Q1 and VEGFC under a linear model.
† z is for variance components while t for fixed effects.

Table 3 :
AFFECTED and VEGFC under a logistic model.

Table 4 :
AFFECTED and VEGFC under a Cox model.

Table 5 :
Q1, Q2, and Q4 under a multivariate polygenic model.SAS, MIXED, GLIMMIX, and NLMIXED together provide a rich source of practical modeling functionality though the Cox model counterpart is not available.The tackling of various issues has led to efficient algorithm 25 .