1. Introduction

IJG

International Journal of Genomics

2314-4378 2314-436X

Hindawi Publishing Corporation

406217

10.1155/2013/406217

406217

Research Article

A Bayesian Hierarchical Model for Relating Multiple SNPs within Multiple Genes to Disease Risk

Duan

Lewei

Thomas

Duncan C.

Gutierrez

Soraya E.

Division of Biostatistics

Department of Preventive Medicine

University of Southern California (USC)

2001 N. Soto Street

Los Angeles

USA

usc.edu

2013

31 12 2013

2013 30 05 2013 03 09 2013 09 09 2013

2013

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A variety of methods have been proposed for studying the association of multiple genes thought to be involved in a common pathway for a particular disease. Here, we present an extension of a Bayesian hierarchical modeling strategy that allows for multiple SNPs within each gene, with external prior information at either the SNP or gene level. The model involves variable selection at the SNP level through latent indicator variables and Bayesian shrinkage at the gene level towards a prior mean vector and covariance matrix that depend on external information. The entire model is fitted using Markov chain Monte Carlo methods. Simulation studies show that the approach is capable of recovering many of the truly causal SNPs and genes, depending upon their frequency and size of their effects. The method is applied to data on 504 SNPs in 38 candidate genes involved in DNA damage response in the WECARE study of second breast cancers in relation to radiotherapy exposure.

1. Introduction

The Women’s Environment, Cancer And Radiation Epidemiology (WECARE) study [1] is aimed at a comprehensive examination of genes involved in particular functional pathways. The study is a population-based nested case-control study of 708 contralateral breast cancers (CBC) within a notional cohort of about 65,000 survivors of a first breast cancer, 1401 of whom serve as controls, and the primary exposure of interest is ionizing radiation dose to the contralateral breast from radiotherapy for treatment of the first cancer. Ionizing radiation is known to cause double strand breaks (DSBs) in DNA, which can invoke any of several DNA damage response mechanisms, primarily DSB repair via either homologous recombination or nonhomologous end joining, cell cycle checkpoint regulation, or apoptosis. The original study focused on mutations in the ATM gene, which plays a central role in the recognition of DSBs. The study was then extended to include BRCA1, BRCA2, and CHEK2, which are all involved in homologous recombination repair (HRR), and later still to include a broader set of 38 candidate genes involved in this and other pathways for DSB damage response. We have previously reported on the main effects of ionizing radiation [2, 3], ATM [4–6], BRCA1/2 [7–12], CHEK2 [13], and the interactions of radiation with ATM [14] and BRCA1/2 [15] as well as with other treatments and reproductive factors [16, 17], amongst other risk factors. The aim of this paper is to provide a comprehensive modeling strategy for examining the effects of all genes in a pathway and to apply the approach to candidate genes for CBC risk in the WECARE study.

There are a growing number of literature works on methods for pathway modeling, motivated in large part by an interest in mining GWAS data for commonalities across related genes that individually may not achieve genomewide significance but in the aggregate may point to novel pathways (see [18] for a review of gene set enrichment analysis and alternatives). Our goal here is more modest, guided by an a priori selection of strong candidate genes [19]. Like other methods of pathway analysis, however, we aim to exploit external knowledge about the biological function of each gene and the relationships between them [20].

Our starting point is a model for multiple variants proposed by Quintana et al. [11], which collapses a subset of the variants within a gene into a single “burden” type index, similar to a number of other recent rare variant proposals (see Basu and Pan [21] for a review and comparison by simulation), but extended to allow for both deleterious and protective effects and to explicitly allow for uncertainty about which variants to include in the model (and which direction for those that are included) by Bayesian model averaging. This approach was further extended to incorporate prior covariates in the probabilities of SNP inclusion [12, 22]. Hoffman et al. [23] introduced a step-up variable selection approach that allows for deleterious and protective effects but did not consider model uncertainty except in the form of a permutation procedure for the overall significance test so is unable to assess the importance and direction of particular variants or alternative models. Chen et al. [24] describe a somewhat similar model that combines variable selection at the SNP level with shrinkage at the gene level. In the current paper, we extend this approach to multiple genes, incorporating prior covariates and prior gene-gene similarity information in a hierarchical modeling framework.

2. Model Specification

We have information on i=1⋯NI individuals with binary outcomes Yi, a vector of fixed effects Xi (age, family history, etc.), and a vector of SNP genotypes Sig=(Sigs), s=1⋯NSg within multiple genes g=1⋯NG for each individual. We propose a novel model based on a hierarchical Bayes framework with three levels: (i) a subject-level model for the association between genes and disease, (ii) a gene-level model for the regression coefficients in the gene-disease association model, and (iii) a SNP-level model describing which variants contribute to each gene and the direction of their effects. (These submodels are described by (1), (2), (4), and (5), resp., below and the surrounding text.) The general framework is similar to one recently proposed by Quintana et al. [12, 22] but differs in a number of details. The overall model is represented as a directed acyclic graph in Figure 1, where boxes represent observed data and circles represent latent variables or model parameters; single arrows denote stochastic links, while double arrows denote deterministic links. The 3 dotted rectangles enclose the covariates and parameters included in each level of the model and their relations.

Figure 1

Directed acyclic graph describing the structure of the model. Boxes describe observed data; circles represent latent variables or model parameters. Single arrows denote stochastic relationships, while double arrows denote deterministic relationships. The first rectangle illustrates the relations of disease status and genes at the subject (i) level; the second rectangle illustrated the relations of external information and first level coefficient βg at the gene (g) level; the third rectangle illustrates the relations of weighted SNP effects and gene burden index at SNP (s) level.

The subject-level model is specified in terms of a burden index for each gene, a deterministic function comprised of the number of positively associated SNPs minus the number of negatively associated SNPs; however, the choice of whether a SNP is included or not and, if included, its direction is stochastic, governed by prior probabilities that could in principle vary across genes or across SNPs within genes. The gene-level model has means and covariances for each ln RR (relative risk in log scale) coefficient that can depend upon external information (“prior covariates” and prior “gene-gene connections”). In principle, the SNP-level model could also include prior covariates [22], although that is not considered here. For the simulations and the analysis of the real WECARE data, we used the Gene Ontology (GO, [26] a pathway ontology database, http://www.geneontology.org/) for the 38 WECARE candidate genes to construct the prior covariate and connection information, as described in more detail in the simulation section.

Level 1. The subject-level model for case-control data uses a conditional logistic regression model to relate burden indexes Gig=G(Wg,Sig) for genes g=1⋯NG to a binary outcome variable Yi, the disease status for individual i. Here, G denotes a deterministic function of the SNP genotypes Sigs for SNP s in gene g with corresponding weights Wg=(Wgs)∈{-1,0,+1} defined in the level 3 model. Thus, the first level model is of the following form: (1)logit Pr(Yi=1)=Xi′α+∑g=1NGeβgG(Wg,Sig)+offseti, where Xi denotes a vector of fixed covariates (confounders) with coefficient vector α. The offset term is needed to account for the counter-matched design in the WECARE study [1].

Each gene burden index has a log regression coefficient eβg describing its contribution to risk, the interpretation of which will depend upon the current assignment of weights. A change of the genotype of a single SNP in the function Gig is reflected by the change of eβg on logit scale. This is based on all SNPs tested in the gene, but each SNP has a different weight Wgs with different prior probabilities; the details are explained in level 3 of the model. The exponentiation of the βs ensures that the effects of each gene will be positive, thereby avoiding the label-switching problem that would arise if the signs of βg and all the Wgs were reversed for a given gene. This also avoids having to deal with truncated normal distributions if βg were not exponentiated but instead constrained to be positive. (We call (1) Model I and briefly describe this alternative possibility (Model II) in Section 7.)

Level 2. The regression coefficients βg in the first level logistic regression model are given by the gene level of the hierarchical model: (2)βg=Zg′π+bg+eg, where (3)π=(π0,…,πNZ)~N(0,VπI),b=(b1,…bNG)~N(0,τ2A),e = (e1,…,eNG)~N(0,σ2I).

The level 2 model uses a simple linear regression to relate the regression coefficients β from the level-1 model to external information on the genes’ involvement in certain pathways and the similarity of their effects. We incorporate information regarding prior predictions of the effects of each gene into the design matrix Z, here structured as a gene-by-pathway matrix of binary values, each indicating whether a gene is in a particular pathway. Basically, Z contains second-stage covariates for each of the genetic factors. π is a column vector of coefficients corresponding to these higher-level effects and is assigned an independent normal prior with mean 0 and variance Vπ and identity matrix I. Prior information about gene-gene connections is incorporated in the A matrix for the b random effects with a multivariate normal distribution centered at zero with variance τ2. The term e is included as a residual error, also given a zero mean independent normal distribution, with σ2 specifying the residual variance of the second-stage covariates.

Level 3. The SNP-level model defines the deterministic functions G(Wg,Sig), where each gene is uniquely determined by the SNP inclusion indicator variables Wgs. The Gig serve as a design matrix of genetic factors for the individuals within the study. In other words, the function serves as a risk index for each gene and as a weighted sum of SNP effects within each gene: (4)G(Wg,Sig)=∑s=1NSgWgsSigs, where the weights Wgs=-1, 0, or +1 have prior probabilities: (5)Pr(Wgs=d)={φ-( N-s+c)(NSg+c),d=-11-(φ-+φ+)( N-s+c)(NSg+c),d=0φ+( N-s+c)(NSg+c),d=1. Here, NSg denotes the number of SNPs in gene g and N-s the average number of SNPs across all genes; we assigned c to be the minimum number of SNPs within any gene. φ+ and φ- represent the parameters of the prior probabilities for deleterious and protective SNP effects, respectively. This form of prior probabilities for the SNP indicator variables keeps the expected number of SNPs included in the model to be roughly similar across genes while allowing genes with more SNPs to have similar probabilities of being included as genes with fewer SNPs. For now, we treat φ as fixed parameters, but these too could be given hyperpriors and estimated.

The posterior estimates for the association parameters resulting from the three-level hierarchical Bayesian analysis are an inverse-variance weighted average between the conventional estimates from the logistic regression only and the estimated conditional second-stage means, Zg′π. Between the maximum likelihood first-stage estimates and the second-stage prior estimates, the weights will favor the one with smaller variance. This intuitive weight adjustment is one of the important differences between Bayesian hierarchical approach and the single-stage logistic regression analysis.

Finally, the variance components are given standard conjugate inverse gamma hyperprior distributions: (6)σ2~IG(dfe,E),τ2~IG(dfb,B),Vπ~IG(1,P).

3. Fitting the Model

The full model is fitted in a sequence of Markov chain Monte Carlo (MCMC) steps described in detail in the Appendix. Basically, the selection of SNPs to include in each gene Ws is performed by sampling from their full conditional distributions one at a time; this involves an approximation to the change in the corresponding estimate of βg and hence the likelihood that would result from adding or deleting that SNP. The gene-level regression coefficients βg and correlated random effects bg are accomplished by the Metropolis-Hastings moves for the entire β and b vectors, conditional on the current SNPs in the model, the prior covariates Zg, and gene-gene correlation matrix A, using a multivariate normal proposal. The second-level gene-level coefficients πg and the independent and correlated variances σ2 and τ2 are then sampled using further Metropolis-Hastings moves. Updating the coefficients α of the fixed covariates involves only a standard update for logistic regression.

4. Posterior Summarization

Instead of parameter estimation, we focus primarily on hypothesis testing and model selection. We use the Bayes factors (BF) at both the SNP level and the gene level to compare the posterior odds provided by data to their prior odds of a pair of hypotheses. Kass and Raftery [27] suggest a qualitative interpretation of BF > 3 (or equivalently 2ln(BF)>2) as providing “positive” evidence, >20 as “strong” evidence, and >150 as “very strong” evidence.

We tabulate the following quantities, where D denotes the ensemble of all the data.(i)

For each SNP, the posterior probability of Wgs=-1,0,+1 and Bayes factor (7)BFgs=(Pr(Wgs≠0∣D)Pr(Wgs=0∣D)) ÷((φ-+φ+)/(NSg+c)1-(φ-+φ+)/(NSg+c)),

where the first factor is the ratio of posterior probabilities that SNP in gene g has any effect (positive or negative) versus no effect given the data D and the second factor is the corresponding ratio of prior probabilities.

(ii)

For each gene, the Bayes factor for the probability that at least one SNP is included in the model is (8)BFg=(1-Pr(Wg≡0∣D)Pr(Wg≡0∣D)) ÷(1-(1-((φ-+φ+)/(NSg+c)))NSg(1-((φ-+φ+)/(NSg+c)))NSg).

We also tabulate the posterior means and standard deviations of each, along with the mean number of SNPs included in the model.

(iii)

For the other parameters, α, β, π, σ2, and τ2, we simply tabulate the posterior means and SDs.

(iv)

Finally, we tabulate the posterior distributions of numbers of SNPs and numbers of genes with at least one SNP included in the model.

5. Simulation Studies

We conducted simulation studies based on the structure of the real WECARE study data described below. Specifically, we used the real SNP, covariate, and counter-matching offset data for each risk set and reassigned case/control status in each risk set based on an assumed relative risk model. We used the estimated values of the coefficients α for the fixed covariates and randomly assigned weights Wgs to SNPs and log relative risk coefficients βg to each gene under the models described above. There were a total of 504 SNPs in 38 genes (ranging from 1 to 51 SNPs per gene) involved in DNA damage response pathways (DNA repair, cell cycle checkpoint control, and apoptosis). Using the Gene Ontology, we extracted 860 terms relating to biological process or molecular function annotated to any of these 38 genes and selected four of these GO terms as prior covariates in the Z matrix (specifically, DNA damage checkpoint, MRE11 complex, double-strand break repair via nonhomologous end joining, and negative regulation of cell cycle), with π=0.25,0.5,0.75, and 1 respectively, and the intercept π0 was set to −2. All 860 GO terms were used to construct a correlation matrix A for the similarity in the ways each pair of genes was described in the GO (Figure 2). The log relative risk coefficients βg were assigned with mean Zg′π and SDs of bg and egσ=τ=0.5. SNP weights Wgs were assigned with φ-=φ+=0.05 and c=1. The resulting gene indices Gg(W,S) and the corresponding βg, along with the real Xi and estimated α coefficients and offset terms, were then used to compute each subject’s relative risk and randomly assign which member of each risk set would be designated as the case. The estimates are based on 10 replicates for the data of each of 10 realizations of the Wgs and βg from these model parameters, using 1000 MCMC scans for tabulation after a burn-in of 500 scans. It yielded a total of 32 causal SNPs in 24 of the genes on average. Table 1 summarizes the posterior probabilities for SNP and gene inclusion, along with the proportion of SNPs and genes with BFs greater than 3, 20, and 150. Although the differences between null and causal SNPs and genes are somewhat modest, there is a clear shift in both the posterior probabilities and the Bayes factors in the appropriate directions.

Table 1

Simulation analysis based on 10 parameter replicates with 10 data replicates per parameter replicate.

(a)

SNP_True	Average counts^a	Posterior SNP inclusion^b			BF^c
SNP_True	Average counts^a	−1	0	1	>3	>20	>150
−1	17.5	24.14%	71.75%	4.11%	25.54%	17.49%	12.46%
0	348.1	3.19%	93.76%	3.05%	3.90%	0.68%	0.19%
1	18.4	3.88%	70.19%	25.94%	28.15%	19.13%	15.54%

(b)

Gene_True	Average counts^d	Posterior gene inclusion^e		BF^f
Gene_True	Average counts^d	Not included	Included	>3	>20	>150
Not included	13.9	55.95%	44.05%	3.67%	0.58%	0.15%
Included	24.1	36.55%	63.45%	27.71%	20.01%	17.14%

a Average counts of simulated SNP inclusion indicators based on 10×10 replicates.

b Average row percentages of the distribution of posterior SNP inclusion indicators based on 10×10 replicates.

c Average row percentages of the SNP counts among the range of the indicated Bayes factors based on 10×10 replicates.

d Average counts of simulated gene inclusion indicators based on 10×10 replicates.

e Average row percentages of the distribution of posterior gene inclusions based on 10×10 replicates.

f Average row percentages of the gene counts among the range of the indicated Bayes factors based on 10×10 replicates.

Figure 2

Graphical representation of the A matrix derived from the Gene Ontology. The lower levels of the graph indicate sets of genes with high correlations across the 860 GO terms.

6. Application to the WECARE Study Data

Using the same settings as for the simulation studies, we analyzed the real WECARE study data, except that 10,000 scans were retained after a burn-in of 4,000 iterations. The posterior distributions of numbers of genes with at least one SNP included and numbers of SNPs included are shown in Figures 3(a) and 3(b). An average of 10 SNPs in 9 genes was included in the model. Figure 4 shows the posterior probabilities (a) and Bayes factor for each of the genes (b) and SNPs (c). At the gene level, only MDC1 and RAD51 were included with substantial Bayes Factors of 20.71 (“strong evidence”) and 3.51 (“positive evidence”), respectively, while ATM and NBN were identified only with BFs between 1 and 3. In this analysis, the known deleterious variants in ATM, BRCA1, BRCA2, and CHEK2 were treated as fixed covariates rather than being lumped in with the other tag SNPs. None of the four GO terms selected as prior covariates contributed significantly to the model, the strongest being DNA damage checkpoint (π=-0.15, SE = 0.27). The correlated variance τ2=0.25, and the independence variance σ2=0.16, suggesting moderately strong residual gene-gene similarities (spatiality τ2/(σ2+τ2)=61%) defined by the ensemble of all GO terms and not explained by the regression of βs on the subset of selected GO terms.

Posterior distributions of numbers of genes (a) and numbers of SNPs (b) included in the analysis of the WECARE study data.

(a) (b)

Posterior probabilities (a) and Bayes factors for gene inclusion (b) and SNP inclusion (c) in the model for the real WECARE study data.

(a) (b) (c)

Table 2 lists the numbers of pairs of the homozygous reference allele, heterozygous allele, and homozygous risk allele for cases (CBC) and controls (UBC), respectively, for all the SNPs identified by our models and by a previous WECARE publication [25]. We also report the estimated lnRRs from simple logistic regression for each selected SNP, adjusted for the same set of covariates (age, menarche, menopause, family history, pregnancy, histology, treatment, the FGFR2 GWAS-identified SNP, and deleterious variants in ATM, BRCA1, BRCA2, CHECK2s and offset term) as in our model. The logistic regression found SNPs rs4713354 and rs2269705 in MDC1 to be strongly associated with CBC risk (P<0.001), and SNPs rs1800057 v_IVS14 m55, rs13447682, rs3736640, and rs1801320 had protective effects with statistical significance (P<0.05) or with marginal statistical significance (rs6005861 and rs9297757, P<0.1).

Table 2

Association between selected variants in DNA-damage response genes and CBC risk in the WECARE study.

Gene	rs number	Homozygous; reference allele		Heterozygous		Homozygous; risk allele		ln RR c		Bayes factors
Gene	rs number	Case (CBC)	Control (UBC)	Case (CBC)	Control (UBC)	Case (CBC)	Control (UBC)	(95% CI)	P value^d	BF SNP	BF gene
ATM	rs1800057^a	680	1322	28	76	0	1	−0.47 (−0.95, −0.01)	0.046	4.58	1.41
ATM	rs4987951^a	674	1278	34	121	0	0	−0.66 (−1.32, −0.25)	0.002	9.04	1.41

CHEK2	rs6005861^a,b	680	1311	27	86	1	2	−0.40 (−0.85, 0.06)	0.086	7	0.36

MDC1	rs4713354^a,b	535	1116	157	267	16	16	0.47 (0.26, 0.68)	<0.001	9.72	20.71
MDC1	rs2269705^a	589	1220	113	175	6	4	0.50 (0.25, 0.76)	<0.001	15.91	20.71

MRE11A	rs13447682^a,b	690	1343	18	54	0	2	−0.56 (−1.12, −0.01)	0.046	5.7	0.52

NBN	rs14448^b	640	1215	60	171	8	13	−0.11 (−0.40, 0.18)	0.447	0.2	2.62
	rs9297757^a,b	651	1233	148	52	5	18	−0.26 (−0.58, 0.05)	0.097	27.33
	rs3736640^a,b	676	1288	32	107	0	4	−0.64 (−1.27, −0.21)	0.003	4.14

RAD51	rs1801320^a	646	1209	58	186	4	4	−0.31 (−0.62, 0.00)	0.048	21.38	3.51

a SNPs identified by Model I based on Bayes factors. Only those SNPs with BF exceeding 3 are listed.

b SNPs identified by Brooks et al. 2012 [25] based on per-allele RR. Only those SNPs with P value for trend <0.05 are listed.

c ln RR : regression coefficients of each SNP from simple logistic regression, adjusted for age, menarche, menopause, family history, pregnancy, histology, treatment, the FGFR2 GWAS-identified SNP, and deleterious variants in ATM, BRCA1, BRCA2, CHECK2, and offset term.

d P values associated with Wald-z test for lnRR estimates from simple logistic regression adjusted for fixed covariants listed in d.

Table 2 also shows the SNP Bayes factors, based on which our model selected a total of nine SNPs with positive to strong evidence for disease association. Two SNPs (one in NBN and one in RAD51) were identified with strong evidence (BF > 20) and seven SNPs from four genes (ATM, CHEK2, MDC1, MRE11A) with positive evidence (BF > 3). In a prior study by the WECARE study Collaborative Group, 134 common variants in six DNA damage response genes (CHEK2, MRE11A, MDC1, NBN, RAD50, and TP53BP1) were tested separately or within haplotypes for association with CBC risk [25]. Six SNPs were reported to be associated with CBC risk with P<0.05, but none remained statistically significantly associated after correction for multiple comparisons. Five SNPs (rs6005861 in CHEK2, rs4713354 in MDC1, rs13447682 in MRE11A, and rs9297757 and rs3736640 in NBN) among those six SNPs reported by Brooks et al. were selected by our model for showing positive or strong evidence for CBC risk. The remaining SNP (rs14448 in NBN) reported by Brooks et al. was not statistically significantly associated with CBC in the logistic regression (P=0.447). All the SNPs except rs4713354 in MDC1 reported by Brooks et al. were found to have protective effects in the log-additive model. The same direction of the risk was also found for each SNP in the logistic regression. In addition, our model shows positive evidence of CBC risk for SNP rs1800057, a variant in ATM, which was previously shown to be associated with a statistically significant reduction in CBC risk [28] in the WECARE study. Its protective effect was also found in the logistic regression (ln RR = −0.47, P=0.046).

Seven of the nine SNPs selected by our model have been found associated with breast cancer risk in previous investigations. Besides the six SNPs reported in the previous WECARE study, rs1801320 (135G > C), a SNP in the 5′-untranslated region (UTR) of the RAD51 gene, was found with mixed results for its role in breast cancer risk from other breast cancer risk studies [29–31]. In addition to those previously reported SNPs, our model selected rs4987951 in ATM and rs2269705 in MDC1, about which we found no previous reports of association with breast cancer.

7. Discussion

Our model is motivated in part by ongoing work on methods for testing associations with multiple rare variants in next generation sequencing data [12, 22], for which it is obvious that attaining statistically significant results for any single variant is difficult because of their rarity and the enormous multiple comparisons penalty. This motivates our choice of a burden index for gene-level associations comprising simple −1/0/+1 weights with model averaging across their uncertainty distribution. For common variants with minor allele frequencies (MAF) >5% (and perhaps in candidate gene studies for uncommon variants with 1% < MAF < 5%), it may be possible to allow each SNP to have its own regression coefficient from some continuous distribution, but constraints would be needed to ensure identifiability if both SNP- and gene-level parameters were to be estimated.

As a compromise, we have treated the known deleterious variants in ATM, BRCA1/2, and CHEK2 as fixed covariates, along with age, treatment, reproductive variables, and so forth, since it seems unreasonable to consider these variants as exchangeable with the tagging SNPs. Unfortunately, this precludes borrowing strength across all the variants within these genes—that is, given that we know that some variants in these genes are deleterious, it would seem more likely that there would be other causal variants in the same genes. Furthermore, if these four genes have similar prior covariate values Zg, that should inform the estimation of the corresponding πgs and draw the estimates of βs for other genes that are highly correlated with them in the A matrix towards the βg values for these genes.

We have included prior information only on genes, not SNPs, in our model, since the GO does not provide any annotation of specific variants within genes. However, there are many ways of classifying SNPs a priori, such as simple indicators for whether they are coding or noncoding variants or the predictions of programs like SIFT [32] and PolyPhen [33] based on predicted effects on protein conformation or evolutionary conservation. Such information could easily be incorporated into a multinomial logistic or probit model for the inclusion probabilities φs [12, 22]. The current version of our program treats φ+ and φ- as fixed constants, but these could simply be assigned prior Beta distributions, subject to the constraint that φ++φ-<1.

In addition to the model described above (Model I), we considered an alternative Model II with a similar structure, except that the gene log RR coefficients βg are not exponentiated: (9)logit Pr(Yi=1)=Xi′α+ ∑g=1NGβgG(Wg,Sig)+offseti, βg≥0. To ensure that they are positive, the second level of the hierarchical model is in the following form: (10)Pr(βg)={φ(βg-Zg′πσ)βg>0Φ(-Zg′πσ) βg=0, where φ denotes the probability density of normal distribution and Φ denotes the cumulative density of normal distribution. This is a proper density for βg, since it integrates to one. The third level of Model II remains the same as Model I. Model fitting is similar to Model I except for some details in updating βgs and πs.

In the simulations, Model II yielded a total of 47 causal SNPs in 25 of the genes on average. Model I showed higher sensitivity and specificity for SNP selection (Table 2) than Model II based on both posterior SNP inclusion and SNP BFs. Model II showed a higher sensitivity for gene selection than Model I based on the posterior gene inclusion, but a lower specificity. In addition, Model I showed a higher sensitivity based on gene BFs.

In the application to WECARE data, Model II identified 5 SNPs in genes MDC1, NBN, and RAD51, with positive evidence for disease association (BF > 3). Four (rs4713354, rs2269705, rs9297757, and rs1801320) of the five selected SNPs are in common with Model I, two (rs4713354, rs9297757) are in common with Brooks et al. [25], and one (rs11620361) is not in common with previous methods. One gene (MDC1) was selected with positive association based on gene-level Bayes factors (BF = 6). Both the simulation study and real data application suggested that Model I performs better than Model II in terms of selecting causal variants.

We have extended the model to incorporate gene-environment (G×E) interactions with radiotherapy or radiation dose since the focus of the WECARE study is on these genes acting in response to the DSB damage induced by radiotherapy exposure. Extending the model to incorporate G×E interactions is straightforward, simply adding the main effect of E and an additional vector of interaction terms to the subject-level model and then putting a similar prior on the interaction coefficients. For the time being, we have treated the βs and δs as independent, but a more appealing approach would be to treat them as having bivariate normal distributions depending on Z and A. No significant G×E interactions were found in this model (results not shown).

It remains to be seen whether this approach is scalable to GWAS data. As currently implemented with MCMC methods, the approach would not be computationally feasible, even with parallel implementations on high-performance computing environments. However, work in progress (Quintana et al. [11, 12, 22]) suggests that analytic approximations may be possible that would obviate the need for MCMC methods.

Appendix Model Fitting

At each iteration, the following updates are performed.

Selection of SNPs to include in the model involves evaluating the three posterior probabilities for d={-1,0,+1} and selecting Ws with the corresponding probability (A.1)[Wgs=d∣Y,S,W; φ] ∝[Y∣{Gg(Wgs=d,Wg(-s),Sg),G-g};{βgsd, β-g}] ×[Wgs=d∣φd,NSg], where βgsd is a single Newton step iteration towards the maximum likelihood estimate (MLE) of βg if Wgs were set to d.

Update the vector of regression coefficients β using a multivariate Metropolis-Hastings move with proposal β′~MVN(β,δβI) and acceptance probability (A.2)min{p(Y∣G(S,W),β′)p(β′∣Zπ+b,σ2I)p(Y∣G(S,W),β)p(β∣Zπ+b,σ2I),1}. Update the vector of random effects b with a similar Metropolis-Hastings move with acceptance probability (A.3)min{p(β∣Zπ+b′,σ2I)p(b′∣τ2A)p(β∣Zπ+b,σ2I)p(b∣τ2A),1}. Note that an alternative possibility would be to sample β from its marginal distribution (A.4)[β∣Z,π,A,Y,S,W, σ2]∝[Y∣G(W,S);β] ×[β∣Z′π,σ2I+τ2A] and omit the update of the bs.

Update the prior regression coefficients π by a simple linear regression and taking a multivariate normal around its MLE, (A.5)[π∣β,Z,b,σ2]∝[β∣Z′π+b,σ2I][π∣0,VπI]. Update the variances σ2 and τ2 using a Metropolis-Hastings move with proposals ln(σ′)~N(ln(σ), δσ) and similarly for τ, with acceptance probabilities (A.6)[σ,τ∣β, Z, π, A]∝[βZ′π,σ2I+τ2A][σ2][τ2]. As noted above, we treat the φs as fixed, but these too could be given prior distributions and estimated as well.

The coefficients (α) of subject-level confounders are updated using single Newton-Raphson iteration towards the MLE of α, following a random multivariate normal update to sample the new α. The procedure is based on the approximation that the likelihood for α is quadratic with flat priors.

Acknowledgments

The authors greatly appreciate valuable methodological suggestions from Melanie Quintana and David Conti. This work was supported by NIH Grants R01-ES019876, P30-ES07048, R01-CA112450, R01-ES016813, R01-CA129639, R01-MH084678, and R01-HG005927. The authors are grateful to the WECARE investigators for providing the data used for the application.

Bernstein

J. L.

Langholz

Haile

R. W.

Bernstein

Thomas

D. C.

Stovall

Malone

K. E.

Lynch

C. F.

Olsen

J. H.

Anton-Culver

Shore

R. E.

Boice

J. D.

Jr. Berkowitz

G. S.

Gatti

R. A.

Teitelbaum

S. L.

Smith

S. A.

Rosenstein

B. S.

Børresen-Dale

Concannon

Thompson

W. D.

Study design: evaluating gene-environment interactions in the etiology of breast cancer-the WECARE study

Breast Cancer Research 2004 6 3 R199 R214

2-s2.0-3142733451

Langholz

Thomas

D. C.

Stovall

Smith

S. A.

Boice

J. D.

Jr. Shore

R. E.

Bernstein

Lynch

C. F.

Zhang

Bernstein

J. L.

Statistical methods for analysis of radiation effects with tumor and dose location-specific information with application to the wecare study of asynchronous contralateral breast cancer

Biometrics 2009 65 2 599 608

2-s2.0-66949117323

10.1111/j.1541-0420.2008.01096.x

Stovall

Smith

S. A.

Langholz

B. M.

Boice

J. D.

Jr. Shore

R. E.

Andersson

Buchholz

T. A.

Capanu

Bernstein

Lynch

C. F.

Malone

K. E.

Anton-Culver

Haile

R. W.

Rosenstein

B. S.

Reiner

A. S.

Thomas

D. C.

Bernstein

J. L.

Dose to the contralateral breast from radiotherapy and risk of second primary breast cancer in the WECARE study

International Journal of Radiation Oncology Biology Physics 2008 72 4 1021 1030

2-s2.0-54049101989

10.1016/j.ijrobp.2008.02.040

Bernstein

J. L.

Teraoka

Southey

M. C.

Jenkins

M. A.

Andrulis

I. L.

Knight

J. A.

John

E. M.

Lapinski

Wolitzer

A. L.

Whittemore

A. S.

West

Seminara

Olson

E. R.

Spurdle

A. B.

Chenevix-Trench

Giles

G. G.

Hopper

J. L.

Concannon

Population-based estimates of breast cancer risks associated with ATM gene variants c.7271T > G and c.1066-6T > G (IVS10-6T > G) from the breast cancer family registry

Human Mutation 2006 27 11 1122 1128

2-s2.0-33750904243

10.1002/humu.20415

Concannon

Haile

R. W.

Bøorresen-Dale

A. L.

Rosenstein

B. S.

Gatti

R. A.

Teraoka

S. N.

Diep

A. T.

Jansen

Atencio

D. P.

Langholz

Capanu

Liang

Begg

C. B.

Thomas

D. C.

Bernstein

Olsen

J. H.

Malone

K. E.

Lynch

C. F.

Anton-Culver

Bernstein

J. L.

Variants in the ATM gene associated with a reduced risk of contralateral breast cancer

Cancer Research 2008 68 16 6486 6491

2-s2.0-53049104973

10.1158/0008-5472.CAN-08-0134

Langholz

Bernstein

J. L.

Bernstein

Olsen

J. H.

Børresen-Dale

Rosenstein

B. S.

Gatti

R. A.

Concannon

On the proposed association of the ATM variants 5557G>A and IVS38-8T>C and bilateral breast cancer

International Journal of Cancer 2006 119 3 724 725

2-s2.0-33745474962

10.1002/ijc.21876

Begg

C. B.

Haile

R. W.

Borg

Å.

Malone

K. E.

Concannon

Thomas

D. C.

Langholz

Bernstein

Olsen

J. H.

Lynch

C. F.

Anton-Culver

Capanu

Liang

Hummer

A. J.

Sima

Bernstein

J. L.

Variation of breast cancer risk among BRCA1/2 carriers

Journal of the American Medical Association 2008 299 2 194 201

2-s2.0-38049171118

10.1001/jama.2007.55-a

Borg

Haile

R. W.

Malone

K. E.

Capanu

Diep

Törngren

Teraoka

Begg

C. B.

Thomas

D. C.

Concannon

Mellemkjaer

Bernstein

Tellhed

Xue

Olson

E. R.

Liang

Dolle

Børresen-Dale

Bernstein

J. L.

Reiner

A. S.

Layne

T. M.

Donnelly-Allen

Olsen

J. H.

Andersson

Bertelsen

Guldberg

Epstein

Boice

J. D.

Jr. Seminara

Shore

R. E.

Jansen

Anton-Culver

Largent

Lynch

C. F.

DeWall

Langholz

B. M.

Zhou

Diep

A. T.

Ter-Karapetova

Thompson

W. D.

Stovall

Smith

Ramchurren

Characterization of BRCA1 and BRCA2 deleterious mutations and variants of unknown clinical significance in unilateral and bilateral breast cancer: the WECARE study

Human Mutation 2010 31 3 E1200 E1240

2-s2.0-77149138300

10.1002/humu.21202

Capanu

Concannon

Haile

R. W.

Bernstein

Malone

K. E.

Lynch

C. F.

Liang

Teraoka

S. N.

Diep

A. T.

Thomas

D. C.

Bernstein

J. L.

Begg

C. B.

Assessment of rare BRCA1 and BRCA2 variants of unknown significance using hierarchical modeling

Genetic Epidemiology 2011 35 5 389 397

2-s2.0-79958170981

10.1002/gepi.20587

Figueiredo

J. C.

Brooks

J. D.

Conti

D. V.

Poynter

J. N.

Teraoka

S. N.

Malone

K. E.

Bernstein

Lee

W. D.

Duggan

D. J.

Siniard

Concannon

Capanu

Lynch

C. F.

Olsen

J. H.

Haile

R. W.

Bernstein

J. L.

Risk of contralateral breast cancer associated with common variants in BRCA1 and BRCA2: potential modifying effect of BRCA1/BRCA2 mutation carrier status

Breast Cancer Research and Treatment 2011 127 3 819 829

2-s2.0-79958272589

10.1007/s10549-010-1285-1

Quintana

M. A.

Berstein

J. L.

Thomas

D. C.

Conti

D. V.

Incorporating model uncertainty in detecting rare variants: the Bayesian risk index

Genetic Epidemiology 2011 35 7 638 649

2-s2.0-80054758152

10.1002/gepi.20613

Quintana

M. A.

Schumacher

F. R.

Casey

Bernstein

J. L.

Conti

D. V.

Incorporating prior biologic information for high-dimensional rare variant association studies

Human Heredity 2012 74 184 195

10.1159/000346021

Mellemkjær

Dahl

Olsen

J. H.

Bertelsen

Guldberg

Christensen

Børresen-Dale

A.-L.

Stovall

Langholz

Bernstein

Lynch

C. F.

Malone

K. E.

Haile

R. W.

Andersson

Thomas

D. C.

Concannon

Capanu

Boice

J. D.

Jr. Bernstein

J. L.

Olsen

J. H.

Borg

Å.

Bertelsen

Mellemkjær

Guldberg

Liang

Wolitzer

Seminara

Haile

R. W.

Diep

A. T.

Zhou

Liu

Ter-Karapetova

Hernandez

Orlow

Bernstein

Donnelly-Allen

Lynch

C. F.

DeWall

Malone

K. E.

Epstein

Anton-Culver

Largent

Stovall

Smith

Shore

R. E.

Boice

J. D.

Jr. Langholz

B. M.

Thomas

D. C.

Begg

Thompson

W. D.

Risk for contralateral breast cancer among carriers of the CHEK2*1100delC mutation in the WECARE Study

British Journal of Cancer 2008 98 4 728 733

2-s2.0-39449090355

10.1038/sj.bjc.6604228

Bernstein

J. L.

Haile

R. W.

Stovall

Boice

J. D.

Jr. Shore

R. E.

Langholz

Thomas

D. C.

Bernstein

Lynch

C. F.

Olsen

J. H.

Malone

K. E.

Mellemkjaer

Borresen-Dale

Rosenstein

B. S.

Teraoka

S. N.

Diep

A. T.

Smith

S. A.

Capanu

Reiner

A. S.

Liang

Gatti

R. A.

Concannon

Radiation exposure, the ATM gene, and contralateral breast cancer in the women's environmental cancer and radiation epidemiology study

Journal of the National Cancer Institute 2010 102 7 475 483

2-s2.0-77950574815

10.1093/jnci/djq055

Bernstein

J. L.

Thomas

D. C.

Shore

R. E.

Robson

Boice

J. D.

Stovall

Andersson

Bernstein

Malone

K. E.

Reiner

A. S.

Contralateral breast cancer after radiotherapy among BRCA1 and BRCA2 mutation carriers: a WECARE study report

European Journal of Cancer 2013 49 14 2979 2985

10.1016/j.ejca.2013.04.028

Poynter

J. N.

Langholz

Largent

Mellemkjær

Bernstein

Malone

K. E.

Lynch

C. F.

Borg

Å.

Concannon

Teraoka

S. N.

Xue

Diep

A. T.

Törngren

Begg

C. B.

Capanu

Haile

R. W.

Bernstein

J. L.

Reproductive factors and risk of contralateral breast cancer by BRCA1 and BRCA2 mutation status: results from the WECARE study

Cancer Causes and Control 2010 21 6 839 846

2-s2.0-77955663595

10.1007/s10552-010-9510-0

Reding

K. W.

Bernstein

J. L.

Langholz

B. M.

Bernstein

Haile

R. W.

Begg

C. B.

Lynch

C. F.

Concannon

Borg

Teraoka

S. N.

Törngren

Diep

Xue

Bertelsen

Liang

Reiner

A. S.

Capanu

Malone

K. E.

Adjuvant systemic therapy for breast cancer in BRCA1/BRCA2 mutation carriers in a population-based study of risk of contralateral breast cancer

Breast Cancer Research and Treatment 2010 123 2 491 498

2-s2.0-77956186324

10.1007/s10549-010-0769-3

Wang

Bucan

Pathway-based approaches for analysis of genomewide association studies

American Journal of Human Genetics 2007 81 6 1278 1283

2-s2.0-36249029788

10.1086/522374

Thomas

Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies

Annual Review of Public Health 2010 31 21 36

2-s2.0-77951497455

10.1146/annurev.publhealth.012809.103619

Thomas

Gene-environment-wide association studies: emerging approaches

Nature Reviews Genetics 2010 11 4 259 272

2-s2.0-77949772292

10.1038/nrg2764

Basu

Pan

Comparison of statistical tests for disease association with rare variants

Genetic Epidemiology 2011 35 7 606 619

2-s2.0-80054728031

10.1002/gepi.20609

Quintana

M. A.

Conti

D. V.

Integrative variable selection via Bayesian model uncertainty

Statistics in Medicine 2013

10.1002/sim.5888

Hoffmann

T. J.

Marini

N. J.

Witte

J. S.

Comprehensive approach to analyzing rare genetic variants

PLoS ONE 2010 5 11

2-s2.0-78149479773

10.1371/journal.pone.0013584

e13584

Chen

L. S.

Hutter

C. M.

Potter

J. D.

Liu

Prentice

R. L.

Peters

Hsu

Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data

American Journal of Human Genetics 2010 86 6 860 871

2-s2.0-77953123145

10.1016/j.ajhg.2010.04.014

Brooks

J. D.

Teraoka

S. N.

Reiner

A. S.

Satagopan

J. M.

Bernstein

Thomas

D. C.

Capanu

Stovall

Smith

S. A.

Wei

Shore

R. E.

Boice

J. D.

Jr. Lynch

C. F.

Mellemkjær

Malone

K. E.

Liang

Haile

R. W.

Concannon

Bernstein

J. L.

Begg

Orlow

Klein

Offit

Woods

John

E. M.

Wang

Olsen

J. H.

Epstein

Seminara

Knight

Chiarelli

Duggan

DeWall

Stram

Diep

A. T.

Xue

Zhou

Ter-Karapetova

Smith

Teraoka

Olson

E. R.

Morrison

V. A.

Navarro

Cerosaletti

K. M.

Wright

Variants in activators and downstream targets of ATM, radiation exposure, and contralateral breast cancer risk in the WECARE study

Human Mutation 2012 33 1 158 164

2-s2.0-84857682603

10.1002/humu.21604

Gene Ontology Consortium

The Gene Ontology in 2010: extensions and refinements

Nucleic Acids Research 2010 38 supplement 1 331 335

10.1093/nar/gkp1018

Kass

Raftery

Bayes factors

Journal of the American Statistical Association 1995 90 430 773 795

10.1080/01621459.1995.10476572

Concannon

Haile

R. W.

Bøorresen-Dale

A. L.

Rosenstein

B. S.

Gatti

R. A.

Teraoka

S. N.

Diep

A. T.

Jansen

Atencio

D. P.

Langholz

Capanu

Liang

Begg

C. B.

Thomas

D. C.

Bernstein

Olsen

J. H.

Malone

K. E.

Lynch

C. F.

Anton-Culver

Bernstein

J. L.

Variants in the ATM gene associated with a reduced risk of contralateral breast cancer

Cancer Research 2008 68 16 6486 6491

2-s2.0-53049104973

10.1158/0008-5472.CAN-08-0134

Antoniou

A. C.

Sinilnikova

O. M.

Simard

Léoné

Dumont

Neuhausen

S. L.

Struewing

J. P.

Stoppa-Lyonnet

Barjhoux

Hughes

D. J.

Coupier

Belotti

Lasset

Bonadona

Bignon

Rebbeck

T. R.

Wagner

Lynch

H. T.

Domchek

S. M.

Nathanson

K. L.

Garber

J. E.

Weitzel

Narod

S. A.

Tomlinson

Olopade

O. I.

Godwin

Isaacs

Jakubowska

Lubinski

Gronwald

Górski

Byrski

Huzarski

Peock

Cook

Baynes

Murray

Rogers

Daly

P. A.

Dorkins

Schmutzler

R. K.

Versmold

Engel

Meindl

Arnold

Niederacher

Deissler

Spurdle

A. B.

Chen

Waddell

Cloonan

Kirchhoff

Offit

Friedman

Kaufmann

Laitman

Galore

Rennert

Lejbkowicz

Raskin

Andrulis

I. L.

Ilyushik

Ozcelik

Devilee

Vreeswijk

M. P. G.

Greene

M. H.

Prindiville

S. A.

Osorio

Benítez

Zikan

Szabo

C. I.

Kilpivaara

Nevanlinna

Hamann

Durocher

Arason

Couch

F. J.

Easton

D. F.

Chenevix-Trench

Chompret

Bressac-de-Paillerets

Byrde

Capoulade

Lenoir

Uhrhammer

Gauthier-Villars

De Pauw

Sinilnikova

Giraud

Hardouin

Berthet

Sobol

Bourdon

Eisinger

Coulet

Colas

Soubrier

Peyrat

Fournier

Vennin

Adenis

Nogues

Lidereau

Muller

Fricker

Longy

Toulas

Guimbaud

Gladieff

Feillel

Leroux

Dreyfus

Rebischung

Olivier-Faivre

Prieur

Frénay

Mazoyer

Yannoukakos

Engel

Haites

Gregory

Morrison

Cole

McKeown

Donaldson

Paterson

Gray

Daly

Barton

Porteous

Steel

Brewer

Rankin

Davidson

Murday

Izatt

Pichert

Trembath

Bishop

Chu

Ellis

Evans

Lalloo

Shenton

Mackay

Robinson

Ritchie

Douglas

Burn

Side

Durell

Eeles

Cook

Quarrell

Hodgson

Eccles

Lucassen

RAD51 135 G→C modifies breast cancer risk among BRCA2 mutation carriers: results from a combined analysis of 19 studies

American Journal of Human Genetics 2007 81 6 1186 1200

2-s2.0-36749002743

10.1086/522611

Antoniou

A. C.

Sinilnikova

O. M.

Simard

Léoné

Dumont

Neuhausen

S. L.

Struewing

J. P.

Stoppa-Lyonnet

Barjhoux

Hughes

D. J.

RAD51 135 G→C modifies breast cancer risk among BRCA2 mutation carriers: results from a combined analysis of 19 studies

American Journal of Human Genetics 2007 81 6 1186 1200

10.1086/522611

K. D.

Yang

Fan

Chen

A. X.

Shao

Z. M.

RAD51 135 G>C does not modify breast cancer risk in non-BRCA1/2 mutation carriers: evidence from a meta-analysis of 12 studies

Breast Cancer Research and Treatment 2011 126 2 365 371

2-s2.0-79958716871

10.1007/s10549-010-0937-5

P. C.

Henikoff

SIFT: predicting amino acid changes that affect protein function

Nucleic Acids Research 2003 31 13 3812 3814

2-s2.0-0043122919

10.1093/nar/gkg509

Jones

I. M.

Mohrenweiser

H. W.

Many amino acid substitution variants identified in DNA repair genes during human population screenings are predicted to impact protein function

Genomics 2004 83 6 970 979

2-s2.0-2642527702

10.1016/j.ygeno.2003.12.016