We propose a two-stage penalized logistic regression approach to case-control genome-wide association studies. This approach consists of a screening stage and a selection stage. In the screening stage, main-effect and interaction-effect features are screened by using
The case-control genome-wide association study (GWAS) with single-nucleotide polymorphism (SNP) data is a powerful approach to the research on common human diseases. There are two goals of GWAS: (1) to identify suitable SNPs for the construction of classification rules and (2) to discover SNPs which are etiologically important. The emphasis is on the prediction capacity of the SNPs for the first goal and on the etiological effect of the SNPs for the second goal. The phrase “an etiological SNP” is used in the sense that either the SNP itself is etiological or it is in high-linkage disequilibrium with an etiological locus. Well-developed classification methods in the literature can be used for the first goal. These methods include classification and regression trees [
The approach of multiple testing based on single or paired SNP models is commonly used for the detection of etiological SNPs. Either the Bonferroni correction is applied for the control of the overall Type I error rate, see, for example, Marchini et al. [
It is natural to seek alternative methods that overcome the drawback of multiple testing. Such methods must have the nature of considering many loci simultaneously and assessing the significance of the loci by their synergistic effect. When the synergistic effect is of concern, adding loci spuriously correlated to an etiological locus does not contribute to the synergistic effect while the etiological locus has already been considered. Thus the drawback of multiple testing can be avoided. In this paper, we propose a method of the abovementioned nature: a two-stage penalized logistic regression approach. In the first stage of this approach,
The two-stage strategy has been considered by other authors. For example, J. Fan and Y. Fan [
Logistic regression models with various penalties have been considered for GWAS by a number of authors. Park and Hastie [
The two-stage penalized logistic regression approach is described in detail in Section
We first give a brief account on the elements required in the approach: the logistic model for case-control study, the penalized likelihood, and the EBIC.
Let
Penalized likelihood makes the fitting of a logistic model with small-
In small-
We now describe the two-stage penalized logistic regression (TPLR) approach as follows.
Let
In the main-effect screening step, only the main-effect features are considered. Let
The interaction screening is similar to the main-effect screening step. However, the main-effect features retained in the main-effect screening step are built in the models for interaction screening. Let
The selection stage consists of a ranking step and a model selection step. In the ranking step, the retained features (main-effect and interaction) are ranked together by a penalized likelihood with SCAD penalty plus an additional Jeffrey’s prior penalty. In the model selection step, a sequence of nested models are formed and evaluated by the EBIC.
For convenience, let the retained interaction features be referred to by a single index. Let
The choice of
A final issue on the two-stage logistic regression procedure is how to determine
The CGEMS data portal of National Cancer Institute, USA, provides public access to the summary results of approximately 550,000 SNPs genotyped in the CGEMS prostate cancer whole genome scan, see
The application of the screening stage to all the 294,179 SNPs directly is not only time consuming but also unnecessary. Therefore, we did a preliminary screening by using single-SNP logistic models. For each SNP, a logistic model is fitted and the
Because of the sheer huge number of features, 17,387 main features and
The features selected by EBIC with
Features associated with prostate cancer from the analysis of CGEMS data (“rsXXX” denotes SNP reference).
Chromosome | Feature | Maximum | Significance Level |
---|---|---|---|
6, 7 | rs1885693-rs12537363 | 0.80 | 1.824985e-11 |
8, 13 | 0.80 | 1.824985e-11 | |
1, 21 | rs1721525-rs2243988 | 0.80 | 1.824985e-11 |
10, 16 | rs11595532-rs8055313 | 0.77 | 9.64352e-11 |
12, 12 | rs10842794-rs10848967 | 0.77 | 9.64352e-11 |
9, 12 | rs3802357-rs10880221 | 0.77 | 9.64352e-11 |
1, 2 | rs3900628-rs642501 | 0.77 | 9.64352e-11 |
1, 16 | rs10518441-rs2663158 | 0.77 | 9.64352e-11 |
3, 13 | rs1880589-rs1999494 | 0.77 | 9.64352e-11 |
5, 18 | rs6883810-rs11874224 | 0.77 | 9.64352e-11 |
13, 19 | rs4274307-rs3745180 | 0.73 | 2.740672e-10 |
5, 19 | rs672413-rs3915790 | 0.73 | 2.740672e-10 |
An older and slightly different version of the CGEMS prostate data has been analyzed by Yeager et al. [
In our analysis, we identified rs7837688 but not rs1447295. This is because the penalized likelihood tends to select only one feature among several highly correlated features, which is a contrast to the multiple testing that selects all the correlated features if any of them is associated with the disease status. We failed to identify rs6983267. The possible reason could be that its effect is masked by other more significant features which are identified in our analysis. We also carried out the selection procedure with only the 100 main-effect features retained from the screening stage. It is found that rs6983267 is among the top 20 selected main-effect features with a significance level
We present results of two simulation studies in this section. In the first study, we compare the two-stage penalized logistic regression (TPLR) approach with the paired-SNP multiple testing (PMT) approach of Marchini et al. [
The comparison of TPLR and PMT is based on four models. Each model involves two etiological SNPs. In the first model, the effects of the two SNPs are multiplicative both within and between loci; in the second model, the effects of the two SNPs are multiplicative within but not between loci; in the third model, the two SNPs have threshold interaction effects; in the fourth model, the two SNPs have an interaction effect but no marginal effects. The first three models are taken from Marchini et al. [
Marchini et al. [
In the first three models, the marginal effects of both loci are nonnegligible and can be picked up by the single-SNP tests at the relaxed significance level. In this situation, the second strategy has an advantage over the first strategy in terms of detection power and false discovery rate. In this study, we compare our approach with the second strategy of PMT under the first three models. In the fourth model, since there are no marginal effects at both loci, the second strategy of PMT cannot be applied since it will fail to pick up any loci at the first step. Hence, we compare our approach with the first strategy of PMT. However, the first strategy involves a stupendous amount of computation which exceeds our computing capacity. To circumvent this dilemma, we consider an artificial version of the first strategy; that is, we only consider the pairs which involve at least one of the etiological SNPs. This artificial version has the same detection power but lower false discovery rate than the full version. The artificial version cannot be implemented with real data since it requires the knowledge of the etiological SNPs. However, it can be implemented with simulated data and serves the purpose of comparison.
Each simulated dataset contains
The
The simulated average PDR and FDR under Model 1: multiplicative effects both within and between loci.
| PDR | FDR | ||||
TPLR | MT | TPLR | MT | |||
(800,1000) | 0.8 | 0.1 | 0.610 | 0.780 | 0.358 | 0.996 |
0.9 | 0.1 | 0.850 | 0.900 | 0.320 | 0.998 | |
1.0 | 0.1 | 0.960 | 1.000 | 0.219 | 0.999 | |
(800,5000) | 0.8 | 0.1 | 0.470 | 0.660 | 0.405 | 0.999 |
0.9 | 0.1 | 0.750 | 0.870 | 0.380 | 0.999 | |
1.0 | 0.1 | 0.890 | 0.930 | 0.233 | 0.999 |
The simulated average PDR and FDR under Model 2: multiplicative effects within loci but not between loci.
| PDR | FDR | ||||
TPLR | MT | TPLR | MT | |||
(800,1000) | 0.5 | 0.1 | 0.265 | 0.175 | 0.086 | 0.352 |
0.5 | 0.2 | 0.650 | 0.550 | 0.071 | 0.763 | |
0.7 | 0.1 | 0.790 | 0.710 | 0.048 | 0.758 | |
0.7 | 0.2 | 0.950 | 1.000 | 0.050 | 0.954 | |
(800,5000) | 0.5 | 0.1 | 0.175 | 0.085 | 0.079 | 0.595 |
0.5 | 0.2 | 0.610 | 0.405 | 0.077 | 0.928 | |
0.7 | 0.1 | 0.720 | 0.480 | 0.062 | 0.776 | |
0.7 | 0.2 | 0.940 | 0.930 | 0.051 | 0.980 |
The simulated average PDR and FDR under Model 3: two-locus threshold interaction effects.
| PDR | FDR | ||||
TPLR | MT | TPLR | MT | |||
(800,1000) | 0.8 | 0.1 | 0.530 | 0.455 | 0.086 | 0.884 |
0.9 | 0.1 | 0.730 | 0.695 | 0.052 | 0.965 | |
1.0 | 0.1 | 0.810 | 0.840 | 0.047 | 0.970 | |
(800,5000) | 0.8 | 0.1 | 0.350 | 0.270 | 0.028 | 0.800 |
0.9 | 0.1 | 0.620 | 0.490 | 0.101 | 0.999 | |
1.0 | 0.1 | 0.712 | 0.657 | 0.060 | 0.982 |
The simulated average PDR and FDR under Model 4: significant interaction effect but zero marginal effects.
| PDR | FDR | ||||
TPLR | MT | TPLR | MT | |||
(800,1000) | 1.9 | 0.1 | 0.828 | 0.702 | 0.012 | |
2.0 | 0.1 | 0.945 | 0.860 | 0.026 | ||
2.1 | 0.1 | 0.965 | 0.915 | 0.015 | ||
(800,5000) | 1.9 | 0.1 | 0.555 | 0.460 | 0.009 | |
2.0 | 0.1 | 0.730 | 0.710 | 0.014 | ||
2.1 | 0.1 | 0.885 | 0.795 | 0.006 |
The results presented in Tables
The data for this simulation study is generated mimicking the structure of the CGEMS prostate cancer data. The cases and controls are generated using a logistic model with the following linear predictor:
In the TPLR approach, 50 main effect features and 50 interaction features are selected in the screening stage using the tournament screening strategy. In the selection stage, EBIC(
In the LASSO-patternsearch approach, at the screening stage, 0.05 and 0.002 are used as thresholds for the
Since in the TPLR approach there is not a definite choice of
Comparison of TPLR approach and LASSO-patternsearch (the PDR and FDR with subscript
TPLR Approach | ||||||
0.0 | 0.964 | 0.902 | 0.884 | 0.907 | 0.947 | 0.855 |
0.2 | 0.768 | 0.926 | 0.884 | 0.900 | 0.947 | 0.852 |
0.4 | 0.714 | 0.505 | 0.748 | 0.677 | 0.803 | 0.494 |
0.6 | 0.692 | 0.199 | 0.680 | 0.413 | 0.730 | 0.200 |
0.8 | 0.684 | 0.112 | 0.642 | 0.331 | ||
1.0 | ||||||
1.2 | ||||||
1.4 | ||||||
1.6 | ||||||
1.8 | ||||||
2.0 | ||||||
LASSO-patternsearch | ||||||
0.882 | 0.445 | 0.710 | 0.967 | 0.847 | 0.940 | |
0.816 | 0.332 | 0.696 | 0.957 | 0.827 | 0.926 | |
0.786 | 0.283 | 0.664 | 0.929 | 0.774 | 0.885 | |
0.718 | 0.241 | |||||
The ROC curves of the LASSO-patternsearch and the TPLR approach for identifying etiological SNPs.
To investigate the effect of the choice of
We also investigated whether the ranking step in the TPLR approach really reflects the actual importance of the features. The average ranks of the ten causal features over the 100 simulation replicates are given in Table
Features | 1 | 2 | 3 | 4 | 5 | (6,7) | (8,9) | (10,11) | (2,12) | (13,14) |
---|---|---|---|---|---|---|---|---|---|---|
Avg. ranks | 4.7 | 2.0 | 7.2 | 6.1 | 5.4 | 7.6 | 6.8 | 9.2 | 3.0 | 1.1 |
On the average, the causal features are all among the top ten ranks. This gives a justification for the ranking step in the selection stage of the TPLR approach.
It is a common understanding that individual SNPs are unlikely to play an important role in the development of complex diseases, and, instead, it is the interactions of many SNPs that are behind disease developments, see Garte [
The analysis of the CGEMS prostate cancer data can be refined by replacing the binary logistic model with a polytomous logistic regression model taking into account that the genetic mechanisms behind aggressive and nonaggressive prostate cancers might be different. Accordingly, the penalty in the penalized likelihood can be replaced by some variants of the group LASSO penalty considered by Huang et al. [
The authors would like to thank the National Cancer Institute of USA for granting the access to the CGEMS prostate cancer data. The research of the authors is supported by Research Grant R-155-000-065-112 of the National University of Singapore, and the research of the first author was done when she was a Ph.D. student at the National University of Singapore.