Genome-wide association studies (GWAS) have extensively analyzed single SNP effects on a wide variety of common and complex diseases and found many genetic variants associated with diseases. However, there is still a large portion of the genetic variants left unexplained. This missing heritability problem might be due to the analytical strategy that limits analyses to only single SNPs. One of possible approaches to the missing heritability problem is to consider identifying multi-SNP effects or gene-gene interactions. The multifactor dimensionality reduction method has been widely used to detect gene-gene interactions based on the constructive induction by classifying high-dimensional genotype combinations into one-dimensional variable with two attributes of high risk and low risk for the case-control study. Many modifications of MDR have been proposed and also extended to the survival phenotype. In this study, we propose several extensions of MDR for the survival phenotype and compare the proposed extensions with earlier MDR through comprehensive simulation studies.
In early genome-wide association studies (GWAS), massive amounts of results have been reported on the associations between single-nucleotide polymorphisms (SNPs) and diseases. By now, 2,051 studies and 14, 836 causal variants (
Traditional statistical methods are not well suited for detecting such interactions since the number of SNPs and their interactions increase exponentially. To address these issues, many bioinformatics methods for identifying gene-gene interactions have been proposed and one such method is multifactor dimensionality reduction (MDR) [
In this study, we focus on gene-gene and/or gene-environment interactions associated with the survival phenotype. In a prospective cohort study, survival time has been one of the important phenotypes in studies of associations with gene expression levels measured by high-throughput microarray technology. Similarly, it has been important to identify the effect of SNPs on the survival phenotype in GWAS. A series of extensions of MDR to the survival phenotype has recently been proposed, which includes Surv-MDR [
Recently, a simple approach to MDR analysis of gene-gene interactions for quantitative traits, called QMDR, has been proposed [
We compare the power of the proposed methods for various parameters including heritability, minor allele frequency (MAF), and censoring proportion with and without adjustment of covariates. It has been found that the improvements of AFT-MDR are less sensitive to censoring fraction than the original AFT-MDR but tend to have less power as the effect of covariate increases. On the other hand, the improvement of Cox-MDR is relatively robust to censoring fraction and tends to have reasonable power across many combinations of parameters.
Since the MDR method has been originally proposed for a binary phenotype in case-control study, it was extended to quantitative traits and various sampling designs. Among those, the Surv-MDR was first proposed [
To overcome the drawback of Surv-MDR, the Cox-MDR method was proposed [
Similarly, the AFT-MDR method has also been proposed by using the standardized residual as a new classifier under the accelerated failure time model [
As mentioned in the previous section, the improvement of AFT-MDR is needed to make it more robust to the fraction of censoring. Based on the simulated data in [
We first transform the continuous standardized residual into a binary variable instead of taking their sum as done in AFT-MDR. In other words, the individual having the positive standardized residual is regarded as a control, whereas the individual having the negative standardized residual is regarded as a case. As a result, all data is discretized into 0 or 1 and then the original MDR algorithm is implemented, which is called dAFT-MDR (discretized AFT-MDR). Though dAFT-MDR is based on a binary value as the original MDR, it can adjust the covariate effect using the standardized residual of the AFT model, whereas the original MDR cannot adjust the covariate effect.
Secondly, we specify the lower and upper bounds of the standardized residuals and replace the extreme values of the standardized residuals beyond these bounds by either lower or upper bounds. Then we apply the algorithm of AFT-MDR, which is called rAFT-MDR (restricted AFT-MDR). By replacing the extreme values by the prespecified thresholds, the effect of the outliers on the standardized residual may be weakened when the distribution of the standardized residual is extremely skewed under the heavier censoring. However, the determination of threshold of the lower and upper bounds seems to be arbitrary and it should be considered with the behavior of the standardized residuals.
Recently, a simple MDR approach called QMDR for the quantitative trait has been proposed [
For Cox-MDR, we obtain the mean value of the martingale residual for each genotype combination and then compare it with the overall mean of the martingale residual. If the mean value of the martingale residual from the specific genotype combination is greater than the overall mean, the corresponding genotype is considered high risk group. Otherwise, it is considered low risk group, since the larger value of martingale residual has higher risk than expected. Once all of the genotypes are classified as high risk and low risk groups, a new binary attribute is created by pooling the high risk genotype combinations into one group and the low risk into another group. Then we use a
We propose various improvements of AFT-MDR and Cox-MDR to increase the power for detecting gene-gene interactions with the survival phenotype. We implement the comprehensive simulation studies to compare the power of these improvements with those of original AFT-MDR and Cox-MDR.
For the simulation studies, the two disease-causal SNPs are considered among 20 unlinked diallelic loci with the assumption of Hardy-Weinberg equilibrium and linkage equilibrium. For the covariate adjustment, we consider only one covariate which is associated with the survival time but has no interactions with any SNPs. The simulation datasets are generated from different penetrance functions which define a probabilistic relationship between a status of high or low risk groups and SNPs. We consider eight different combinations of two minor allele frequencies of 0.2 and 0.4 and the four different heritabilities of 0.1, 0.2, 0.3, and 0.4. For each of the eight heritability-MAF combinations, a total of 5 models are generated, which yield 40 epistatic models with various penetrance functions, as described in [
Suppose that SNP1 and SNP2 are the two disease-causal SNPs and let
To generate the survival time, we consider three different models: log-normal, Weibull, and Cox model. For each model, the effect size of the genetic factor is fixed as 1.0 and the effect sizes of adjusted covariate are given as
First, we check whether the false detection rate is close to the expected value when there is no gene-gene interaction effect because the best model is selected using the maximum balanced accuracy in the algorithm of MDR. To do this, we generate 100 datasets from each of the 40 models, which is a total of 4000 null datasets. Here the false detection rate is estimated as the percentage of times that the method randomly chooses the two disease-causal SNPs as the best model out of each set of 100 datasets for each model. Table
The false detection rate of AFT-MDR, dAFT-MDR, rAFT-MDR, Cox-MDR, qCox-MDR, and qAFT-MDR for the log-normal distribution with
MAF |
|
AFT-MDR | dAFT-MDR | rAFT-MDR | Cox-MDR | qCox-MDR | qAFT-MDR |
---|---|---|---|---|---|---|---|
0.2 | 0 | 0.008 | 0.004 | 0.006 | 0.006 | 0.008 | 0.003 |
0.2 | 0.3 | 0.002 | 0.005 | 0.008 | 0.006 | 0.007 | 0.005 |
0.2 | 0.5 | 0.007 | 0.005 | 0.004 | 0.005 | 0.004 | 0.003 |
0.4 | 0 | 0.003 | 0.006 | 0.006 | 0.004 | 0.008 | 0.006 |
0.4 | 0.3 | 0.004 | 0.004 | 0.005 | 0.003 | 0.006 | 0.003 |
0.4 | 0.5 | 0.007 | 0.006 | 0.005 | 0.006 | 0.005 | 0.008 |
MAF: minor allele frequency;
For the power, we consider 100 simulated datasets for each of the 40 models, including two disease-causal SNPs, and we selected the best model over all possible two-way interaction models without and with adjustment of covariates, respectively. The power of dAFT-MDR is estimated as the percentage of times dAFT-MDR correctly chooses the two disease-causal SNPs as the best model out of each set of 100 datasets for each model. The power of the other improvements is defined as the same way of that of dAFT-MDR.
Figures
Comparison of the power of AFT-MDR, dAFT-MDR, and rAFT-MDR for the log-normal distribution when
Comparison of the power of AFT-MDR, dAFT-MDR, and rAFT-MDR for the log-normal distribution when
On the other hand, the power of AFT-MDR, dAFT-MDR, and rAFT-MDR behaves similarly when the effect of the covariate increases from
Figures
Comparison of the power of Cox-MDR, qCox-MDR, AFT-MDR, and qAFT-MDR for a Cox model when
Comparison of the power of Cox-MDR, qCox-MDR, AFT-MDR, and qAFT-MDR for a log-normal distribution when
Comparing the simulation results shown in Figures
On the other hand, for a log-normal model, the power of Cox-MDR decreases from 0.650 to 0.458 as the censoring fraction increases to 0.3 when the MAF is 0.2 and the heritability is 0.2, whereas the power of qCox-MDR changes from 0.958 to 0.960. In addition, the power of Cox-MDR decreases to 0.360 as the censoring fraction increases to 0.5, but the power of qCox-MDR is 0.95, which implies that qCox-MDR is very robust to the censoring fraction. Under the same setting, however, the power of AFT-MDR decreases from 0.738 to 0.302 and the power of qAFT-MDR decreases from 0.998 to 0.564, respectively, as the censoring fraction increases to 0.3. As the censoring fraction increases to 0.5, the power of AFT-MDR and qAFT-MDR decreases to 0.098 and 0.232, respectively. This result is consistent for both the Cox model and the log-normal model, which implies that only the power of qCox-MDR is robust to heavy censoring, though the power of qAFT-MDR is rather higher for the log-normal model than that for Cox model. These trends are similar for Weibull distribution.
In summary, the simulation results show that AFT-MDR, dAFT-MDR, rAFT-MDR, and qAFT-MDR are more sensitive to heavy censoring (more than 0.5) than Cox-MDR and qCox-MDR across various situations. However, for the moderate censoring (less than 0.3), dAFT-MDR, rAFT-MDR, and qAFT-MDR perform much better than the original AFT-MDR.
Since many findings from GWAS have been published for the last decades, there is still a missing heritability problem. In order to search the missing heritability, we focus on gene-gene interactions because most of common diseases may be due to the complexity of gene-gene and/or gene-environment interactions rather than a single gene effect. Many plausible approaches have been developed by extending existing methods into a more general framework.
In this paper, we propose various improvements to AFT-MDR and Cox-MDR, which include dAFT-MDR, rAFT-MDR, qAFT-MDR, and qCox-MDR. The motivation to propose dAFT-MDR and rAFT-MDR is to improve the power of AFT-MDR because the performance of AFT-MDR is poor when censoring becomes heavier than 0.3. To reduce the effect of heavy censored observation, we discretize the standardized residual into a binary value, which yields dAFT-MDR. Alternatively, we truncate the extreme values and replace them by specified lower and upper bounds, which leads to rAFT-MDR. As shown in the simulation results, both AFT-MDR and rAFT-MDR have larger powers than the original AFT-MDR for the moderate censoring but still have low powers for the heavy censoring.
In addition, we considered the improvement of QMDR, which has been recently proposed in [
In conclusion, the improvement of Cox-MDR, say qCox-MDR, has reasonable power and is robust to the heavy censoring, whereas the several improvements of AFT-MDR, say dAFT-MDR, rAFT-MDR, and qAFT-MDR, perform better than AFT-MDR but are not robust to heavy censoring. More studies on the behavior of the standardized residuals are needed to improve the power of AFT-MDR under the heavier censoring.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by Basic Science Research Program through the National Research Foundation (NRF) funded by the Ministry of Education, Science and Technology of Korea (MEST) (NRF: 2013R1A1A3010025 and 2013M3A9C4078158).