One of main objectives of a genome-wide association study (GWAS) is to develop a prediction model for a binary clinical outcome using single-nucleotide polymorphisms (SNPs) which can
be used for diagnostic and prognostic purposes and for better understanding of the relationship between the disease and SNPs. Penalized support vector machine (SVM) methods have been widely used toward this end. However, since investigators often ignore the genetic models of SNPs, a final model results in a loss of efficiency in prediction of the clinical outcome. In order to overcome this problem, we propose a two-stage method such that the the genetic models of each SNP are identified using the MAX test and then a prediction model is fitted using a penalized SVM method. We apply the proposed method to various penalized SVMs and compare the performance of
SVMs using various penalty functions. The results from simulations and real GWAS data analysis show that the proposed method performs better than the prediction methods ignoring the genetic models in terms of prediction power and selectivity.
1. Introduction
We consider a genome-wide association study (GWAS) on a complex disease. One of the popular study objectives of such study is to predict a binary clinical outcome, such as benign versus malignant and response versus no response with respect to a specific regimen, based on single-nucleotide polymorphisms (SNPs) data. A fitted prediction model will be used to predict the diagnostic or prognostic outcomes of future patients. Recently, penalization approaches incorporating logistic model or support vector machines have been actively proposed to fit prediction models with binary outcomes. These are well known to achieve both predictive accuracy and variable selection simultaneously.
By introducing shrinkage priors of the normal exponential-gamma (NEG) distribution family, Hoggart et al. [1] suggested a stochastic search method for penalized logistic regression models with SNPs. Ayers and Cordell [2] showed that the NEG priors have better performance than other competing penalized methods using simulations, while it is very computing intensive to produce the results. Wu et al. [3] considered lasso-penalized logistic regression [4] with a large number of SNPs and proposed a cyclic coordinate descent algorithm [5] to implement the computation. Kooperberg et al. [6] removed SNPs that had a Hardy-Weinberg P value smaller than 10-5 and applied logistic regression models with lasso and Elastic net [7] penalties using a set of SNPs preselected by a cross-validation procedure. On the other hand, Wei et al. [8] proposed selecting SNPs using EigenStrat algorithm [9] and applying the SVM and logistic regression as predictive models. Abraham et al. [10] showed that the two penalized methods, l1 and Elastic-net SVM, were robust in case/control predictive performance based on simulation studies and real data analyses. These simultaneous analysis methods ignored the genetic models of SNPs [6] or assumed the additive model for all SNPs [6, 8, 10].
The statistical tests such as the Pearson’s chi-squared test or the Cochran-Armitage trend test (CATT) are frequently used to test if an SNP is associated with a binary outcome by assuming a specific genetic model. Oftentimes, however, the true genetic model is unknown. We can improve the testing power if we know the true genetic model of an SNP [11]. Toward this end, the test based on the maximum over the three CATT statistics (MAX test) has been presented by several authors [12, 13]. Kim et al. [14] recently proposed a prediction method for time-to-event traits using SNPs and showed that a prediction model based on the best fitting genetic models of SNPs can improve the prediction efficiency. We extend their approach to the prediction of binary outcomes using SVMs.
In this paper, we propose a prediction method combining the MAX test and penalized SVM to predict binary outcome using SNPs. The proposed method consists of two phase procedures: (i) to select candidate prognostic SNPs and identify their genetic models using MAX test, and (ii) to fit a prediction model using the penalized SVM with appropriate scores for the selected SNPs based on their genetic types. We compare the performance of the proposed method using a different penalized SVM method through simulations and a real GWAS data analysis. Each SVM method is combined with MAX test or the general practice ignoring the genetic types of the SNPs.
To facilitate and enable MAX test, we provide the R package called SNPselect in http://datamining.dongguk.ac.kr/Rlib/SNPselect which uses the penalized SVM R package [15] to implement SVM with SCAD, l1, and Elastic Net penalties.
2. Methods2.1. Penalized Support Vector Machine
Suppose that there are n subjects. For the subject i (=1,…,n), we have an input vector xi∈Rp and a class label yi∈{-1,1}. The SVM [16, 17] is to find the optimal hyperplane which separates data points into two classes with the largest margin.
Wahba et al. [18] and Hastie et al. [19] found that the optimization problem of the SVM can be represented as a penalized optimization problem:
(1)minβ0,β∑i=1n[1-yi(β0+βTxi)]++pλ(β),
where [1-yf]+=max(1-yf,0) is called the hinge loss and pλ is a penality function with regularization parameter λ. The SVM using an l2-norm, pλ(β)=∥β∥22, as a penalty function is called the standard SVM or l2-SVM.
The l2-SVM has been successfully applied to classification with high-dimensional data such as gene microarrays and SNPs, but it does not select the variables affecting the response class label. For feature selection with l2-SVM, Guyon et al. [20] proposed the SVM-REF procedure which combines the recursive feature elimination (RFE) with the l2-SVM. This procedure consists of a two-step procedure using an external gene selection method.
In order to achieve classification accuracy and feature selection simultaneously, variants of SVM have been proposed by replacing the penalty function in (1) with other types of penalty functions, for example, SVM with 1-norm [21, 22], adaptive lasso [23], or smoothly clipped and absolute deviation (SCAD) [24, 25] penalties. The SVM with 1-norm (or l1-SVM) adapts the lasso (or l1) penalty, pλ(β)=λ∥β∥1, originally proposed by Tibshirani [4] as a practical alternative to l2 penalty. Due to the l1 penalty, the l1-SVM automatically selects variables by shrinking the small coefficients of the hyperplane to exactly zero.
One of major drawbacks of the l1 penalty is that it tends to select only one variable when there are many correlated input variables in data. To overcome this limitation of LASSO, Zou and Hastie [7] proposed the Elastic Net penalty by combineing l1 and l2 penalties:
(2)pλ(β)=λ1∥β∥1+λ2∥β∥22.
The Elastic Net penalty provides variable selection owing to l1 penalty, while finding highly correlated variables, called grouping effect. Wang et al. [26] applied the Elastic Net penalty to SVM classification problems.
Fan and Li [24] proposed the smoothly clipped absolute deviation (SCAD) penalty given as
(3)pλ(β)=∑j=1ppλ(βj;a),
where
(4)pλ(β;a)={λ|β|if|β|<λ-|β|2-2aλ|β|+λ22(a-1)ifλ≤|β|≤aλ(a+1)λ22if|β|≥aλ.
Here, a (>2) and λ (>0) are tuning parameters. Fan and Li [24] showed that the prediction with SCAD penalty is not sensitive to the tuning parameter a and recommended to use a=3.7.
The SCAD yields the same behavior as l1 for small coefficients βj, j=1,…,p, but assigns a constant penalty for large coefficients. This property reduces the estimation bias. Fan and Li [24] demonstrate more desirable theoretical properties of the SCAD penalty compared to the l1 penalty. Later, Zhang et al. [25] proposed the SVM with the SCAD penalty for feature selection.
2.2. Genetic Models for SNPs
Let AA, AB, and BB be three possible genotypes where B is a risk allele for a given SNP. We denote the number of B alleles in a genotype by k; that is, k=0,1, or 2 if the genotype is AA, AB, or BB, respectively. For a given SNP, the data from n patients are summarized in Table 1.
Genotype frequencies.
AA
AB
BB
Total
Response
r0
r1
r2
r
No response
s0
s1
s2
s
Total
n0
n1
n2
n
Let pk denote the response probability given a genotype k=0,1,2. If B is the response allele, the response probability increases as the number of B alleles in the SNP increases; that is, p0≤p1≤p2. In this paper, we will consider three popular genetic models satisfying this assumption:
recessive model: p0=p1<p2;
dominant model: p0<p1=p2;
additive model: p0<p1=(p0+p2)/2.
2.3. Trend Test and MAX Test
For testing association between an SNP and a clinical outcome in case-control studies, the statistical tests such as the Pearson’s chi-squared test or CATT are frequently used when the true genetic model is known. In this case, the CATT is usually more powerful than Pearson’s chi-squared test when p0≤p1≤p2 [12]. For a single SNP, borrowing the notations of Table 1, the CATT statistic can be written as
(5)Tc=n1/2∑k=02ck(srk-rsk)rs{n∑k=02ck2nk-(∑k=02xknk)2},
where (c0, c1, and c2) is a set of scores assigned to genotypes (AA, AB, and BB) with respect to a specific genotype. The trend test is invariant under a linear transformation with c0≤c1≤c2, so that the typical choice of these scores is c0=0 and c2=1, but c1 can take a different value according to a specific genetic model. From the results of Sasieni [27] and Zheng et al. [12, 28], the optimal choices of c1 are 0,1/2 and 1 for the recessive, additive, and dominant models, respectively. Let pk denote the response probability for genotype group k=0,1,2. Under the null hypothesis of no association, H0:p0=p1=p2, Tc approximately follows N(0,1) for large n.
When the true genetic model is unknown, the test based on multiple CATTs for different genetic models can lead to substantial reduction in statistical power [11] or inflated type I error rate. To address this issue, the test based on the maximum over the three CATT statistics (MAX test) has been proposed by several authors [12, 13]. Let TR, TA, and TD denote the CATT statistics using the scores for recessive, additive, and dominant models, respectively. Based on the three CATT statistics, the MAX test statistic is defined as
(6)Tmax=max(|TR|,|TA|,|TD|).
The MAX test has robust properties [29] and is more powerful than the Pearson’s chi-squared test [12] when the underlying genetic model is unknown.
Even though one can easily calculate the MAX test statistic from (5) and (6), it is not simple to compute its P value. One approach of obtaining the P value is based on a Monte-Carlo simulation. Under H0, Zheng et al. [12] showed that (TR,TD,TA) is asymptotically normal with covariances
(7)cov(TR,TA)=f2(f1+2f0)f2(1-f2)f0(f1+2f2)+f2(f1+2f0),cov(TR,TD)=f0f2f0(1-f0)f2(1-f2),cov(TA,TD)=f0(f1+2f2)f0(1-f0)f0(f1+2f2)+f2(f1+2f0),
where fk denotes the relative frequency of genotype k=0,1,2. Thus we can approximate the P value of MAX test based on Monte-Carlo samples from multivariate normal distribution with estimated variance-covariance matrix Σ^ which is obtained by replacing fk in the above covariances with f^k=rk/nk for k=0,1,2(f0+f1+f2=1).
There have been some studies on variants of MAX test for binary clinical outcomes. Zheng et al. [12] developed a robust ranking method, called MAX-rank test. Conneely et al. [30] proposed an efficient P value computation method that is shown to be more accurate than that using permutations by adjusting for correlated test statistics. Li et al. [31] proposed the P-rank test approximating the P value for the MAX test with or without covariate adjustment. Li et al. [32] compared the performance of the MAX-rank and P-rank tests. For more detailed discussions on MAX test, see [11] or [32].
2.4. Classification via SVM with MAX Test
For patient i=1,…,n, let yi denote the binary clinical outcome taking 1 if responded or -1 if not responded and (ki1,…,kim) the encoded data on m SNPs, that is, kij=0,1,2, the number of the risk allele for SNP j (=1,…,m). To build a classification model with this data set, we propose a method combining a penalized SVM and the MAX test. Our method consists of two-phase procedures: (i) prescreening SNPs and identifying the genetic models for the selected SNPs using the MAX test and (ii) applying the penalized SVM to fit a classification model. Our method can be summarized as follows.
Read in the clinical outcomes (y1,…,yn) and SNP data {(ki1,…,kim),i=1,…,n}.
For SNP j (=1,…,m),
using the original data, calculate test statistics (Tj,R,Tj,A,Tj,D) and their two-sided P values (pj,R,pj,A,pj,D) and MAX test statistic Tj,max=max(|Tj,R|,|Tj,A|,|Tj,D|).
compute the approximate P value of MAX test by Monte-Carlo simulation:
estimate the variance-covarince matrix Σ^j;
generate (tj,R(b),tj,A(b),tj,D(b)) from N(0,Σ^j) for b=1,…,B (=100,000, say);
approximate the P value for MAX test by
(8)pj=B-1∑b=1BI(tj,max(b)≥Tj,max),
where tj,max(b)=max(|tj,R(b)|,|tj,A(b)|,|tj,D(b)|).
SNP screening: select J (≪m) SNPs with pj<α for a prespecified α value, such as 0.01.
For SNP j, identify the genetic model by the smallest P value among pj,R, pj,A, and pj,D.
Assign covariate values (zi1,…,ziJ) using the score corresponding to the identified genetic model.
Standardize the covariates; that is,
(9)zij′=zij-z-jsj,
where z-j=n-1∑i=1nzij and sj2=n-1∑i=1n(zij-z-j)2.
Apply the penalized SVM to the response data (y1,…,yn) and the standardized covariates {(zi1′,…,ziJ′),i=1,…,n}.
3. Results3.1. Simulation Studies
At first, we generate IID N(0,1) random variables ϵi1,…,ϵim and, for ρ∈(0,1), set
(10)xij={ϵij,j=1ρxi,j-1+1-ρ2ϵij,j=2,…,m.
Note that xi1,…,xim have an AR(1) correlation structure with autocorrelation coefficient ρ as in [14]. Correlated SNP data are generated by
(11)zij={0,xij<uf01,uf0≤xij<u(f0+f1)2,otherwise,
where uq denotes the qth quantile of the standard normal distribution. The binary clinical outcome of patient i is generated using response probability pi which is related to the covariates by
(12)logit(pi)=∑j=1mβjzij.
To consider the cases of uncorrelated or moderately correlated SNPs in our experiment, we set ρ=0 or 0.3. We generate m=1000 encoded SNPs with (f0,f1)=(1/4,1/2) for j=1,…,6 and (f0,f1)=(1/3,1/3) for j=7,…,1000. SNPs 1 and 2 have recessive models; SNPs 3 and 4 have dominant models, and SNPs 5 and 6 have additive models, the regression coefficients for these six prognostic SNPs are set at β1=β2=β3=β4=β5=β6=0.8. According to the above data generation scheme, we have generated simulation data sets of size 200, and each data set is partitioned into 2/3 training set and 1/3 test set. For a classification model fitting, the SVM with one of the three penalty functions, SCAD (SCAD-SVM), l1 (l1-SVM), and Elastic Net (Enet-SVM), is applied to the SNPs selected using α=0.01. To choose a final classification model, we use 5-fold cross-validation for selecting the tuning parameters. One of the standard practice in the classification model fitting using SNP data will be assuming an equal genetic model for all SNPs. In order to evaluate the performance of the model fitting methods combined with the MAX test, we also have fitted a classification model by assuming one genetic model for all SNPs.
For each model fitting method, we calculate three performance measures such as the number of the selected SNPs, the number of the selected prognostic SNPs by the penalized SVM, and the misclassification error. Here, the selected SNPs are selected by penalized SVM among SNPs after a prescreening step, and the selected prognostic SNPs are the prognostic ones included in the selected SNPs. The misclassification errors are estimated using test data set; that is,
(13)1n∑i=1nI(yi≠signf^(zi)),
where I(x) is an indicator function, f^(z)=β^1z1′+⋯+β^JzJ′ denote the predicted response score predicted for the test set, and zj′s are standardized covariates in the test set using the means and standard errors calculated from the training set. In order to assess the variability of the experiments, we replicate the whole process 100 times. Table 2 summarizes the three averaged performance measures from our simulations.
The result of simulations with 100 replications: selected SNPs and prognostic SNPs indicate the averaged numbers of the selected SNPs and the selected prognostic SNPs, respectively, in the fitted models; standard error is reported in the parentheses.
ρ
Genetic model
Selected SNPs
Prognostic SNPs
Misclassification error
l1
Enet
SCAD
l1
Enet
SCAD
l1
Enet
SCAD
0
Proposed
43.10
48.66
20.31
5.11
5.46
3.31
0.1766
0.1567
0.2736
(0.54)
(0.70)
(0.65)
(0.09)
(0.07)
(0.14)
(0.0048)
(0.0054)
(0.0062)
Recessive
40.50
41.38
25.28
4.62
4.74
3.66
0.2408
0.2518
0.2912
(0.56)
(1.11)
(0.66)
(0.10)
(0.10)
(0.14)
(0.0054)
(0.0065)
(0.0047)
Additive
42.71
45.58
24.12
5.23
5.35
4.07
0.2161
0.2118
0.3272
(0.49)
(0.87)
(0.91)
(0.08)
(0.08)
(0.18)
(0.0048)
(0.0064)
(0.0076)
Dominant
41.35
43.46
23.99
4.70
4.86
3.38
0.2457
0.2347
0.2995
(0.58)
(0.98)
(0.70)
(0.10)
(0.09)
(0.12)
(0.0056)
(0.0063)
(0.0042)
0.3
Proposed
42.92
45.68
19.56
5.12
5.20
3.40
0.1690
0.1541
0.2833
(0.50)
(0.80)
(0.48)
(0.08)
(0.08)
(0.10)
(0.0049)
(0.0047)
(0.0060)
Recessive
39.49
41.09
27.23
4.34
4.47
3.03
0.2383
0.2368
0.2741
(0.58)
(0.87)
(0.59)
(0.10)
(0.11)
(0.03)
(0.0057)
(0.0057)
(0.0019)
Additive
42.07
43.90
21.89
5.06
5.04
3.74
0.2126
0.2074
0.3338
(0.52)
(0.96)
(0.67)
(0.08)
(0.08)
(0.08)
(0.0052)
(0.0057)
(0.0056)
Dominant
39.97
38.56
24.97
4.56
4.29
2.04
0.2502
0.2338
0.2607
(0.62)
(1.03)
(0.53)
(0.10)
(0.11)
(0.03)
(0.0065)
(0.0059)
(0.0039)
When comparing the number of selected SNPs in Table 2, we observe that Enet-SVM tends to select more SNPs but SCAD-SVM selects lower SNPs except for the case of ρ=0.3 and dominant model. In view of different genetic models, the proposed method selects more SNPs when applying l1-SVM or Enet-SVM. However, the combination of proposed method and SCAD-SVM selects much less SNPs than other combinations. Comparing the numbers of prognostic SNPs, Enet-SVM or l1-SVM performs better than SCAD-SVM and assuming the proposed method or additive model has good selectivities of the true prognostic SNPs. In results with correlated SNPs (ρ=0.3), Enet-SVM and l1-SVM with the proposed method result in better selectivities for true prognostic SNPs than those with the additive model. However, the proposed methods can be the worst when SCAD-SVM is used for uncorrelated SNP data. We also compare the misclassification errors. Even if there are a little differences between Enet-SVM and l1-SVM, Enet-SVM performs better than other penalized methods. SCAD-SVM produces the worst misclassification errors for all cases. We also find that the proposed method has the lowest misclassification errors whatever the penalized SVM method used except the case of applying SCAD-SVM for ρ=0.3. Based on the discussions on the simulation results so far, the proposed method combined with Enet-SVM or l1-SVM could improve the selectivity for true prognostic SNPs and the ablility of precdiction than other methods using a prefixed genetic model.
3.2. Real Data Analysis Example
Kim et al. [33] performed a GWAS using Affymetic Genome-wide Human SNP Arrays 6.0 (San Diego, CA, USA) on 190 patients with chronic myelogenous leukemia (CML). After excluding the SNPs with one missing case and those with the same genotype for all 190 patients, we use 330,353 autosomal SNPs in the further data analysis. The clinical endpoint is the achievement of major molecular response by 18 months to an induction chemotherapy. BCR/ABL transcript levels were measured to determine molecular response to imatinib therapy as described before by Kim et al. [34] and presented using the international scale. Major molecular response (MMR) was defined as <0.1% of the BCR/ABL fusion gene transcript level on an international scale by quantitative PCR. Among the 190 patients, 115 responded.
We randomly partition the CML data into 126 training samples and 64 test samples and then calculate the predictive performance measures for the methods over 100 random partitions. Table 3 summarizes the number of selected SNPs and the mean misclassification errors with their standard errors in parentheses over 100 random partitions. Similar to the simulation results, l1-SVM and Enet-SVM using the MAX test slightly increase the number of selections, but produce lower misclassification errorr. Among the three penalized methods, Enet-SVM selects the largest number of SNPs but has the lowest misclassification error regardless of the use of the MAX test. However, SCAD-SVM selects the lowest SNPs, while it has poor prediction performances for any assumption for genetic models, which is the same observation in the simulation results.
The results of CML data: number of selected SNPs and misclassification error are calculated on average over 100 random partitions; standard error is reported in the parentheses.
Genetic model
Average number of selected SNPs
Misclassification error
l1-SVM
Enet-SVM
SCAD-SVM
l1-SVM
Enet-SVM
SCAD-SVM
Proposed
70.38
99.80
55.90
0.0737
0.0590
0.1098
(1.29)
(4.10)
(0.52)
(0.0036)
(0.0062)
(0.0013)
Recessive
55.24
120.46
27.82
0.1184
0.0562
0.2003
(1.19)
(4.73)
(2.33)
(0.0048)
(0.0051)
(0.0044)
Additive
66.32
120.76
43.50
0.1063
0.0667
0.1530
(1.12)
(5.00)
(0.27)
(0.0051)
(0.0061)
(0.0026)
Dominant
51.90
91.92
50.90
0.1013
0.0702
0.1663
(0.89)
(4.81)
(1.30)
(0.0062)
(0.0069)
(0.0044)
Table 4 shows the list of 51 SNPs selected commonly by three penalized methods from 126 training samples of one of 100 random partitions. TGFBR1 gene (rs420549, located in 3′UTR region) among 51 SNPs, transforming growth factor beta receptor 1, interacts with TGF beta 1 [35, 36] and TGF beta receptor 2 [37, 38] and is located in 9q22. TGF beta is playing an important role of maintaining the growth and differentiation balance of hematopoietic cells [39, 40] and is known to have bidirectional properties of tumor suppressing and promoting function [41]. TGF-β-FOXO signaling pathway is involved in the maintenance of leukemia-initiating cells in CML, contributing to intrinsic resistance of CML LSCs to tyrosine kinase inhibitor [42, 43]. Accordingly, intrinsic trait of receptor affinity on TGF-β might contribute to different sensitivities to TGF-β; thus, it is potentially explainable that the response to imatinib therapy is dependent on the TGFBR1 genotype.
List of SNPs selected commonly by three penalized methods.
RS ID
Genetic model
P value
RS ID
Genetic model
P value
RS ID
Genetic model
P value
rs3750551
D
0.000510
rs9289221
R
0.000160
rs6621316
A
0.000890
rs3886721
A
0.000040
rs16972014
A
0.000170
rs9890262
R
0.000210
rs2938451
A
0.000000
rs3013492
R
0.000760
rs6779769
A
0.000510
rs6429646
R
0.000050
rs7095688
A
0.000920
rs9502826
D
0.000690
rs6426870
R
0.000230
rs1439691
R
0.000100
rs9896683
R
0.000850
rs4784924
R
0.000100
rs7123207
R
0.000490
rs12907966
D
0.000220
rs8075266
R
0.000190
rs16830058
A
0.000830
rs5979009
D
0.000150
rs4851920
R
0.000130
rs10484180
R
0.000930
rs17157980
D
0.000730
rs9809817
R
0.000190
rs1952096
A
0.000250
rs2865510
R
0.000160
rs342735
A
0.000180
rs2842068
D
0.000600
rs12457620
D
0.000810
rs17066311
D
0.000790
rs420549
D
0.000440
rs4510937
R
0.000390
rs6627852
A
0.000470
rs16822723
A
0.000590
rs8073928
R
0.000510
rs11841074
D
0.000130
rs2492664
A
0.000270
rs10409991
R
0.000290
rs9447907
R
0.000650
rs2029866
R
0.000730
rs1871332
A
0.000150
rs16873423
D
0.000360
rs764515
A
0.000030
rs1264547
D
0.000670
rs315025
A
0.000390
rs11197596
A
0.000240
rs2016016
A
0.000360
rs2355615
A
0.000130
rs9344734
D
0.000690
rs6605081
R
0.000150
4. Conclusions
Although the penalized methods have been considered as successful ones for prediction in GWAS, they are still subject to high misclassification error by ignoring the genetic models of prognostic SNPs. In this paper, we proposed a two-phase procedure: (i) carrying out the MAX test for screening out noncandidate SNPs and identifying the genetic models of the selected SNPs at the first stage and then (ii) applying a penalized SVM to the selected SNPs for fitting a classification model at the second stage. We have compared the performances of the proposed method with the conventional methods ignoring the genetic type of prognostic SNPs through simulations and real data example. In the simulations, we observed that Enet-SVM and l1-SVM select more SNPs but have higher selectivities for true prognostic SNPs and lower misclassification errors among the three penalized SVM methods. Combining the proposed method which selects candidate SNPs and estimates their genetic models, we observed that the penalized SVMs except for SCAD-SVM could improve the performances in terms of the selection of the true prognostic SNPs and misclassification errors. Furthermore, the differences of misclassification errors among the three methods with the proposed method become much smaller. Hence, whichever a penalized SVM for model fitting we use, combining it with the MAX test to identify the genetic models of candidate prognostic SNPs could help to improve its performances. We made similar observations from a real data example. Even so, the selection of candidate SNPs could vary according to the choice of a prespecified α; thus, the prescreening by the MAX test could not select a part of true prognostic SNPs. We will consider this point in future work.
Authors’ Contribution
Jinseog Kim and Insuk Sohn contributed equally to this work.
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (no. 2010-0023302).
HoggartC. J.WhittakerJ. C.De IorioM.BaldingD. J.Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies2008472-s2.0-4824909680610.1371/journal.pgen.1000130e1000130AyersK. L.CordellH. J.SNP Selection in genome-wide and candidate gene studies via penalized logistic regression20103488798912-s2.0-7864952308810.1002/gepi.20543WuT. T.ChenY. F.HastieT.SobelE.LangeK.Genome-wide association analysis by lasso penalized logistic regression20092567147212-s2.0-6254911574710.1093/bioinformatics/btp041TibshiraniR.Regression shrinkage and selection via the lasso: a retrospective20117332732822-s2.0-7995504021810.1111/j.1467-9868.2011.00771.xFriedmanJ.HastieT.HoflingH.TibshiraniR.Pathwise coordinate optimization200712302332KooperbergC.LeBlancM.ObenchainV.Risk prediction using genome-wide association studies20103476436522-s2.0-7795638551010.1002/gepi.20509ZouH.HastieT.Regularization and variable selection via the elastic net20056723013202-s2.0-1624440145810.1111/j.1467-9868.2005.00503.xWeiZ.WangK.QuH.-Q.ZhangH.BradfieldJ.KimC.FrackletonE.HouC.GlessnerJ. T.ChiavacciR.StanleyC.MonosD.GrantS. F. A.PolychronakosC.HakonarsonH.From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes20095102-s2.0-7344912971210.1371/journal.pgen.1000678e1000678PriceA. L.PattersonN. J.PlengeR. M.WeinblattM. E.ShadickN. A.ReichD.Principal components analysis corrects for stratification in genome-wide association studies20063889049092-s2.0-3374651251210.1038/ng1847AbrahamG.KowalczykA.ZobelJ.InouyeM.Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease201337218419510.1002/gepi.21698FreidlinB.ZhengG.LiZ.GastwirthJ. L.Trend tests for case-control studies of genetic markers: power, sample size and robustness20025331461522-s2.0-0036364293ZhengG.FreidlinB.GastwirthJ. L.Comparison of robust tests for genetic association using case-control studies200649253265SladekR.RocheleauG.RungJ.DinaC.ShenL.SerreD.BoutinP.VincentD.BelisleA.HadjadjS.BalkauB.HeudeB.CharpentierG.HudsonT. J.MontpetitA.PshezhetskyA. V.PrentkiM.PosnerB. I.BaldingD. J.MeyreD.PolychronakosC.FroguelP.A genome-wide association study identifies novel risk loci for type 2 diabetes200744571308818852-s2.0-3384717660410.1038/nature05616KimJ.SohnI.SonD.KimD. H.AhnT.JungS.Prediction of a time-to-event trait using genome wide SNP data2013145810.1186/1471-2105-14-58BeckerN.WerftW.ToedtG.LichterP.BennerA.PenalizedSVM: a R-package for feature selection SVM classification20092513171117122-s2.0-6764921446510.1093/bioinformatics/btp286VapnikV.1996New York, NY, USASpringerScholkopfB.SmolaA.2002MIT PressWahbaG.LinY.ZhangH.SmolaA. J.BartlettP. L.ScholkopfB.SchuurmansD.Gacv for support vector machines2000297211HastieT.TibshiraniR.FriedmanJ.2001New York, NY, USASpringerGuyonI.WestonJ.BarnhillS.VapnikV.Gene selection for cancer classification using support vector machines2002461–33894222-s2.0-003616125910.1023/A:1012487302797BradleyP.MangasarianO.Feature selection via concave minimization and support vector machinesMorgan Kaufmann (ICML '98)1998ZhuJ.RossetS.HastieT.TibshiraniR.1-norm support vector machines2003MIT PressZouH.An improved 1-norm SVM for simultaneous classification and variable selection2Proceedings of the Eleventh International Conference on Articial Intelligence and Statistics2007675681FanJ.LiR.Variable selection via nonconcave penalized likelihood and its oracle properties200196456134813602-s2.0-1542784498ZhangH. H.AhnJ.LinX.ParkC.Gene selection using support vector machines with non-convex penalty200622188952-s2.0-3034443883910.1093/bioinformatics/bti736WangL.ZhuJ.ZouH.The doubly regularized support vector machine20061625896152-s2.0-33746154240SasieniP. D.From genotypes to genes: doubling the sample size1997534125312612-s2.0-003146698310.2307/2533494ZhengG.FreidlinB.LiZ.GastwirthJ. L.Choice of scores in trend tests for case-control studies of candidate-gene associations20034533353482-s2.0-003772530110.1002/bimj.200390016GastwirthJ. L.The use of maximin efficiency robust tests in combining contingency tables and survival analysis19858039038038410.1080/01621459.1985.10478127ConneelyK. N.BoehnkeM.So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests2007816115811682-s2.0-3674907933610.1086/522036LiQ.YuK.LiZ.ZhengG.MAX-rank: a simple and robust genome-wide scan for case-control association studies200812366176232-s2.0-4504908457110.1007/s00439-008-0514-8LiQ.ZhengG.LiZ.YuK.Efficient approximation of P-value of the maximum of correlated tests, with applications to genome-wide association studies20087233974062-s2.0-4214911511110.1111/j.1469-1809.2008.00437.xKimD.Genome-wide genotype-based prognostic stratification of treatment outcomes following Imatinib therapy in chronic myeloid leukemia in chronic phaseIn submission, 2013KimD. H.KongJ. H.ByeunJ. Y.JungC. W.XuW.LiuX.Kamel-ReidS.KimY.-K.KimH.-J.LiptonJ. H.The IFNG (IFN-γ) genotype predicts cytogenetic and molecular response to imatinib therapy in chronic myeloid leukemia20101621533953502-s2.0-7804945708810.1158/1078-0432.CCR-10-1638EbnerR.ChenR.-H.LawlerS.ZioncheckT.DerynckR.Determination of type I receptor specificity by the type II receptors for TGF-β or activin199326251359009022-s2.0-0027739933OhS. P.SekiT.GossK. A.ImamuraT.YiY.DonahoeP. K.LiL.MiyazonoK.Ten DijkeP.KimS.LiE.Activin receptor-like kinase 1 modulates transforming growth factor-β1 signaling in the regulation of angiogenesis2000976262626312-s2.0-1294427354510.1073/pnas.97.6.2626RazaniB.ZhangX. L.BitzerM.Caveolin-1 regulates transforming growth factor (TGF)-β/SMAD signaling through an interaction with the TGF-β type I receptor20012769672767382-s2.0-003579421810.1074/jbc.M008340200KawabataM.ChytilA.MosesH. L.Cloning of a novel type II serine/threonine kinase receptor through interaction with the type I transforming growth factor-β receptor199527010562556302-s2.0-002893123810.1074/jbc.270.10.5625KimS.-J.LettirioJ.Transforming growth factor-β signaling in normal and malignant hematopoiesis2003179173117372-s2.0-014161383310.1038/sj.leu.2403069FortunelN. O.HatzfeldJ. A.MonierM.-N.HatzfeldA.Control of hematopoietic stem/progenitor cell fate by transforming growth factor-β2002136–104454532-s2.0-0842327330BierieB.MosesH. L.Tumour microenvironment—TGFΒ: the molecular Jekyll and Hyde of cancer2006675065202-s2.0-3374551502310.1038/nrc1926NakaK.HoshiiT.MuraguchiT.TadokoroY.OoshioT.KondoY.NakaoS.MotoyamaN.HiraoA.TGF-Β-FOXO signalling maintains leukaemia-initiating cells in chronic myeloid leukaemia201046372816766802-s2.0-7624908742310.1038/nature08734MiyazonoK.Tumour promoting functions of TGF-Β in CML-initiating cells2012152538338510.1093/jb/mvs106