High Dimensional Variable Selection with Error Control

Background. The iterative sure independence screening (ISIS) is a popular method in selecting important variables while maintaining most of the informative variables relevant to the outcome in high throughput data. However, it not only is computationally intensive but also may cause high false discovery rate (FDR). We propose to use the FDR as a screening method to reduce the high dimension to a lower dimension as well as controlling the FDR with three popular variable selection methods: LASSO, SCAD, and MCP. Method. The three methods with the proposed screenings were applied to prostate cancer data with presence of metastasis as the outcome. Results. Simulations showed that the three variable selection methods with the proposed screenings controlled the predefined FDR and produced high area under the receiver operating characteristic curve (AUROC) scores. In applying these methods to the prostate cancer example, LASSO and MCP selected 12 and 8 genes and produced AUROC scores of 0.746 and 0.764, respectively. Conclusions. We demonstrated that the variable selection methods with the sequential use of FDR and ISIS not only controlled the predefined FDR in the final models but also had relatively high AUROC scores.


Introduction
Prognosis will continue to play a critical role in patient management and decision making in 21st century medicine. Advanced technologies for genomic profiling are now available and they include millions of sets of molecular data in these assays. A critical element of personalized medicine is utilizing and implementing validated diagnostic signatures (or classifiers) for diagnosing or treating cancer patients. These signatures are built and validated utilizing common statistical methods and machine learning tools. For example, the Decipher signature has been developed as a prognostic model to predict metastasis after radical prostatectomy in patients with prostate cancer [1]. The Decipher score is a 22-feature genomic classifier that has been used to predict metastasis and has been independently validated for prediction of prostate metastasis [2][3][4][5]. Another example is oncotypeDx that has been used to stratify randomization and guide treatment in women with breast cancer [6].
A vital step in model building is data reduction. It is assumed that there are several variables that are associated with the clinical outcome in the large dimensional data. The main purpose of the variable selection is to detect only those variables related to the response. Variable selection is composed of two steps: screening and model building. The screening step is to reduce the large number of variables into moderate size while maintaining most of the informative variables relevant to the clinical response. In contrast, in the model building step, investigators develop a single best model utilizing a proper evaluation criterion.
Penalized variable selection methods have played a key role in identifying important prognostic models in several areas in oncology [7][8][9]. Many articles focused on the development of methodologies related to "small N and large P" with the advent of high throughput technology in cancer. The sure independence screening (SIS) was introduced to reduce the high dimension to below the sample size to efficiently select the best subset of variables to predict clinical responses [10]. Although this approach is popular, it does not perform well under some situations. First, unimportant variables that are heavily correlated with important variables are more highly likely to be selected than important variables that are weakly associated with the response. Second, important variables that are not marginally significantly related to the response are screened out. Finally, there may be collinearity between variables that may impact the calculations of the individual predictors.

BioMed Research International
The iterative sure independence screening (ISIS) was proposed to overcome the above issues. The procedure is to apply iteratively high dimensional variable screening followed by the proper scale of variable selection until the best subset of variables with high predictive accuracy is obtained. ISIS screening, however, is also computationally intensive and leads to high false discovery rate (FDR) in ultra-high dimensional setting ( ≫ 1 mils).
The oncology literature is rich in articles related to the use of validated signatures. Despite their abundance, comparisons and the performance of these various methods have not been studied. We propose to use the false discovery rate (FDR) of the multiple testing correction methods as a screening method to reduce the high dimension to lower dimension as well as controlling the false discovery rate in the final model. We investigate the feasibility of the sequential use of FDR screening method with the ISIS and utilize three popular variable selection methods: LASSO [11], SCAD [12,13], and MCP [14], through the extensive simulation studies. To the best of our knowledge, this is the first paper that thoroughly analyzes and compares the performance of the variable selection methods with the sequential use of FDR and ISIS screening methods. We use a prostate cancer signature as an example [1] where the number of probes is around 1.4 million and the clinical outcome is binary in nature: presence of metastasis (presence of metastasis = 1, no metastasis = 0) by fitting models based on the simulation results.
In addition, we provide a broad review of the existing penalized variable selection methods with screening methods. The remainder of this paper is organized as follows. In Section 2, we provide general details of the screening methods of FDR [15] and ISIS [10] and the variable selection methods with the penalized logistic regression. In Section 3, we describe the simulation studies and in Section 4, we summarize the results of the simulations. We then apply the best screening methods from the simulation studies to the real data in Section 5. Finally in Section 6, we discuss our findings.

Methods
We divide this section into several subsections describing the methods used in our paper. The screening section briefly discusses commonly used methods that reduce high dimensionality: false discovery rate (FDR) and iterative sure independence screening (ISIS). We then describe the methods needed to assess variable selection models. The final section considers three existing popular variable selection methods with the logistic regression. All simulations and calculations were carried out using glmnet and ISIS packages in the R library, and the code is available at https://www.duke.edu/halab001/FDR.

Benjamini and Hochberg False Discovery Rate (FDR).
The false discovery rate is defined as the expected proportion of incorrectly rejected null hypotheses. That is, where is the number of falsely rejected hypotheses and is the total number of rejected hypotheses. We focus on the Benjamini and Hochberg FDR [15] method as a screening method in the simulation studies and application. Briefly, the procedure works as follows. Let denote the FDR, where ∈ (0, 1).

Iterative Sure Independence Screening (ISIS).
The ISIS method was proposed to overcome the difficulties caused by the sure independence screening [16]. Briefly, the algorithm works in the following way: (1) The likelihood of marginal logistic regression (LMLR) is computed for every ∈ = {1, 2, . . . , }. Then which is /4 log( ) of the top ranked variables of the descending order list of the LMLR is selected to obtain the index set̂1.
(2) Apply those variables in̂1 to the penalized logistic models to obtain a subset of indiceŝ1.
(3) For every variable ∈ { −̂1}, the likelihood of the marginal logistic regression condition on the variables in̂1 is solved. Then the likelihood estimators are sorted in descending order and then the top ranked variables are selected to get the index set̂2.
(4) Apply those variables in̂2 ∪̂1 to the penalized logistic models to obtain a new index set̂2.

Regularizing Methods with Penalized Logistic Regression.
The logistic regression is one of the most commonly used methods for assessing the relationship between a binary outcome and a set of covariates and building prognostic models of clinical outcomes. In addition, it is widely used in the classification of two classes such as the development of metastasis in prostate cancer [1]. The purpose of variable selection with the logistic regression model in high dimensional setting is to select the optimal subset of variables that will improve the prediction accuracy [17]. Variable selection in high dimensional setting is composed of two components: a likelihood function and a penalty function in order to obtain better estimates for prediction. Let the covariates of th individual be denoted as = ( 1 , . . . , ) for = 1, . . . , and is the total number of covariates. The penalized logistic regression is as follows: BioMed Research International   3 where ( ), a penalty, is function and is 1 for cases and 0 for controls. The probability that th individual is a case based on covariates' information is expressed as The regression coefficients are obtained by minimizing the objective function (2). One of the most popular penalty functions is the least absolute shrinkage and selection operator (LASSO) [11]. It forces the coefficients of unimportant variables to be set to 0 and then the LASSO has sparsity property. The LASSO estimates are obtained by minimizing the above penalized logistic regression form (2). It has a satisfactory performance in identifying a small number of representative variables. Though LASSO is widely used in most applications [18][19][20][21], its robustness is open to question as it has the tendency to randomly select one of the variables with high correlation and exclude the rest of the predictors [22]. Another disadvantage of LASSO is that it always chooses at most (sample size) number of predictors even though there are more than variables with true nonzero coefficients [23]. The coefficients estimates are obtained by minimizing the following objective function based on the likelihood function of logistic regression:̂l Another method commonly employed is the smoothly clipped absolute deviation (SCAD) with a concave penalty function that overcomes some of the limitations of the LASSO [12]. The coefficients from SCAD are solved by minimizing the following objective function: The SCAD penalty function, , ( ), is defined by with ≥ 0 and > 2.
The minimum concave penalty (MCP) is also a recognized method with SCAD, where the coefficients are estimated via minimization of the following objective function: The MCP penalty function, , ( ), is defined by for ≥ 0 and > 1.

Simulation Setup.
We performed extensive simulation studies to explore the performance of three popular variable selection methods: LASSO, SCAD, and MCP in high dimensional setting. We employed 10-fold cross validation to tune the regularization parameter for the methods. Figure 1 describes the schema of the simulation procedures. Based on the logistic regression model, we generated the binary outcome and covariates for each simulation as follows. First, we generated 1 , 2 , . . . , independently from (0, 1), and each of is an × 1 vector. We defined 1 = 1 , . . , , so that correlation of and was | − | for some ∈ [0, 1). That is, the covariates were generated with serialized correlation structure (AR (1)). Next, we specified the true regression coefficients . We fixed all of 's except the first 25 's to be 0. The true nonzero 's were sampled independently from uniform distribution [−1.5, 2]. We considered 25 true effects of the regression coefficients since several classifiers including the Decipher score had selected 20-25 genes [1,2] and because that number predicted reasonably well the outcome. The number of variables was fixed at = 100,000, and the  sample size was set at = 900. Finally, the corresponding binary response was simulated based on the Bernoulli distribution with the following: where = ( ,1 , ,2 , . . . , , ) and = ( 1 , 2 , . . . , ) . Covariates were generated until the target number of 450 cases and 450 controls was reached.
We considered different simulation scenarios for the correlation matrix Σ × , = {0, 0.1, 0.4} among variables. Each simulation scenario was composed of the nine different models with the combination of the FDR, the ISIS, and the random filtering (RF 1000 ). The RF 1000 selected 1,000 variables with the smallest unadjusted values obtained from the marginal logistic regression with the three variable selection methods (LASSO, SCAD, and MCP). The reason we used RF 1000 from the 100,000 potential variables was that the number of false discovery rates is low relative to the other random filtering (such as 2,000 or higher). Therefore, we considered the top 1,000 variables to be a reasonable number of variables screened as reference to be compared with our proposed methods.
We then simulated the data 500 times because of computational intensity. In each simulation, we randomly divided the data into two parts: the training set ( = 600) for model selection and the testing set ( = 300) for validation.

Metrics of Performance.
We calculated the true positive rate (TP), the false positive rate (FP), the false discovery rate (FDR), the average number of false positives in the final model, the average model size, the average of the area under receiver operating characteristic (AUROC), and the number of screened true important variables through the FDR and the RF 1000 to assess the impact of the FDR-ISIS screening method with the three variable selection methods.
The true positive rate (TP), also called sensitivity, is the proportion of positives that are identified correctly given true positives: where TP is the number of the true positives and FN is the number of false negatives. The false positive rate is the BioMed Research International 5 proportion of incorrect identification as a true positive given true negatives. That is, where the FP is the number of false positives and the TN is the number of true negatives. In addition, the average number of false positives (ANFP) was computed as the number of false positives that were selected in the final model out of 500 simulations. Furthermore, the average model size was computed as the number of variables selected in the final model out of 500 simulations. Finally, the AUROC was utilized as a measure of the performance of the logistic regression and is the proportion of the time which a model predicts correctly given observations of a random positive and negative. A perfect model produces an AUROC = 1 whereas a random model has an AUROC = 0.5.

Simulation Results
We summarized the simulation results for = 0.1, one of three correlation structures in Table 1, where all 25 important covariates were assumed to have linear effects. Table 1 presents the performance of the nine different models with the FDR, ISIS, and random filtering based on 500 simulations. The average true positive rates (TP) were 0.223 and 0.268 for the three variable selection methods using FDR .05 − ISIS and FDR .20 − ISIS. The average true positive rates (TP) of the LASSO, SCAD, and MCP with RF 1000 − ISIS were 0.46013, 0.46365, and 0.46739, respectively. These values were much higher than the two FDR screening methods which were below 0.30. On the other hand, the three variable selection methods with RF 1000 − ISIS selected several of the false positive variables that consequently increased the false positive rate (FP). LASSO, SCAD, and MCP with RF 1000 − ISIS included a higher average number of the false positives of 12.364, 12.260, and 12.170, respectively. Although the FDR filtering method did not select a higher number of true important variables, this screening method reduced the false positive rates below the predefined target .
The average numbers of the false positives in the final models with the FDR − ISIS methods were much smaller than that of using RF 1000 − ISIS (Table 1). Specifically, the average numbers of the false positives in the final models with the LASSO, SCAD, and MCP with FDR .05 − ISIS were 0.216, 0.214, and 0.214 with the corresponding standard deviations 0.0219, 0.0215, and 0.0215. As expected, the three variable selection methods with RF 1000 − ISIS had selected a higher average model size of 22.8 than the FDR methods. Similar results were observed for FDR .20 − ISIS. We also calculated the false discovery rate. The variable selection models with the FDR at the target = 0.05 and = 0.20 controlled the false discovery rate below whereas over 40% of the finally selected variables were incorrectly selected using the random filtering methods. The average AUROC scores with RF 1000 − ISIS were relatively higher than the FDR .05 − ISIS and FDR .20 − ISIS. Similar results were noted for independent and moderate correlation = {0, 0.4} as presented in Tables S1 and S2 in Supplementary Material available online at http://dx.doi.org/10.1155/2016/8209453. have any counts and thus were not selected in the simulation. The variables with the highest selection frequencies had true regression coefficients that were strongly associated with the clinical response. These variables were g07, g03, g19, g23, g01, g24, g12, and g06 and were selected over 100 times out of 500 simulations with average corresponding regression coefficients of 1.65, −1.45, 1.57, −1.17, 1.12, 0.968, 0.963, and −1.01 (see Table S3 in Supplementary File). The coefficients of the eight variables were ranked the highest among the 25 absolute values of the true regression coefficients which had To gain more insights into the comparisons of the methods, we present the plots of the AUROC scores and the corresponding false discovery rate under = 0.1 in Figure 3. (a), (c), and (e) in Figure 3 represent the AUROC scores whereas (b), (d), and (f) represent the false discovery rates using three different screening methods. The variable selection methods with random filtering screening had relatively higher AUROC scores compared to the FDR methods. However, there were a number of false positive in the final models as seen in Figure 3(f). It is noteworthy that the variable selection methods using the FDR not only controlled the FDR below the target = 0.05 and = 0.20 but also had AUROC scores that were relatively high (Figures 3(d)  and 3(e)). Similar patterns were observed for independent and moderate correlation = {0, 0.4} ( Figures S3 and S4 in Supplementary File).
Therefore, the FDR − ISIS screening method is preferred to RF 1000 −ISIS since it allowed the variable selection methods to obtain the proper AUROC scores while controlling the false discovery rate at the nominal level of . As a result of the simulation studies, we applied the three variable selection methods with the FDR and ISIS screening to the high dimensional data of the prostate cancer in the following section.

Real Data Analysis
We analyzed the prostate cancer data from the public domain (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE46691: GSE46691). The dataset has 1.4 million probes and the primary outcome is presence of metastasis (yes or no) by fitting the LASSO, SCAD, and MCP methods using the FDR and ISIS screenings suggested from the simulation studies with the sequential filtering of both FDR and ISIS. In the prostate cancer application, we considered the false discovery rate (FDR) at = 0.01 as the screening method. Figure 4 describes the schema of the prognostic model building for the prostate cancer. We utilized the training set that was obtained from the random split and was composed of 359 individuals (140 cases and 219 controls) with 1.4 million probes to build each of the three models. We then estimated the AUROC scores with the validation set with 186 individuals (72 cases and 114 controls). We used 10-fold cross validation for each of the variable selection models to tune the parameters after the screening. We obtained 39 variables with FDR at = 0.01. We repeated each of the three models 100 times to improve the AUROC with those screened variables. Figure 5 shows the AUROC plots of the three models.  Table S4 for more details in Supplementary File). It is noteworthy to note that MCP selected the same set of genes as SCAD and LASSO and the 95% confidence intervals were overlapping. On the other hand, using the FDR at = 0.05, LASSO, SCAD, and MCP selected 15, 13, and 15 gene models out of 565 potential genes with corresponding AUROC scores of 0.697 (95% CI = 0.619-0.775), 0.714 (95% CI = 0.637-0.791), and 0.683 (95% CI = 0.603-0.763), respectively. It is worthwhile to note that MCP had the highest AUROC score (FDR-ISIS at = 0.01 and AUROC = 0.764) followed by LASSO (FDR-ISIS at = 0.01 and AUROC = 0.746) although the results were not consistent with the FDR at = 0.05. This could be due to the larger number of potential variables (565 variables) when using FDR at a higher level. Nevertheless because our interest was to use FDR at = 0.01, MCP and LASSO methods were used for the variable selection in our real example. Table 2 presents the selected probes and their corresponding genes from the two models that had two highest AUROC scores among the six models. LASSO and MCP identified each of the 12 and 8 genes that were associated with developing prostate cancer metastasis. The four genes (ANO7, UBE2C, MYBPC1, and CAM2KN1) associated with developing prostate cancer metastasis were detected in both models. These four genes were a subset of the 22 biomarkers for the Decipher PCa classifier [1]. MYBPC1 (Myosin-Binding Protein C) on chromosome 12 and ANO7 (Anoctamin 7) on chromosome 2 were only   Figure S5.

Discussion
This paper explored the feasibility of using the false discovery rate (FDR) followed by ISIS as screening methods in conjunction with three popular variable selection methods in ultrahigh dimensional data for the purpose of controlling FDR and improving AUROC scores.
Our simulation studies demonstrated that the variable selection methods with FDR − ISIS not only controlled the false discovery rate below the target , but also produced high AUROC scores. Furthermore, the results showed that the false discovery rate was controlled conservatively even with the increased correlation structures. As demonstrated in the simulation studies, if the truly prominent variables have not passed through the screening, they would lose the opportunity to be selected to the final model. Thus, the prediction accuracy may be relatively reduced. Currently, most multiple testing correction methods underscore the priority of identifying prominent variables. Therefore, effective filtering techniques are ultimately needed for the situation when there are weak effects among the important variables.
Although RF 1000 − ISIS produced the highest AUROC through the simulation studies, it also had the highest false discovery rate. It is reasonable to expect that if one variable is not selected during the screening step, then the other variables that were correlated with the unselected variable have a tendency not to be chosen in the final model. As expected, the true positive rates of RF 1000 − ISIS were relatively higher than those of using FDR − ISIS. This is because random filtering had more opportunity to select the true important variables. Due to the relatively high number of true important variables selected, the number of unimportant variables highly correlated with the true important variables was also high. This may explain why the average AUROC scores were highest in our simulation studies for RF 1000 − ISIS.
There are some caveats to the sequential use of FDR with ISIS. First, the total number of true important variables was restricted to 25 variables in the simulation studies. This may explain why the three variable selection methods, LASSO, SCAD, and MCP, performed similarly in the simulations. Second, the computational time for 500 simulations with ISIS alone was 52,500 minutes (around 36.45 days). Although ISIS is computationally intensive, FDR with ISIS performed very well and it took 9,625 minutes to perform the 500 simulations (7 days).
Turning back to our motivation example of prostate cancer, LASSO and MCP under the FDR 0.01 − ISIS screening methods produced the best AUROC scores. We also present the results for other LASSO and MCP methods selected 12 and 8 genes out of 39 screened probes using the FDR at the target = 0.01 and had the AUROC scores of 0.7462 and 0.7644, respectively. The AUROC score of the MCP was 0.144 points higher than what was reported by Erho et al. ( [1]; AUC = 0.75). Although the authors did not report a 95% confidence interval for the AUROC scores, it is most likely that the 95% confidence intervals for the AUROC scores of Erho et al. and the MCP were overlapping.
In summary, based on our extensive simulations, FDR with ISIS seems to be superior to random filtering in terms of error control and is less computationally intensive compared with ISIS only. We also showed that the classifier based on 8 genes detected by the MCP had similar performance to the prognosis for early clinical metastasis prostate cancer model. To our knowledge, this is the first paper that systematically compared the performance of high dimensional methods with screening methods. Based on the extensive simulation studies, effective screening procedures with penalized logistic regression methods would not only lead to controlling the FDR but also produce high area under receiver operating characteristic curve.