Variable Selection in ROC Regression

Regression models are introduced into the receiver operating characteristic (ROC) analysis to accommodate effects of covariates, such as genes. If many covariates are available, the variable selection issue arises. The traditional induced methodology separately models outcomes of diseased and nondiseased groups; thus, separate application of variable selections to two models will bring barriers in interpretation, due to differences in selected models. Furthermore, in the ROC regression, the accuracy of area under the curve (AUC) should be the focus instead of aiming at the consistency of model selection or the good prediction performance. In this paper, we obtain one single objective function with the group SCAD to select grouped variables, which adapts to popular criteria of model selection, and propose a two-stage framework to apply the focused information criterion (FIC). Some asymptotic properties of the proposed methods are derived. Simulation studies show that the grouped variable selection is superior to separate model selections. Furthermore, the FIC improves the accuracy of the estimated AUC compared with other criteria.


Introduction
In modern medical diagnosis or genetic studies, the receiver operating characteristic (ROC) curve is a popular tool to evaluate the discrimination performance of a certain biomarker on a disease status or a phenotype. For example, in a continuous-scale test, the diagnosis of a disease is dependent upon whether a test result is above or below a specified cutoff value. Also, genome-wide association studies in human populations aim at creating genomic profiles which combine the effects of many associated genetic variants to predict the disease risk of a new subject with high discriminative accuracy [1]. For a given cutoff value of a biomarker or a combination of biomarkers, the sensitivity and the specificity are employed to quantitatively evaluate the discriminative performance. By varying cutoff values throughout the entire real line, the resulting plot of sensitivity against 1-specificity is a ROC curve. The area under the ROC curve (AUC) is an important one-number summary index of the overall discriminative accuracy of a ROC curve, by taking the influence of all cutoff values into account. Let be the response of a diseased subject, and let be the response of a nondiseased subject; then, the AUC can be expressed as ( > ) [2]. Pepe [3] and Zhou et al. [4] provided broad reviews on many statistical methods for the evaluation of diagnostic tests.
Traditional ROC analyses do not consider the effect of characteristics of study subjects or operating conditions of the test, so test results may be affected in the way of influencing distributions of test measurements for diseased and/or nondiseased subjects. Additionally, although the number of genes is large, there may be only a small number of them associated with the disease risk or phenotype. Therefore, regression models are introduced into the ROC analysis. Chapter Six in Pepe [3] offered a wonderful introduction to the adjustment for covariates in ROC curves. As reviewed in Rodríguez-Álvarez et al. [5], there are two main methodologies of regression analyses in ROC: (1) "induced" methodology, which firstly models outcomes of diseased and nondiseased subjects separately and then uses these outcomes to induce ROC and AUC and (2) "direct" methodology, which directly models the AUC on all covariates. In this paper, we focus on the induced methodology, to which current model selection techniques may be extended.
If there are many covariates, the variable selection issue arises in terms of the consideration of model interpretation and estimability. There are two main groups of variable selection procedures. One is the best-subset selection associated with criteria such as cross-validation (CV, [6]), generalized cross-validation (GCV, [7]), AIC [8], and BIC [9]. The other is based on regularization methods such as LASSO [10], 2 Computational and Mathematical Methods in Medicine SCAD [11], and adaptive LASSO [12], with tuning parameters selected by the same criteria such as CV and BIC. Procedures in the second group have recently become popular because they are stable [13] and applicable for high-dimensional data [14].
So far, not much attention has been drawn on the topic of variable selection in the ROC regression. Two possible reasons may account for this situation. Firstly, if we model outcomes of diseased and nondiseased subjects separately, selected submodels may be different. The difference will result in difficulties in interpretation, because it is natural to expect that the same set of variables contributes to discriminating diseased and nondiseased subjects. Secondly, most current criteria for variable selection procedures focus on the prediction performance or variable selection consistency. However, in the ROC regression, instead of prediction or model selection, our focus is the precision of an estimated AUC, which means that most popular criteria may not be appropriate. Claeskens and Hjort [15] argued that these "one-fit-all" model selection criteria aim at selecting a single model with good overall properties. Alternatively, they developed the focused information criterion (FIC), which focuses on a parameter singled out for interests. The insight behind this criterion is that a model that gives good precision for one estimand may be worse when used in inference for another estimand. Wang and Fang [16] successfully applied the FIC to variable selection in linear models and demonstrated that the FIC exactly improved the estimation performance of singled-out parameters. This "individualized" criterion exactly fits the ROC regression.
The remaining parts of this paper are organized as follows. In Section 2, we rewrite the ROC regression into a grouped variable selection form so that current criteria can be applied. Then, a general two-stage framework with a BIC selector for the group SCAD under the local model assumption is proposed in Section 3. Simulation studies and a real data analysis are given in Sections 4 and 5. A brief discussion is provided in Section 6. All proofs are presented in the Supplement; see Supplementary Materials available online at http://dx.doi.org/10.1155/2013/436493.

ROC Regression
In this section, we rewrite the penalized ROC regression with induced methodology into a problem of the grouped variable selection by SCAD. Initially, we require that all covariates be centered at 0 for the consideration of comparability. Also, for notation simplicity, response variables are centered. If not, we can center responses to finish the model selection and then add centers back to evaluate the AUC. By following notations of the local model, which generalizes the commonly used sparsity assumption, homoscedastic regression models for diseased and nondiseased subjects are assumed as follows: where includes variables added always, includes variables which may or may not be added, = ( , ) , 0 and 0 are dimensional vectors, = /√ and = / √ are dimensional vectors with and as sample sizes for diseased and nondiseased groups, respectively, = ( 0 , ) = ( 1 , . . . , ) and = ( 1 , . . . , ) are ≜ + dimensional vectors, and and independently follow N(0, 1). Especially, if = = 0 , a sparse model is given. Then, the AUC given can be written as where Φ(⋅) is the cumulative distribution function of a standard normal distribution. Clearly, the narrow model is . . , . Instead of selecting separate models, we consider the following single objective function with a group penalty, given a tuning parameter : where = ( , ) , a 2-dimensional vector, with the th component of and the th component of , and ( )/ = ( ≤ )+max(0, − ) ( > )/( −1) with = 3.7. More generally, instead of the 2 norm for , we can define ‖ ‖ = √ with a positive definite 2 × 2 matrix . Then, given , the minimizer of (3) can be obtained as an estimate of ( , ) . The motivation of considering such a penalty on jointly rather than separately is that the inclusion or exclusion of the effect of a certain variable should be simultaneous for both diseased and nondiseased groups. It may not be appropriate to include either or in the model only, which will bring troubles in interpretation of the resulting model. This is exactly the motivation of the group LASSO method by Yuan and Lin [17] to handle categorical variables, and the group SCAD by Wang et al. [18] to address spline bases.
Note that there are two separate summations of residual squares in (3). In order to comply with the framework of selecting grouped variables, a modified version of the objective function (3) where is an ≜ + dimensional vector with components , = 1, . . . , , and is an × 2 dimensional matrix. Clearly, there are grouped variables, and can be split into submatrices = ( , . . . , ), each of which includes two consecutive columns of in turn. Similarly, = ( 1 , . . . , ) with = ( , ) , = 1, . . . , . Additionally, due to different variances of healthy and diseased subjects, weighted least squares should be applied. Let be a diagonal matrix, with each diagonal entry Then, the objective function (3) is written as Furthermore, in order to facilitate computation with current R packages, we would define transformed observations by weighting. Simply, put̃= 1/2 and̃= 1/2 . Therefore, Finally, the penalized ROC regression (3) has been written into a group SCAD-type problem (7). Then, current model selection criteria, like CV, GCV, AIC, and BIC, can be applied to select a final model. For this specific ROC regression problem, where AUC is the focus, these criteria may not be appropriate. Therefore, as argued by Claeskens and Hjort [15], the FIC can play a role here. Under the local model assumption, a novel procedure of applying the FIC to the grouped variable selection is developed, which is motivated by Wang and Fang [16]. Briefly speaking, the procedure consists of two steps. Firstly, a narrow model, containing variables added always, is identified through the objective function (7). Secondly, the FIC is applied to select a subgroup of remaining variables. As a consequence, the final model is the combination of variables selected in both two steps. Details are provided in the following section. In terms of FIC, naturally, the focus parameter is the AUC at a given 0 ; that is, Later, in simulation studies, the separate variable selection for diseased and nondiseased models will also be utilized to make a comparison. We expect, the group selection is superior to the separate selection.

A BIC Selector for Group SCAD under the Local Model Assumption
This section follows notations used in the two fundamental papers of the FIC: Hjort and Claeskens [19] and Claeskens and Hjort [15]. Furthermore, we allow grouped variables, each of which stands for a factor, such as a series of dummy variables coded from a multilevel categorical variable. The starting assumption of the FIC is that some variables are added to the regression model always and the others may or may not be added; that is, where includes variables which are added always, includes variables which may or may not be added, and ∼ N(0, 2 ). Without loss of generality, both and are standardized to remove the intercept term. Furthermore, we assume that actually consists of factors, that is, For simplicity, assume that the residual variance 2 is estimated based on the full model and is not considered as a parameter.
In the literature of the variable selection, in order to show the selection consistency of a variable selection procedure, usually, the true model is assumed to be sparse. Thus, the sparsity assumption plays a critical role in the current model selection literature. Many procedures have been shown to be selection consistent under this sparsity assumption [20]. For example, the SCAD with tuning parameter selected via BIC has been shown to be selection consistent by Wang et al. [21,22], and Zhang et al. [23].
However, it is questionable or too strict to assume that the true model is sparse. It is more reasonable and flexible to consider the local model (8) with true = ( 0 , ) and = 0 + /√ as a true model, where 0 = 0 for the purpose of variable selection, under which the FIC is developed. This model is close to the sparse model, but it is different from it by −0 = /√ . The sparsity assumption, with notations in this paper, is equivalent to assume that = 0 and true = ( 0 , 0 ). Therefore, the local model assumption used here is a natural extension of the sparsity assumption. All "consistency" results obtained in this paper still apply to sparse models with grouped variables.
The FIC centers at the inference on a certain estimand or focus, denoted by true = ( true ). It is well known that using a bigger model would typically mean smaller bias but bigger variance. Therefore, the FIC tries to balance the bias and the variance of estimating a certain parameter estimand. To be specific, like what any existing criterion does, among a possible model range, the FIC starts with a narrow model that includes only variables in and searches over submodels including some factors in . The whole process leads to totally 2 submodels, one for each subset of {1, . . . , }.
In this framework, various estimators of the focus parameter range from̂f ull = (̂f ull ,̂f ull ) tôn arr = (̂n arr , 0 ). In general, the FIC attempts to select a subsetŜ associated with the smallest mean squared error (MSE) of̂S = (̂S,̂S, 0,S ), where S is the complement of S and the subscript S means a subset of corresponding vectors indexed by S.

Stage 1: Consistent Selection of the Narrow Model.
Once assuming the true model (8) with true = ( 0 , ) and = /√ as well as grouped variables, here arises the first important question regarding whether we can select the narrow model S 0 = {1, . . . , } consistently. A similar question has been addressed by Wang and Fang [16], where they considered nongrouped variables. In the following, we show that the group SCAD with a tuning parameter selected via BIC can consistently select the narrow model.
Wang et al. [18] extended the SCAD, proposed by Fan and Li [11], to grouped variables and established its oracle property, following an elegant idea of the group LASSO [17]. The group SCAD generates an estimate via following penalized least squares: Under the local model assumption with no grouped variables, Wang and Fang [16] showed that, with a tuning parameter selected via BIC, the SCAD is selection consistent; that is, with probability tending to one, the narrow model can be identified. Similarly, a BIC selector can be defined based on the group SCAD as follows: wherê2 = ‖ −̂‖ 2 / and df = ∑ ∈Ŝ . We expect that the group SCAD is still selection consistent in the sense that Pr(Ŝ̂B = S 0 ) → 1 as → ∞, provided that S 0 is the narrow model.
Formally, within the framework of FIC, assuming that the local model (8) is the true model and that S 0 is the narrow model, we show the following theorem. Proofs can be found in the Supplement.

Theorem 1. Under some mild conditions (see the Supplement for details), one has that
provided that model (8) with true = ( 0 , ) and = /√ is the true model.

Remark 2.
If we assume that = 0 , that is, the model is sparse, then Theorem 1 provides a BIC selector for the tuning parameter in the group SCAD, which can consistently identify nonzero effects. In other words, we extend the BIC selector for the SCAD proposed by Wang et al. [21] to the situation with the group SCAD.
Theorem 1 also implies both advantages and disadvantages of the BIC, which have been discussed by Wang and Fang [16]. Briefly speaking, the BIC sacrifices prediction consistency [24] in the sense of filtering all of the variables whose effect sizes are of order (1/√ ) to achieve the model selection consistency. The previous theorem provides a datadriven method to consistently specify a narrow model, which is critical before applying FIC. In the following subsection, we suggest a two-stage framework to apply the FIC based upon a narrow model selected via the BIC, in order to recover part of the variables filtered by the BIC.

Stage 2: FIC.
In Stage 1, a narrow model,Ŝ 0 = {1, . . . ,̂}, has been identified via the group SCAD with a tuning parameter selected via BIC. In Stage 2, any subset ofŜ 0 = {̂+ 1, . . . , =̂+̂} can be added toŜ 0 . A direct application of the FIC proposed by Claeskens and Hjort [15] is not plausible even for moderate size of̂, because there are 2̂subsets of S 0 . Furthermore, the best-subset selection is unstable [13]. Therefore, similar to Wang and Fang [16], without double minimizations through both subsets and tuning parameters proposed by Claeskens [25], we suggest limiting the search domain to those subsets on the solution path from any group regularization procedure such as group LASSO or group SCAD.
With a selected narrow modelŜ 0 = {1, . . . ,̂}, let̃= ( 1 , . . . ,̂) ,̃= ( ,̂+1 , . . . , ) ,̃= ( 1 , . . . ,̂) , and = (̂+ 1 , . . . , ) . Then, a solution path is generated from the following group LASSO procedure (or group SCAD): where the tuning parameter controls the grouped variables included in the subsetÂ = { :̸̂ = 0 }. As the tuning parameter varies from some large value to 0,Â increases from an empty set to a "full" set {1, . . . ,̂}. Then, we utilize the FIC to guide the selection of in (12) over the resultingÂ 's, which consist of a search domain. Now, Stage 2 of the FIC for a certain focus true = ( true ) is summarized as follows. For a given , a subsetÂ is provided by indices of nonzero factors from (12). Then, based on Computational and Mathematical Methods in Medicine 5 the submodel S =Ŝ 0 ∪Â , the FIC is evaluated according to a formula developed in Claeskens and Hjort [15, formula (3.3)], which is essentially a parametric estimate of the MSE of true on a model S. Consequently, is selected aŝ and the final submodel is selected asŜ F =Ŝ 0 ∪Â̂F.

Simulation
Simulated data are generated under models (1)  Clearly, the narrow model of the first two settings is {1, 2, 3}, whereas, for the third one, no clear boundary is specified between big effects and small effects.
Corresponding to each setting, test datasets 01 , 02 , and 03 are selected to generate AUC around 0.6, 0.8, and 0.95 to accommodate low-, moderate-, and high-accuracy cases, respectively. Consider the following: Besides the proposed two-stage framework (FIC) with group SCAD, for comparison purpose, four popular variable selection criteria, including 5-fold CV, GCV, AIC, and BIC, are also employed. Additionally, the SCAD penalty is applied to diseased and healthy groups separately to show the gain of applying the group SCAD.
Two popular measurements, MSE = ( (̂Ŝ F ) − ( true )) 2 and the mean absolute error (MAE), defined by | (̂Ŝ F ) − ( true )|, are utilized to evaluate the prediction performance of selected models based on different criteria, wherêŜ F is an estimate of based on the final modelŜ F selected by a certain selection criterion. Due to the limited range of AUC and skewed distributions of estimates of AUC especially at boundaries, the MAE is supposed to be more appropriate.
In this paper, a composite measurement, the F-measure, is employed to evaluate the performance of selecting the narrow model among various methods, including commonly used proportions of selecting underfitting, correct, and overfitting models separately. As noted by Lim and Yu [26], a high Fmeasure means that both false-positive and false-negative rates are low. Define Precision = true positivity, Recall = true discovery and then, F-measure ≜ (2 ⋅ Precision ⋅ Recall)/(Precision+Recall). All results are summarized based on 500 repetitions according to simulation settings in Tables 1, 2, and 3. Table 1 indicates that the BIC has the best performance to identify the narrow model, compared with others. Also, if there are more weak signals, like Setting 2, the performance is not as good as that of Setting 1. This is reasonable, because, with increasing number of variables given the sample size, it is more challenging to filter weak signals, even under the sparsity assumption. From Table 2, we can see that, in all three settings, these five methods perform well. Specifically, for moderate and large AUC cases, the FIC performs slightly better, providing smaller MAE. Additionally, in these cases, the FIC improves the BIC substantially, which once again indicates that the BIC would filter weak signals.
In order to show how we can benefit from applying the grouped variable selection, separate model selections for diseased and healthy subjects are also considered, and results are summarized in Table 3. By comparing Tables 2 and 3, in most cases, the group penalty provides smaller MSE and MAE for every criterion. Due to limited range of the AUC, all MSE and MAE values in Tables 2 and 3 are small, but the group selection can improve separate selections by as high as 25%. It is not surprising to see that, in high AUC situations, differences are small, and separate selections with BIC are better. Possible reasons are the following: (1) there is no much room for an estimated AUC to vary when it is close to 1; (2) separate selections with BIC offer a larger flexibility to obtain a sparse model.

Real Data Analysis
In this section, we demonstrate the proposed procedure by the audiology data reported by Stover et al. [27], which has  ) for each element. Former studies on this dataset showed that −SNR provided quite high discriminative performance and that had a small effect. In order to avoid specifying inappropriate covariates, we randomly select three centered observations from the whole dataset as focused subjects. Table 4 shows AUC values of models selected by each method as well as corresponding model sizes. CV, AIC, and GCV tend to select a full model. On the contrary, BIC tends to select a sparse model, only containing . The full model may not provide the largest AUC, because a large model will bring instability and ruin the AUC. As indicated in the table, for the second test point, both BIC and FIC provide a higher AUC than the full model. But a single variable selected by the BIC seems to be too strict. By focusing on the precision of

Discussion
In this paper, we rewrite the model selection problem of the ROC regression into a grouped factor selection form with induced methodology. Also, we develop a two-stage framework to apply the FIC to select a final model with group SCAD under the local model assumption. Specifically, if the true model is sparse, our framework naturally accommodates current model selection criteria. Furthermore, the BIC selector is proved to be model selection consistent if either a sparse or a local model is assumed, in the sense of selecting a sparse model or a narrow model.
Most current model selection criteria aim at the prediction performance or model selection consistency; thus, in the ROC regression where the AUC is a focus parameter, they may not be appropriate. This observation motivates an application of FIC, which is shown to perform well through simulation studies. Therefore, our method has a potential application in genetic studies, where the number of gene arrays is always large, compared with the sample size.
For the direct methodology, the literature based on generalized estimating equations is prosperous, which is motivated by the range [0, 1] of the AUC, similar to the probability of a binary random variable. Our future work will extend the framework developed here to generalized estimating equations and apply it to the ROC regression with the direct methodology.
As discussed by one referee, it is possible that some coefficients are the same for both and . As in (1), modeling them separately will increase the degree of freedom in (3), especially when a large number of genes are covariates. If the shrinkage of a coefficient, which is known a priori to be the same in both diseased and healthy groups, is not necessary, then it is natural for the FIC to include it in the narrow model with a single coefficient. By using the proposed objective function, a fused LASSO type of penalty may be applied to obtain such kind of structure, in addition to the group LASSO/SCAD. Friedman et al. [30] provided a note on the group LASSO and the sparse group LASSO, which could shed light on the question here. It will be also an interesting topic in the future.

Conflict of Interests
There is no conflict of interests regarding the publication of this article.