Classification and Selection of Biomarkers in Genomic Data Using LASSO

High-throughput gene expression technologies such as microarrays have been utilized in a variety of scientific applications. Most of the work has been done on assessing univariate associations between gene expression profiles with clinical outcome (variable selection) or on developing classification procedures with gene expression data (supervised learning). We consider a hybrid variable selection/classification approach that is based on linear combinations of the gene expression profiles that maximize an accuracy measure summarized using the receiver operating characteristic curve. Under a specific probability model, this leads to the consideration of linear discriminant functions. We incorporate an automated variable selection approach using LASSO. An equivalence between LASSO estimation with support vector machines allows for model fitting using standard software. We apply the proposed method to simulated data as well as data from a recently published prostate cancer study.


INTRODUCTION
DNA microarrays simultaneously gauge the expression of thousands of genes in clinical samples. In this paper, we focus on cancer studies, where gene expression technologies have been applied extensively (Alizadeh et al [1]; Khan et al [2]; Dhanasekaran et al [3]). Obtaining large-scale gene expression profiles of tumors should theoretically allow for the identification of subsets of genes that function as prognostic disease markers or biologic predictors of therapeutic response. Because the data are highly multivariate and complex, it is important to develop automated statistical methods to detect systematic signals in gene expression patterns.
In cancer studies, analyses have typically focused on one of three problems. First, investigators have looked for genes that discriminate neoplastic from benign tissue. Statistically, this is the problem assessing differential expression of genes and has been studied by several authors; see, for example, Efron et al [4]. A second problem is clustering the samples to find subtypes of disease using algo-Correspondence and reprint requests to D. Ghosh, Department of Biostatistics, University of Michigan, 1420 Washington Heights, Ann Arbor, MI 48109-2029, USA, E-mail: ghoshd@umich.edu This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. rithms such as those in [5]. The final class of problems is classification or supervised learning, which involves using the profile to predict some clinical outcome, such as the stage of disease. Suppose that in this instance, we treat the gene expression profile as the independent variables and tissue type as the response. A particular feature of microarray experiments is that the dimension of the predictor space (number of genes) is typically larger than the number of samples. This is known as the "large p, small n" paradigm (West [6]), so classification methods must take this into account.
One method to do this is apply prefiltering criteria in which the candidate number of genes for building a classifier is smaller than the number of samples. For example, Dudoit et al [7] performed a systematic comparison of several discrimination methods for the classification of tumors based on microarray experiments. However, they must perform an initial reduction in the number of predictors before building the classifier.
We wish to consider the joint effects of genes in determining classification rules for discriminating tumors. There are two assumptions that drive our proposed methodology. First, we assume that the joint effects of multiple genes must be considered in discriminating classes of disease. Recently, much attention has been given to the finding that a 70-gene signature can predict breast cancer survival (van't Veer et al [8]; van de Vijver et al [9]). However, most such gene signatures have been constructed using univariate methods. It seems reasonable to consider joint models, as genes are correlated because of their mutual involvement in disease pathways. 2005:2 (2005) The second assumption is that there are individual genes that can discriminate classes. This is different from the latent factor and partial least squares proposals put forth by other authors (West [6]; Nguyen and Rocke [10]), where linear combinations of all available genes are used to predict the outcome. We seek to develop interpretable models for classification; for this purpose, using individual genes for predictors rather than linear combinations of genes seems reasonable.
In this paper, we develop classification rules based on the consideration of measures of diagnostic accuracy. In particular, we are interested in finding gene expression profiles that can discriminate between two populations. A unique challenge is posed because of the large p, small n problem. Our solution is to combine the problems of variable selection and classification. We suggest an approach for classification using the LASSO approach (Tibshirani [11]). An advantage of this approach is that some of the effects of the variables in these models are estimated to be exactly zero. These will represent genes that have no discriminatory power between the two classes, while those with nonzero coefficients will represent genes that can separate classes of tumors successfully. Thus, a byproduct of the approach is the generation of a gene list. We exploit an equivalence between LASSO and support vector machines (SVMs) in order to fit the proposed classifier. The structure of the paper is as follows. In "materials and methods," we provide background on the data structures observed and the motivation based on biomarker combinations, which leads to the use of linear discriminant functions. We also provide a review of LASSO estimation (Tibshirani [11]) in this section. The latter two techniques are then involved in the proposed estimation procedure, described in "results and discussion." There, we also describe how to implement the proposed method using software for SVMs. Issues of model selection are also discussed. We describe the application of the proposed methodologies to simulated data and data from a recent cancer profiling study (Dhanasekaran et al [3]) in "prostate cancer gene expression data." Finally, some concluding remarks are made in "conclusion."

MATERIALS AND METHODS
Let a T denote the transpose of the vector a. For the ith sample (i = 1, . . . , n), we let X i = [X i1 · · · X ip ] T denote the p × 1 gene expression profile vector (ie, X i j is the gene expression measurement of the jth gene, j = 1, . . . , p). We suppose that the data have already been preprocessed and normalized. In addition, it is assumed that the gene expression data are standardized so that for each gene, the mean is zero and standard deviation one. Let g i denote the tumor class for the ith sample (i = 1, . . . , n); we assume that there are two classes so that g i takes values g ∈ {0, 1}. Here and in the sequel, we will refer to g = 1 as the diseased class and g = 0 as the healthy class; however, the methods proposed here are applicable to any two-class setting. In "LASSO estimation," we assume the existence of a continuous response variable Y i for the ith sample (i = 1, . . . , n).

ROC curves and optimal biomarker combinations
Our approach is to consider each measurement from a microarray for a single gene as a diagnostic test. Thus, for each subject, we have a high-dimensional vector of diagnostic test results. We then want to utilize this information in a way to separate the two populations of patients. This issue of finding combinations of biomarkers to accurately classify patients has been considered by Su and Liu [12], Baker [13], and Pepe and Thompson [14] in the statistical literature.
To combine information across the high-dimensional vector of gene expression profiles, we consider linear combinations of the form β T 0 X i , i = 1, . . . , n. Without loss of generality, we will also assume that larger values of this linear combination corresponding to increasing likelihood of having g = 1. While the method can be easily extended to incorporate interactions between gene expression measurements, we focus on consideration of the main effects for purposes of exposition.
Suppose X D represents the gene expression profile for a typical cancer specimen (ie, g = 1), and XD is the corresponding profile for a randomly chosen benign specimen. Note that in our situation, the diagnostic test is the linear combination β T 0 X. One relevant quantity is the false positive rate based on a cutoff c, defined to be FP(c) = P(β T 0 X > c|g = 0). Similarly, the true positive rate is TP(c) = P(β T 0 X > c|g = 1). The true and false positive rates can be summarized by the receiver operating characteristic (ROC) curve, which is a graphical presentation of {FP(c), TP(c) : −∞ < c < ∞}. The ROC curve shows the tradeoff between increasing true and false positive rates. Tests that are have {FP(c), TP(c)} values close to (0, 1) indicate perfect discriminators, while those with {FP(c), TP(c)} values close to the 45 • line in the (0, 1) × (0, 1) plane are tests that are unable to discriminate between the diseased and healthy populations. Examples of ideal and noninformative ROC curves are given in Figures 1a and 1b. While the specificity and sensitivity of a diagnostic test depend on the cutoff value chosen, a useful summary measure to consider is the area under the ROC curve. It can be shown mathematically that the area under curve is [15]). Under a binormal probability model, Su and Liu [12] showed that this quantity is optimized using the linear discriminant function. This motivates our choice of consideration of these variables. We next present an algorithm for estimation of these functions.

Linear discriminant functions by optimal scoring
While linear discriminant analysis (LDA) is typically calculated using matrix algebra techniques, an alternative method of calculating them is through the use of optimal scoring (Hastie et al [16,17]). In this method, the problem of classification into two groups is reexpressed as a regression problem based on quantities known as optimal scores.
The point of optimal scoring is to turn the categorical class labels into quantitative variables. Let θ(g) = [θ(g 1 ), . . . , θ(g n )] T be the n × 1 vector of quantitative scores assigned to g for the kth class. The optimal scoring problem involves finding the vector of coefficients η ≡ (η 1 , η 2 , . . . , η p ) and the scoring map θ : {0, 1} → R that minimize the following average squared residual: Let Z be an n×2 matrix with the ith row equal to (1, 0) if g i = 1 and (0, 1) if g i = 0 (i = 1, . . . , n). The optimal scores are assumed to be mutually orthogonal and normalized with respect to an inner product. Thus, the minimization of (1) is subject to the constraint N −1 ZΘ 2 = 1, where Θ = [θ(0) θ(1)] T is a 2 × 1 vector of the optimal scores. Hastie et al [16] state that the minimization of this constrained optimization problem leads to estimates of η that are proportional to the discriminant variables (ie, the discriminant function) in LDA. In particular, they propose the following algorithm for the estimation of the LDA functions (1) Choose an initial score matrix Θ 0 satisfying (4) Define f opt (x) = Φ T f(x).
As mentioned before, a problem with attempting to apply standard linear discriminant function methods to the data here is that there is not a numerically unique solution because p is larger than n. Thus, some type of regularization is needed. Our approach is based on the LASSO, which is described in the next section.

LASSO estimation
We suppose that our data are (Y i , where β = (β 1 , . . . , β p ) and λ ≥ 0 is a penalty term. Thus, the constraint that is utilized is an L 1 constraint. An alternative way of formulating (2) Note that in the absence of the constraint, the solution is given by the ordinary least squares (OLS) estimator. If the usual OLS estimator satisfies the constraint, then the LASSO and OLS estimates of β coincide. However, for smaller values of t, some of the components of β are estimated to be zero. In the linear regression setting, LASSO estimation has been considered by Tibshirani [11].
For a given value of t, minimization of n i=1 (Y i − β T X i ) 2 subject to an L 1 constraint on the components of β is a quadratic programming problem with 2 p linear equality constraints. A sequential algorithm is given by Tibshirani [11] to solve the optimization problem.
While Tibshirani [11] considered estimating coefficients in regression models using LASSO, our interest is in using gene expression data to classify tumors. In particular, we seek to extend the LDA approach advocated by Dudoit et al [7] to handle the case where p is larger than n. We outline the proposed method in the next section.

Estimation methods
We propose to use an optimal scoring procedure for classification, where LASSO estimation is incorporated. In the notation of the previous section, we wish to solve the following optimization problem. Minimize subject to the constraint N −1 ZΘ 2 = 1. Here is the outline for our procedure.
(1) Choose an initial score matrix Θ 0 satisfying Θ T 0 D p Θ 0 = I, and let Θ 0 = ZΘ. (2) Fit a linear regression model of Θ 0 on X subject to an L 1 constraint on the parameters. Define the fitted values Θ * 0 . Let f(X) be the vector of fitted regression functions. Note that we are incorporating the LASSO estimation procedure in step (2) of the algorithm. We cannot use the algorithm of Tibshirani [11] because it is too computationally intensive for large p (number of genes). However, it turns out that the algorithm can be fit using standard software for SVMs, which we will now describe.

Support vector machines
An excellent descriptions of SVMs for classification can be found in [18]. We provide an overview of the method here. We assume that the data are {x i , y i } (i = 1, . . . , n), where x i is a d-dimensional vector and y i ∈ {−1, +1} is the class label. The goal of SVMs is to find an optimal separating hyperplane between the observations with y = −1 and those with y = 1. This problem can be expressed as minimizing w 2 subject to the following constraints: Details on how to solve the optimization problem can be found in [18, chapter 7]. In the unregularized case, fitting the LASSO model is equivalent to fitting an SVM classifier with the following 2p × 1 n-dimensional vectors as the inputs: g, Y k and −Y k (k = 1, . . . , p), defined to be the sample labels, gene expression values and their negative values for the kth gene across the n samples. The label is the vector y 0 , defined to be −1 for the first entry and 1 for the other entries. The proof of the equivalence is given in the "appendix." We have created a macro in R (R foundation) that implements the proposed method and can be obtained from the first author.
As mentioned earlier, an advantage of this approach is that most of the gene effects are estimated to be exactly zero. The method can also identify the genes associated with each of the two classes. Genes whose coefficients are negative are associated with the class g = −1, while those with positive estimated coefficients are associated with g = 1.
As is evident in the algorithm from the previous section or in (3), the parameter λ needs to be estimated. We use fivefold cross-validation for this.

Simulated data
We first performed a set of simulations to determine how well the proposed methods were at classification. We generated p = 1000 dimensional vectors for two populations. We considered the following sample size combinations (n 0 , n 1 ) = (15,15), (20,10), (50, 50), and (70, 30), where n k is the number of samples in the group with g = k (k = 0, 1). All the genes were assumed to be independent with a normal distribution and variance 1. We assumed a model in which a fraction π of the genes was differentially expressed between the two classes, π = 0.05 and π = 0.5 were considered. We examined two scenarios. For the first scenario, there was a big change in differential expression in the differentially expressed genes, a shift of 5 units in the mean. In the second scenario, the fold change was only a 1.5 unit difference in mean. For each simulation setting, 100 datasets were generated, and the classification error rates were estimated using three-fold cross-validation. No optimization was performed; we set λ = 10. The results are summarized in Table 1. Based on the table, we find that for larger sample sizes and larger effect sizes, as well as larger numbers of effects, the error rates are smaller.
However, in our simulations (data not shown), we found that the method had difficulty in selecting the correct variables when p is larger than n. This attests to the fact that variable selection in the situation of large p and small n is quite difficult. We discuss this situation in the "conclusion."

Prostate cancer gene expression data
The example we consider is from a prostate cancer study; a subset of the samples was considered by Dhanasekaran et al [3]. We focus here on noncancer versus cancer tissues. The samples are profiled using spotted cDNA (ie, red/green) microarrays; there are initially 101 samples profiled using 10 K chips (9984 genes). We have taken the following preprocessing steps: (1) remove genes that are reported as missing in more than 10% of the samples; (2) remove genes that have a variance less than 0.05 in all samples; (3) impute measurements for missing genes using the median.
This leaves a total of 4880 genes for analysis. We first performed an estimation of the error rate using fivefold cross-validation. This generally gave an error rate between 15-20% for various choices of λ, suggesting that the classifier is not sensitive to the choice of the smoothing parameter.
One of the by-products of the procedure is a list of genes that are estimated to have non-zero effects. We present the gene lists for λ = 1 in Table 2. Out of the 4880 genes, only 21 are estimated to have nonzero effects. Of the genes that are overexpressed in prostate cancer relative to benign prostate tissue, the early growth response (Hs. 326035/301865), feline sarcoma viral oncogene homolog (Hs.81665), T-cell receptor gamma locus (Hs. 112259), and fatty acid synthase (Hs.83190) have been seen by other investigators to be upregulated in prostate cancer, as in Table 3. The other genes on the list could represent false positives or genes whose joint effect is predictive of cancer status.

Conclusion
In this paper, we have introduced a new approach to the joint problems of classification and variable selection in the analysis of microarray data. These problems have been treated as separate problems in the previous literature. Our approach is combine the two problems by use of the LASSO. This work has opened the way for several future avenues of research that we are currently investigating. First, a popular alternative to LDA in classification problems is logistic regression. It has been recently motivated by ROC considerations (McIntosh and Pepe [19]). While it is possible to formulate a LASSO estimation for logistic regression models, adapting the LASSO-SVM equivalence to this situation requires new algorithms. It will also be important to compare the performance of the two L 1regularized procedures (LDA and logistic regression) on real and simulated microarray datasets.
In this paper, we focused on the two-class problem. While LDA and logistic regression can be extended to accommodate multicategorical responses, the ROC arguments that motivated the method here only exist for two populations. We are currently exploring theoretical frameworks for generalizing ROC ideas for multiple disease states.
The estimation procedure described in this paper allows the joint estimation of multivariate gene effects on the response (class label). The approach described here could be generalized by fitting more nonlinear gene effects in the estimation algorithm or by including higherorder interactions between genes. Another generalization is to perform a clustering of the genes and to enter the cluster averages as covariates in the model. Such an approach was taken by Hastie et al [20] and Tibshirani et al [21].
It is also of current interest to incorporate biological knowledge into microarray data analyses. In many instances, scientists are interested in the effects of a particular gene or pathway on genetic expression. In this context, approaches have been suggested by Zien et al [22] and Pavlidis et al [23] in which biological knowledge as represented by pathway scores or functional annotation status are correlated with gene expression. However, their approaches were univariate. There would be potential gains in efficiencies of analyses by considering joint models for pathways. We are currently studying the applicability of the joint estimation procedure described here to that setting.