Enhancing the Lasso Approach for Developing a Survival Prediction Model Based on Gene Expression Data

In the past decade, researchers in oncology have sought to develop survival prediction models using gene expression data. The least absolute shrinkage and selection operator (lasso) has been widely used to select genes that truly correlated with a patient's survival. The lasso selects genes for prediction by shrinking a large number of coefficients of the candidate genes towards zero based on a tuning parameter that is often determined by a cross-validation (CV). However, this method can pass over (or fail to identify) true positive genes (i.e., it identifies false negatives) in certain instances, because the lasso tends to favor the development of a simple prediction model. Here, we attempt to monitor the identification of false negatives by developing a method for estimating the number of true positive (TP) genes for a series of values of a tuning parameter that assumes a mixture distribution for the lasso estimates. Using our developed method, we performed a simulation study to examine its precision in estimating the number of TP genes. Additionally, we applied our method to a real gene expression dataset and found that it was able to identify genes correlated with survival that a CV method was unable to detect.


Introduction
In the past decade, researchers have predicted survival in a cancer patient based on gene expression data [1][2][3][4]. Revealing the relationship between gene expression profiles and the time to an event of interest (e.g., overall survival, metastasisfree survival) can improve treatment strategies and establish accurate prognostic markers. The Cox proportional hazard model is the most popular method for relating covariates to survival times [5]. However, due to the high dimensionality of gene expression data (i.e., the number of genes expressed exceeds the number of patients), it is not possible to take an estimation approach based on the Cox log partial likelihood. To overcome this problem, a penalized estimation approach, which includes a shrinkage estimation of coefficients, is frequently taken [6][7][8].
In penalized estimation approaches, the least absolute shrinkage and selection operator (lasso) [9,10] is often used because of its attractive ability to simultaneously select the genes correlated with survival and estimate the coefficients in the Cox model. The lasso shrinks most of the coefficients towards zero exactly by adding 1 norm to the Cox log partial likelihood, and the amount of shrinkage is dependent on the tuning parameter. The value of the tuning parameter is often determined by a cross-validation (CV), which maximizes the out-of-data prediction accuracy [11].
Several researchers have investigated the operating characteristics of the lasso. Goeman [12] used the lasso to analyze a publicly available gene expression dataset, obtained from the articles of van't Veer et al. [2] and van de Vijver et al. [3] in which a 70-gene signature for prediction of metastasisfree survival in breast cancer patients had been established. This data included 295 patients with 4919 genes that were prescreened from 24,885 genes based on the quality criteria in van't Veer et al. 's work [2]. The lasso selected 16 genes with which to develop a prediction model of overall survival when using the tuning parameter that was determined using a CV. Goeman [12] also conducted ridge regression using all 4919 genes to develop a model by adding 2 norm to the Cox log partial likelihood. The prediction accuracy of the lasso and ridge regression were compared, and the ridge regression with 4919 genes slightly outperformed the lasso with 16 genes. Goeman [12] concluded that the lasso potentially passes over genes that are correlated with survival in order to develop a simple prediction model. Bøvelstad et al. [7] reached the same conclusion in a review of the survival prediction methods available for analyzing breast cancer gene expression datasets. Table 1 summarizes a typical result of gene selection by the lasso.
The CV method determines the value of the tuning parameter by considering the trade-off between the number of true positives (TP) and false positives (FP), and so the possibility of identifying false negatives (FN) cannot be eliminated. One solution for identifying more outcomepredictive genes is to monitor the number of TP in several values of the tuning parameter and, subsequently, determine its final value. In this study, we developed a method for estimating the number of TP for a series of values of the tuning parameter. We assumed a mixture distribution with components of TP and FP for the lasso estimates, and these could be used to estimate the number of TP and FP. It is possible to generate the solution path that includes the lasso estimates for a series of values of the tuning parameter using the methods developed by Goeman [12]. Here, we proposed an algorithm to sequentially fit the mixture distribution for this solution path, and we used a simulation study to test the precision of the algorithm when estimating the number of TP. We further demonstrated the proposed algorithm using a well-known diffuse large B-cell lymphoma (DLBCL) dataset comprising overall survival of 240 DLBCL patients and gene expression data of 7399 genes [1].

Lasso in the Cox Proportional Hazard
Model. The Cox proportional hazard model is the most popular method for evaluating the relationship between gene expression and time to an event of interest [5]. The hazard function of an event at time for a patient ( = 1, . . . , ) with the gene expression levels x = ( 1 , . . . , ) T is given by where = ( 1 , . . . , ) T is a parameter vector and ℎ 0 ( ) is the baseline hazard, which is the hazard for the respective individual when all variable values are equal to zero. In the general setting where > , the coefficients are estimated by maximizing Cox log partial likelihood as follows: where is an indicator, which is 1, if the survival time is observed, or 0, if censored. ( ) is the risk set of the individuals at . In the lasso for the high-dimensional setting where < , the coefficients are estimated by maximizing the following penalized likelihood function [9,10]: where is the tuning parameter, which determines the amount of shrinkage.

Solution Path of the Lasso Estimates.
Goeman [12] introduced a method to calculate the solution path of the lasso estimates as a function of ,̂( ), which is based on the algorithm developed by Park and Hastie [13]. The method maximizes ( , ) at a fixed based on a combination of gradient ascent optimization with the Newton-Raphson algorithm.̂( ) are calculated for 0 > ⋅ ⋅ ⋅ > > ⋅ ⋅ ⋅ > > 0 successively, starting from 0 = max / | =0 (which giveŝ( 0 ) = 0 because the value has zero gradients). is chosen arbitrarily but is often set to 0.05 × 0 in analyses of gene expression data [14]. The lasso estimates at a current step are set to initial values for calculation of the subsequent step.

Mixture Distribution for Estimating the Number of TP in the Lasso Estimates.
To estimate the number of TP in the lasso estimates at a fixed value of , we assumed a mixture distribution developed in our previous study [15]. We introduced the mixture distribution based on the two features of the lasso: (i) the lasso selects at most genes because of the nature of the convex optimization problem when < [16,17] and (ii) in the Bayesian paradigm the lasso estimates are the posterior mode with the independent Laplace prior distribution ( ; 0, 1/ ) = ( /2) exp(− | |), where ( ; , ) = 1/2 exp(−| − |/ ) is the probability density function of Laplace distribution with location parameter and scale parameter [9]. Therefore, the mixture distribution assumed for the lasso estimates at was as follows: where 0 and are mixed proportions ( 0 + ∑ =1 = 1); (̂( ); , 2 ) is the probability density function of the normal distribution with mean ( ̸ =0) and variance 2 in component ; is the number of components, which is determined by model selection criteria; and is the constant value, which is boundlessly close to 0; for example, = 10 −8 . The unknown parameters, 0 , , , , and , are estimated by maximizing the log-likelihood function of (4) by using the Newton-Raphson method.
The mixture distribution defined in (4) is formulated on the basis of the following concepts: since the lasso selects a maximum of genes when > , the coefficients for − genes are exactly zero; therefore, (4) consists of 2 terms ( / term and 1 − / term). In the / term, the Laplace distribution with location parameter 0 and scale parameter 1/ was assumed to be the distribution for the FP on the basis of the lasso feature (ii) discussed above, while the component normal distribution with location parameter and scale parameter 2 was assumed as the distribution for the TP. In the 1 − / term, the Laplace distribution with location parameter 0 and scale parameter was assumed as the distribution of − genes based on the aforementioned lasso feature (i).
The with location parameter 0 and scale parameter 1/ was assumed to be the distribution for the FP on the basis of lasso feature (i), discussed above. The with location parameter and scale parameter 2 was assumed as the distribution for the TP. The of the (1 − / ) term was assumed as the distribution of − genes based on the aforementioned lasso feature (ii). Given a cut-off value (>0), the estimated proportions of the FP and TP are the area under the estimated Laplace and normal distribution in the / term of (4), respectively, and can be written as follows:   (5) when the number of components, , is 1. Using (5), the number of TP and FP was estimated byF

Algorithm for Estimating Number of TP in a Series of
Values . Here, we propose an algorithm to sequentially fit the mixture distribution in (4) to the solution path of the lasso estimates, which was described in Section 2.2. In this algorithm, we assumed that the number of TP changed when the newly selected or excluded gene from to +1 was truly correlated to survival, based on the maximum log-likelihood of (4). First, we approximated̂F P ≈̂0 and̂T P ≈ ∑ =1î n (5) by assuming a suitably small cut-off value (≈0). We then obtained̂0 =FP/ and̂=TP / ( = 1, . . . , ) from (6) and (7), respectively, whereTP is an estimate of the number of TP in component . For = 1, . . . , , the proposed algorithm was as follows.
Step 3. In this step, we determined whether the newly selected or excluded gene from to +1 was TP or FP based on the maximum log-likelihood which was calculated in Steps 1.2 and 2.3. If ( +1) 0 was the largest in ( +1) ( = 0, . . . , ), we assumed that the newly selected or excluded gene was FP; if not, we assumed that it was TP. Therefore, calculate max = argmax ∈{0,1,..., } ( +1) . If max = 0, updateFP ( ) as follows:F If max > 0, updateTP ( ) max as follows: Here, calculate the estimated TP at + 1 byTP

Simulation Study.
We performed a simulation study to examine the precision of our estimated TP. In this study, the number of patients, , was set to 200. The number of genes, , was set to 1000, which included the 1 (=5 or 30) outcomepredictive genes that are randomly chosen from genes in each simulation. The coefficient for gene ( = 1, . . . , ), , was set to 1.5 for the 1 outcome-predictive genes and 0 for the remaining − 1 none-outcome-predictive genes. We set to 5 and the number of components, , to 1 throughout (although was determined using a model selection criterion in practice). The gene expression levels for patient , x , were generated from the multivariate normal distribution with mean vector 0 and covariance matrix Σ so that the variance was 1 and the correlation ( , ) = 0 or 0.5 | − | [18]. The survival time for patient was generated based on the exponential model = − log( )/ exp(x T ) where is the uniform random variable between 0 and 1 [19]. In order to evaluate the precision of the estimated TP for various values of , we report a number of selected genes, including true TP, and estimated TP and FP, for ( = 5, 10, 50, 100, 150). Table 2 shows the average of , a number of selected genes, true TP, and estimated TP and FP, through 1000 repeats. We observed that the precision of estimated TP varied depending on the value of both 1 and (see Table 2). When 1 = 5, the precision of the estimates was sufficient for = 10, 50, 100, and 150, while TP was slightly underestimated for = 5. However, when 1 = 30, the precision of the estimates was sufficient for = 5, 10, and 150, while TP was overestimated for = 50 and 100. For example, when 1 = 30, = 0.5, and = 100, the average number of true and estimated TP was 29.9 and 35.3, respectively. The values of did not greatly affect the accuracy of the estimated TP.

Real Data Analysis.
To illustrate how our proposed algorithm could be used to determine , we applied it to the DLBCL dataset, comprising survival of 240 DLBCL patients and gene expression data from 7399 genes [1]. In the gene expression data from the 240 patients, we identified 434 genes with complete sets of gene expression values; all other genes had missing expression values, with an average of 24.7 missing values per gene. Here, we used 0.0 as the missing expression value for descriptive purposes. Similar to Rosenwald et al. [1], we divided the data into two: training data consisting of 160 patients and validation data consisting of 80 patients.
We applied our proposed algorithm to the obtained solution path. We assumed three mixture distributions on the lasso estimates with = 1, 2, or 3 and compared their goodness of fit for thê( ) ( = 0, 1, . . . , ) by the Akaike information criterion (AIC). As a result, we chose = 1 because it had the best AIC for all ( = 0, 1, . . . , ). Figure 2 shows the estimated number of TP in a series of values of . We found that the lasso selected at most 42 TP, with the number of selected genes at 96, when = 7.19 (=0.86 as log 10 ). Therefore, we selected = 7.19 as the optimum , and the estimated mixture distribution for the value of was as follows:     In order to identify the 42 TP from the 96 selected genes, we arranged the 96 in descending order of |̂| and identified the first 42 listed genes with a cut-off value = 0.084. Subsequently, the model that included these 42 genes is identified as the "42 TP-model. " In comparison to the 42 TP-model, we performed CV. Briefly, the -fold CV was given by where (− ) ( ) and̂( − ) are the log partial likelihood and the lasso estimate with left th fold out, respectively. The optimal value of was obtained by maximizing CV( ). On the basis of 5-fold CV, 12 genes were selected with = 27 (=1.43 as log 10 ). Subsequently, the model including these 12 genes is identified as the "CV-model. " Notably, both the 42 TP-model with 42 genes and the CV-model with 12 genes selected 4 genes in common. Table 3 shows the GenBank accession number and description for each of the 4 genes selected by both models. We compared the prediction accuracy of the 42 TPmodel and the CV-model using validation data consisting  values for the log-rank test and prognostic index and the deviance. The 80 patients were categorized into 2 groups, the "better" and "worse" prognostic groups, using the boundary of the median of prognostic index̂= x T . The Kaplan-Meier curves between the 2 groups were compared with a log-rank test. Next, we calculated the value for the parameter multiplied by the prognostic index̂in the Cox proportional hazard model ℎ( | x) = ℎ 0 ( ) exp(̂). Finally, the deviance was calculated by −2{ (validation) (̂t raining ) − (validation) (0)}, where (validation) (̂t raining ) and (validation) (0) are the Cox log partiallikelihood function for the estimated coefficients by using the training data and zero vector 0, respectively. For each criterion, the lower value suggested better prediction accuracy. Table 4 shows the values of the 3 criteria for each model. We found that the values of all 3 criteria for the 42 TP-model were lower than those for the CV-model, suggesting that the model based on the proposed method was more accurate (see Table 4). Additionally, Figure 3 shows that the Kaplan-Meier curves for the 42 TP-model distinguished the "better" and "worse" prognostic groups more definitely than those for the CV-model (42 TP-model, < 0.001; CV-model, = 0.007). Therefore, by using our proposed algorithm, we determined and were able to select important genes, likely to be correlated with survival, in which the CV was unable to select.

Discussions
In this study, we proposed an algorithm for estimating the number of TP on the solution path of lasso estimates. Monitoring and determining the number of TP for a series of values are important because they can increase the probability of uncovering all outcome-predictive genes. The number of TP should be estimated with appropriate accuracy. To confirm the accuracy of our TP, we conducted a simulation study using a typical gene expression dataset. We found that the precision of our algorithm for estimating the number of TP was adequate, although an overestimation occurred with some values of . However, the overestimation occurred when the true number of TP was saturated, and so it may not cause a problem by passing over genes that truly correlated with survival. In the simulation study where 1 = 30 and = 0.5, the maximum average estimated number of TP was 35.3 at = 12.4 (see Table 2). Using this to select TP, an average selection of 29.9 TP within 30 outcome-predictive genes can be made, with the number of TP genes that are passed over being negligible in practice.
The data that have been provided in Table 2 showed that the number of false positives increased, while the number of true positives increased and then plateaued as the tuning parameter decreased. To decrease the number of FP identified while maintaining an adequate number of TP, we should determine the value of by monitoring both the number of Computational and Mathematical Methods in Medicine 7 TP and the false positive rate (=FP/(TP+FP)) in the proposed method.
Additionally, our proposed algorithm was applied to DLBCL data. We determined the value of the tuning parameter based on the maximum number of estimated TP uncovered by the algorithm. We identified 42 TP genes among 96 selected genes based on the ranking of the absolute values of the lasso estimates. We can also identify TP based on model evaluation criteria such as AIC among all possible combinations of 42 genes from 96, that is, 96 C 42 (>10 27 ) combinations in total; however, calculation of AIC for all possible gene combinations is a distant approach. To evaluate the efficiency of the approach using the ranking of the lasso estimates, we calculated the AIC for 10,000 randomly chosen models among all the possible models and subsequently compared it with the AIC of our approach. From 10,000 models, the AIC of 425 models (4.25%) was better than that of our approach. This result indicated that our rankingbased approach has a satisfactory performance in practice with respect to the identification of 42 genes. Although investigation of all possible gene combinations is ideal, our approach is a good alternative.
In the application to DLBCL data, in comparison to a CV method by which 12 genes were identified, we identified 42 TP genes with our algorithm, and we improved the prediction accuracy of the model. In practice, some researchers might be satisfied with identifying a few promising genes and would not be unduly worried about passing over others. In such a situation, the CV would be preferable because it developed the model to uncover a few genes with just a small loss of prediction accuracy. However, genes that are selected by the lasso are often investigated with greater scrutiny by genetic researchers, and so passing over outcome-predictive genes by the lasso could represent a major problem. Indeed, if the lasso passes over outcome-predictive genes, some genetic research may not take place. Therefore, when identifying all outcomepredictive genes is a priority, our proposed algorithm will be most useful.

Conclusions
We developed a method for estimating the number of true positives for a series of values of a tuning parameter in the lasso. We demonstrated the utility of the developed method through a simulation study and an application to a real dataset. Our results indicated that our developed method was useful for determining a value for the tuning parameter in the lasso and reducing the probability of passing over genes that are truly correlated with survival.