Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model

Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq  (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.


Introduction
One of the most important objectives for survival analysis is to select a small number of key risk factors from many potential predictors [1]. Commonly, the Cox proportional hazards model [2,3] is used to study the relationship between predictor variables and survival time. Suppose a dataset has a sample size of to study the survival time on covariate ; we use the data form of ( 1 , 1 , 1 ), . . . , ( , , ) to represent the individual's sample, where the survival time being complete if = 1 and right censored if = 0. As in regression, = ( 1 , 2 , . . . , ) is a potential prediction vector.
By Cox's proportional hazards model, the hazard function is given as where the baseline hazard function ℎ 0 ( ) is unspecified and = ( 1 , 2 , . . . , ) is the regression coefficient vector of variables. Cox's partial log-likelihood is expressed as where denotes the set of indices of the survival individuals at time .
In practice, only a small number of the predictor variables actually affect the hazard rate. The goal of variable selection in Cox's proportional hazards model is to select the key risk factors. Recently a series of penalized partial likelihood methods, such as the 1 [4][5][6][7], (0 < < 1) [8] and 1/2 [9,10] penalized methods were proposed for Cox's proportional hazards model. These penalized partial likelihood methods find important risk factors by shrinking some regression coefficients to zero.

Computational and Mathematical Methods in Medicine
The standard penalized methods cannot directly be applied to the nonlinear Cox model to obtain parameter estimates. Therefore, Tibshirani [11] proposed an iterative procedure to transform the Cox's partial log-likelihood function (2) to linear regression problem. Let = ( 1 , . . . , ) , = 1, . . . , , and denote the × predictor variable matrix, = , = −( / ), = −( 2 / ), and = + − , where − is a generalized inverse of . Since the general quadratic programming cannot be directly solved to the cases with ≫ , Gui and Li [12] applied the Choleski decomposition to obtain = 1/2 such that = , = , and̂= . By the Taylor expansion, the partial log-likelihood ( ) is approximated by the quadratic form: Thus, the regularization methods can directly solve the penalized regression problem: where is the tuning parameter. Tibshirani [5] proposed the Lasso (least absolute shrinkage and selection operator) method, which has 1 penalty ( ) = | |, which shrinks small coefficients to zero and hence results in a sparse representation of the solution. Fan and Li [4] proposed the smoothly clipped absolute deviation (SCAD) penalty, which avoids excessive penalties on large coefficients and enjoys the oracle properties. Zhang [7] proposed the minimax concave plus (MCP) method, which is a continuous and nearly unbiased approach in highdimensional linear regression. Zhang and Lu [6] suggested an adaptive Lasso method with an adaptively 1 penalty estimate the parameters, which uses the penalty ( ) = | |/| |, where the weights 1/| | are chosen adaptively by the data. Zou and Hastie [13] proposed an elastic net method that combines the 1 and 2 ( ( ) = | | 2 ) penalties.
The above mentioned series of regularized regression methods were based on the 1 penalty. Recently, several works on learning sparse models have stressed the need of other penalties for achieving better sparsity profile. For instance, Rosset and Zhu [14] suggested the use of a penalty, which simply consists in replacing the 1 norm with nonconvex norm (0 < < 1). Zhang [15] presented a multistage convex relaxation scheme, which can be relaxed to a smoothed regularization. Mazumder et al. [16] pursued a coordinate-descent approach with nonconvex penalties (SparseNet) and study its convergence properties. Xu et al. [9,10] further explored the properties of the (0 < < 1) penalty and revealed the extreme importance and special role of the 1/2 regularization. In our previous work [17,18], we developed several fast algorithms using the 1/2 penalty to solve the logistic regression model and the Cox model. Our computational results showed that 1/2 regularization outperforms some 1 regularization methods. In this paper, we propose a novel harmonic regularization method which approximates to the (1/2 < < 1) penalties. We also investigate the fast harmonic regularization algorithm to solve the Cox model for the high dimension low sample size problem ("large small problem"). The rest of the paper is organized as follows. Section 2 describes the harmonic regularization method. Section 3 gives a harmonic regularization algorithm to obtain estimates form Cox model. Section 4 evaluates our method by simulation studies and application to four real microarray datasets, such as the diffuse large B-cell lymphoma (DLBCL) datasets with the survival times and gene expression data. Section 5 concludes the paper with some useful remarks.

Harmonic Regularization
In general, a united framework of the regularization in machine learning has a form: where ( ) is a loss function, ( ) is a penalty function, and is a tuning parameter. Different here is in correspondence with different penalized constraint to the model, so different solution is to be got, respectively. The penalized constraint is the weakest when = 0 and becomes stronger as increases.
Obviously a regularization (5) can be divided by two elements, the loss function ( ) and the penalty function ( ). Moreover, different loss function and different penalty will result in different algorithm. For example, when the loss function is hinge loss and the penalty ( ) = ‖ ‖ 2 , the result is a support vector machine algorithm. Let the loss function be square loss and using ( ) = ‖ ‖ denote the regularization methods, if = 2, it is the ridge regression [19] and can be used to solve the ill-posed problem. If = 0, it is the subsets regression [20], which applies 0 regularization with the penalty function ( ) = (1/2) (| ̸ =0|) . When = 1, it is the Lasso algorithm [21], which applied 1 regularization. Lasso and its variations (or the Lasso type algorithms), such as elastic net [13], SCAD [4], MCP [7], adaptive Lasso [6], and stage-wise Lasso [22] are extensively studied and applied in recent years in the fields of statistics and machine learning.
It is well known that 0 regularization is ideal sparsest for variable selection. Unfortunately, 0 regularization is a combinatorial optimization problem, which is difficult to be solved. In contrast, 1 regularization leads to a convex optimization problem and easy to be solved, but it does not yields sufficiently sparse variable selection. Donoho et al. [23,24] had shown that 0 regularization is equivalent to 1 regularization under certain conditions. These imposed conditions therefore characterize those problems for which no matter what 1 or 0 regularization is applied, the same sparse solutions will be produced. However, for many practical problems, the sparsity of solutions yielded through 1 and 0 regularization is far from being equivalent. Particularly, the solutions found with 1 regularization is very often less as sparse as the solutions found with 0 regularization.
In fact, when 0 ≤ ≤ 1, the regularization automatically performs variable selection by removing predictors with very small nonzero estimated coefficients. The smaller the is, the sparser the solutions found with regularization will be. This leads researchers to study regularization with 0 < < 1 because it can find the more sparse solutions than those found with 1 regularization and easier to be solved than 0 regularization. For example, Zhang [15] presented a multistage convex relaxation scheme for solving problems with nonconvex objective functions. For learning formulations with sparse regularization, they analyzed the behavior of a specific multistage relaxation scheme.
Nevertheless, the applications to the penalty function with 0 < < 1 not often attracts much attention done mainly due to the reason that when 0 < < 1, the penalty function changes from a convex function to a nonconvex one and so the corresponding optimization problem is not easy to solve. Meanwhile, another difficulty in fact is that the differential quotient of the penalty function at origin is +∞ which results in the invalidation of the ordinarily optimization algorithms.

The Harmonic Regularization Algorithm for the Cox Model
In this section, we propose a generalized path seeking algorithm of the harmonic regularization for Cox's model. As mentioned in the last section, when (0 < < 1) regularization is to be applied, an inevitable difficulty is how to efficiently solve the nonconvex optimization problem caused by the (0 < < 1) regularization (It is easy to see that in the case of ( > 1) regularization is applied, the penalty term becomes convex). Fortunately, direct path seeking makes it possible to overcome that difficulty. Direct path seeking, which sequentially constructs a path directly in the parameter space, closely approximates that for a penalty function without having to repeatedly solve numerical optimization problems. Popular path seeking based on squarederror includes partial least squares regression (PLS, [23]), forward stepwise regression [22], least angle regression [25], piecewise linear path [14], and gradient boosting. Friedman [26] proposed the generalized path seeking, which can produce solutions that closely approximate those for any convex loss function and nonconvex constraints. The advantages of path seeking methods provide us a new way to solve the problem of regularization with nonconvex penalty. We will propose a new generalized path seeking method to solve the harmonic regularization.
We let where which shows that each additive term ( ) is a monotonically increasing function of absolute value of its argument. This implies that the net regularization penalty function we have suggested meets the validity of the general path seeking algorithm [11]. Let V measure length along the path and ΔV > 0 a small increment. Define and let Then, we give the harmonic regularization algorithm procedures for the Cox model as follows: (2) computêand̂based on (3) using the current value (V), = 1, . . . , ; (12) ← +1, and go back to Step 2 until the convergence criterion is met.
In the above algorithm, after Step 2, at each step those coefficients (V) with sign opposite to that of the corresponding (V) are identified. When the set is empty, the coefficient corresponding to the largest component of { (V)} =1 , an absolute value is selected at Step 6. And when there are one or more elements in the set , the coefficient with corresponding largest (V) within this subset is instead selected. The selected coefficient is the increments by a small amount in the direction of the sign of its corresponding * (V) while all other coefficients remain unchanged, yielding the solution for the next path point V+ΔV. The iterations continue until all components of (V) = 0 and the algorithm then reaches a regularized solution for the harmonic regularized Cox model.

Selection of the Shrinkage Parameter and the Tuning
Parameter . To select the shrinkage parameter a and the tuning parameter , we use the maximization of the crossvalidation partial likelihood (CVPL) method proposed by van Houwelingen et al. [27], which is defined as where (− ) ( , ) represents the estimation of based on the harmonic regularization procedure with the parameters and from the data without the th subject. The terms ( ) and (− ) ( ) are the log partial likelihoods with all the subjects and without the th subject, respectively. The optimal value of the parameters and are chosen to maximize the sum of the contributions of each subject to the log partial likelihood over a grid of ( , ). CVPL is the special case of a more general cross-validated likelihood approach for model selection and has been demonstrated to perform well in prediction in the context of the penalized Cox regression.

Model Validation Measures.
The performance measures of censored survival data is more complicated: the measure can only be computed if the case is not right censoring. Thus, several specially designed measure method have been proposed in the literatures. In this paper, we employ the integrated brier score (IBS) [28] and the concordance index (CI) [29] to evaluate the prediction ability of the regularization methods.
Integrated Brier Score (IBS). The brier score (BS) is defined as a function of time > 0 by wherê(⋅) denotes the Kaplan-Meier estimation of the censoring distribution and̂(⋅ | ) stands to estimate survival for patient . Note that the BS( ) is dependent on the point in time , and its values are between 0 and 1. Good predictions at time result in small values of BS. The integrated brier score (IBS) is given by The IBS is used to assess the goodness of the predicted survival functions of all observations at every time between 0 and max( ).
Concordance Index (CI). The concordance index (CI) can be interpreted as the fraction of all pairs of subjects which predicted survival times are correctly ordered among all subjects that can actually be ordered. By the CI definition, we can determine > when > and = 1, where (⋅) is survival function. The pairs for which neither > nor < can be determined are excluded from the calculation of CI. Thus, the CI is defined as Note that the values of CI are between 0 and 1, the perfect predictions of the building model would lead to 1 while have the CI value of 0.5 at random.

Analyses of the Simulated Data.
In this section, we evaluate the performance of the harmonic regularization method for the Cox model in simulation study. We generate high-dimensional and low sample size data which contain many irrelevant features. Six methods are compared with our proposed harmonic regularization approach (HRA): the Lasso penalty ( 1 ), the smoothly clipped absolute deviation penalty (SCAD), the minimax concave penalty (MCP), the adaptive Lasso (A-Lasso), the elastic net ( en ), and the 1/2 penalty ( 1/2 ). We adopted the Cox model simulation scheme in Bender's work [30]. The data generation procedure is as follows.
Step 1. We generated the vectors 0 , 1 , . . . , ( = 1, . . . , ) independently from a standard normal distribution and the Computational and Mathematical Methods in Medicine 5 Step 2. The survival time ( = 1, . . . , , indicates the sample size) is constructed from a uniformly distributed variable by = (1/ ) log(1 − ( × log( ))/( exp( + × ))), where is the shape parameter, is the scale parameter, is the ground-true regression coefficients, is the independent random error generated from (0, 1), and is the parameter which controls the signal to noise.
In every simulation, the dimension of the predictor vector is 1000, and the first five true coefficients are nonzero: 1 = 1, 2 = 0.8, 3 = −1, 4 = −0.8, 5 = 1, and = 0 (6 ≤ ≤ 1000). About 25% of the data are right censored. We consider the cases with the training sample sizes = 100, 150, 200, the correlation coefficient = 0.1, 0.5, and the noise control parameter = 0.2, 0.5, respectively. To assess the variability of the experiment, each method is evaluated on a test set including 100 random generated samples.
The estimation of the optimal tuning parameter in the regularization models can be done in many ways and is often done by -fold cross-validation (CV). Note that the choice of will depend on the size of the training set. In our experiments, we use 10-fold cross-validation ( = 10). The elastic net method has two tuning parameters; we need to cross-validate on a two-dimensional surface. Table 1 shows the average number of variable selected and the recovery rate by each regularization method in 500 runs. The recovery rate is defined as the ratio of the average number of the selected relevant variables ( 1 -5 ) to the average number of the selected variables [9]. As shown in Table 1, when the sample size increases, the prediction 6 Computational and Mathematical Methods in Medicine performances of all the seven methods are improved. For example when = 0.1 and = 0.2, the average of the variables selected by the harmonic regularization method decreased from 6.6 to 5.8 and its recovery rate is improved from 0.72 to 0.86 with the sample sizes increased from 100 to 200. When the correlation parameter and the noise parameter increase, the variable selection performances of all the seven methods are decreased. For example, when = 0.1 and = 200, the average of the recovery rate from the harmonic method decreased from 0.86 to 0.71, in which increased from 0.2 to 0.5. When = 0.5 and = 150, the average of the recovery rate of the harmonic method decrease from 0.63 to 0.50, in which increased from 0.1 to 0.5. Moreover, in our simulation, the influence of the noise may be slightly larger than that of the variable correlation for the prediction performance of all the seven methods. On the other hand, at the same parameter setting case, the recovery rates of the harmonic method and 1/2 penalty are almost better than the results of the other five methods. For example, when = 0.1, = 0.2 and = 100, the recovery rate of the harmonic method is 0.72 much better than 0.14, 0.43, 0.54, 0.28, and 0.13 got by the Lasso, SCAD, MCP, adaptive Lasso, and elastic net, respectively, slight better than 0.71 got by 1/2 penalty method.
To evaluate prediction performance of the seven regularization methods for the Cox model, we presented their average IBC and CI values on the simulated datasets among 500 times in Table 2.
In terms of IBC and CI, for different parameters' settings, no methods almost performed better than others, but their prediction performances are only small differences. For example, when = 0.1, = 0.2, and = 150, the average of IBS from the harmonic method is 0.079, better than 0.081, 0.087, 0.084, 0.08, 0.08, and 0.084 got by Lasso, SCAD, MCP, adaptive Lasso, and elastic net and 1/2 penalty. When = 0.1, = 0.2, and = 100, the average of CI from the harmonic method is 0.845, better than 0.749, 0.788, 0.822, 0.757, and 0.838 got by Lasso, SCAD, MCP, adaptive Lasso, and 1/2 , but slight worse than 0.851 got by elastic net penalty method.
Combined with the results reported in Table 1, we concluded that the harmonic penalized method showed better or equivalent predictive performance than the other regularization methods.  In bold is the best performance.

Analysis of the Real Microarray Datasets.
In this section, we evaluated the performance of the harmonic regularization methods on the real survival gene expression datasets. Four publicly available datasets are used in this part. A brief description of these datasets is given below and summarized in Table 3. AML Dataset. The AML dataset is from Bullinger et al. [34]. It contains the expression profiles of 6283 genes for 116 patients, and the number of censored cases is 49.
We evaluated the prediction accuracies of the seven estimated regularization methods using random partition: a training set of about 2/3 of the patients used for estimation and a test set of about 1/3 of the patients used for testing of the prediction capability. For estimating , we employed the five-fold cross-validation scheme using the training set. We repeated each procedure 200 times. Table 4 reports the average number of genes selected by each method. The harmonic regularization method performs better than those of 1 type methods (Lasso, SCAD, MCP, adaptive Lasso, and elitist net), and slightly better than that of 1/2 penalty. As shown in Table 4, for DLBCL (2002) dataset, the harmonic penalized methods selected about 71 genes, compared to about 174, 129, 129, 146, and 180 about five Lasso, SCAD, MCP, adaptive Lasso and elitist net, slightly better than 76 got by 1/2 penalty. For DLBCL (2003) and AML datasets, the best one is 1/2 penalty and the second is harmonic methods.
To assess predictive performance, we summarize the results of IBS and CI obtained by the seven methods, respectively, in Table 5. Both the results of IBS and CI, the results of all regularization methods, were not much different and the elitist net and harmonic penalized method almost 8 Computational and Mathematical Methods in Medicine outperforms than other five penalized methods. Combined with the results reported in Tables 4 and 5, we concluded that the harmonic penalized method selected the smaller subset of the key genes while give best or equivalent predictive performance.

Conclusion
Variable selection is a fundamental problem in statistics and machine learning, and the regularization method is one of the ways to solve this problem. Generally speaking, a regularization algorithm is always a combination of a loss function and a penalty function in the past research and applications. Particularly, in the procedure of variable selection, the harmonic regularization is like a net which can always catch the correct model. This demonstrates the stronger sparsity and better correctness of the harmonic regularization. We have provided a serous of simulations to demonstrate that 1 type regularization methods are inefficient; the harmonic regularization and 1/2 penalty methods proved are efficient and effective.
In the simulation part, we use four real datasets. There are the DLBCL (2002), the DLBCL (2003), the Lung cancer, and the AML. Results indicate that our harmonic regularization algorithm is very competitive in analyzing high dimensional survival data in terms of sparsity. Simulation results indicate that the harmonic penalized Cox model is very competitive in analyzing high dimensional survival data, because it was able to reduce the size of the predictor even further at moderate costs for the prediction accuracy [8]. The harmonic penalized Cox model will provide an efficient tool in building a prediction model for survival time based on high dimensional biological data.