Iterative Reweighted Noninteger Norm Regularizing SVM for Gene Expression Data Classification

Support vector machine is an effective classification and regression method that uses machine learning theory to maximize the predictive accuracy while avoiding overfitting of data. L2 regularization has been commonly used. If the training dataset contains many noise variables, L1 regularization SVM will provide a better performance. However, both L1 and L2 are not the optimal regularization method when handing a large number of redundant values and only a small amount of data points is useful for machine learning. We have therefore proposed an adaptive learning algorithm using the iterative reweighted p-norm regularization support vector machine for 0 < p ≤ 2. A simulated data set was created to evaluate the algorithm. It was shown that a p value of 0.8 was able to produce better feature selection rate with high accuracy. Four cancer data sets from public data banks were used also for the evaluation. All four evaluations show that the new adaptive algorithm was able to achieve the optimal prediction error using a p value less than L1 norm. Moreover, we observe that the proposed Lp penalty is more robust to noise variables than the L1 and L2 penalties.


Introduction
Support vector machine (SVM) has been shown to be an effective classification and regression method that uses machine learning theory to maximize the predictive accuracy while avoiding overfitting of data [1]. L2 regularization method is usually used in the standard SVM. It works well especially when the dataset does not contain too much noise. If the training data set contains many noise variables, L1 regularization SVM will provide a better performance. Since the penalty functions are predetermined for data training, SVM algorithms sometimes work very well but other times are unsatisfactory. In many potential applications, the training data set also contains a large number of redundant values and only a small amount of data points is useful for machine learning. This is particularly more common in bioinformatics applications.
In this paper, we propose a new algorithm for supervised classification using SVM. The algorithm uses an iterative reweighting framework to optimize the penalty function for which the norm is selected between 0 and 2, that is, 0 < ≤ 2. We call it the iterative reweighted -norm regularization support vector machine (IPWP-SVM). The proposed algorithm is simple to implement and has a fast convergence and improved stability. It has been applied to the diagnosis and prognosis of bladder cancer, lymphoma, melanoma, and colon cancer using publicly available data sets and evaluated by a cross-validation arrangement. The results from this proposed method provide more accurate functions than the rules obtained with classical methods such as the 1 and 2 norm SVM. The simulation results also reveal several interesting properties about the -norm regularization behavior. The rest of this paper is organized as follows. The motivation of the variable selection of the -norm will be formally introduced, followed by the IRWP-SVM algorithm development. Simulation results and results using real patient data sets will be discussed in Section 4. Finally, in Section 5 we provide a brief conclusion. ∈ is an -dimensional input vector and ∈ {−1, +1} is the corresponding target. Large-margin classifiers typically involve the optimization of the following function:
Equation (1) can be rewritten as where is a user-selected limit. Equations (1) and (2) are asymptotically equivalent. The standard SVM classifier can be considered as another approach to solve the following problem: where is a bias term. Oftentimes, the target value is determined by only a few input elements in the input vector with a large dimension. In other words, the dimension of a sample data set is significantly larger than the number of key input features which are useful for identifying the target. The weight vector will be a sparse vector with many zeros. In this situation, the optimization problem in (1) and (2) should be searching for a sparse vector which still allows for accurate correlation between the target and inputs. A simple way of identifying the less sparse vectors is to count the number of nonzero elements of . In other words, the actual objective function being minimized is the 0-norm of . Therefore, (3) should be replaced with the 0-norm of as This optimization problem is known as the regularization SVM where the complexity of the model is related to the number of variables involved in the model. Amaldi and Kann show that the above problem is NP-hard [2]. In order to overcome this issue, several modifications have been proposed to relax the problem in machine learning and signal processing [3][4][5]. Instead of 0-norm, (4) is modified to the following convex optimization problem: It turns out that for linear constraints satisfying certain modest conditions, L0-norm minimization is equivalent to L1-norm minimization, which leads to a convex optimization problem for which there exist practical algorithms [6]. The presence of the L1 term encourages small components of to become exactly zero, thus promoting sparse solutions [7,8].
Another interesting possibility is to minimize the Lpnorm, where 0 < < 1, which should yield sparser solutions than with = 1 and = 2. Such an optimization problem is nonconvex and likely has many local solutions, which make its use technically more challenging than that of the more common L1 or L2 norm. However, there may be an advantage in the case of data inconsistencies caused by noises. Despite the difficulties raised by the optimization problem, good empirical results were reported in signal reconstruction [9], SVM classification [10], and logistic regression [11]. Figure 1 provides an illustration of the following penalty functions: for = 1 and = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0}. When 0 < < 1, -norm is known as the bridge penalty function. This type of penalty has been used in signal processing fields [12,13] and popularized further in statistical community [14,15]. The special case of -norm where 0 < < 1 can be considered a quasi-smooth approximation of the L0-norm.
Meanwhile, several works have provided some theoretical guarantees on the use of the Lp penalty which justifies the use of such a penalty for variable selections [16][17][18][19]. Chartrand and Yin [20,21] and Candés et al. [22] proposed some algorithms that were applied in the context of compressive sensing and share the same idea of solving a non-convex problem using an iterative reweighted scheme until complete convergence.

Iterative Reweighted p-Norm
Regularization SVM In this section, we propose our iterative reweightednorm regularization algorithm. Given a set of datasets = { 1 , . . . , } and their labels = ( 1 , . . . , ) ∈ {−1, +1} , the goal of the binary-class classification in SVM is to learn a model that assigns the correct label to the test samples. This can be thought of as a learning function = sign( + ): → which maps each instance ∈ to an estimated valuê. In this paper, for simplicity and brevity, only two classification problems will be shown. The data set is assumed to be linearly separable. Then, the problem of hard-margin, support vector machine using norm regularization can be represented by the following optimization problem: Computational and Mathematical Methods in Medicine Parameter where ∈ and 0 < ≤ 2. By rearranging the constraints in (7), the optimization becomes Now define By substituting definition (9) into (8), we can rewrite the minimization in (8) as The Lagrangian function can be obtained as follows: Therefore Define the following two variables: Using the matrix and vector notation, we can rewrite (12) as The corresponding dual is found by the differentiation with respect to the primal variable and, that is, 4

Computational and Mathematical Methods in Medicine
Substituting into the Lagrangian function, one may obtain Therefore, the Wolfe dual problem becomes The above optimization problem is a QP problem of variable and it takes a form similar to the dual optimization problem for training support vector machines. The corresponding minimization problem becomes Let denote the set of indices of the support vector, where ̸ = 0; | | is the cardinality of . According to the Karush-Kuhn-Tucker (KKT) conditions, where either = 0 or ( − ) = 1 for = 1, . . . , . Therefore, The final discriminant function is 3.1. Implementation of the IRWP-SVM. There exists a large body of literature on solving QP wolf dual problems represented by (18). Several commercial software programs are also available for QP optimization. However, these mathematical programming approaches and software are not suitable for SVM problems fortunately, and the iterative nature of the current SVM optimization problem allows us to derive tailored algorithms which result in faster convergence with small memory requirements even for problems with large dimensions. Currently, the following four types of implementation have been proposed.
Iterative Chunking. In 1982, Vapnik proposed an iterative chunking method, that is, working set method, making use of the sparsity and the KKT conditions. At every step, the chunking method solves the problem containing all nonzero plus some of the violating the KKT conditions.
Decomposition Method. The decomposition method has been designed to overcome the problem in which the full kernel matrix is not available. Each iteration of the decomposition method optimizes a subset of coefficients and leaves the remaining coefficients unchanged. Iterative chunking is a particular case of the decomposition method.
Sequential Minimal Optimization. The sequential minimal optimization algorithm proposed by Platt selects working sets using the maximum violating pair scheme, that is, always using two elements as working set size.
Coordinate Descent Method. This method iteratively updates a block of variables. During each iteration, a nonempty subset is selected as a block and the corresponding optimization subproblem is solved. If the subproblem has a closed-form solution, it neither uses any mathematical programming package nor needs any matrix operations. In our study, we have applied Platt's "sequential minimal optimization" learning procedures to solve the QP wolf dual problems in (18). Sequential minimal optimization is a fast and simple training method for support vector machines. The pseudocode is given in Algorithm 1. Specifically, given an initial point ( (0) , (0) ), the IRWP-SVM computes ( ( +1) , ( +1) ) from ( ( ) , ( ) ) by cycling through the training data and iteratively solving the problem in (18) for only two elements which are composed of the maximum violating pair at a time.

Experiments and Discussion
Both simulation data and clinical data have been used to illustrate the IRWP-SVM. In particular, the results to follow will show that the IRWP-SVM is able to remove irrelevant variables and identify relevant (sometimes correlated) variables when the dimension of the samples is typically larger than the number of training points.

IRWP-SVM for Feature Selection in Simulation.
We start with an artificial problem which is taken from the work by Weston et al. [23]. We generated artificial data sets as in [23] and followed the same experimental protocol in the first experiment. All samples were drawn from a multivariate normal distribution: the probability of = 1 or −1 was equal. One thousand samples with 100 features were generated. Six dimensions out of 100 were relevant. These features are composed of three basic classes.   We used IRWP-SVM for the feature selection. To find out how the prediction error rate and feature selection error rate can be affected by the different training and validation set sizes, we conducted three sets of experiments on the data sets with the following combinations:  at least 100 independent trials. One may expect that a smaller should work best in a setting where the number of relevant features is very small.
When the training + validation set size is 250 + 750, the highest prediction error rate is 0.47%, the lowest prediction error rate is 0.38%, the highest feature selection error is 8.33%, and the lowest feature selection error is 2%. When the training + validation set size is 500 + 500, the highest prediction error rate is 0.37%, the lowest prediction error rate is 0.28%, the highest feature selection error is 2.67%, and the lowest feature selection error is 0. For = 0.7, 0.8, 0.9, and 1.0, the feature selection error is 0%. When the training + validation set size is 750 + 250, the highest prediction error rate is 0.37%, the lowest prediction error rate is 0.26%, the highest feature selection error is 0.50%, and the lowest feature selection error is 0. For = 0.7, 0.8, 0.9, and 1.0, the feature selection error is 0%. The prediction accuracy rate is in between 99.5% and 99.8%. For 0.8, 0.9, and 1.0, the feature selection error is 0%. One can see that the feature selection error is sensitive to changes in the value. When 0 < < 1, -norm regularization SVM is a sparse model, and the feature selection error is sensitive enough to select the specified -norm SVM model for improving the prediction accuracy. Figure 2 represents the error rate of feature selection for different -norm values. Each subfigure consists of three data points which represent, respectively, the feature selection error rate when training + validation set size is 250 + 750, 500 + 500, and 750 + 250. The error bar is 2 times the standard deviation. With the increasing ratio of training and validation set sizes, the average value of feature selection error rate first decreased and then became stable. When the ratio of training and validation set size reached 500 + 500, the feature selection error rate reached its lowest point. To sum up, the sensitivity of the feature selection error rate of IRWP-SVM algorithm decreases when more training samples are used. Figure 3 shows another perspective of the error rate trends. In summary, considering both the error rate for feature selection and the prediction error rate, = 0.8 appears to be more suitable for the data. Our IRWP-SVM algorithm is highly accurate and stable, is able to remove irrelevant variables, and provides robustness in the presence of noises.

IRWP-SVM for Four Clinical Cancer Datasets.
A major weakness of the L2-norm in SVM is that it only predicts   automatically select the value and identify the relevant genes for the classification. Table 2 shows the information of the four real cancer datasets and training + validation set size used in our evaluation. The bladder cancer dataset consists of 42 training and 15 validation data sets (http://www.ihes.fr/∼zinovyev/ princmanif2006/), a total of 57 sample sets. The dimension of each sample vector is 2215. The melanoma cancer dataset consists of 58 training and 20 validation data sets (http://www.cancerinstitute.org.au/cancer inst/nswog/ groups/melanoma1.html), a total of 78 sample sets. The dimension of each sample vector is 3750. The lymphoma cancer dataset consists of 72 training and 24 validation data (http://llmpp.nih.gov/lymphoma/data/rawdata/), a total of 96 samples. The dimension of each sample vector is 4026. The colon cancer dataset consists of 46 training and 16 validation data (http://perso.telecom-paristech.fr/∼gfort/GLM/Programs.html), a total of 62 samples. The dimension of each sample vector is 2000. Table 3 is the prediction error rate and selected feature number for = 0.25, 0.5, 0.75, 1.0, and 2.0. All results reported here are averages over at least 100 independent trials. Figure 4, = 0.5 resulted in the minimum prediction error rate. As the value increases, the number of features gradually increases. The upper limit is the  Figure 5, the predicted error rate at first increases and then decreases. The value = 0.25 provides the minimum average error rate. As increases, the number of features gradually increases (the upper limit is the maximum number of features in the original data, i.e., 3750). The average number for the selected features is 256.5 for = 0.25. The average prediction error rate is 12.11%. It is also the value of the optimal point. The selected features are only 6.84% of the number of total features. Figure 6, the predicted error rate at first decreases and then increases. = 0.75 provides the minimum average error rate. The upper limit is the maximum number of features in the original data set that is, 4026. For = 0.75, the average number for the selected features is 2734.6, and the average prediction error rate is only 5% at the optimal value. The average predicted error rate is 5.83% at = 1, slightly higher than 5%. The data set also has a number of outliers. The stability of the predicted error at = 1 is less than that of = 0.75. The average number for the selected features is 3426.3, significantly higher than 2734.6. Therefore, the IRWP-SVM algorithm that selected = 0.75 as the -norm regularization is better than the 1 norm SVM. Figure 7, as increases, the predicted error rate at first decreases and then increases. The average error rate achieved a minimum at = 0.5. As increases, the number of features gradually increases. The upper limit is the maximum number of features in the original data of 2000. For = 0.5, the average number for the selected features is 1067.5, and it does not have any outlier. The selected features are only 53.4% of the number of total features, and thus the prediction time is significantly reduced.

Comparison.
In this section, we compare the L0-norm regularized SVM ( 0-SVM), the L1-norm regularized SVM (L1-SVM), the L2-norm regularized SVM (L2-SVM), random forest and the IRWP-SVM. We use random forest in WEKA 3.5.6 software developed by the University of Waikato in our experimental comparison. Each experiment is repeated 60 times. For L0-SVM, L1-SVM, and L2-SVM, the tuning parameters are chosen according to 10-fold cross validation, and then the final model is fitted to all the training data and evaluated by the validation data. The feature selection error is the minimum error when choosing the subsets of different sizes of genes. The means of the prediction error and feature selection error are summarized in Table 4. As one can see in the table, the IRWP-SVM seems to have the best prediction performance.

Conclusions
We have presented an adaptive learning algorithm using iterative reweighted -norm regularization support vector machine for 0 < ≤ 2. The proposed regularization algorithm has been shown to be effective and able to significantly improve the classification performance on simulated and clinical data sets. Four cancer data sets were used for the evaluation. Based on the clinical data sets, we have found the following.
(i) The IRWP-SVM is a sparse model; the smaller the values, the more sparse the model. (ii) The experiments show that the prediction error of the IRWP-SVM algorithm is small and the algorithm is robust. (iii) Different data require different p value for optimization. The IRWP-SVM algorithm can automatically select the value in order to achieve high accuracy and robustness.
The IRWP-SVM algorithm can be easily used to construct arbitrary p-norm regularization SVM algorithm (0 < ≤ 2). It can be used as a classifier for many different types of applications.