Feature Scaling via Second-Order Cone Programming

Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.


Introduction
Selecting relevant and important features has been an active research area in statistics and data mining [1][2][3][4][5][6].In some realworld applications, one is often confronted with the highdimensional data such as text data and gene data.One important characteristic of these data sets is that some features of data may contain irrelevant or redundant information.It is shown that a large number of irrelevant or redundant features will degrade the performance of classifiers.In order to improve the comprehensibility of classification modes and their classification performance, it is interesting to explore how to reduce or remove irrelevant features of data.
Owing to their better generalization performance of support vector machines, they have become an effective tool to select relevant features in the past several years.In [1], SVMs are used as a subroutine in the feature selection process and the SVMs accuracy is optimized on the resulting subset of features.In [7], the gradient descent method is used to obtain the weights of features in terms of the SVM criterion.Moreover, the relevance of features can also be measured by their scaling factors.Thus feature scaling is performed to measure the importance of features.In [8], scaling factors of features are tuned by minimizing the standard SVM empirical risk.Further, some estimates of the generalization error of SVMs [9] are used to automatically tune scaling factors of features by using the gradient descent algorithm.In [10], the smooth leave-one-out error is optimized to obtain the scaling factors of features, while an iterative feature scaling method for linear SVM is proposed in [11].In addition, some methods depend on other criteria to perform feature selection.For example, Maji [12] proposed a rough hypercuboid approach in approximation spaces to select relevant features of data.Liu et al. [13] combined global and local structures of data to perform feature selection.Li et al. [14] developed a stable feature selection algorithm.Li and Tang [15] combined nonnegative spectral analysis and redundancy control to select relevant features in the unsupervised case.Wang et al. [16] minimized global redundancy of data to obtain the optimal features.In order to improve the discriminant power, Tao et al. devised an effective discriminative feature selection method in [17].Although these algorithms continue to contribute to the development of feature selection, some of these methods are often dependent on the gradient descent method, which may lead to their sensitivity to the choice of the gradient step and often falling into local minima.
In this paper, we propose a novel method for feature scaling in terms of the SVM criteria which avoids the scaling factors of features falling into locally optimal solutions.The proposed method first introduces the scaling factors of features in linear support vector machines and then uses a kind of generalized performance measure to learn the scaling factors of features.To make the generalized performance measure suitable for feature scaling, the measure is modified and formulated as a second-order cone programming problem.Finally, the scaling factors of features are obtained by solving a convex optimization problem.In addition, we also carry out experiments on some data sets to evaluate the proposed method.

Linear Support Vector Machines (LSVMs)
In this section, we briefly recall the basic idea of LSVMs.This class of algorithms introduced by in [11] has been shown to perform well in real applications.
Given the training data {(  ,   )}  1 with input data   ∈   and the corresponding binary class label   ∈ {−1, 1}, the SVM is to find an optimal hyperplane that separates two classes such that the hyperplane is the greatest distance from the closed training vectors of each class.Often, the hyperplane is obtained by solving the following optimization problem: where (, ) ∈   ×  and ⟨⋅, ⋅⟩ is the inner product of two vectors.When data cannot be perfectly separated, a penalty term ∑  =1   is added to the objective function in (1), where  is a positive number.Accordingly, the following optimization problem is constructed: ( It is found that (2) can be solved in the dual space of Lagrange multipliers   ≥ 0,  = 1, . . ., .In such a case, (2) can be formulated as the following optimization problem: After   and  are obtained, the following decision function is used to classify samples: Note that although there are  training samples in (4), only the samples with   > 0 play a role in the decision function.The samples with   > 0 are called support vectors.

Feature Scaling Using Second-Order Cone Programming (SOCP)
In this section, we propose a feature scaling method based on the generalized performance measure.First, it would be helpful to introduce the generalized performance measure.
3.1.Generalized Performance Measure.Note that the generalized performance measure [18] is optimized not only over the function space but also over a convex cone of positive semidefinite matrices.In the case of the SVM criteria, the following generalized performance measure is used to choose proper kernel parameters: where  is a linear combination of different kernel matrices such as Gaussian kernels and polynomial kernels, diag() = diag( 1 ,  2 , . . .,   ),  = ( 1 ,  2 , . . .,   )  , and  = (1 ⋅ ⋅ ⋅ 1)  .Instead of adopting multiple kernels, we consider the linear kernel in this paper.It is clear that the Gram matrix  in linear support vector machines can be written as a linear combination of dimensions of features.In other words, the following equation is constructed: In [19], each feature of data is associated with an indicator value.In this paper, based on the similar idea, we introduce the scaling factors   ( = 1, . . ., ) of features of data in the Gram matrix .Accordingly, one can obtain where V  = ( 1 , . . .,   )  .From (7), it is observed that the th feature is removed if   = 0. Further, if the scaling factors   ( = 1, . . ., ) are sparse, this corresponds to selecting part of the features.If the scaling factors of features are not sparse, a large value of   indicates a useful and important feature.As a result, feature selection can be performed by removing the features that correspond to small scaling factors [8,9].In addition, it should be pointed out that the condition   ≥ 0 should be imposed in order that the matrix K is semidefinite.
If the kernel in (5) takes K in (7), then one has min Applying an idea in [18], one can recast (8) as a semidefinite programming problem which can be solved using a general-purpose program such as SeDuMi [20] or SDPT3 [21].Note that K is a linear combination of rank-one matrices   V  V   .Equation ( 8) can also be formulated as the quadratically constrained quadratic programming (QCQP), which is a special form of SDP.
As will be shown in the following, directly solving (8) may not be suitable for feature scaling.To determine why this happens, in the following we first analyze the characteristics of (8).To this end, we start by stating the following definition.
The right-hand side of (9) says that  * minimizes ( * , ) for  * .The left-hand side of (9) says that  * maximizes (,  * ) for  * .Accordingly, it follows that ( * ,  * ) is a saddle point if and only if max (,  * ) = ( * ,  * ) = min ( * , ) [22].If there exist saddle points of (8), strong duality holds and the optimal values for   ( = 1, . . ., ) and   ( = 1, . . ., ) are obtained simultaneously.One can also observe that ( 8) is a linear programming problem with respect to  if the optimal value of  * is given.This shows that the optimal value of  can be obtained by searching for extreme points in a linear programming (LP) problem if  * is known.Although nonextreme points may be optimal values of LP, there always exist extreme points corresponding to optimal values.Based on these facts, one may obtain the extreme points in the linear programming problem as the optimal .Thus, the solution to  will be overly sparse, since there exists one equation with respect to  in (8).Further speaking, only few scaling factors are not zero in such a case.As a result, one cannot evaluate the importance of most features in such a case.
So now we know that there is a possibility that there might be some cases when there are just few scaling factors of data that are not zero, making it become unsuitable for feature selection in most cases due to the fact that only few features are chosen.To deal with this difficulty, we add the L2-norm of   ( = 1, . . ., ) to the objective function of (8).Accordingly, (8) has the following form: Likewise, one can transform (10) into a semidefinite programming (SDP) problem.

Second-Order Cone Programming (SOCP) Problems.
Solving SDP remains computationally expensive even with the advances in interior point methods.One way to reduce this computational complexity is to transform (10) into a second-order cone programming (SOCP) problem.Secondorder cone programming problems are convex optimization problems in which a linear function is minimized over the intersection of an affine linear manifold with the Cartesian product of second-order cones.Interior point methods for solving SOCP directly have a much better worst complexity than those for SDP.As a result, solving SOCP problems is more efficient than solving SDP problems in practice.It is also noted that SOCP has obtained many applications in machine learning in recent several years [23,24].Miyashiro and Takano [25] used mixed integer second-order cone programming formulations for variable selection in linear regression.More interesting, the SOCP technique can be used to solve the problems of support vector machines [26,27].
Applying the techniques in [28,29], one can formulate (10) as the following SOCP problem: where  () = V  (V  )  ,   = 1, and ℎ  ∈  1 ,  = 1, . . ., , ,  ∈   .One can solve (11) by using MOSEK optimization software (http://www.mosek.com/).From (11), one can see that it contains the L1-norm in the constraint.We refer to (11) as L1L2SOCP for clarity.In fact, the constraint trace(∑  =1   V  V   ) =  may be removed in (11), since there exists one constraint  2 ≥   .In such a case, we refer to (11) as L2-norm SOCP (L2SOCP).Here it should be pointed out that computational complexity of solving (11) is ( 2 ), where  is the number of features and  is the number of samples.It is obvious that this method is very effective when the training set contains a large number of samples and becomes less effective when the number of features is very huge.It should be noted that one generally obtains the nonsparse optimal solutions of   ( = 1, . . ., ) in the case of the L2-norm of   ( = 1, . . ., ).It is obvious that scaling factors of features are globally optimal by solving (11), which is different from previous methods, where scaling factors are often locally optimal.In order to obtain the decision function, Lagrange multipliers   ≥ 0 and the bias  can be achieved by the following equations: where (⋅) + denotes the pseudoinverse of the matrix and   is an  ×  identity matrix.

Experimental Results
In this section, we carry out the experiments on simulated data, UCI data sets, and the gene data set to evaluate the proposed optimization model.

Effect of Irrelevant Features.
To evaluate the proposed method, where irrelevant features are present, we generate -dimensional Gaussian data from two classes, where  1 = [1, 1, 0, . . ., 0]  ,  2 = [−1, 1, 0, . . ., 0]  , and the covariance matrices are both identity matrices.We compare the proposed method with classical support vector machines (CSVMs), SVM with multiple parameters based on the radius-margin bound (RW), and SVMs with multiple parameters based on the span bound (Span).All the algorithms are trained on a training set consisting of 100 samples of each class and are tested on an independent test of 1000 samples.In order to reduce variations of performance, the experimental results are reported by averaging over 20 runs. Figure 1(a) shows the error rate of each method with the increase of irrelevant features.Figure 1(b) shows the scaling factors of our method.
From Figure 1(a), one can see that the proposed method is superior to CSVM, SVM (RW), or SVM (Span) with the increase of features.This may be because the scaling factors in SVM (RW) and SVM (Span) are locally optimal due to gradient descent and the scaling factors in our method are globally optimal.Note that, in this experiment, we do not carry out feature selection in either SVM (RW) or SVM (Span).These two methods do perform better than CSVM but are not better than the proposed method (L2SOCP or L1L2SOCP).This shows that feature scaling indeed improves the performance of classifiers when irrelevant features are present.From Figure 1(b), it is found that the scaling factors which correspond to the relevant features are significantly larger than those corresponding to the irrelevant features.In other words, one feature scaling factor is obviously bigger than the other scaling factors.In such a case, one can only select the feature with the largest scaling factors.This confirms that it is reasonable to use scaling factors to select relevant features.Overall, the experiments show that feature scaling can improve the classification performance in the presence of irrelevant features and scaling features can be used to select the relevant features.

Experimental Results on Benchmark Data Set.
To further test the performance of the proposed method, we compared it with CSVM, SVM (RW) and SVM (Span), the iterative method (IM) in [11], and alternative second-order cone programming (ASOCP) formulations of SVMs in [23] on a collection of benchmark data sets which can be obtained from the UCI Machine Learning Repository [30].These data sets have been widely used in testing and evaluating the performance of learning algorithms.Table 1 shows the statistics for some data sets we use.The attributes of each data set are normalized to the interval of [−1, 1].The parameters of all the support vector machines are estimated using tenfold cross validation on the training set.There exist three parameters, , , and , in L1L2SOCP.For simplicity, we first perform a grid search over two-dimensional parameter space  (, ) in the case that  = 1, where  ranges from 2 −10 to 2 6 and  ranges from 2 −9 to 2 9 .Then we fix  and perform a grid search over two-dimensional parameter space (, ).In order to evaluate the performance, tenfold cross validation is performed and the average error rate of each method is reported in Table 2.
As can be seen from Table 2, our method is superior to SVM (RM), SVM (Span), and IM in most cases.This may come from the fact that the scaling factors in our method are globally optimal and the scaling factors in other methods are locally optimal.Moreover, L2SOCP is better than CSVM on Ionosphere and Diabetes data sets and L1L2SOCP is better than CSVM on Heart and Liver data sets.This shows the proposed method can use feature scaling to improve the classification performance to some degree.It is found that ASOCP obtains similar performance with CSVM, since these two methods solve SVMs in different optimization algorithms.In addition, we performed a twotailed -test with a significance level of 0.05 to determine whether there is a significant difference between the proposed method and other methods.The results show that there is no significant difference among these methods on these data sets.This may be due to the fact that these data sets have few completely irrelevant features.However, we may set a threshold to scaling factors and choose those features as the most relevant features, where the scaling factors are bigger than the threshold.This strategy can effectively reduce the number of features by performing feature selection.[31] measure the expression levels of thousands of genes simultaneously.In general, the gene expression data contains a large number of irrelevant and redundant features.Feature scaling techniques can be used to select relevant genes and contribute to discriminating different genes.In the following experiments, we test our method on the colon data set.This data set contains 62 samples: 22 normal and 40 cancerous colon examples.These two classes are discriminated by the expression profiles of 2000 genes.

Gene Expression Data. DNA microarrays
In the first set of experiments, the proposed method is evaluated using the leave-one-out cross validation (LOOCV).Table 3 shows the LOOCV results of the proposed method, SVM, RVM [32], logistic regression, and JCFO [32] and mixed integer SOCP (MISOCP) [25] for variable selection.Note that in these methods we only use the linear kernel.
From Table 3, it is found that the classification performance of the L2SOCP or L1L2SOCP method is superior to that of SVM, RSVM, and logistic regression.This also shows that the feature scaling can improve the performance of classifiers in the presence of irrelevant features.It is also found that our method is superior to JCFO.This may be because JCFO performs feature selection, while our method does not.It is noted that MISOCP is worse than our method, since MISOCP is only used to select relevant features.Overall, using scaling factors to choose the effective features in the presence of redundant features can improve the performance of algorithms.
In the second set of experiments, we select the features in terms of scaling factors.The larger the scaling factors are, the more important the features are.In general, the features can be sorted based on scaling factors.For comparison purposes, we also perform the recursive feature elimination (REF) method using linear SVMs [2] and MISOCP [25].Table 4 shows the number of misclassified samples of each method with the change of selected features.For L2FS + L2SOCP and L1L2FS + L1L2SOCP in Table 4, we first use L2SOCP and L1L2SOCP to obtain the scaling factors and then select those features in terms of their scaling factors.We continue to perform L2SOCP and L1L2SOCP on the selected features.For L2FS + SVM and L1L2FS + SVM in Table 4, we first adopt L2SOCP and L1L2SOCP to obtain the scaling factors of features and then select the features in terms of their scaling factors.We continue to perform SVM on the selected features.That is, the results are obtained by using classical SVMs on a subset of genes selected by scaling factors of features based on L2SOCP or L1L2SOCP.
From Table 4, it is found that feature scaling can further improve the performance of classifiers if the number of selected features is high.To be specific, when the number of selected features is bigger than 16, the performance of L2FS + L2SOCP or L1L2FS + L1L2SOCP is superior to that of RFE + SVM.However, as the number of selected features decreases, the feature scaling will result in performance deterioration.That is, when the number of selected features is smaller than 16, the performance of L2FS + L2SOCP or L1L2FS + L1L2SOCP is the same as that of RFE + SVM.However, if we perform SVMs based on selected features from L2SOCP or L1L2SOCP, it is found that the L2FS + SVM or L1L2FS + SVM method is better than the RFE + SVM or MISOCP + SVM method, since our method can obtain the globally optimal scaling factors.This shows that it is better to select those features in terms of L2SOCP or L1L2SOCP and then to perform SVM for classification tasks.It is also observed that the best performance of our method is superior to that of the JCFO method in the first set of experiments if the scaling factors are used to select the features.Overall, the experiments show that it is effective to select the features in terms of scaling factors obtained by L2SOCP or L1L2SOCP.

Conclusions and Further Work
When the data in real world contains redundant features, it is important to select those effective features in order to enhance the classification performance of classifiers.Different from previous methods, in this paper, we assign weights to each feature of data and then use the generalized performance measure to construct the optimization model.Fortunately, the proposed optimization model can be transformed into the problem of SOCP.Thus the weight or scaling factors of features can be obtained in globally optimal way.In terms of scaling factors that are nonnegative, the optimal features can be easily obtained.A number of experiments on a toy example, UCI data set, and the gene data set are done and it is found that the proposed method obtains better performance than previous methods.It is also noted that our method is only suitable for linear kernels.It is of interest to extend our optimization model to its nonlinear kernel version, which is our further work in the near future.

Figure 1 :
Figure 1: Experimental results on simulated data.(a) Performance comparisons of several featuring scaling methods.(b) Scaling factors with the dimension  = 25.

Table 1 :
Statistics for the data sets we deal with.

Table 2 :
The error rates (%) of each method on eight binary data sets.

Table 4 :
Performance comparisons on different selected features.