Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.
Selecting relevant and important features has been an active research area in statistics and data mining [
Owing to their better generalization performance of support vector machines, they have become an effective tool to select relevant features in the past several years. In [
In this paper, we propose a novel method for feature scaling in terms of the SVM criteria which avoids the scaling factors of features falling into locally optimal solutions. The proposed method first introduces the scaling factors of features in linear support vector machines and then uses a kind of generalized performance measure to learn the scaling factors of features. To make the generalized performance measure suitable for feature scaling, the measure is modified and formulated as a second-order cone programming problem. Finally, the scaling factors of features are obtained by solving a convex optimization problem. In addition, we also carry out experiments on some data sets to evaluate the proposed method.
In this section, we briefly recall the basic idea of LSVMs. This class of algorithms introduced by in [
Given the training data
It is found that (
After
Note that although there are
In this section, we propose a feature scaling method based on the generalized performance measure. First, it would be helpful to introduce the generalized performance measure.
Note that the generalized performance measure [
Instead of adopting multiple kernels, we consider the linear kernel in this paper. It is clear that the Gram matrix
In [
Applying an idea in [
As will be shown in the following, directly solving (
If a pair of variables (
The right-hand side of (
So now we know that there is a possibility that there might be some cases when there are just few scaling factors of data that are not zero, making it become unsuitable for feature selection in most cases due to the fact that only few features are chosen. To deal with this difficulty, we add the L2-norm of
Likewise, one can transform (
Solving SDP remains computationally expensive even with the advances in interior point methods. One way to reduce this computational complexity is to transform (
Applying the techniques in [
In this section, we carry out the experiments on simulated data, UCI data sets, and the gene data set to evaluate the proposed optimization model.
To evaluate the proposed method, where irrelevant features are present, we generate
Experimental results on simulated data. (a) Performance comparisons of several featuring scaling methods. (b) Scaling factors with the dimension
From Figure
To further test the performance of the proposed method, we compared it with CSVM, SVM (RW) and SVM (Span), the iterative method (IM) in [
Statistics for the data sets we deal with.
Data sets | Number of samples | Number of dimensions | Number of classes |
---|---|---|---|
Australian | 690 | 14 | 2 |
Breast | 683 | 10 | 2 |
Diabetes | 768 | 8 | 2 |
German | 270 | 13 | 2 |
Heat | 1000 | 24 | 2 |
Ionosphere | 351 | 34 | 2 |
Sonar | 208 | 60 | 2 |
Liver | 345 | 6 | 2 |
The error rates (%) of each method on eight binary data sets.
Data set | L2SOCP | L1L2SOCP | CSVMs | ASOCP | SVM (Span) | SVM (RW) | IM |
---|---|---|---|---|---|---|---|
Australian | 14.21 |
|
15.05 | 15.01 | 14.09 | 14.36 | 14.28 |
Breast | 3.06 |
|
|
2.92 | 2.98 | 3.02 | 2.93 |
German |
|
23.90 | 23.55 | 23.57 | 23.70 | 23.76 | 23.32 |
Heart | 17.04 |
|
17.41 | 17.48 | 17.31 | 17.26 | 17.38 |
Ionosphere |
|
11.39 | 11.85 | 11.84 | 11.29 | 11.56 | 11.78 |
Diabetes |
|
22.79 | 22.79 | 22.69 | 22.68 | 22.54 | 22.80 |
Liver | 30.67 |
|
31.81 | 31.79 | 30.55 | 31.28 | 31.13 |
Sonar |
|
23.01 | 23.48 | 23.55 | 22.35 | 23.58 | 23.27 |
As can be seen from Table
DNA microarrays [
In the first set of experiments, the proposed method is evaluated using the leave-one-out cross validation (LOOCV). Table
LOOCV accuracy (%) of the colon data set.
Methods | Accuracy |
---|---|
L1SOCP |
|
L1L2SOCP |
|
SVM | 77.4 |
RVM | 80.6 |
Logistic regression | 71.0 |
JCFO | 86.8 |
MISOCP | 86.3 |
In the second set of experiments, we select the features in terms of scaling factors. The larger the scaling factors are, the more important the features are. In general, the features can be sorted based on scaling factors. For comparison purposes, we also perform the recursive feature elimination (REF) method using linear SVMs [
Performance comparisons on different selected features.
Selected features | 1024 | 512 | 256 | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |
|
|||||||||||
Methods | Number of misclassified samples | ||||||||||
|
|||||||||||
RFE + SVM | 22 | 24 | 22 | 13 | 9 | 4 | 0 | 1 | 1 | 2 | 4 |
|
|||||||||||
L2FS + L2SOCP | 22 |
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
L1L2FS + L1L2SOCP | 22 | 22 | 20 | 9 | 8 | 0 | 0 | 1 | 1 | 2 | 4 |
|
|||||||||||
L2FS + SVM | 22 | 15 | 10 | 8 | 3 | 0 | 0 | 1 | 1 | 3 | 4 |
|
|||||||||||
L1L2FS + SVM | 22 | 20 | 18 | 8 | 8 | 0 | 0 | 1 | 1 | 2 | 4 |
|
|||||||||||
MISOCP + SVM | 22 | 20 | 19 | 8 | 5 | 1 | 0 | 1 | 2 | 2 | 4 |
From Table
When the data in real world contains redundant features, it is important to select those effective features in order to enhance the classification performance of classifiers. Different from previous methods, in this paper, we assign weights to each feature of data and then use the generalized performance measure to construct the optimization model. Fortunately, the proposed optimization model can be transformed into the problem of SOCP. Thus the weight or scaling factors of features can be obtained in globally optimal way. In terms of scaling factors that are nonnegative, the optimal features can be easily obtained. A number of experiments on a toy example, UCI data set, and the gene data set are done and it is found that the proposed method obtains better performance than previous methods. It is also noted that our method is only suitable for linear kernels. It is of interest to extend our optimization model to its nonlinear kernel version, which is our further work in the near future.
The author declares that there are no competing interests regarding the publication of this paper.
This work is partially supported by the FRF for the Central Universities (2015XKMS084).