A Semisupervised Feature Selection with Support Vector Machine

Feature selection has proved to be a beneficial tool in learning problems with the main advantages of interpretation and generalization. Most existing feature selection methods do not achieve optimal classification performance, since they neglect the correlations among highly correlated features which all contribute to classification. In this paper, a novel semisupervised feature selection algorithm based on support vector machine (SVM) is proposed, termed SENFS. In order to solve SENFS, an efficient algorithm based on the alternating directionmethod of multipliers is then developed. One advantage of SENFS is that it encourages highly correlated features to be selected or removed together. Experimental results demonstrate the effectiveness of our feature selection method on simulation data and benchmark data sets.


Introduction
Feature selection, with the purpose of selecting relevant feature subsets among thousands of potentially irrelevant and redundant features, is a challenging topic of pattern recognition research that has attracted much attention over the last few years.A good feature selection method has several advantages for a learning algorithm such as reducing computational cost, increasing its classification accuracy, and improving result comprehensibility [1].
Considering the usage of the class label information, feature selection methods can be classified into supervised methods, unsupervised methods, and semisupervised methods.Supervised feature selection methods usually use only information from labeled data to find the relevant feature subsets [2][3][4].However, in many real, world applications, the labeled data are very expensive or difficult to obtain, which brings difficulty to create a large training data set.This situation arises naturally in practice, where large amount of data can be collected automatically and cheaply, when manual labeling of samples remains difficult, expensive and time consuming.Unsupervised feature selection methods could be an alternative in this case through exploiting the information conveyed by the large amount of unlabeled data [5,6].However, as these unsupervised algorithms ignore label information, important hints from labeled data are left out and this will generally downgrade the performance of unsupervised feature selection algorithms.The combination of both supervised methods and unsupervised methods is semisupervised approaches [7][8][9][10] which exploit the information of both labeled and unlabeled data.A good survey about semisupervised feature selection approaches can be found in [9].
The performances of the most existing semisupervised feature selection methods are insufficient when there are several highly correlated features, which are all relevant to classification and the way they interact can help with the interpretability of the objective problem [11].Given these premises, this paper provides two main contributions as follows.
(i) We present a novel semisupervised feature selection scheme based on support vector machine (SVM) and the elastic net penalty proposed by Zou and Hastie [12] combining  1 and  2 regularizations, termed SENFS.
(ii) In order to solve SENFS with the nondifferentiability of both the loss function and the  1 -norm regularization term, an efficient algorithm based on the

Methodology
Assume that all samples sampled from the same population generated by target concept consist of  features.Given a set of samples X = ( 1 , . . .,   )  , in which  is the number of samples, the th sample or input vector   of original feature  with  features is denoted by   = { 1 ,  2 , . . .,   }.
The set X can be divided into two parts: labeled set X  = ( 1 , . . .,   )  for which labels y  = ( 1 , . . .,   )  are provided with   ∈ {−1, 1}  =1 for binary problem and unlabeled set X  = ( +1 , . . .,  + )  whose labels are not given, where  and  are the number of labeled and unlabeled samples, respectively, and  =  + .Then, the generic goal of semisupervised feature selection is to find a feature subset   with  ( < ) features which contains the most informative features using both data information of X  and X  .In other words, the samples (  1 , . . .,    )  represented in the dimensional space can well preserve the information of the samples X = ( 1 , . . .,   )  represented in the original dimensional space.We begin our discussion with the binary supervised feature selection based on the elastic net penalty.Wang et al. [11] proposed a supervised feature selection method based on SVM with the elastic net penalty term named doubly regularized support vector machine for binary classification problems, which solves the optimization of the following generic objective function over both the hyper plane parameters (,  0 ): where the decision function is defined as () =  +  0 and both  1 and  2 are tuning parameters, and  is the regularization parameter. is the margin loss function; for example, hinge loss () =  1 () = max(1 − , 0).The role of the  1 -norm penalty is to allow selection, and the role of the  2 -norm penalty is to help groups of highly correlated features get selected or removed together which is denoted by the grouping effect [11].
As for semisupervised feature selection, considering X  and X  , inspired by the semisupervised learning algorithm TSVM [14], we apply the elastic net penalty for semisupervised feature selection (SENFS), which solves the following optimization task over both the hyper plane parameters (,  0 ) and the unlabeled vector y  = ( +1 , . . .,   )  : min The constraint in ( 2) is called the balancing constraint and is necessary to avoid the trivial solutions where all unlabeled samples are assigned to the same class.This constraint enforces a manually chosen constant  and an approximation of this constraint writes (1/) ∑  =1 (  ) = 2 − 1 [15] with .So the constraint can be rewritten as (1/) ∑  =+1 (  ) = (1/) ∑  =1   . and  employ the same loss; for example, hinge loss () =  1 () = max(1 − , 0).
Obviously, the difficulty of the above optimization task consists in finding the optimal assignment for the unlabeled vector  and the hyper plane parameters (,  0 ), which is a mixed-integer programming problem [16].As described in [15], for a fixed (,  0 ), arg min  (()) = sign(()).So the problem of (2) can be seen equivalently as min On the other hand, one effective approximation of the loss function (||) was a clipped variant [17] which can be expressed as where   () is the Ramp loss defined as   () = From ( 5), we can know that solving the optimization problem (3) with the clipped symmetric hinge loss is equivalent to solving a classical SVM with the unlabeled samples counted twice with   = 1 when  + 1 ≤  ≤  +  and   = −1 when ++1 ≤  ≤ +2, which are artificial labels.Therefore problem (3) As we can seen from ( 6), when  ̸ = 0,  * = 0, SENFS evolves into a supervised feature selection algorithm, and when  = 0,  * ̸ = 0, it becomes an unsupervised model.In the following, we will illustrate how SENFS has the grouping effect for correlated features.The following theorem describes this point.The term | β − β | in (7) describes the difference between the coefficient paths of   and   .If both features are highly correlated, that is,  = 1, Theorem 1 says that difference between the coefficient paths of them is almost 0, in which case both features will be selected or removed together.The upper bound in (7) or (8) provides a quantitative description for the grouping effect of SENFS.

Algorithm for SENFS
The alternating direction method of multipliers (ADMM) developed in the 1970s and is well suited to distributed convex optimization and in particular to large-scale problems arising in statistics, machine learning, and related areas.The method is closely related to many other algorithms, such as the method of multipliers [18], Douglas-Rachford splitting [19], Bregman iterative algorithms [20] for  1 problems, and others.
In this section, we first propose an efficient algorithm to solve SENFS based on ADMM by introducing auxiliary variables and reformulating the original problem.Then prove its convergence property and get the adjustment principle for penalty parameters.Finally describe the stopping criterion and computational cost.

Deriving ADMM for SENFS.
It is hard to solve the model ( 6) directly due to the nondifferentiability of three loss functions and a  1 -norm term.In order to derive an ADMM algorithm, we introduce some auxiliary variables to handle these nondifferentiable terms.
Let   = {(  ) , =1,=1 } denote labeled data, let   = {(  ) +2, =+1,=1 } denote unlabeled data, and let   ,   be diagonal matrixes with their diagonal elements to be the vector y  = ( 1 , . . .,   )  and y  = ( +1 , . . .,  +2 )  , respectively.The constrained problem in ( 6) can be reformulated into an equivalent form min where h = (ℎ 1 , . . ., ℎ  )  , a = ( 1 , . . .,  2 )  , and b = ( 1 , . . .,  2 )  , 1  is an -column vector of 1s and 1  is an 2column vector of 1s, and  = (1/) ∑  =1   .The Lagrangian function of ( 9) is In problem (10),  = {,  0 , h, a, b, t, u ℎ , u  , u  , k, }, and u ℎ , u  , and u  are dual variables corresponding to the constraints h = 1  −   (   +  0 1  ), a = 1  −   (   +  0 1  ), and b = 1  − (   +  0 1  ), respectively, k is corresponding to the constraint t = , and  is a scalar corresponding to the balancing constrain.As in the method of multipliers, we form the augment Lagrangian where  1 ,  2 ,  3 ,  4 > 0 are parameters.Problem (11) is the form of ADMM, which consists of the following iterations: The efficiency of the iterative algorithm (12) lies on whether the first equation of ( 12) can be solved quickly.According to the theory of ADMM, these variables (,  0 ), h, a, b, and t are updated in an alternating or sequential fashion, which accounts for the term alternating direction.So we can get For the first equation in (13), it is equivalent to the following convex optimization: The objective function in the above minimization problem is quadratic and differentiable, and since ( +1 ,  +1 0 ) minimizes this function by definition, the optimal solution can be found by solving a set of linear equations In (15), I is a  ×  unit matrix and the coefficient matrix is a ( + 1) × ( + 1) matrix, independent of the optimization variables.For large , small  setting, the term      in the coefficient matrix will be a positive low rank matrix with rank at most  while the term      in the coefficient matrix will be a positive low rank matrix with rank at most 2.Therefore, the coefficient matrix is also low rank matrix with rank at most (2 + 1).And if we use CG to solve the problem (15), it will converge in less than (2 + 1) steps [21].
For the second equation in (13), it is equivalent to solving In order to solve (16), we need the following Proposition [22].
where  > 0. Then Combined with Proposition 2 and we can update h +1 according to Corollary 3.
For the third equation in (13), it is equivalent to solving (20): ( Combined with Proposition 2 and we can update a +1 according to Corollary 4.
For the fourth equation in (13), it is equivalent to solving In order to solve (23), we need the following proposition.
For the fifth equation in (13), it is equivalent to solving Solving (29) can be done efficiently using soft threshold, and we can update t +1 according to Corollary 7.

Convergence Analysis and Computational
Cost.The convergence property of Algorithm 8 can be derived from the theory of the alternating direction method of multipliers.According to the standard convergence theory of ADMM, Algorithm 8 satisfies the dual variable convergence [24].So Theorem 9 holds.Theorem 9. Suppose that ( * ,  0 * ) is one of solution of (5).Then the following property holds: 2 . (32) As for the computational issue, it is hard to predict the computational cost because it depends on the all the penalty parameters.According to our experience, we only need to iterate a few hundred iterations to get a reasonable result.On the other hand, the efficiency of Algorithm 8 lies mainly on whether we can quickly solve the linear equations (15).And the computational cost for solving (15) is ( 2  + 4 2 ).

Varying Penalty Parameter.
In order to make performance less dependent on the initial choice of the penalty parameter, it is necessary to use different penalty parameters.According to our experiment experience, the penalty parameters  1 ,  2 ,  3 , and  4 have a huge influence on the performance and the number of iterations involved, so adaptive selections of them are performed. For 2 , and the constraint conditions are h = 1  −   (   +  0 1  ) and the constraint of (3).The optimization task ( 6) is equivalent to The necessary optimality conditions for the problem (6) are dual feasibility as Since  +1 minimizes  ( +1 ,  +1 0 , u  ℎ ,   ) by definition, we have that Compared with (34), ( 35) means that the quantity ) can be viewed as a residual for (34), and let , and the constraint conditions are a = 1  −   (   +  0 1  ) and the constraint of (3).Through the same solving process as parameter  1 , we can get the residual With these residuals, we can get a simple scheme to update  1 ,  2 ,  3 , and  4 , respectively, according to Corollary 10.

Experimental Evaluation
This section examines the performance of SENFS with respect to its feature selection and test error on simulated data and six benchmark data sets.In order to evaluate the effectiveness of SENFS, we compare SENFS with an existing semisupervised feature selection algorithm: Spectral [10], and a supervised feature selection algorithm: DrSVM [11], which also has the characteristics of grouping effect.On the other hand, in order to evaluate the quality of selected features, SVM was executed on these selected features.The experiments are run on a desktop with Pentium(R) 2.0 G CPU, 1.99 G main memory.The programs are compiled in Windows system with Matlab in version R2009a.The limited number of samples prohibits having enough and independent training and testing data for performance evaluation.It is very common to apply accross-validation (CV) in this scenario.We used 5-fold CV: we partitioned the data set into five complementary subsets of equal size.Four subsets were used as training data; the remaining subset served as test data.We repeated this process five times such that each of the five subsets was used exactly once as test data.To get more reliable estimate, we performed the 5-fold CV for 10 times and the experimental results are average results over test data sets.Moreover, finding the appropriate value of the tuning parameter pair  1 and  2 is essential for the performance of SENFS.We employed 10-fold CV over a large grid.
4.1.Simulation.We evaluate the performance of SENFS using two parameters: the correlation between relevant features denoted by , the number of labeled samples, and the degree of overlapping among classes denoted by .Consider 2-class problem in which the samples are lying in a  dimensional space with the first 10 dimensional being relevant to classification and the remaining features being noise, where the correlation between the first 10 features is .The number of samples is 300 with  = 500.For the samples from +1 class, they are sampled from a normal distribution with mean and covariance as follows: ) , where the diagonal elements of ∑ * are 1 and the off-diagonal elements are all equal to .The −1 class has a similar distribution expect that its mean is: ) .
To evaluate the effect of the correlation between relevant features, SENFS is compared with Spectral and DrSVM, measured by the number of selected features with two labeled samples and  = 1.The results are summarized in Table 1.As shown in Table 1, on this simulated data, when the relevant features are highly correlated (e.g.,  = 0.9), Spectral and DrSVM tend to keep only a small subset of the relevant correlated variables and overlook the others, while the SENFS tends to identify all of them, due to the grouping effect.These three methods seem to work well in removing irrelevant features.
The effects of the number of labeled samples on test error over the top 10 selected features are summarized in Figure 1 with  = 1 and  = 0.9.As can be seen, the test errors of SENFS, Spectral, and DrSVM decrease with the increase of the number of labeled samples, but SENFS seems to achieve the best classification performance when the number of labeled samples is varying, which may imply that SENFS can make better use of the labeled samples than spectral and DrSVM.The supervised feature selection method DrSVM achieves the worst results because it only relies on the few labeled samples and discards the large amount of unlabeled samples.
In Table 2, the effect of the degree of overlapping among classes on test error over the top 5 selected features is evaluated with two labeled samples and  = 1 and  = 0.9, also reporting the typical computational time of our experimental campaign.As we can see, SENFS seems to have the best prediction performance.When  is small, the two classes overlap largely and in this case, other methods achieved worse performance compared with SENFS.However, SENFS, solving the programming problem (6) needing an iterative procedure, requires more computational time than the other methods as you can see in Table 2.It is noted that the absolute values are not as important as the relative differences between the individual methods.

Application to Benchmark Data Sets.
Several benchmark data sets are selected to test the performance of SENFS, which are used as benchmark data sets in [7,8] to test the performances of semisupervised algorithms.These benchmark data sets consist of 9 semisupervised learning data sets.We did not test the SSL6, SSL8, and SSL9 data sets since the SSL6 data set includes six classes, the SSL8 data set contains too many samples ( is over one million) and the SSL9 data set has too many dimensions ( is over ten thousand).The names and characteristics of the left six data sets are given in Table 3.
In this study, we examine performance evaluation through 5-fold cross-validation that is, we randomly select four fifths of the unlabeled samples, plus all the labeled samples, for SENFS, Spectral, and DrSVM to select optimal feature subsets, while leaving the remaining one fifth for testing test error on the selected features using SVM, where all the labeled samples are used for training SVM.The results measured by test error are reported in Table 4.As can be seen, SENFS outperforms the semisupervised and supervised feature selection methods on all the six data sets when  = 10 and  = 100.When  = 10, Spectral performs the second best, on USPS, COIL2, and BCI data sets, while DrSVM performs the second best on Digit1, BCI, g241c and g241n data sets when  = 100.

Conclusion
This paper has proposed a novel semisupervised feature selection algorithm based on SVM and the elastic net penalty.The whole methodology of SENFS and the solution path based on ADMM have been described in detail in this paper.The experimental results illustrate that SENFS can identify the relevant features and encourage highly correlated features to be selected or removed together.
Future work will address how these selected features interpret their semantic relationship with the data they are selected from, which can be used for unknown data analysis, and extend SENFS to be suitable for multiclass case.

Table 1 :
Comparisons of selected features and their standard errors (in parenthesis). relevant is the number of selected relevant features and  noise is the number of selected noise features.

Table 2 :
Comparisons of test errors, computational time needed, and their standard errors (in parenthesis).

Table 3 :
Data sets used in the experiments., , and  are the number of samples, features, and labeled samples, respectively.

Table 4 :
Comparisons of test errors and their standard errors (in parenthesis) on benchmark data sets.