Semisupervised Feature Selection with Universum

The Universum data, defined as a set of unlabeled examples that do not belong to any class of interest, have been shown to encode some prior knowledge by representing meaningful information in the same domain as the problem at hand. Universum data have been proved effective in improving learning performance in many tasks, such as classification and clustering. Inspired by its favorable performance, we address a novel semisupervised feature selection problem in this paper, called semisupervised feature selection with Universum, that can simultaneously exploit the unlabeled data and the Universum data. The experiments on several UCI data sets are presented to show that the proposed algorithms can achieve superior performances over conventional unsupervised and supervised methods.


Introduction
With the rapid accumulation of high-dimensional data such as financial time series, gene expression microarrays, and digital images, feature selection has been a significant preprocessing step to data mining and machine learning.In many real-word applications, feature selection has been very effective in reducing feature space dimension, removing irrelevant and redundant features, and improving learning performance [1,2].
1.1.Background.Typically, according to whether supervised information is used or not, feature selection methods can be divided into two categories: unsupervised ones and supervised ones.The supervised feature selection methods evaluate feature relevance by the correlation between features and class labels, while unsupervised methods measure feature relevance by the capability of keeping certain properties of the data, for example, the locality or sparsity preserving ability [3,4].In general, supervised feature selection methods need the label information of the training sets.However, in some cases it is time consuming and expensive to complete the labeling task and the amount of labeled data is often quite limited.Most conventional supervised feature selection methods cannot work on such situation.To deal with such problem, semisupervised feature selection is a common option when unlabeled data is available since unlabeled data helps model data distribution of the whole data.The popular supervised information used in semisupervised feature selection is class labels.
In fact, besides class labels, there exist other forms of supervised information, for example, the pairwise constraints or the Universum data.The Universum learning concept was first proposed to increase the binary classification with the help of Universum data, the data that do not belong to either target classes but belong to the same domain as the classification problem at hand [5,6].Universum data can carry additional valuable prior knowledge from the domain of the problem into the training process.Moreover, Universum based learning can better model the whole data set since Universum data stays in the same domain of learning problem with which we are concerned, while the unlabeled data may be too general and stay outside of the domain.The purpose of Universum provided a new way to alleviate the problem of insufficient labeled samples.
Universum learning has attracted many scholars since it was proposed and it has been used for classification, clustering, and other machine learning scenarios and obtained favorable improvements with the help of Universum data [5][6][7][8][9][10][11][12].In [5], Universum was firstly proposed to enhance the performance of support vector machine; in the paper USVM (Universum Support Vector) was introduced to leverage the 2 Mathematical Problems in Engineering Universum by maximizing the number of observed contractions.Experiments showed that the performance of USVM outperformed SVM.Besides, the results also confirmed that the Universum can be an important instrument for boosting performance, especially in the small sample size regime.In 2012, Universum was introduced to improve the classification performance of TSVM (Twin Support Vector Machine); the results demonstrate that Universum samples are helpful to improve the generalization ability of the model and the training time of Universum based TSVM is faster than USVM [7].Universum learning was extended to dimensionality reduction by incorporating it with linear discriminant analysis; the proposed method was termed as Universum linear discriminant analysis (ULDA) which aimed to find discriminant directions by maximizing the distance between two target classes and simultaneously minimizing the distance between the Universum and the mean of the target classes [8].A novel semisupervised classification problem, called semisupervised Universum, that can simultaneously utilize the labeled data, unlabeled data, and the Universum data to improve the classification performance was addressed in [9].In 2012, Weston's principle of maximal contradiction on Universum data was extended to boosting learning; however, as pointed in the paper, in some scenarios poorly generated Universum data may not help [12].A maximum margin clustering method was proposed to model both target samples and Universum samples for document clustering and the method performed substantially better than stateof-the-art methods in most cases [10].Universum was also introduced to multiview learning and obtained satisfactory performance.
In terms of data acquisition, Universum data can be obtained more easily with different methods [5][6][7]9].The generation methods can be divided into three categories, including U rest (other samples that are not included in the learning tasks serve as Universum data), U mean (each Universum is generated by first randomly selecting two samples from two different classes and then combined with a specific combination coefficient), and U gene (generate Universum according to the statistic of the labeled and unlabeled samples).What is more, an algorithm was proposed about how to evaluate and select the informative and useful Universum data in [13].
1.2.Contributions and Novelty.However, to the best of our knowledge, there is few work about introducing Universum learning to feature selection.Inspired by the favorable performance of Universum data in guiding learning tasks, we address a novel semisupervised feature selection problem in this paper; the main contributions can be listed as follows: (i) A semisupervised feature selection technique with Universum is proposed which can simultaneously exploit the unlabeled data and the Universum data.
(ii) Based on Variance Score, we integrate Universum into it and introduce a Universum based Variance Score algorithm so as to select features with larger variances as well as maximizing the margins between Universum and target samples while minimizing the margins on the Universum.
(iii) An improved Laplacian Score named ULS is proposed to select features with stronger locality preserving ability as well as exploiting the supervised information provided by the Universum data.
(iv) What is more, we add the Universum data into the Sparsity Score in order to combine the sparse structure of data and the prior knowledge encoded by the Universum data.
The three improved semisupervised methods can inherit the merits of traditional unsupervised methods as well as the valuable prior information carried by the Universum data so as to select more discriminative features.Experiments carried out on several UCI data sets validate the effectiveness of Universum in enhancing the performance of feature selection.

Traditional Feature Selection Algorithms
In this selection, we briefly present several algorithms popularly used in feature selection, including Variance Score [14], Laplacian Score [15], Sparsity Score [16], and Fisher Score [14].Among them, the former three are unsupervised, while the last one is supervised.
Variance Score uses the variance along a feature dimension to evaluate its representative power and those features with the maximum variance are selected.The th feature's score can be computed as follows [14]: where   is the th feature of the th sample x  and   is the mean of the th feature,  = 1, . . ., ,  = 1, . . ., .
Laplacian Score aims to select features not only with larger variances but also with stronger locality preserving ability.Laplacian Score is under the assumption that data with the same label are close to each other.The th feature's score can be computed by minimizing the following formula [15]: Here D is a diagonal matrix and D , = ∑  Q , ; the definition of Q , can be defined as follows: where  is a constant to be set.In this paper, the parameter  is set to default value 1 according to [15].Another unsupervised feature selection algorithm Sparsity Score is based on sparse representation; it aims to identify an optimal feature subset that is most useful in capturing the intrinsic sparse structure of data.The objective function of Sparsity Score can be defined as the following formulation [16]: where ŝ is the estimated value of   and   indicates the th sample's contributions to the reconstruction of x  ; thus x  can be reconstructed by other samples in the training data as An optimal feature for x  implies the following equality: With full class labels, the supervised Fisher Score prefers features with best discriminant ability.The Fisher Score of the th feature FS  , which should be maximized, is computed as follows [14]: Here  is the number of classes and   is number of samples in class ;    and (   ) 2 denote the mean and variance of class  corresponding to the th feature.

Semisupervised Feature Selection with Universum
Here we formulate the Universum data guided feature selection as follows: given a set of data samples X = [x 1 , x 2 , . . ., x  ],  is the number of samples.The Universum data can be donated as U = [x  1 , x  2 , . . ., x   ] while  is the number of Universum data samples.
Based on Variance Score, we are now in the position to derive a semisupervised Variance Score feature selection algorithm, called Variance Score with Universum (UVS).The intuition is that the Universum data can serve as supervised information since they are known not belonging to any target classes.UVS prefers those features with larger variances but also prefers to selecting features which can maximize the margins between Universum and target samples while minimize the margins on the Universum simultaneously.The th feature score of UVS, which should be maximized, is computed as follows: where  and  are two scaling parameters, whose functions are to balance the contributions of the three terms in (6).The first term of (6) constrains the selected features to maximize the margins between Universum and target samples.On the contrary, the second term aims to minimize the margins among Universum.The last term expresses the variance between the selected features, which is equivalent to the Variance Score criterion.Now we give the formal representation of Laplacian Score with Universum (ULS).ULS can select features with stronger locality preserving ability as well as exploiting the supervised information provided by the Universum data.The objective function of ULS can be maximized as follows: The motivation of the former two terms of ( 7) is to use Universum to enhance performance of feature selection.The last term aims to improve the locality preserving ability of selected features.
Similarly, we add the Universum data into the Sparsity Score and propose a semisupervised Sparsity Score with Universum, called USS.USS combines the sparse structure of data and the prior knowledge encoded by the Universum data.The th feature score of USS can be obtained as follows: The motivation of the former two terms of ( 8) is similar to ( 6) and (7).The third term is used to preserve the sparse structure of the data.
To sum up, the advantages of the proposed three methods are very clear.Firstly, they inherit the merits of traditional unsupervised methods as well as the valuable prior information carried by the Universum data so as to select more discriminative features.Secondly, no label information is needed, which saves the great cost of labeling task.Besides, the Universum samples are easy to obtain, which makes the proposed methods have great potentials in those applications where labeled data is rare.The disadvantages are that the proposed methods cost more running time than the corresponding unsupervised methods since we add the Universum constraint terms in the objective functions.

Experiments and Results Analysis
To evaluate the performance of our proposed algorithms, we apply them on five UCI data sets, that is, ionosphere, sonar, wine, zoo, and vehicle.Some statistics information of the data sets can be found in Table 1.
The LibSVM package and a 10-fold cross validation strategy are adopted to perform classification and compute the average classification accuracy, respectively.Each classification experiment is repeated 10 times and the average accuracy is the final result.In this paper, we consider two ways to generate the Universum data samples.For the two-class data sets, each Universum is generated by first randomly selecting two samples from two different categories and then combined with a specific combination coefficient (here the combination coefficient is set to 0.5).The generated Universum plus 50% of the data sets is used in features selection process while the other 50% serves as the classification samples.For the multiclass data sets, each time one class of samples is treated as Universum; for the remaining samples of each data set, 50% is employed in the classification task; the other 50% and the selected Universum are used to select features.The proposed algorithms have two parameters:  and .We search one parameter while fixing the other one.The ranges of parameters are set as follows:  = [0.01,0.05, 0.1, 0.5, 1, 5, 10, 50, 100] and  = [0.01,0.05, 0.1, 0.5, 1, 5, 10, 50, 100].

Experimental Results.
In order to show the changing trend of the accuracy along with the parameters, in Figure 1 we plot the accuracy versus different values of parameters of UVS on ionosphere data set.It is easy to see that the accuracy first rises and then declines with the increase of  and .The accuracy reaches its peak when  is around 1 while  is around 10.
Tables 2 and 3 give the best accuracy of different algorithms on ionosphere, sonar, and wine.For ionosphere and sonar, we add 150 and 200 constructed Universum samples, respectively.As to wine, we select each class sample as Universum.Here the numbers in parentheses after the accuracy represent the optimal features.The averaged accuracy versus different numbers of selected features and different class of samples used as Universum on vehicle is summarized in Table 4.The three tables indicate that the performances of the improved algorithms are significantly better than that of unsupervised algorithms, even outperforming the supervised Fisher Score.This verifies that Universum is very useful   We try to give the following reasons.Firstly, the proposed methods not only exploit the prior information carried by Universum, but also preserve the structure (such as the local or sparse structure) of the data.However, the supervised Fisher Score only utilizes the label information of the data.Secondly, there does not exist the feature selection algorithm which is best, but the most appropriate for specific data set; we should select the most suitable method according to the property of the data.
To investigate the effectiveness of Universum in guiding feature selection, we compare the original algorithm with its corresponding improved algorithm on zoo in Figure 2. As shown in the figure, the performances of the improved algorithms are significantly superior to the original algorithms in most cases.This verifies again the usefulness of Universum in feature selection.
Figure 3 shows the plots for accuracy versus different numbers of selected features on zoo.Here we randomly select the 1st-and 6th-class samples as Universum.Obviously, the proposed methods obtain a satisfying accuracy and reach their peaks with only a few features being selected.
To uncover the influence of the Universum data on feature selection, we plot the distribution of the former six selected features and the feature score curves of different algorithms in Figures 4 and 5.It is surprising to see that the selected feature distribution of the improved algorithms (UVS, ULS, and USS) is similar to each other and dissimilar to the original algorithms (VS, LS, SS, and FS).The underlying reason is that the prior knowledge encoded by Universum reflects certain kind of structure information of the data and dominates the feature selection process.The feature score curves of the improved algorithms are also similar to each other.This phenomenon reveals that the constraint information of Universum plays a major role in feature selection while  other constraint information (the samples variance, locality structure, or sparse structure information) plays a secondary role.Meanwhile, we find that the feature score curves of the original algorithms (VS, LS, SS, and FS) are different to each other, because they use different criterion function to compute feature score.

Conclusions
In this paper, we address a novel semisupervised feature selection problem, called semisupervised feature selection with Universum.Three new score functions are presented to evaluate features based on Universum.The proposed algorithms can inherit the merits of traditional unsupervised methods as well as exploit the valuable prior information carried by the Universum data so as to select more discriminative features from the high-dimensional data.The experiments on five UCI data sets demonstrate that the improved algorithms achieve similar or higher accuracy to supervised Fisher Score and significantly outperform unsupervised methods.Finally, because in many real applications generating Universum data is much easier than obtaining labeled or unlabeled data, our improved algorithms have great potentials in those applications.
In the future, how to generate more discriminative Universum will be an interesting research idea.Besides, our work will also include Universum motivated active learning and metric learning.What is more, how to quantitative evaluate the effectiveness remains to be solved.

Figure 1 :
Figure 1: Accuracy versus different values of parameters.

Figure 2 :
Figure 2: Accuracy versus different class of samples used as Universum.

Figure 3 :
Figure 3: Accuracy versus different numbers of selected features on zoo.

Figure 4 :Figure 5 :
Figure 4: The distribution of the former six selected features.

Table 1 :
Statistics information of the data sets.

Table 2 :
The best accuracy of different algorithms.

Table 3 :
The best accuracy on wine data set.

Table 4 :
Averaged accuracy on different numbers of selected features.