Unsupervised Optimal Discriminant Vector Based Feature Selection Method

An efficient unsupervised feature selection method based on unsupervised optimal discriminant vector is developed to find the important features without using class labels. Features are ranked according to the feature importance measurement based on unsupervised optimal discriminant vector in the following steps. First, fuzzy Fisher criterion is adopted as objective function to derive the optimal discriminant vector in unsupervised pattern. Second, the feature importancemeasurement based on elements of unsupervised optimal discriminant vector is defined to determine the importance of each feature.The featureswith little importance measurement are removed from the feature subset. Experiments on UCI dataset and fault diagnosis are carried out to show that the proposed method is very efficient and able to deliver reliable results.


Introduction
Feature selection (FS) has become an active research topic in the area of pattern recognition, machine learning, data mining, intelligent fault diagnosis, and so forth.It is performed to choose a subset of the original features by removing redundant and noisy features from high-dimensional datasets in order to reduce computational cost, increase the classification accuracy, and improve result comprehensibility.
In the supervised FS algorithms, since class labels are available in supervised learning, various feature subsets are evaluated using some function of prediction accuracy to select only those features which are related to or lead to the decision classes of the data under consideration.There are numerous supervised feature selection methods [1][2][3][4][5][6][7] such as Fisher criterion [1,2], Relief [3], and Relief-F [4].
However, for many existing datasets, class labels are often unknown or incomplete because large amounts of data make it difficult for humans to manually label the categories of each instance.Moreover, human labeling is expensive and subjective.Thus, it indicates the significance of unsupervised dimensionality reduction.Principal component analysis (PCA) [8] is often used in unsupervised pattern.However, PCA creates new features or principal components which are functions of original features.It is difficult to obtain intuitive understanding of the data using the new features only.Some unsupervised feature selection methods [8][9][10][11][12][13][14] have been proposed such as SUD [9].SUD, which is a sequential backward selection algorithm to determine the relative importance of variables for Unsupervised Data, uses entropy similarity measurement to determine the importance of features with respect to the underlying clusters.
It is known to us that the famous Fisher criterion which can derive optimal discriminant vector is commonly used to realize feature dimension reduction in supervised pattern.In the unsupervised pattern, how to overcome the lack of the class information to realize feature selection is a worthy topic.

An Overview of Optimal Discriminant Vector
Fisher criterion is a discriminant criterion function that was first proposed by Fisher.It is based on the betweenclass scatter and the within-class scatter.By maximizing this criterion, one can obtain an optimal discriminant vector.
After the sample is projected to this vector, the withinclass scatter is minimized and the between-class scatter is maximized [15].Given  pattern classes Χ () = [  ] in the pattern set which contains  -dimensional patterns, where  = 1, 2, . . ., ,   is the number of all the patterns in the th class; thus,  =  1 +  2 + ⋅ ⋅ ⋅ +   .Fisher criterion is defined as follows: where   is the between-class scatter matrix denoted by and   is the within-class scatter matrix denoted by where   denotes the mean of the th class, and  denotes the mean of all the patterns in the pattern set.
In order to seek an optimal discriminant vector  by maximizing the Fisher criterion, the optimal discriminant vector  * can be obtained by solving the following eigensystem equation: where  is diagonal and consists of the corresponding eigenvalues.When the inverse of   exists,  * can be obtained by the maximum eigenvalue of  −1    * .

Unsupervised Optimal Discriminant Vector Based Feature Selection Method
Fisher criterion mentioned above can only be used in supervised pattern.This means that traditional optimal discriminant vector cannot be calculated directly by the unlabeled samples.Cao et al. [16] introduce fuzzy theory into Fisher criterion and define fuzzy Fisher criterion.Maximizing this criterion cannot only realize clustering but also obtain optimal discriminant vector.Suppose that the membership function   ∈ [0, 1] with ∑  =1   = 1 for all  and the fuzzy index  > 1 is a given real value, where   denotes the degree of the th -dimensional pattern belonging to the th class; we can define the following fuzzy within-class scatter matrix   : and the following fuzzy between-class scatter matrix   : Thus, we can derive fuzzy Fisher criterion as follows: It is obvious that maximizing  FFC directly in ( 7) is not a trivial task due to the existence of its denominator.However, we can reasonably relax this problem by applying the following Lagrange multipliers;  and   ( = 1, 2, . . .) together with the constraint ∑  =1   = 1 to (7): Setting / to be zero, we have where  is the eigenvector belonging to the largest eigenvalue  of  −1    .Setting /  to be zero, we have Here,   is a local maximum of  [17] proved in Appendix.
Setting /  to be zero, we have When ( 11) is used, as stated previously,   should satisfy   ∈ [0, 1]; hence, in order to satisfy this constraint, we let   = 1 and     = 0 for all   ̸ = , if With the above discussion, we can obtain the optimal discriminant vector  in unsupervised pattern and then do feature selection based on .Now, let us illustrate this by the following experiment on 2-dimensional artificial dataset.
Figure 1 contains 168 2-dimensional samples.Through maximizing fuzzy Fisher criterion, we can obtain 2-class clustering result shown as red points and blue points, respectively, and can also get the vector  = ( 1 ,  2 )  = (0.4562, −0.8899)  shown as a line in Figure 2. We project all samples to axis and -axis.It is obvious that projective points in -axis from different class are overlaping while those in -axis are separated well.It means that  feature is more important than  feature for leading to the decision classes.This is consistent with | 2 | > | 1 | which gives us a revelation that we can apply the vector  for feature selection.
Suppose  = ( 1 ,  2 , . . .  )  ; we define   as the  single feature importance measurement for comparison: To the above artificial dataset,  1 = 0.3389 is the importance measurement of  feature and  2 = 0.6611 is the importance measurement of  feature.
Step 3. Compute the largest eigenvalue  and the corresponding  using (9).
Step 6.If  FFC <  or the number of iterations ≥ , go to Step 7; otherwise go to Step 2.
Step 7. Compute the feature importance measurements which are normalized as   .Then sort   by the descending order.
Step 8. Set the feature importance threshold .
Step 9. Find a feature subset size   which is a minimize number making ∑   =1   no less than the threshold .
Step 10.Choose   features corresponding to the sorted   in the descending order, that is,   ( = 1, 2, . . .,   ), as the selected features and then terminate.
Different feature importance threshold  leads to different feature subset size.In Step 7 of proposed method, features have already been sorted by the descending order.If the feature subset size   is given from the start, we can simply select the first   features.But if   is not given, we can use  to determine the feature subset size.The bigger  is, the larger   is.The recommended range of  is from 0.8 to 0.95.

Feature Selection on UCI Dataset Wine.
In this experiment, the benchmarking UCI dataset Wine [18] was chosen to test the feature selection effectiveness of SUD, Relief-F and our method.We use the following Rand index [19] to evaluate the clustering performance of the dimension reduction data: where Table 2 lists the importance measurement of every feature computed by the proposed method.Due to the threshold , 6 features will be selected from original features.Figure 3 shows the Rand index values corresponding to the number of features using SUD, Relief-F and our method.
From Figure 3, we can easily find that data selected features by the proposed method have the best clustering result among these three algorithms.faults as testing dataset.The parameters for the proposed method are set as the previous experiment.Table 4 lists the importance measurement of every feature computed by the proposed method.Due to the threshold , 11 features will be selected from original features.Figure 4 shows the Rand index values corresponding to the number of features using SUD, Relief-F, and our method.

Feature Selection for Fault
Figure 4 shows that the proposed method is able to find the important features.It also shows that the performance of the proposed method without using class labels is very close to and sometimes better than that of SUD or Relief-F which ranks the original features using the class labels.

Conclusions
An efficient unsupervised feature selection method based on unsupervised optimal discriminant vector is developed to find the important features without using class labels.It adopts fuzzy Fisher criterion to derive the optimal discriminant vector in unsupervised pattern.It defines the single feature importance measurement based on unsupervised optimal discriminant vector to determine the importance of every feature.Two experiments on Wine dataset and fault diagnosis were carried out to show that the proposed method is able to find important features and is a reliable and efficient feature selection methodology compared to SUD and Relief-F.In the future, we will research how to introduce kernel techniques to the proposed method to enhance its applicability.According to [23], (A.5) is the local maximum of .

Figure 3 :
Figure 3: Rand index values corresponding to the number of features.

Figure 4 :
Figure 4: Rand Index values corresponding to the number of features.

Table 1 :
1 ,  2 denote the clustering results for the original dataset without noise and the corresponding noisy dataset,  denotes the number of any two patterns in the original dataset belonging to the same cluster in  1 ,  2 ,  denotes the number of any two patterns in the original dataset belonging to two different clusters in  1 ,  2 , and  is the number of all patterns in the original dataset.Obviously, Rand( 1 ,  2 ) ∈ [0, 1].And Rand( 1 ,  2 ) = 1 when  1 is the same as  2 .The smaller Rand( 1 ,  2 ), the bigger the difference between  1 and  2 .In other words, the corresponding algorithm has less robust capability in this case.Table1illustrates the basic information of the dataset.We choose 130 samples which belong to class 1 and class 2 as testing dataset.The parameters for the proposed method are set as follows: Class distribution and features of Wine dataset.

Table 2 :
The feature importance measurement of Wine dataset.

Table 3 :
Class distribution and features of steel plates dataset.

Table 4 :
The feature importance measurement of steel plates dataset.