Robust Feature Selection from Microarray Data Based on Cooperative Game Theory and Qualitative Mutual Information

High dimensionality of microarray data sets may lead to low efficiency and overfitting. In this paper, a multiphase cooperative game theoretic feature selection approach is proposed for microarray data classification. In the first phase, due to high dimension of microarray data sets, the features are reduced using one of the two filter-based feature selection methods, namely, mutual information and Fisher ratio. In the second phase, Shapley index is used to evaluate the power of each feature. The main innovation of the proposed approach is to employ Qualitative Mutual Information (QMI) for this purpose. The idea of Qualitative Mutual Information causes the selected features to have more stability and this stability helps to deal with the problem of data imbalance and scarcity. In the third phase, a forward selection scheme is applied which uses a scoring function to weight each feature. The performance of the proposed method is compared with other popular feature selection algorithms such as Fisher ratio, minimum redundancy maximum relevance, and previous works on cooperative game based feature selection. The average classification accuracy on eleven microarray data sets shows that the proposed method improves both average accuracy and average stability compared to other approaches.


Introduction
In feature selection, the features of the data sets are selected which are effective for predicting the target class. By eliminating additional features from data sets, the efficiency of the learning models dramatically increases. Genetic data sets have high dimensions and small size and are usually imbalanced. Increasing the high dimensions leads to classification complexity and it can reduce the classification accuracy. The small size of the data sets is another challenge [1]. Robustness issue is often ignored in feature selection. Increasing and decreasing the training samples in a nonrobust feature selection algorithm will lead to different results [2].
Feature selection methods are classified as filter methods, wrapper methods, and embedded methods [3]. Filtering methods are independent from learning algorithms. They are statistical tests which rely on the basic features of the training data and have much lower computational complexity compared with wrapper methods [4]. These methods use some measurements. Among these measurements are distance measurements [5,6], rough set theory [7], and information theoretic measures [8]. A common distance measure is Euclidean distance, which was applied in Relief method by Kira and Rendell [5], which uses Euclidean distance to assign weights to each feature. Then, [6] developed Relief algorithm which adds the ability of dealing with multiclass problems. Peng et al. [9] proposed minimum redundancy maximum relevance (mRMR) approach, which selects features that have the highest relevance with the target class and are minimally redundant. Wang et al. [10] proposed a new filtering algorithm based on the maximum weight minimum redundancy (MWMR) criterion. The weight of the feature shows its importance and redundancy represents the correlation between the features. Nguyen et al. [11] proposed a hierarchical approach called Modified Analytic Hierarchy Process (MAHP) which uses five individual gene ranking methods including -test, entropy, Receiver Operating 2 Advances in Bioinformatics Characteristics (ROC) curve, Wilcoxon test, and Signal to Noise Ratio (SNR) for feature ranking and selection.
In the wrapper methods, each subset is evaluated using a specific learning algorithm, and optimal features are selected to improve classification performance [3]. For example, Inza et al. [12] proposed classical wrapper algorithms for this purpose (i.e., sequential forward and backward selection, floating selection, and best-first search) and evaluated them on three microarray data sets. Another wrapper approach for gene selection from microarray data is proposed in [13] using modified ant colony optimization. Another heuristic based wrapper approach is proposed in [14] which is based on Genetic Bee Colony (GBC) optimization. More referencing materials on wrapper methods can be found in [15][16][17][18].
Similar to wrapper methods, embedded feature selection methods are classifier dependent but this relationship is stronger in embedded approaches. Guyon et al. [19] proposed Support Vector Machine (SVM) based on Recursive Feature Elimination (SVM-RFE) for feature selection in cancer classification. Canul-Reich et al. [20] introduced and applied Iterative Feature Perturbation (IFP) method, as an embedded gene selector, on four microarray data sets.
Robust feature selection algorithms are another source of researches. Yang and Mao [2] tried to improve the robustness of feature selection algorithms, with an ensemble method called Multicriterion Fusion-Based Recursive Feature Elimination (MCF-RFE). Also, Yassi and Moattar [1] presented robust and stable feature selection by integrating ranking methods and wrapper technique for genetic data classification.
The authors of [3,21] tried to overcome the weaknesses of feature selection methods using cooperative game. They introduced a framework based on cooperative game theory to evaluate the power of each feature. To evaluate the weight of each of the features, the Banzhaf power index and the Shapley value index are used. Paper [3] used the Banzhaf power index in game theory for evaluating the weight of each feature, while paper [21] used the Shapley value index in game theory for evaluating the weight of each feature. Reference [22] presented a novel feature selection approach called Neighborhood Entropy-Based Cooperative Game Theory (NECGT) which was based on information theoretic approaches. The results of the evaluation of some UCI data sets showed that the approach yields better performance compared to classical methods in terms of accuracy.
Stability and robustness are important specially when the data is scarce and classes are imbalanced. In this paper, cooperative game theory is used for robust feature selection. In the proposed method, a cooperative game theory framework based on Shapley value is introduced for evaluating the power of each feature. To score the features in Shapley value index, we propose Qualitative Mutual Information criterion to achieve more stable results even in the presence of class imbalance and data scarcity. This criterion improves the robustness of feature selection algorithm. In this criterion, we use the Fisher ratio as the utility function in calculating Qualitative Mutual Information and this criterion leads to better robustness of the feature selection algorithm. The rest of this paper is organized as follows. Section 2 introduces the fundamental materials and methods. In Section 3, our proposed method is described. In Section 4, the simulation results and evaluation of the proposed method are discussed. Finally, conclusions and future work are described in Section 5.

Mutual Information, Qualitative Mutual Information, and Conditional Mutual Information
The entropy, ( ), of discrete random variable with ( ) = Pr( = ) as its probability density function is defined as follows: Also, the mutual information (MI) between two random variables and with a joint probability distribution Pr( , ) is formulated as follows [22]: Hence, Qualitative Mutual Information (QMI) is defined by multiplying a utility function ( , ) to mutual information formula; the formula is as follows [23,24]: Pr ( ) Pr ( ) ) . ( Different informative functions can be applied as the utility function. In the proposed approach, we use the Fisher ratio [2]. Also, Conditional Mutual Information (CMI) noted as CMI( ; | ) denotes the amount of information between and , when is given and is formulated as follows [21]: ) . (4)

Proposed Approach
The proposed method consists of three phases, which are described as follows. Figure 1 depicts the flowchart of the proposed feature selection algorithm.

Filter Approaches for Dimension
Reduction. Due to high dimension of features of microarray data, in the first phase of the proposed method, the features are reduced by using one of the two filter-based feature selection methods, namely, mutual information and Fisher ratio. Fisher's ratio is an individual feature evaluation criterion. Fisher's ratio measures the discriminative power of a feature by the ratio of interclass to intraclass variability. This relationship is as follows [2]: wherêis the sample mean of feature in class and 2 is variance of feature in .

Feature Evaluation.
In large data sets, there are intrinsic correlations among features. However, most of the filterbased feature selection algorithms discard redundant features that are highly correlated with the selected ones [21]. Each feature is weighted using information theory measures such as CMI and QMI [21]. Theoretically, the more relevancy of a feature means that it shares more information with the target class [21,25]. Feature is to be redundant with feature if the following form is accepted [21]: Two features and are interdependent if the following form is accepted [21,26]: The relevance criterion that is introduced by Peng et al. [9] is the relevance criterion of set with the target class; we have The change of the relevance of set of features with the target class considering the feature ( ∉ ) which is introduced by Jain et al. [27] is measured by the following formula:

Feature Evaluation via Shapley Value.
In the second phase of the proposed method, after filtering the features, the weight of each feature is obtained using the Shapley value [28,29] for all features. In this work, we used QMI in all our formulations. The Shapley value is an efficient way for estimating the importance of features weight. The Shapley value of the th feature is denoted by (V) as is formulated as follows: where denotes the total number of features and Δ ( ) is defined as follows: And ( , ) denotes the interdependence index and is defined as follows: which means that the feature will be appropriate if it increases the relevance of subset on the target class and is interdependent with at least half of the features [21]. Figure 2 shows the flowchart of feature evaluation. The output is a weight vector indicating the Shapley value of feature . In all the above relationships, we used QMI criterion for more robustness of selected features. In step 2 of the proposed algorithm, for each feature , we should calculate all subsets of features, so that these subsets do not contain feature . So at first we calculate 4

Advances in Bioinformatics
Step 2: for each feature i, create all coalitions set Step 3: for each coalition, calculate Step 4: calculate the Shapley value Step 5: normalize the weight vector Input: a microarray data set Step all two member subsets of features. Then, using (9), Conditional Mutual Information measure between set and target class is calculated. Then, we calculate the Qualitative Mutual Information measure between each feature and class, namely, QMI( ; class), and set these two measures in the interdependence index ( , ). Finally, Shapley value (weight) of each feature is computed according to (10).

Final Feature Selection Phase.
In the third phase that is the final phase of the proposed method, the feature selection algorithm is performed using the weight of each feature. Accordingly, we use the weights of features to reevaluate the features. For this purpose, the features with the highest weights are used. A straightforward optimization is used by employing any information criterion, such as BIF [26], SU [30], and mRMR [9]. In this paper, we used SU criterion. The formula for obtaining this criterion is provided as follows: where indicates the entropy and MI is the mutual information between feature and class. The flowchart of this phase is specified in Figure 3.
To select the optimal feature in each iteration, the victory criterion ( ) is defined to evaluate the superiority of each feature on others. The criterion function ( ) (i.e., SU) is used to select feature by feature relevance or redundancy Step 3: calculate victory criterion Input: a microarray data set Output: selected feature subset S Choose feature f i with the largest Step 2: calculate criterion g(f i ) Step 1: initialize parameters: S = , k  analysis. The weight ( ), which denotes the impact of feature on the whole feature space, is used to regulate the relative importance of evaluation value ( ) in feature selection. Finally, we choose the features with the largest victory measure.

Data Set Descriptions.
In this paper, we used the cancer microarray data sets from the Plymouth University [31]. For better performance and better evaluation of the proposed approach, we applied 11 high dimensional microarray data. These eleven data sets are briefly summarized in Table 1.

Evaluation Criteria.
The criteria used to evaluate the proposed method consist of accuracy, classification precision, Advances in Bioinformatics 5 -measure, and stability criterion for feature selection algorithm.

Accuracy of Classification.
Classification accuracy is the main criterion for evaluating the classification and predicting the samples. The classification accuracy is as follows: Accuracy is described in terms of true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP).

Precision, Recall
, and -Measure. The precision in (15) and recall in (16) are two other evaluation criteria. -measure criterion in (17) is used to integrate precision and recall into a single criterion for the comparison:

Robustness Measure.
The robustness (stability) of a feature selection algorithm can be evaluated based on the ability to select repeated features, given various categories under the same distribution [2]. Let and denote feature subsets selected using the th categories of resampled data and the full data, respectively. For measuring the similarity between two feature subsets, similarity index JC ∈ [0, 1] is used which is defined as follows: where | ∩ | is the number of common features between and , SC is the sum of the absolute correlation values between different features from and , and is set size of and . Assume categories of data are produced by resampling and feature subsets are selected. The robustness or stability measure of the feature selection algorithm is calculated by (19) as follows:               data sets have two classes; therefore, precision and -measure criteria are obtained only for these two data sets and for other data sets only the accuracy criterion is examined. The 300 superior features of 11 Tumors, 14 Tumors, Leukemia1, Leukemia2, and DLBCL data sets are selected using mutual information criterion. For other data sets, Fisher ratio is applied for prior feature selection. For comparison, the classification accuracy is given against the number of features. The -axis is the number of features which is considered up to 50 and the -axis is the classification accuracy. The proposed method is compared with feature selection using Fisher's ratio, mRMR, and CGFS-SU. The mRMR algorithm is used as information criterion algorithm for comparison. Fisher's ratio method is a univariate filter method for evaluating each feature individually and the CGFS-SU method is proposed in [21] for cooperative game based feature selection. For estimating the performance of classification algorithms, tenfold cross-validation is used.      KNN classifier results show that the proposed method has been improved in most cases compared to the other methods of feature selection. For 11 Tumors data, the proposed method has achieved the maximum accuracy of 74.73 percent. Also, the result of CGFS-SU method is almost the same as the proposed method and is 74.08 percent, but in 30 first features, its accuracy is less than the proposed method. In 14 Tumors data, the results are low for all feature selection algorithms. However, the proposed method has the maximum accuracy of 48.34 percent and other methods have much lower accuracy, which is due to the weak correlation between selected features. However, it can be observed that the proposed method achieves higher accuracy in all 50 features as compared to other methods. The accuracy has been increased in SRBCT data, and the proposed method and Fisher and mRMR algorithms reached the maximum accuracy of 98.75 percent. Furthermore, the CGFS-SU method accuracy is 97.56 percent. In Prostate Tumors data, it has been shown that the proposed method achieved maximum accuracy among all feature selection methods for all three evaluation criteria and for all cases. For DLBCL data, the proposed method achieved the maximum value for most cases for each of the three criteria, and it reached the highest accuracy of 93 percent, precision of 91 percent, and -measure of 88 percent compared to other methods. This shows that the proposed method could reduce interdependency between groups of features and is effective as compared with other algorithms. Figures 19-33 show the classification results using SVM classifier on the evaluation data sets. The results of most data sets with SVM classifier are better compared to KNN classifier. In SRBCT data, the proposed method reached the highest rate of accuracy of 100 percent for 10 to 40 features and also the mRMR method. But the CGFS-SU method has lower accuracy and the maximum accuracy is 97.63 percent. Also, Fisher criterion is not much accurate and its maximum accuracy is 97.5 percent. It may be due to the fact that the relationship between the features and target classes is maximum for this data set and the mRMR method and the proposed approach have been able to retain this relationship between features. Also, in DLBCL data, the proposed method is superior to the other approaches for all three criteria.

Stability Results.
In the stability diagram, the -axis shows the number of features and the -axis shows the stability index. As denoted above, the stability index value is between 0 and 1. For this criterion, features are normalized except for the SRCBT data. Fisher ratio is applied to select the prior 300 best features of all data sets. For estimating the stability, we used tenfold cross-validation. The results of stability are observed in Figures 34-44.
The stability values of the CGFS-SU and the proposed methods are very close to each other and are slightly different. But we have seen that the proposed method reached the maximum stability in 11 Tumors, 14 Tumors, Brain Tumor2, and Lung Cancer data sets. This shows that the proposed method is much more robust than the other feature selection approaches. However, the CGFS-SU method achieved the maximum stability only for Brain Tumor2 data. The stability of other feature selection methods has significant differences with the CGFS-SU and the proposed method; that is, the stability of mRMR method is approximately equal to these methods only for Brain Tumor2 data. Table 2 shows the maximum accuracy and average accuracy of feature selection methods on evaluation microarray data using KNN, SVM, NB, and CART classifiers.  According to Table 2, it can be seen that the proposed method achieved the highest accuracy among all the algorithms in most cases. Also, results of the average classification accuracy show that the proposed method has improved the average accuracy between 1 and 5 percent as compared with the CGFS-SU method, also between 2 and 14 percent as compared with the Fisher ratio method, and between 1 and 14 percent as compared with the minimum redundancy maximum relevance approach. In Table 3, average stability ± standard deviation of four feature selection methods is shown.

Advances in Bioinformatics 13
In these experiments, if the average stability is high and the standard deviation is low, the approach is more robust. It is observed that, among the eleven microarray data sets, the stability of the proposed method is more than the other feature selection methods on eight data sets and the stability is less than the CGFS-SU method only on three data sets. Furthermore, the results of these experiments show that  the proposed method has improved the average stability between 0.001 and 0.01 as compared with the CGFS-SU method, between 0.1 and 0.5 as compared with the Fisher ratio method, and between 0.01 and 0.3 as compared with the minimum redundancy maximum relevance method.

Conclusion and Future Work
This paper proposed a feature selection approach for robust gene selection of microarray data. In the proposed method, a cooperative game theory method is introduced to evaluate the weight of each feature considering the complex and inherent relationships among features and Qualitative Mutual Information (QMI) measure is used for more robust feature selection. The idea used for the stability is to use Fisher ratio as a utility in calculating QMI. The results on eleven evaluation microarray data sets show that the proposed method is an effective and stable method for reducing the dimensions of data and is able to reach relative improvement as compared to the other feature selection methods. As future works, we propose to calculate the weight of each feature using fuzzy Shapley value or fuzzy Banzhaf power index.