Bat Algorithm Based Hybrid Filter-Wrapper Approach

This paper presents a new hybrid of Bat Algorithm (BA) based on Mutual Information (MI) and Naive Bayes called BAMI. In BAMI,MIwas used to identify promising features which could potentially accelerate the process of finding the best known solution. The promising features were then used to replace several of the randomly selected features during the search initialization. BAMI was tested over twelve datasets and compared against the standard Bat Algorithm guided by Naive Bayes (BANV). The results showed that BAMI outperformed BANV in all datasets in terms of computational time.The statistical test indicated that BAMI has significantly lower computational time than BANV in six out of twelve datasets, while maintaining the effectiveness. The results also showed that BAMI performance was not affected by the number of features or samples in the dataset. Finally, BAMI was able to find the best known solutions with limited number of iterations.


Introduction
A number of studies have illustrated hybrid approach that combines the good characteristics of both filter and wrapper techniques.They are more efficient than wrapper methods while at the same time providing comparable accuracy [1][2][3][4].Lemma and Hashim [5] proposed a hybrid approach by using boosting technique and integrated some of the features in wrapper methods into a fast filter method.The results show that the proposed method is competitive with wrapper methods while selecting feature subsets much faster.Reference [6] developed a hybrid method based on Markov Blanket filter for high-dimensional genomic microarray data.The experimental results stated that the proposed method led to feature subsets outperforming those of regularization methods.
Hu et al. [7] investigated filter and wrapper methods for biomarker discovery from microarray gene expression data for cancer classification.They proposed a hybrid approach where Fisher's ratio was employed as the filtering method.The proposed method was tested extensively on real datasets, and the results demonstrate that the hybrid approach is computationally more efficient than the simple wrapper method.Furthermore, the results showed that the hybrid approach significantly outperformed the simple filter method.Huda et al. [8] presented a flexible hybrid feature selection method based on floating search methods to increase flexibility in dealing with the quality of result against computational time trade-off.The performance of the hybrid method was tested using real-world datasets.The authors stated that the proposed method significantly reduced the search time while achieving comparable accuracies with those of the wrapper methods.
The measures of relevance are of fundamental importance in a number of applications.MI is widely accepted in quantifying linear or nonlinear relevance degree between random variables [9][10][11].MI of two random variables is a quantity that measures the mutual dependence of the two variables.The earliest studies to use MI for selecting features in building models were by Lewis [12] and Lashin et al. [13].Huang et al. [14] developed an MI technique that utilized feature similarity for redundancy reduction in an unsupervised feature selection.
Tay and Shen [15] developed a method that targeted an efficient estimation of MI in high-dimensional datasets.From their observation, a feature is relevant to the classes if it embodies important information about the classes; otherwise, the feature is irrelevant or redundant.This method was based on both information theory and statistical tests, 2 Advances in Operations Research whereby selected feature is conditional and the information given by this feature must allow for statistical reduction of overlapping class.Using both synthetic and real-world datasets, authors stated that the hybrid method was able to eliminate irrelevant and redundant features, even in very large feature spaces, and performed more efficiently than the pure wrapper methods.
Tomar and Agarwal [16] proposed a hybrid Genetic Algorithm (GA) with MI for finding a subset of features that are most relevant to the classification task.They optimized the MI between the predictive labels in a trained classifier and the true class labels instead of optimizing the classification error rate.The method was validated using real-world datasets and the results indicated that the hybrid method outperformed the accuracy performance of filter methods.They also concluded that their hybrid method was more efficient than the wrapper methods.
BA has shown superior performance in handling feature selection problem [17][18][19][20][21] as well as other problems [22][23][24].However, the aim of this paper is to develop an algorithm that extracts and combines the best characteristics of filter-based and wrapper-based approach into one algorithm.The filter model is based on the Mutual Information (MI) and wrapper model is based on Naive Bayes classifier.The proposed algorithm BAMI will be tested over twelve benchmarking datasets compared against BANV which has been proposed previously by Goyal and Patterh [25].In this section, hybrid models in feature selection problem have been presented.The rest of this paper is organized as follows.Section 2 presents the proposed algorithm BAMI with the relevance and redundancy of features based on MI.Section 3 compares the proposed hybrid BAMI with the BANV and reports the computational efficiency and iterations.Finally, Section 4 discusses the results and Section 5 concludes the paper.

Proposed Bat Algorithm with Mutual Information
Our main motivation to build a hybrid feature selection model is to strike a good balance between the computational efficiency of a filter model and the accuracy performance of a wrapper model.In the conventional wrapper model such as the BANV, all bats are initialized with randomly selected features.In this hybrid model, we propose a small fraction of the bats to be initialized with "good" features ranked by the Maximum Relevance Minimum Redundancy (mRMR) method.The injection of the good features aims to strategize part of the search in effort to speed up the swarm convergence towards the best known solution.

Maximum Relevance Minimum Redundancy.
A "good" feature is defined as the one that has the best trade-off between minimum redundancy within the features and maximum relevance to target variable.Chen and Cheng [26] proposed a method called the Maximum Relevance Minimum Redundancy (mRMR) to gauge the "goodness" of a feature.The mRMR method is a sequential forward selection algorithm that evaluates the importance of different features.
This algorithm uses MI to select  features that best fulfill the minimal redundancy and maximal relevance criterion.It is found to be very powerful for feature selection.The relevance and redundancy are measured by the MI as defined in where  and  are two random variables, (, ) is their joint probability density, and () and () are their marginal probability densities, respectively.Let  represent the entire feature set, while   denotes the already-selected feature set which contains  features, and   denotes the yet-to-bescreened feature set which contains  features.Relevance  of the feature  in   with the target  can be calculated by The redundancy  of the feature  in   with all the features in   can be calculated by (3): max To obtain the feature   with maximum relevance and minimum redundancy, (3) and ( 4) are combined with the mRMR function.For a feature set with  features, the feature evaluation will continue  rounds.After these evaluations, we will get a feature set  by the mRMR method as illustrated in (5).The feature index ℎ indicates the importance of the respective feature.Better features will be extracted earlier with a smaller index ℎ:  = {1, 2, 3, . . .,  ℎ , . . .,   } . (5)

Algorithm
Procedure.The main steps in the proposed algorithm have been cleared in Figure 1; the shaded region refers to the main difference from previous proposed BANV.
In our proposed BAMI algorithm, Maximum Relevance Minimum Redundancy (mRMR) method is used to analyze the "goodness" of each feature.A particular number of the top-ranked features () will be used to initialize one bat in the swarm.As shown in ( 5) and ( 6),  are dynamic parameters that give flexibility to the proposed algorithm by making changes according to the swarm size and the number of features in the dataset: Next, all the bats will be evaluated by a Naive Bayes classifier.Accordingly, if the initialized bats really contain informative features, one of these bats will become the global best solution and speed up the swarm convergence to a promising area within the search space.Otherwise, the proposed BAMI would proceed with ordinary BANV procedure; hence, the solution quality will not be affected.

Experiments and Results
The objective of the experiments is to evaluate the performance of the proposed algorithm BAMI versus a traditional Bat Algorithm with Naïve Bayes classifier (BANV).Note that the searching efficiency in this study is evaluated based on the speed to converge to best known solution.The speed is measured in terms of number of iterations and execution time; therefore, once the algorithm obtains the best known solution, we will record the time and the number of iterations.Twelve benchmark datasets were used for the evaluation.The datasets have been selected from various domains; furthermore, each dataset has different number of features and samples as shown in Table 1.Both algorithms BANV and BAMI were run for 30 times with the maximum of 250 iterations; the population size is set to 10 for both algorithms, and the value  equals 0.5.Table 2 shows the average time and number of iterations (over 30 runs) when algorithms obtain the best known solution.In this table, ATB refers to average time of standard BA, ATM refers to average time for BAMI, AIB refers to average iteration for BA, and AIM refers to average iteration for BAMI.
Next, we investigated the significance of enhancement in terms of number of iterations required to get the best known solution.In achieving this, a set of statistical tests were carried out.The results were verified by Kolmogorov-Smirnov and Levene tests, whereby the outcome showed that only some of the data met the assumptions of normality distribution and equality of variance, while the remaining data did not.Because of this, t-test was used for normal data and the Wilcoxon Test was used for nonnormal data.Table 3 presents the results from statistical tests for both BAMI and BANV.Between the brackets is the algorithm that outperformed the other.Figure 2 illustrates the deference between average numbers of iterations for both algorithms across all datasets.

Discussion
The results showed that the performance of the proposed algorithm BAMI is superior and more efficient than BANV.As shown in Figure 2, BAMI recorded less time consumption and lower number of iterations to obtain the good solutions in all datasets.Statistically, it can be seen in Table 3 that BAMI performed significantly better than a standard BANV in nine out of 12 datasets.Next, we will discuss the results according to the time saving from biggest to smallest by observing the numbers of features and samples to see whether these factors are related with the time saving or the algorithm performance.
In the LED dataset, the proposed method achieved best known solution in the first iteration within 4.41 seconds, with an average of 90.45% time saving across all dataset.With more features but smaller samples in the Derm and Derm2 datasets, the result for BAMI was very close to BANV, whereby the time saving is 41.40% in Derm and 43.28% in Derm2.In the credit dataset, with a decreased number of features and increased number of samples as compared to the previous dataset, the time saving is 38.85%.
The results also showed that in Heart dataset the time saving is 37.97%, which is very close to the Credit dataset in spite of variation in number of features and samples, while in Vote dataset, with comparable number of samples and slightly higher number of features as compared to the Heart dataset, the decrease in time saving is up to 31.19%.It can also be seen that M-of-N dataset which has the same number of features as Heart dataset only has time saving of 28%.
Although both Exactly and Exactly2 datasets have the same number of features and samples, their time saving is different.For Exactly2, the time saving is 25.51% while Exactly has time saving of 14.82%.In WQ dataset, the time saving is 22.25%, which is lower by 50% as compared to Derm2 dataset in spite of the fact that both datasets have approximately the same number of features.In Lung and Mushroom datasets, the time saving is small, which is only 2.63% and 1.35%, respectively.From the results, it can be seen that the performance of BAMI is not affected by the number of features and samples in the dataset.The subsets obtained by both algorithms are the same, which implies that the proposed algorithm BAMI is able to maintain the same effectiveness while having higher efficiency.

Conclusion
In this study, a hybrid filter-wrapper approach named BAMI is presented.BAMI approach structurally integrates the MI model within BA using the Naive Bayes classifier.BAMI aims to bring together the efficiency of filter approach with higher accuracy from wrapper approach.In BAMI, MI was used to identify promising features which could potentially accelerate the process of finding the best known solution.The promising features were then used to replace several of the randomly selected features during the search initialization.BAMI was compared to BANV using the twelve datasets.The results showed that BAMI outperformed BANV in most datasets in terms of the computational time.The statistical test indicated that BAMI has significantly lower computation time than BANV in six out of twelve datasets.More importantly, this research presented a new feature selection technique that provided a good start point for further investigation and enhancement.

Figure 1 :
Figure 1: Flowchart of Bat Algorithm with Mutual Information.

Figure 2 :
Figure 2: Average numbers of iterations across all datasets.

Table 1 :
Characteristics of datasets.

Table 2 :
Average iterations and runtimes to obtain the best known solution.