A Binary Superior Tracking Artificial Bee Colony with Dynamic Cauchy Mutation for Feature Selection

-is paper aims to propose an improved learning algorithm for feature selection, termed as binary superior tracking artificial bee colony with dynamic Cauchy mutation (BSTABC-DCM). To enhance exploitation capacity, a binary learning strategy is proposed to enable each bee to learn from the superior individuals in each dimension. A dynamic Cauchymutation is introduced to diversify the population distribution. Ten datasets from UCI repository are adopted as test problems, and the average results of cross-validation of BSTABCDCM are compared with other seven popular swarm intelligencemetaheuristics. Experimental results demonstrate that BSTABC-DCM could obtain the optimal classification accuracy and select the best representative features for the UCI problems.


Introduction
Feature selection is one of the cornerstones in machine learning [1] to select the proper combination of features to best describe the target problem. e useful features are retained while the redundant features and irrelevant features are removed by feature selection. As a result, an appropriate set of selected features can reduce computational complexity, improve knowledge discovery, and achieve satisfactory learning performance [2]. However, the number of possible combinations increases exponentially with the number of candidate features, which make efficient methods always in need for feature selection.
Existing feature selection methods can be classified as filter method, embedded method, and wrapper method [3]. e filter method is to evaluate the correlation between variables to reduce feature size of dataset, and the evaluation process does not involve specific learning algorithms [4]. e embedded method embeds feature selection with classifiers, where the commonly embedded methods include support vector machine (SVM), ID3, C4.5, and Lasso, a least-squares regression method based on L1 regular terms [5]. e wrapper method uses a feature search component to produce feature subset and utilizes the specific classifier to evaluate the performance of different feature subsets until achieving termination conditions. In the wrapper method, swarm intelligence algorithms [6], such as particle swarm optimization [7], bacterial foraging optimization [8], grey wolf optimization [9], and brain storm optimization [10], have attracted great interests across various areas [5,11].
Artificial bee colony (ABC) [12], as a recently proposed swarm intelligence metaheuristic [13], has been employed to address feature selection problems due to its promising efficiency and simple implementation. Keles and Kılıç [14] applied ABC to feature selection on SCADI dataset with 70 samples and 206 attributes. Seven features were finally selected from 206 features to classify the dataset with various classification methods. e classification accuracy was significantly improved. Kiliç and Keleş [15] proposed ABC to select features on z-Alizadeh Sani dataset with 303 samples and 56 attributes, and the classification accuracy is enhanced on the original data. e promising applications encourage researchers to continuously improve the optimization performance of original ABC in feature selection. Shunmugapriya et al. [16] enhanced ABC by mining the global optimal solutions and previously abandoned solutions. Experimental results showed that the feature selection performance was improved without significant increase in computational cost.Özger et al. [17] implemented 8 variants of the binary ABC algorithm to solve feature selection problems, and the proposed algorithm using the bit-by-bit operator has a better global search capability. Wang et al. [18] diversified the initial food sources to make the initialization evenly distributed. Numerical results displayed that the method could get high classification accuracy and smaller feature subset. Liu [19] proposed an ABC variant based on knee points to accelerate the convergence speed on feature selection problems. ough the proposed strategies improve original ABC performance, however, in each iteration of the ABC variants, each food source is only updated in one dimension, and the searching strategy enables the bees to randomly learn from other bees. is results in slow convergence speed and inferior exploitation capability of ABC in the process of optimization.
In this work, a binary superior tracking artificial bee colony with dynamic Cauchy mutation (BSTABC-DCM) is proposed to further improve convergence speed and exploitation capability of ABC for feature selection. Two efficient search strategies, namely, superior tracking strategy [20] and dynamic Cauchy mutation, are integrated into the proposed algorithm. Compared with original ABC, superior tracking strategy enhances ABC' learning behaviors in two aspects: (1) instead of only updating one dimension in each iteration, bees learn from others in each dimension in each iteration; (2) instead of learning randomly, bees select individuals with better fitness to follow. A dynamic Cauchy mutation is integrated to diversify the population and improve global search ability. Ten datasets of different types of UCI are adopted to test BSTABC-DCM on feature selection problems. Seven popular swarm intelligence algorithms are included for comparison. Comprehensive experiments are conducted to evaluate the effectiveness of the proposed method for feature selection.
is paper is organized as follows: Section 2 introduces the background includes the principle of the original ABC algorithm and related work on feature selection. Section 3 presents the proposed BSTABC-DCM. Section 4 shows the process and results of the experiment. Section 5 makes conclusions.

Artificial Bee Colony.
e ABC framework is divided into employed bee stage, onlooker bee stage, and scout bee stage [21]. e employed bee first mines food source and shares the information of food source with the onlooker bee. e onlooker bee selects food source to be mined according to the information. If the better food source not be mined, the onlooker bee transforms into the scout bee and searches for the new food source [22,23]. Firstly, initialize the food source. e initialization method is as follows: in which i � 1, 2, . . . , SN, where SN refers to the number of food source, j � 1, 2, . . . , D, D refers to problem dimension, and ub j and lb j refer to the maximum and minimum values of the j dimension of source, respectively.
(1) Employed bee stage: e amount of the employed bee is half of the initial food source, and it is attached to half of the better source. A new food source is mined near the attached food source. e mining method is shown in the following equation: where v ij refers to the newly mined food source, R ij is a random number between [− 1, 1], and x ij and x kj refer to the j dimension of food source i, k, respectively. Calculating and updating the objective function value of the new food source, and after the employed bee found a better food source, calculating the new fitness according to the following equation: where f(x i ) refers to objective function value of food source x i . (2) Onlooker bee stage: Calculating the probability by (4) and then selecting food source, the better food sources are retained by the roulette method. Further exploring food source by the following equation: where x kj refers to the j dimension of food source selected by roulette and x gj is the food source different from k. (3) Scout bee stage: If the food source is not replaced by a better one, then the scout bees start to generate new food sources randomly according to equation (1).

Related Work on Feature Selection.
It is a great challenge to train the high-dimensional dataset. With the increase in dimensions, the demand of training samples increases exponentially causing "curse of dimensionality". In addition, many models are inapplicable to high-dimensional data [24]. erefore, feature selection is an important way to improve learning performance and a key role in the data preprocessing step of machine learning.
Existing feature selection methods can be divided into three categories: filter method, embedded method, and 2 Complexity wrapper method [25]. e principles of each type of method and the corresponding algorithm are reviewed in this section.
(1) Filter method: e filter method is to evaluate the correlation between variables according to statistical, information theory, and distance measurement [4]. Filter methods find out K variables with strong correlation based on the correlation parameters including correlation coefficient, chi-square test, Fisher score, and information gain [5]. It does not involve any specific learning algorithm in the evaluation process. Peng et al. [26] derived the minimalredundancy-maximal-relevance criterion (mRMR) based on the principle of statistics and improved a two-stage feature selection algorithm, and experimental results show that mRNR performs well in classification accuracy. Meyer and Bontempi [27] proposed a filter method of feature selection based on the double input symmetrical relevance (DISR), a new information theoretic selection criterion. e proposed method is competitive with other filter method of feature selection. Almuallim and Dietterich [28] presented a filter method FOCUS, a quasipolynomial time algorithm, and FOCUS can achieve good performance on coverage, sample complexity, and generalization. ABB and relief are also belonging to the filter method [25].
(2) Embedded method: e embedded method integrates the process of feature selection with the process of classifier learning, which is mainly to solve the problem of high computational costs caused by the reconstruction of the classification model when processing different datasets. e commonly embedded methods can be divided three types [5]. e first is the pruning method, which is eliminating features from all features, and support vector machine (SVM) eliminates features recursively. e second is the feature selection algorithms with a build-in mechanism such as ID3 and C4.5. e third is regularization models.
is method minimizes fitting errors by objective functions and the features with low coefficients are eliminated, such as lasso regularization and bridge regularization [5]. Mohsenzadeh et al. [29] proposed relevance sample feature machine (RSFM), which is an embedded feature selection method based on sparse Bayesian approach and Gaussian priors. e results show that RSFM performs well in both eliminating redundant features and classification accuracy. Mirzaei et al. [30] proposed an embedded feature selection method based on a fully Bayesian framework and introduced a multistep algorithm with variational approximation to maximum the posteriori probability.
e proposed method is successful in both regression and classification.
(3) Wrapper method: Wrapper methods put the datasets into the algorithm for training until obtaining the best combination of features within the number of iterations. e commonly used algorithms include greedy search and stochastic search [31]. Greedy search has two methods: forward selection and backward elimination; forward selection expands the subset of features from empty, and backward elimination gradually eliminates features from the complete feature set [25]. e swarm intelligence algorithm belongs to stochastic search, like genetic algorithm (GA), ant colony optimization (ACO), and particle swarm optimization (PSO). Yang et al. [32] proposed chaotic binary particle swarm optimization (CBPSO) for feature selection and used two classifiers to test the algorithm, and the classification accuracy obtained by CBPSO is higher than that of other methods from the literature. Jingwei Too et al. [4] put multiple inertia weight strategy into binary particle swarm optimization and proposed CBPSO-MIWS for feature selection. e results show that CBPSO-MIWS can achieve competitive performance in all five swarm intelligence algorithms. Xue et al. [33] studied two multiobjective feature selection algorithms based on PSO; the first algorithm applied the idea of nondominated sorting into PSO while the second algorithm put mutation and crowding strategy into PSO to search for the Pareto solutions. ese two algorithms can evolve a set of nondominated solutions automatically. Compared with the three famous multiobjective algorithms, the second algorithm obtains the better results. Cheng and Lu [34] integrated the sampling survey method into the heuristic intelligent optimization algorithm and proposed a new feature selection method. e sampling survey method is used to build the featurescoring system and reduced dimension lengthscoring system. Results showed that the proposed algorithm can select features quickly and effectively.
We have studied the wrapper feature selection method in this paper. e flowchart of the wrapper feature selection method is given in Figure 1.

Binary Superior Tracking Strategy.
In BSTABC-DCM, a binary superior tracking strategy is proposed for feature selection. Specifically, compared with ABC's search strategy, the integrated superior tracking strategy has two main differences: (1) bees learn from others in each dimension instead of only learning one dimension in each iteration and (2) bees choose the better individuals to learn instead of random moving. e food sources are updated as the following equation: where SN i denotes the superior neighbor for guidance; it is a D-dimensional vector of which elements are constructed by itself (position of the food source) for i− th food source and other superior food sources with two probabilistic selection Complexity 3 methods (roulette selection and tournament selection). e pseudocode for generating SN i is presented in Algorithm 1.
Pr is the initialized probability threshold, which determines whether the current individual learns from his superior neighbor or itself. rand i is a function used to produce an integer from a uniform discrete distribution, FS refers to the position of food source, and FV is the function value.
As feature selection is a binary optimization problem, the continuous food source is converted to a binary one after updating the food source using equation (6). Two-step binarization technique [35] is used to make the binarization of the continuous solution. e first step corresponds to the transfer function; sigmoid function [4] is adopted to transform the position of food source into probability value according to equation (7). e second step is binarization technique; the probability value of food source is converted to the binary one by applying the following equation: where v i refers to the continuous food source, V i refers to the binary food source, and rand is a random number uniformly distributed between 0 and 1.

Dynamic Cauchy Mutation Method.
To enhance the global exploration of the proposed method, a dynamic Cauchy mutation is implemented to refine the global best solution in each iteration. e dynamic Cauchy mutation method in this study is defined [36] as the following equation: where d is a random number between [1, D], δ denotes the Cauchy distribution scale, v g,d is the global best solution in each iteration, and cauchy refers to a random number generated by Cauchy() distribution. e value of δ is set to 1 to balance the exploitation ability of the proposed algorithm, and α is a dynamic weight defined as follows: where rand is a random number between [0, 1], and iter means the current number of iteration.
3.3. Procedure of the Proposed Method. BSTABC-DCM includes two main searching components: binary superior tracking strategy (BST) and dynamic Cauchy mutation method (DCM). In BSTABC-DCM, the food sources are explored using equation (6) in the employed bee stage and onlooker bee stage, where each dimension of food source is updated by learning from the better food sources. is ensures the timely information exchange between food sources. After the onlooker bee stage, a dynamic Cauchy mutation method is implemented to increase population diversity, and the most fertile food sources are explored. e flowchart of BSTABC-DCM is given in Figure 2.

Experimental Study
In this section, a set of feature selection problems is employed to comprehensively verify the performance of BSTABC-DCM.
Regarding the parameter setting of BSTABC-DCM, KNN has been widely used in various fields with relatively large sample size, such as pattern recognition, text categorization, and moving object recognition [40,41]. As more than half of the test datasets are relatively large in size, KNN (k � 5) is selected as the classification algorithm in our study. In each experiment, we conduct 5-fold experiment by dividing 80% of the dataset as training set and the remaining 20% as testing set. Each algorithm runs 20 times independently in each dataset. Evaluation indicators include best accuracy, worst accuracy, mean accuracy, standard deviation (STD), and the number of redundant features removed. e final accuracy is the average accuracy of 5-fold cross-validation [42].

Benchmark Datasets.
Considering the number of instances and dimensions of different datasets, ten datasets from the UCI are adopted as test problems [4]. e number of instances and dimensions of the datasets are shown in Table 1.  Table 2.

Experimental Results.
For each experiment, we randomly divide the datasets into five parts, with 80% as training set and 20% as testing set. e average results obtained by 20 times 5-fold cross-validation are obtained as the final results. Without feature selection, the classification accuracy of the KNN (K � 5) algorithm is shown in Table 3.
We implement the comparison algorithms on the ten datasets for feature selection. e average results obtained by cross-validation are regarded as the final experiment results. e experimental results of the proposed algorithm and other seven compared algorithms are shown in Tables 4 and  5. e best experimental results among the eight algorithms are shown in bold.

Start
Initialize population x i , termination criteria and the imit.
Convert the continuous population to the binary one by Equations (7) and (8) Calculate

Complexity
From the results, it can be observed that the optimal classification accuracy obtained by BSTABC-DCM is higher than that of the other seven algorithms on 8 out of 10 datasets. e average classification accuracy obtained by BSTABC-DCM is higher on 9 out of 10 datasets. e classification accuracy of BSTABC-DCM increases by 16.67% on average, while the classification accuracy increases by 15.27% on average by other seven algorithms. Specifically, for classification accuracy, BSTABC-DCM increases more than 10% on 4 datasets, the next is GWO, which increases more than 10% on 3 datasets, and other six algorithms increase only on 2 datasets. e highest increased classification accuracy obtained by BSTABC-DCM is in BreastCancerCombiadataR2 dataset, which is 68.19%. Figure 3 demonstrates the convergence process of the comparison algorithms, where the proposed BSTABC-DCM is shown in green. It is observed that BSTABC-DCM converges faster than other algorithms on 8 out of 10 datasets. e classification results and the convergence curves indicate BSTABC-DCM enhances ABC's exploitation capability.
In addition to classification accuracy, another indicator to measure the performance of algorithms is the number of redundant features removed. e number of redundant features removed is the average number obtained by 20 runs. Table 6 shows the number of redundant features removed by the eight algorithms. e best value of the eight methods is shown in bold.
e result in the optimal number row represents the number of data sets that each algorithm performs best/the total number of features removed. As can be seen from the above table, the proposed algorithm BSTABC-DCM removed the most redundant features in three datasets, which are diabetic retinopathy dataset, ionosphere dataset, and Pdspeechfeatures dataset. Although BSO and qABC also performed well in removing redundant features, judging from the percentage of redundant functions removed in the original dataset, BSTABC-DCM is undoubtedly the most promising, reaching 43.37%. In other words, compared with the original dataset, BSTABC-DCM can greatly reduce the number of features and ensure higher classification accuracy after feature selection. In conclusion, BSTABC-DCM has promising performance on the test problems.

Discussion
In summary, the above experimental results show that the proposed method has improved the best classification accuracy on 8 out of 10 datasets. For the number of removed redundant features, BSTABC-DCM is outperforming with the highest percentage of redundant features removed. is reveals that our proposed strategies have significantly enhanced the original ABC algorithm in terms of global exploration and exploitation capabilities. In addition, it can be seen from Figure 2 that the convergence speed of BSTABC-DCM is comparable to the other comparison algorithms.
ough convergence speed is comparable to the other algorithms, it is promising to combine the advantages of different methods, e.g., random forest and Naive Bayes, with the proposed algorithm to further refine the optimization process.

Conclusion
In this study, a binary superior tracking artificial bee colony with dynamic Cauchy mutation (BSTABC-DCM) is proposed for feature selection. Specifically, a binary superior tracking strategy is integrated to improve the learning behavior of bees by boosting the efficiency of information sharing in population. In each iteration, the bees can learn from the superior bees in each dimension. A dynamic Cauchy mutation is introduced to diversify population and enhance global search ability. We select ten datasets from UCI to verify the performance of BSTABC-DCM, and seven state-of-the-art swarm intelligence algorithms are included for comparison. Experimental results indicate that BSTABC-DCM achieves the best results on classification accuracy while removing nearly half of the redundant features. e convergence speed of BSTABC-DCM is comparable to the comparison algorithms.
While promising, there are still margins for further investigation. Firstly, different binary conversion methods could be considered to improve optimization efficiency. Secondly, both the classification accuracy and feature size will be combined with fitness function to build a multiobjective model. irdly, adaptive search strategies can be considered to enhance the adaptability of the proposed method. Applying BSTABC-DCM to more practical problems such as capacitated location problem of distribution center, vehicle routing problems, and scheduling problems is also one of our research directions.

Data Availability
e data used to support the findings are available from UCI Machine Learning Repository.

Disclosure
An earlier version of this work was previously presented in the International Conference on Neural Computing for Advanced Application.

Conflicts of Interest
e authors declare that they have no conflicts of interest.