Feature Selection on Elite Hybrid Binary Cuckoo Search in Binary Label Classification

For the low optimization accuracy of the cuckoo search algorithm, a new search algorithm, the Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm, is improved by feature weighting and elite strategy. The EHBCS algorithm has been designed for feature selection on a series of binary classification datasets, including low-dimensional and high-dimensional samples by SVM classifier. The experimental results show that the EHBCS algorithm achieves better classification performances compared with binary genetic algorithm and binary particle swarm optimization algorithm. Besides, we explain its superiority in terms of standard deviation, sensitivity, specificity, precision, and F-measure.


Introduction
Feature selection attempts to find the most discriminative subset of features to bring reasonable recognition rates for some classifiers. Given a problem with d features, we have 2 d possible solutions, making an exhaustive search impracticable for high-dimensional feature spaces. In addition, the high-dimensional data also contains a large number of irrelevant and noise-polluted features, and there is often information redundancy between features. These factors will affect the learning effect of the learning algorithm and significantly increase the algorithm's computational complexity. Therefore, feature selection has become a research hot spot.
As a key technology of pattern recognition and machine learning, feature selection is an effective method to deal with high-dimensional data. Feature selection models can be divided into three categories [1]: filter [2], embedding [3], and wrapper [4]. Filter methods define the relevant features without prior classification of the data. The embedding method refers to the process of embedding the feature selection algorithm into the classification algorithm and conducting feature selection and training at the same time. Wrapper methods on the other hand incorporate classification algorithms to search for and select relevant features. The wrapper methods generally outperform filter methods in terms of classification accuracy [5]. Recent studies have shown that feature selection can better solve many practical problems, including classification and medical problems [6][7][8][9].
Another vital part of the feature selection process is the search strategy: selecting the feature subset that meets the optimal evaluation criteria, which is usually a combinatorial optimization problem. In recent years, metaheuristic algorithms based on biological behavior and physical systems in nature are proposed to solve the optimization problems [10]. Metaheuristic optimization algorithm, also known as natural heuristic algorithm, studies the evolutionary behavior of species and simulates it into computer science algorithms, including genetic algorithm [11], particle swarm optimization algorithm [12], bat algorithm [13,14], and cuckoo algorithm [15]. The metaheuristic optimization algorithm has achieved good results in feature selection. For example, Liu et al. [16] combined genetic algorithm and simulated annealing algorithm to select feature subsets. The experiment result expresses the hybrid algorithm has high reliability and strong convergence. On the contrary, Siedlecki and Sklansky [17] combined genetic algorithm and feature selection to achieve a certain effect, but it exposed the problem of premature convergence of genetic algorithm. Kennedy and Eberhart [18] proposed the binary particle swarm optimization algorithm called BPSO, which modified the traditional particle swarm optimization algorithm and solves the binary optimization problems. Besides, Firpi and Goodman [19] applied BPSO to feature selection problems.
The success of metaheuristic methods lies in the efficiency of search strategies and its ability to find solutions to combinatorial optimization problems. Metaheuristics take the information gathered during the search to guide the search process, and therefore, they are considered independent of the problems. The cuckoo search algorithm is a novel heuristic optimization approach introduced by Yang and Deb in 2009 [15]. The algorithm simulates cuckoo birds' parasitic breeding habits and is a random algorithm with strong global search ability. The cuckoo search algorithm has been efficiently employed in many fields, such as intelligent optimization and calculation. Cuckoo search is superior to other algorithms in continuous optimization problems including spring design and welding beam in engineering design applications [20]. This algorithm is especially suitable for largescale problems [21]. Valian et al. have applied it in training the neural network [22] and spike neural model [23]. The experiment proved that CS has better search capability than other algorithms like particle swarm optimization algorithm, genetic algorithm, and artificial bee colony algorithm [21,24,25]. Therefore, CS is a metaheuristic algorithm used in combinatorial optimization problems to obtain higher performance.
The CS can only solve optimization problems in the continuous solution space. To solve combinatorial optimization problems in discrete solution space, Gherboudj et al. [26] proposed a binary version of the cuckoo search algorithm, namely, BCS algorithm. Pereira and Rodrigues [27] applied BCS algorithm to feature selection. Bhattacharjee and Sarmah [28] improved BCS by using the balance combination of local random walk and global exploration random walk so that BCS algorithm can better balance locality and globality. Sudha and Selvarajan [29] presented a feature selection approach based on an enhanced cuckoo algorithm and applied it to breast X-ray images. It can supply valuable information for clinicopathologists. Aziz and Hassanien [30] proposed a new improved cuckoo algorithm combined with the theoretical knowledge of rough set and finally applied it to feature selection.
The cuckoo search algorithm uses Lévy flight random walk to search space in the iteration. The cuckoo search cannot effectively search around the cuckoo's nest due to the Lévy flight with sharp 90-degree turns. Therefore, it suffers from low optimization accuracy [31]. In order to improve the cuckoo search algorithm, this paper proposes an Elite Hybrid Binary Cuckoo Search algorithm, and the novelty of the paper is two-fold: (1) EHBCS adopts feature weighting and elite strategy in the binary cuckoo search algorithm. Feature weighting based on Relief algorithm is to estimate the feature weight and its importance according to the ability of each feature to distinguish different class instances. Elite strategy and genetic algorithm with the selection and crossover operators are embedded into the cuckoo algorithm so that the wellpositioned nests can be inherited to the next generation (2) EHBCS is applied to a set of binary label datasets, including low-dimensional and high-dimensional samples such that only the best features are retained in the subset. Experimental results demonstrate that EHBCS achieves a better classification performance to minimize the number of selected features, simultaneously maximizing the classification accuracy by SVM compared with binary genetic algorithm and binary particle swarm optimization The main contributions of this paper are summarized as follows: (1) It is the first time to combine the feature weighting and elite strategy with BCS algorithm. (2) It specifically improves the low optimization accuracy of the BCS algorithm. (3) It may provide a useful revelation to highdimensional data researches such as text processing, medical research, and gene analysis.
The structure of this paper is as follows: Section 2 provides details of the classical version of the Cuckoo Search and Binary Cuckoo Search algorithms; Section 3 presents the Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm; Section 4 discusses the experimental methodology and in particular the dataset and evaluation measures; numerical experiment is also carried out to evaluate the prediction performance of our method in Section 5. The results demonstrate that the proposed method is efficient for highdimensional datasets; finally, the conclusions of our work are given in Section 6.

Cuckoo Search Algorithm
2.1. Cuckoo Search (CS) Algorithm. The parasite behavior of cuckoos is extremely intriguing. These birds can lay down their eggs in host nests and mimic external characteristics of host eggs such as color and spots. If this strategy is unsuccessful, the host can throw the cuckoo's eggs away or simply abandon its nest, making a new one in another place. Based on this context, Yang and Deb [15] have developed a novel evolutionary optimization algorithm named cuckoo search (CS), and they have summarized CS using three rules, as follows: (1) Each cuckoo chooses a nest to lays eggs randomly (2) The number of available host nests is fixed, and nests with high-quality eggs will be passed on the next generations (3) If a host bird discovered the cuckoo egg, it can throw the egg away or abandon the nest and build a completely new nest For optimization problems, each nest represents a possible solution to the problems, and a nest can contain one or more eggs depending on the size of the problems. Firstly, the algorithm randomly initializes each nest, and then, the 2 Computational and Mathematical Methods in Medicine algorithm carries out an iterative process. During each iteration, each nest is updated by Lévy flight with random walk, and the formula is shown in Equations (1) and (2): The updating formula of each dimension is expressed as where x t i denotes ith nest and x t ij stands for the jth eggs at nest i for the t generation. α is step size, and the product ⊕ means entrywise multiplications. In most case, we can use α = 1. The Lévy flights Lévy (λ) employ a random step length, and Lévy j (λ) is its jth component.
In the 1930s, Lévy proposed Lévy's distribution, believing that the relationship between the continuous jump path of Lévy's flight and time t follows Lévy's distribution. Later, many scholars have studied Lévy's distribution and used it to explain random phenomena in nature, such as Brownian motion and random walk. Yang [15] studied and obtained the probability density function of Lévy distribution in power form by simplifying and Fourier transform: where λ is the power coefficient. Equation (2) is a probability distribution with a heavy tail. Although it can essentially describe the random walk process of cuckoo birds, it has not been further described in a more concise and easy to program mathematical language to achieve the CS algorithm. So Yang adopted the Mantegna algorithm to simulate Lévy jump path: where s is the Lévy flight Lévy j (λ), the relation of parameters β in equation (2) is λ = 1 + β and content 0<β ≤ 2. The parameter is β = 1:5, and μ and ν are random number and satisfy Equations (5) and (6): Let step = α × levyðλÞ j = α × s then step is the path that cuckoo bird experiences each time in solution space when it randomly searches for the new nest location x t+1 ij from the old nest location x t ij according to Equation (2). In the finally step of each iteration, the nest with the worst quality is substituted with probability p a ∈ [0,1]. Algorithm 1 shows the pseudo-code for the classical version of CS.

Binary Cuckoo Search (BCS) Algorithm.
In traditional CS, the position of the solution is updated in the continuous search space. Unlike the above CS, the BCS search space for feature selection is modeled as a binary d-bit string, where d is the number of features. BCS represents each nest as a binary vector, where each 1 corresponds to a selected feature and 0 otherwise. This means each nest represents a possible solution, and each nest represents a feature.
The original cuckoo algorithm introduces mapping functions to extend the cuckoo algorithm to discrete binary regions as follows [25]: in which rand ðÞ~Uð0, 1Þ and x t ij denotes the new egg's value at iteration t.

Elite Hybrid Binary Cuckoo Search (EHBCS) Algorithm
3.1. Feature Weighting Based on Relief Algorithm. The core idea of feature weighting based on Relief is to estimate the feature weight and its importance according to the ability of each feature to distinguish different class instances [32]. Given a two-class dataset D, C containing n cases is a class label set, x = ðx 1 , x 2 ,⋯,x d Þ is a case in D, and x is a realvalued vector with dimension d. Relief performs the following iterative learning: randomly select a case x, then find the nearest case NHðxÞ of the same class and the nearest case NMðxÞ of the different class, and then update the weight using the following rules: where w j represents the weight of the jth feature and T represents the maximum number of iterations. |x j − y j | is used to calculate the difference between the jth dimensional eigenvalues of two instances, that is, the absolute value vector of the feature difference vector. A variant that considers k neighbors has been developed from the nearest neighbor Relief, whose weight value update formula is 3 Computational and Mathematical Methods in Medicine where KNNðx ; cÞ is the set of k nearest neighbors of x in X c by Euclidean distance. Process is shown in Algorithm 2.

Selection and Crossover
Operator. The selection operator is to inherit the individuals with high fitness in the current population to the next generation according to selection probability. Generally, individuals with high fitness will have more chances to inherit to the next generation. This paper uses the roulette model to select individuals. The calculation formula is as follows: where pðx i Þ is the selection probability, q i is the cumulative probability, f ðx i Þ is the individual x i fitness function value, and n is the number of the group. Select operator process is in Algorithm 3. Crossover is to cross the selected a pair of individuals according to probability, such as single-point crossover or multipoint crossover. In this paper, the single-point cross-over is adopted, that is, the random number is generated within the range of individual coding bits as the crossover point, and then, the coding exchange of the two bodies from this point to the end is carried out, so that the crossover process can be completed.

Weight-Based Elite Hybrid Binary Cuckoo Search
(EHBCS) Algorithm. In the CS algorithm, the Lévy flight is used to explore the search space using a straight flight path with a sudden 90-degree turn, and Figure 1 simulates Lévy's flight path. In addition, the CS algorithm is highly dependent on random walk search, which can be easily moved from one area to another without carefully exploring each nest. Therefore, the CS algorithm has weak local search ability and low optimization accuracy [31]. In order to cover the mentioned weakness of the CS, elite strategy and genetic algorithm operators are embedded into the cuckoo algorithm, such as selection and crossover operators, so that the well-positioned nests can be inherited to the next generation. The so-called elitist strategy is to preserve the nest in a good location so as not to miss the optimal nest during the algorithm iterations by Lévy flight. According to certain rules, the selection operator is to inherit the individuals with high fitness in the current population to the next generation. Generally, individuals with high fitness will have more chances to inherit while (t < MaxGeneration) or (stop criterion)do Get a cuckoo randomly by Levy flights evaluate its quality/fitness F i Choose a nest among n (say, j); if (F i < F j ) then replace j by new solution; end A fraction (p a ) of worse nests are abandoned and new ones are built Keep the best solution (or nests with quality solutions) Rank the solutions and find the current best end Postprocess results and visualization Algorithm 1: Classical version of CS adapted from [15].

Input: binary label dataset D with n cases and d dimensions, Maxiter T
Randomly select an case x from the dataset D and calculate the distance between k nearest cases of the same kind NHðxÞ and k nearest cases of the different kind NMðxÞ; foreach jðj = 1, ⋯, dÞdo w j generated by formula (11); t = t + 1; end end Algorithm 2: Relief algorithm.
to the next generation. The crossover operator usually inputs two individuals as candidate solutions with a certain probability and generates neighborhood solutions by exchanging part of the chromosomes of two individuals.
The CS algorithm is suitable for continuous domain problems, and the feature selection is a binary discrete problem. This paper proposes an Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm considering these facts. The EHBCS algorithm weights the features firstly according to the Relief algorithm mentioned in part III-A, so that the features with larger weights have greater opportunities to be selected. Then, in each iteration of the EHBCS algorithm, the optimal nest does not carry out Lévy flight or crossover to avoid damaging the optimal nest position. The nest generated by Lévy flight is operated by selection and crossover operators.
Since the existing BCS algorithm does not consider the influence brought by the SigðstepÞ function, the coefficient in the SigðstepÞ function is changed to the feature weight in this paper so that features with significant feature weight have a greater chance to be selected and the improved algorithm can finish the iterative process faster. The BCS mapping function is modified as follows: When x t ij = 1 rand ðÞ ≤ sig step ð Þ 0 otherwise When step < 0 Input: population with n nests, number of dimensions (features) d, crossover rate p c , fitness function f ðxÞ Output: New population newnest after elite selection and crossover foreach iði = 1, ⋯, nÞdo The crossed nests and Bestness form a new population newnest as output Algorithm 3: Elite selection and crossover operators.
The function of sigðstepÞ does not represent the probability of change, and it represents the probability of a certain change being 1. Let γ = −5, −3, 3, 5. The corresponding function graph is shown in Figure 2. It can be seen from the figure that the greater the parameter of the same abscissa, the greater the corresponding value. That is, the greater the feature weight, the greater the probability of being selected.
It should be emphasized that the weights calculated by the Relief algorithm may have negative weights, and the negative weight indicates that the distance of the similar neighbor samples is larger than that of the nonsimilar neighbor samples. Therefore, it is considered that this feature is unfavorable to classification, and the probability of selecting this feature in the corresponding feature selection is low.
Because the purpose of nest discovery and crossover operation is to make the population various, this paper adopts crossover operation instead of discovery operation. In the late iteration of the algorithm, the elite strategy proposed in this paper ensures the convergence. The elite selection and crossover operators as well as the pseudo-code of the algorithm presented in this paper are as follows: Algorithm 3 and Algorithm 4.

Experimental Methodology
4.1. Datasets. Eight datasets were extracted from the UCI Machine Learning Repository [33][34][35]. In order to make a more comprehensive comparison between the proposed algorithm and other algorithms, four low-dimensional feature datasets and four high-dimensional feature datasets are selected. Each dataset has two classes, and Table 1 provides the datasets' names, the total number of features, total number of cases, and classification accuracy before feature selection.

Performance Evaluation Measures.
Generalization ability is the ability of a model to predict new data accurately after training on the training datasets. Cross-validation is a method to evaluate model generalization ability, which is widely used in data mining and machine learning [36]. In cross-validation, the dataset is usually divided into two parts: the training set, which is used to build a prediction model, and the other is test set, which is used to test the model's generalization ability. Cross-validation was performed, and the value of k was set to k = 5 for datasets with cases below 100 and to k = 10 for datasets with cases above 100. The evaluation indicators used include Accuracy, Sensitivity, Precision, and F-measure [37].
whereTPis the total number of positive cases and correctly identified as positive,TNis the total number of negative cases and correctly identified as negative,FPis the total number of negative cases and wrongly identified positive cases, andFNis the total number of positive cases and wrongly identified negative cases.
For the overall classification performance of each algorithm, we calculate the average value of all tests as follows: where k is the total number folds.

Evaluating Classification
Performance. The support vector machine (SVM) classifier was adopted to evaluate the accuracy of feature subset classification. SVM is a supervised machine learning algorithm introduced by Boser et al. [38], in which data is mapped as the points in an n-dimensional feature space (n = number of features). The final output of SVM is an optimal hyperplane that classifies new cases. SVM highly depends on kernel functions, so the experiments with different kernel functions are fundamental. The kernel function is a similarity function, which determines      Table 2.

Fitness Function.
The main objective of the feature selection task is to find a subset of features from the dataset so that the learning algorithm can use these selected features to achieve as high accuracy as possible.
In the classification problems, two feature subsets with different numbers likely have the same classification accuracy for the same dataset. Therefore, in the case of the same  classification accuracy, if the metaheuristic algorithm finds the subset with more features earlier, the subset with fewer features will be ignored. In this paper, a new evaluation method is proposed as the fitness function to overcome this constraint, which considers the classification accuracy and takes the rate of feature reduction as an adjusting term. Let d be the total number of features contained in the datasets, s be the number of features selected by metaheuris-tic optimization algorithms, β be the weight of rate of feature reduction, and 1-β be the weight of average accuracy. The value of the adaptation fitness function can be calculated as shown in (28). We set β=0.2.    Table 3 lists the parameter values for each algorithm. The population size of all optimization algorithms is set to 30, and each algorithm was run 5 times to perform the feature selection task. All runs are executed in Matlab 2017, running on a Windows 10 operating system on a Huawei MagicBook with Intel(R) Core(TM) i5-8250U 1.6GHz with 8Gb of RAM.
4.6. Analysis of Computational Complexity. The EHBCS algorithm uses the Relief algorithm and the binary conversion of Lévy flight as well as the selection and crossover process. For the Relief algorithm, assuming that the number of runs is M, the number of iterations is m, the number of cases is N, and the individual dimension is d; the complexity of the algorithm is Oðm × N × d × MÞ. For Lévy flight and binary con-version, assuming that the number of individuals is n, the individual dimension is d, and the number of iterations is t; the computational complexity is Oðn 2 × d × tÞ. For selection and crossover, assuming the number of individuals is n, the computational complexity is Oðn 2 × t × dÞ. Therefore, the computational complexity is Oðm × N × d × M + n 2 × t × dÞ for EHBCS algorithm. Figures 3 and 4 provide the performance of all optimization algorithms for feature selection using the medical datasets described in Section 4.1. They contain the following information:

Experimental Results
Accuracy: classification accuracy for each datasets All: classification accuracy before feature selection for each dataset  Dataset: the dataset used for experimentation as described in Table 1 Avg: average of all corresponding data obtained by the three algorithms The experimental results show that the average feature subsets are smaller for all datasets, and the average classification accuracy is improved to different degrees. Compared with the original datasets, the number of the average feature subsets after feature selection by the optimization algorithms was reduced by about 18.395%-89.667%, and the average classification accuracy was improved by about 3.3%-34.6%. For the Breast Cancer Wisconsin (diagnostic) dataset, the maximum average For low-dimensional datasets, such as Cervical Cancer Behavior Risk, Breast Cancer Wisconsin (diagnostic), Breast Cancer Wisconsin (prognosis), and Sonar, the EHBCS algorithm can effectively reduce features to obtain a smaller subset of target features. It can get minimum standard deviation in three algorithms, which shows the EHBCS algorithm is the most stable of three. But it is the second of the three optimization algorithms in terms of classification accuracy, SE, SP, Pre, and F1. Compared with the data corresponding to Avg, the EHBCS algorithm has minimum standard deviation, higher classification accuracy, SE, SP, Pre, and F1 in entirety. Compared with the original dataset classification, the number of subset features after feature selection by the EHBCS algorithm is reduced by 58.182%-80%, and the classification accuracy is improved by 5%-33.9%. The results show that the EHBCS algorithm can efficiently diminish the number of features to ensure accuracy, but it did not perform well in low-dimensional datasets.
For high-dimensional datasets, such as Colon Tumor, Medulloblastomas, Central Nervous System and Relation Leukemia, the average classification accuracy, standard deviation, SE, SP, Pre, and F1 obtained by the EHBCS algorithm were superior to BGA and BPSO on the whole. Compared with the data corresponding to Avg, the average classification accuracy of the EHBCS algorithm is improved by 1%-10.6%, and the EHBCS gets lower standard deviation. But it needs to be explained that the standard deviation of the EHBCS algorithm is greater than the data corresponding to Avg when adopting fitness acc (Function (23)) for dataset Medulloblastomas and Central Nervous System. In addition to these, SE, SP, Pre, and F1 are optimal overall. Compared with the original dataset classification, the number of subset features after feature selection by the EHBCS algorithm is reduced by 43.772%-53.498%, and the classification accuracy is improved by 4.5%-22.8%. The results show that the feature selection method based on EHBCS has higher classification accuracy, SE, SP, Pre, F1, and smaller standard deviation. EHBCS algorithm is more suitable for the feature selection of high-dimensional datasets.
It should be emphasized that the purpose of feature selection is to reduce irrelevant or weakly correlated features as much as possible on the premise of ensuring classification accuracy. However, the number of feature subsets cannot be reduced indefinitely. Too few feature subsets may lead to the loss of important features, thus affecting the classification accuracy of the datasets. Therefore, it is necessary to balance the relationship between classification accuracy and the number of feature subsets. In practical applications, evaluation function models should be set scientifically and reasonably to ensure the classification performance of feature subsets.

Conclusion
This paper proposes an Elite Hybrid Binary Cuckoo Search Algorithm that adopts feature weighting and elite strategy. The proposed EHBCS algorithm aims to optimize the feature selection task on binary label datasets. The experimental results show that EHBCS achieves a better classification performance. Besides, all statistical metrics (standard deviation (Std), sensitivity (SE), specificity (SP), precision (Pre), and F-measure (F1)) reveal markedly the EHBCS is superior to BGA and BPSO. However, the algorithm still has shortcomings, such as increased computational complexity.
Future work requires further modification of the proposed algorithm to make it suitable for feature selection of multiclass datasets and to evaluate the results using different datasets and classification models.