A Master-Slave Binary Grey Wolf Optimizer for Optimal Feature Selection in Biomedical Data Classification

A new master-slave binary grey wolf optimizer (MSBGWO) is introduced. A master-slave learning scheme is introduced to the grey wolf optimizer (GWO) to improve its ability to explore and get better solutions in a search space. Five high-dimensional biomedical datasets are used to test the ability of MSBGWO in feature selection. The experimental results of MSBGWO are superior in terms of classification accuracy, precision, recall, F-measure, and number of features selected when compared to those of the binary grey wolf optimizer version 2 (BGWO2), binary genetic algorithm (BGA), binary particle swarm optimization (BPSO), differential evolution (DE) algorithm, and sine-cosine algorithm (SCA).


Introduction
A number of datasets especially of biomedical nature are high dimensional. This means that they have a high number of features per sample. Most of these features can be described as either redundant or irrelevant and introduce noise which affects the performance of a classifier used in medical diagnosis. It is therefore important to apply dimensionality reduction methods that will select the most informative subset of features. Feature selection is one such method [1].
Depending on the search strategy, the feature selection methods can be categorized as wrapper, filter, and embedded methods. Wrapper methods have an underlying learning algorithm used to evaluate the quality of selected features. Filter methods are efficient in terms of execution time and are independent of any learning algorithm. The embedded methods include both the wrapper and filter methods [2,3]. This paper focuses on the wrapper method for feature selection.
Since feature selection is an NP-hard problem, there are two traditional methods used to solve it. The methods are the exact method and metaheuristics [2]. Exact methods are time consuming as they have to consider each and every subset, and this becomes computationally expensive as the search space increases. Metaheuristic algorithms which are nature inspired are generally preferred. They are able to find optimal solutions without traversing the entire search space of a given problem. Examples of metaheuristic algorithms that have been used in feature selection include the genetic algorithm (GA) [4], particle swarm optimization (PSO) [5], ant colony optimization (ACO) [6], salp swarm algorithm (SSA) [7], krill nerd algorithm (KNA) [8], dragonfly algorithm [9], grasshopper optimization algorithm [10], whale optimization algorithm [11], firefly algorithm [12], ant lion optimizer [13], emperor penguins algorithm [14], and sinecosine algorithm [15].
The grey wolf optimizer (GWO) developed by Mirjalili et al. [16] in 2014 mimics the social and hunting behavior of the grey wolves in nature. The GWO is quite popular for its excellent search capability [17] and has the advantage of having few control parameters and fast convergence rate. It has been used in a number of fields including unmanned combat aerial path planning [18], medical diag-nosis [19], economic dispatch [20], intrusion detection [21], EMG signal classification [22], and solving engineering problems [23]. However, in the presence of a large search space, the GWO is vulnerable to getting trapped in the local optima.
Researchers have suggested various methods to help it improve its global search capability. In [17], variable weights are used in determining the position of a wolf and an exponential control parameter was also introduced, and the experimental results showed its dominance over the GWO, ALO, PSO, and bat algorithms. In [19], the GWO is hybridized with the genetic algorithm (GA), and using the kernel extreme learning machine (KELM), it outperformed the GA and GWO in the performance metrics on the Parkinson and breast cancer datasets. Also in [24], the particle swarm optimization (PSO) and GWO are combined, and the results are superior compared to those of other algorithms. The concept of competition is introduced among the population of wolves in [22] and outperformed the binary grey wolf optimizer, binary genetic algorithm, and binary particle swarm optimization in classification. The Powell local optimization method is introduced to the GWO for clustering analysis and compared to some evolutionary algorithms; the results were better on the benchmark functions and datasets considered [25].
Despite these improvements, no method has been able to exhaustively find the optimal solution when it comes to feature selection. In an effort to improve the exploration ability of GWO to escape the local optimum, this paper proposes a master-slave binary grey wolf optimizer (MSBGWO) algorithm. This proposed methodology alters the position of the wolves during exploration and exploitation and ensures diversification of the solutions to be considered.
The main contributions of this paper are as follows: (i) The MSBGWO introduces a master-slave learning mechanism that makes the bottom half of wolves in terms of fitness to learn from the top half in a sequential manner.
(ii) The proposed MSGW is applied to five highly dimensional datasets. The experimental results show that it is able to select the fewest set of features and obtain higher classification accuracy The rest of the paper is arranged as follows. Section 2 gives the background information of GWO. Section 3 presents the proposed master-slave binary grey wolf optimizer. Section 4 outlines the experimental design. Section 5 details the experimental results and discussion, and finally, Section 6 draws the conclusion.

Grey Wolf Optimizer
It is part of the swarm intelligence family. It mimics the social and hunting behavior of the grey wolf as stated before [16]. Grey wolves generally move in groups of 5-12 members. The social structure shown in Figure 1 comprises alphas that are the top most, betas that rank just below the alphas, omegas that lie at the bottom, and finally deltas that are neither omegas nor the top two.
The hunting behavior involves finding prey, encirclement and harassment of the prey to restrict its movement, and then finally attacking the prey.
The process of encircling the prey can be modeled mathematically as in the equations below: where X w ! and X p ! are positional vectors of the wolf and prey, respectively, t represents the iteration, and A ! and C ! are vector coefficients.
The vectors A ! and C ! can be determined as follows: where r1 and r2 are random numbers uniformly distributed between ½0, 1 and a is the encircling coefficient which is linearly decreased from 2 to 0 as the iterations increase according to the equation below: where t is the number of iterations and T is the maximum number of iterations.
Hunting is usually led by the alpha. Beta and delta can occasionally participate in hunting. Since we have no idea of the position of the optimum prey, we assume that alpha, beta, and delta have a better knowledge of the position and thus can lead the rest of the pack. Mathematically, this is achieved by selecting the top three fittest solutions which are then used to update the other positional vectors of the grey wolves.
The new position of the wolf is updated as follows: where X 1 ! , X 2 ! , and X 3 ! are calculated as follows: 2 BioMed Research International where X α ! , X β ! , and X δ ! are the position of alpha, beta, and delta at iteration t, respectively. D α ! , D β ! , and D δ ! are defined in the equations below: For continuous optimization problems, the GWO is used. However, feature selection is a binary optimization problem; thus, the GWO is modified to a binary version which has already been developed [4].

Master-Slave Binary Grey Wolf Optimizer (MSBGWO)
In each generation or iteration of the grey wolf optimizer, the best three solutions are used in updating the position of each wolf. The omega wolves constitute a larger percentage of the population and have lower fitness in relation with the alpha, beta, and delta wolves. By repositioning the weaker wolves in a guided approach, we can improve the diversification ability of GWO in search of better solutions. A master-slave learning scheme is hereby introduced. In each generation, the wolves are sorted in ascending order of fitness. The top half are then termed master wolves, and the remaining half become slave wolves. Each slave wolf is assigned a master wolf from whom they will learn.
The slaves will learn from the master using the following equations: where D L is the fraction of distance between a master and slave wolf, ω ϵ ½0 1 is the learning coefficient, C 4 is determined by equation (4), X M is a master wolf, X S is a slave wolf, and S and M are evaluated using equation (14).
For a population of N wolves, Following the principles of equation, the new continuous position of the slave wolves is calculated as follows: where X n is the continuous solution and A 4 is determined as in equation (3).
Since feature selection is a binary problem, the continuous solutions are forced to be binary [26].
where X Sd is the new binary solution for a slave wolf

Begin
Randomly initialize the position of the wolves. Sort the wolves in ascending order of fitness While the number of iterations is not exceeded Determine a as in Equation (18) Sort the wolves in ascending order of fitness Masters = top half of the pack(fitness) Slave = remaining half of the pack Slaves update their positions using Equations (13), (15) and (16) for each wolf Determine A,C using Equation (3) and (4) Determine D for X α , X β , and X δ using Equations (10), (11) and (12) Determine X1,X2,X3 using Equations (7), (8) and (9) Determine the position using Equation (6) and (16)  End Determine the fitness of each wolf Update the positions of X α , X β , and X δ End Return X α as solution End Algorithm 1: Pseudocode for MSBGWO. 3 BioMed Research International in dimension d, rand ϵ ½0 1, and S is a sigmoid function given by Equation (16) is also applied when the positions of all wolves are updated but rand is now set as 0.5.
The slave wolves can now be integrated with the master grey wolves in the population and can now move to the next generation.
To increase the number of iterations in the exploration stage, the nonlinear control parameter a adopted in [17] is used in place of The pseudocode of MSBGWO is presented in Algorithm 1, and the flowchart is shown in Figure 2.

Experimental Design
4.1. Datasets. A total of five high-dimensional biomedical datasets obtained from [4] were used for validation. Each dataset has two labels. The datasets are shown in Table 1.     Each algorithm is run k times, and the results are averaged as follows: 4.3. Fitness Function. Feature selection is a biobjective prob-lem concerned with minimizing misclassification errors and minimizing the number of features selected. Thus, the fitness function is determined by the following equation which is from [27].
where AvgAcc is the average accuracy determined by the KNN classifier, S is the number of selected features, D is the total number of features, and α is set to 0.8 in this paper.

Parameter
Setting. The performance of the proposed MSGWO is compared to that of the binary grey wolf optimizer version 2 (BGWO2), binary genetic algorithm (BGA), binary particle swarm optimization (BPSO), differential evolution (DE) algorithm, and sine-cosine algorithm (SCA). The parameter values for the algorithms are listed in Table 2.
The value of ω was selected as 0.1 after values ranging 0.1-1 were considered.
The population size is set to 10 in each of the algorithms, and the number of iterations is set at 100. To complete the wrapper-based approach, a KNN classifier with Euclidean distance, k = 5, is also used. A KNN classifier performs optimally when dealing with normalized data, and therefore, all datasets were normalized in the preprocessing step.
Each algorithm is run 10 times on an Intel® Core™ i5 CPU M 520 @ 2.40 GHz to provide a good measure of the results. The implementation is in MATLAB.

Experimental Results and Discussion
Experimental results of the proposed MSGWO were compared to those of BGWO2, BGA, BPSO, DE, and SCA. The classification accuracy, precision, sensitivity, and F -measure over 10 runs using 10-fold CV have been averaged using equations (23)- (27) in Section 4.2 to provide the final results. Box plots have also been used to probe the variations. Table 3 presents the detailed results on the colon cancer dataset. In the table, we see that the proposed MSBGWO was able to achieve the highest classification accuracy of 0.957. The minimum accuracy of 0.919 when MSBGWO was used to select features was higher than that of BGWO2, BGA, BPSO, and DE. It was only lower than that of SCA. The average classification accuracy, average precision, and average F-measure were also the best for MSBGWO among the algorithms considered having selected the fewest features. However, BGWO2 was able to achieve the highest average sensitivity. BPSO was more stable as it had the lowest standard deviation in the average classification accuracy (lower than that of MSBGWO). The box plot in Figure 3 also shows that the median values for accuracy, precision, sensitivity, and F-measure are way above those of the other algorithms. The overall superiority of MSBGWO can be attributed to its ability to diversify its solution and minimize being trapped in the local optima. Table 4, we see that using MSBGWO, the values of accuracy, precision, sensitivity, and F-measure  were the best compared to those of the other algorithms. It also selected the fewest features. Figure 4 shows the box plots and the median values for accuracy, precision, sensitivity, and F-measure which are above those of other algorithms considered. This shows that MSBGWO was more explorative in the search space than the other algorithms ensuring that it selected the most informative features.

Leukemia Dataset
Results. In Table 5, it is noted that a sensitivity of 100% was obtained when BGWO2, DE, and MSGWO were used for feature selection. We again see that MSBGWO selected the fewest number of features in comparison with the other algorithms. The average values of classification accuracy, precision, and F-measure using the proposed MSBGWO are also the best among the algorithms. In fact, the minimum accuracy obtained using MSBGWO betters the maximum achieved by the other algorithms. In the pictorial representation using a box plot in Figure 5, we note that the median values are way superior. The ability of MSBGWO to avoid the local optima by   Table 6, we note that using MSBGWO, we attained the highest classification accuracy. The minimum classification accuracy for MSBGWO also matched the maximum classification accuracies for BGWO2 and SCA. The average values for accuracy, precision, sensitivity, and F-measure for MSBGWO were highest among the algorithms. This is shown as well in the box plots in Figure 6 where the median values obtained using MSGWO are superior. Table 7, the MSBGWO proves to be superior as it selected the fewest features on average and had the highest classification accuracy and its minimum classification accuracy was not bettered by maximum accuracy of the remaining algorithms. Average    Table 8.

Ovarian Cancer Results. From
The null hypothesis is that the median values of two samples will be equal, and the alternative hypothesis is unequal median values. h = 0 represents the null hypothesis, and h = 1 rejects the null hypothesis.
In summary, we see that the proposed MSBGWO selected the fewest features which proved to be most informative as the accuracy, precision, sensitivity, and F-measure were better than those of BGWO2, BGA, BPSO, DE, and SCA in the datasets in Table 1. This demonstrates the superiority of the algorithm when it comes to feature selection, and the modification of GWO helped in diversification.    Figure 7: Box plots on accuracy, precision, sensitivity, and F-measure on the ovarian cancer dataset.

Conclusion
A master-slave binary grey wolf optimizer is proposed in this paper. A master-slave learning scheme is introduced to improve the exploration ability of the grey wolf optimizer. Five biomedical datasets are used to test the strength of the proposed MSBGWO. The experimental results show that the proposed algorithm outperforms the BGWO2, BGA, BPSO, DE, and SCA in the performance metrics considered in this paper. In future work, the proposed algorithm can be used in noncontinuous optimization problems. From the results, we see that BGA was the most stable; thus, hybridizing BGA and MSGBWO should be a consideration.

Data Availability
Data is available from the corresponding author upon request.

Conflicts of Interest
There is no conflict of interest.