Hybrid Gradient Descent Grey Wolf Optimizer for Optimal Feature Selection

Feature selection is the process of decreasing the number of features in a dataset by removing redundant, irrelevant, and randomly class-corrected data features. By applying feature selection on large and highly dimensional datasets, the redundant features are removed, reducing the complexity of the data and reducing training time. The objective of this paper was to design an optimizer that combines the well-known metaheuristic population-based optimizer, the grey wolf algorithm, and the gradient descent algorithm and test it for applications in feature selection problems. The proposed algorithm was first compared against the original grey wolf algorithm in 23 continuous test functions. The proposed optimizer was altered for feature selection, and 3 binary implementations were developed with final implementation compared against the two implementations of the binary grey wolf optimizer and binary grey wolf particle swarm optimizer on 6 medical datasets from the UCI machine learning repository, on metrics such as accuracy, size of feature subsets, F-measure, accuracy, precision, and sensitivity. The proposed optimizer outperformed the three other optimizers in 3 of the 6 datasets in average metrics. The proposed optimizer showed promise in its capability to balance the two objectives in feature selection and could be further enhanced.


Introduction
With the advent of Big Data, more computation power and higher specialized methods are required to turn the collected statistics into valuable information and prediction strategies. Solutions to relate Big Data problems may be expressed as a function for which we seek to find an optimal solution, often with preset constraints. An analytic solution may end up as highly complex, requiring high computational power, or even nonexistent. In these common cases, an alternative method is required to efficiently comb through the search space of the problem to yield the optimal solution. Optimization techniques are well suited for this since they are search methods in which the objective is to find a solution for a given optimization problem.
Metaheuristic optimization algorithms, in specific, are viable owing to their simplicity, conceptuality, and analyticity. They guide the search process to explore the search space and arrive at an optimal solution [1]. Metaheuristic algo-rithms may be classified according to different characteristics, some of which include the following.
Memory usage vs. memoryless distinguishes whether the algorithm retains information of the already traversed search space and if this information is utilized in determining the algorithm's future actions. A well-known example of this is the swarm intelligence algorithm [2], which mimics the pheromone trails, produced by ants as they move to and from the colony, in encoding information of the search space.
Static and dynamic objective function metaheuristics refer to the morphing of the objective function as the search space is traversed as experienced in [3]; the algorithm dynamically modifies its objective function by analyzing the search space and adding constraints to the features that then alter the objective function through "penalty terms." Population-based vs. single point search optimizers describe the utilization of either a set of solutions referred to as search agents [4] or a single solution within the search space at each iteration [5], respectively. One vs. various neighborhood structure metaheuristic algorithm is when a single search space is described by the objective function as opposed to multiple search spaces requiring the objective function to switch between different representations, as seen in [6].
Nature-inspired vs. non-nature-inspired simply specifies the insight that led to the creation of the algorithm. Many well-known metaheuristics are nature-inspired owing to the diverse and often ingenious methods through which nature has adapted to tackle naturally occurring optimization problems. The genetic algorithm, inspired by the process through which genetic information is encoded and handled in organisms to produce slight variations in offspring carrying traits of both parents, allows mutation and crossover in the parents' (the source of the genetic data) encoded features to produce offspring that may randomly acquire traits from either or both parents. From this idea, research was undertaken to create variants that exemplify the traits of the algorithm [7]. Some other algorithms are made after extensive research on natural processes in order to mimic nature such as the whale algorithm [8], dragonfly algorithm [9], and ant lion colony [10].
The grey wolf optimizer [11], inspired by the hunting patterns of the grey wolf (Canis lupus), is known for its well-defined pack leadership structure. The 4-tiered hierarchical hunting is encoded as tiered search agents, with the leaders driving the search direction and the rest encircling the prey (local minima).
Metaheuristic algorithms attempt to achieve a balance between diversification and intensification in the function space. As described in [12], an optimization algorithm cannot be a "one catch all," having great performance values for all dataset types. This drives researchers to attempt to develop hybrid algorithms that incorporate the core of different algorithm ideologies to achieve a performance greater than the sum of its parts. This is especially critical in the medical field with numerous ailments requiring constant research to find relevant algorithms for each case.
With the simplification of data collection strategies, especially in internet-based processes, researchers have gained access to large repositories of data commonly referred to as Big Data. The data may be analyzed to understand the relations between different features or classify the set into distinct groups or predict future outcomes from it. Big Data tends to contain many features with some not affecting the classification of the data. The features are ignored, reducing computer workload and increasing accurate prediction of the algorithm in development.
The aim of feature selection is to filter out the features of a dataset that are redundant, not contributing to the prediction and labeling of the data [13], achieving data reduction. It helps in understanding a given data sample to know which of the input features significantly affect output. In industry, feature selection is vital in fault detection and diagnosis [14].
In the medical field, it may translate to a medical researcher gaining insight on specific metrics which are used to diagnose a specific ailment in a patient or create prediction algorithms for diagnosis. For diagnostic cases such as those used later in this paper, a researcher seeks to acquire as much information on a patient as possible. It is through feature selection that the researcher may input the data collected and gain insight on which of the features are relevant in the diagnosis of a patient, as in diagnosis and classification of neurodegenerative disorders [15]. Feature selection may also be used in preemptive testing of a patient and flagging a patient even before signs of illness are noticeable.
Feature selection's most common classification methods are as follows [14]: filter methods, embedded and hybrid methods, structured and streaming features, and wrapper (black box) methods, which are employed in this paper. The wrapper methods generate a subset of the features by the quality of the performance on the modeling algorithm, interacted with as a black box such that the internal mechanisms of operation of the evaluator are not manipulated during optimization; rather, the subsets are feed-in, and the output is generated as performance metric(s).
Stochastic gradient descent [16] is an optimization algorithm that utilizes partial differentials at the current solutions' spatial coordinates to traverse a search space and locate a local minimum. The algorithm effectively takes steps, at each iteration, in the direction away from the current gradient of the function. One of the main shortcomings of this algorithm is in its tendency to get trapped in a local minimum, unable to find the global minimum, especially in functions with an exceedingly high number of local minima, since the partial derivative also increases the computational complexity of the function by increasing the number of calculations required in iterating to the next point but in doing so informs the search agent on the most efficient direction to arrive at the minimum.
The aim of this paper is to propose a hybrid of the grey wolf and the gradient descent to achieve better performance in well-known medical benchmark datasets against established feature selection strategies by solving the feature selection problems.
The paper is subdivided into the following sections: introduction, literature review, review of the proposed method, results and discussion, and conclusion.

Literature Review
Binary GWO [17], proposed in 2015, modified the algorithm to operate in binary feature selection. This variant of the GWO operates on binary input variables as wolves denoting the selection of each feature. In this, a "1" in the vector includes the corresponding feature in the grey wolf's search space, while a "0" denotes an exclusion. Two approaches are proposed: (1) where the alpha, beta, and delta wolves are changed into binary vectors, after which a crossover is performed to find the value of the new grey wolf, and (2) where the GWO operates on the wolves as continuousvalued inputs to achieve a binary vector the wolves are subjected to a threshold or transfer function, specifically a sigmoid function. The optimizer's performance is compared against PSO [4] and genetic algorithms [7], with the outcome placing GWO as a viable candidate for feature selection problems.

BioMed Research International
In 2017, [18] was proposed. The novel framework is composed of the feature selection module and the KELM, evaluating the selected feature set. Compared against both the genetic algorithm and the GWO algorithm on multiple medical datasets, including Parkinson's and WDBC, the enhanced GWO achieves smaller feature subsets with higher performance metrics.
In 2017, a hybrid GWO was proposed by Singh and Singh [19]. It improved the exploitation of the base GWO by combining it with PSO. Compared against its parent optimizers, the hybrid was able to perform equivalently or markedly better in a set of benchmark equations yielding better solutions with comparatively reasonable CPU times [19].
Chaotic GWO [20], proposed in 2018, incorporates a set of chaotic maps to increase the convergence rate of GWO. Chaotic maps are used in optimization, taking advantage of their nature to generate highly variant randomness from "close" initial conditions. The modified GWO is applied to various design problems in engineering including the classic tension/compression spring design problem and the pressure vessel design problem, consistently outperforming algorithms such as the PSO and ant bee colony optimization algorithm [2].
In 2018, a binary grey wolf optimizer was used in feature selection for EMG signal classification in [21]. In the paper, the leaders are enhanced by integrating a random walk around the alpha, beta, and delta wolves to prevent trapping at the local optimum. The EMG signals, consisting of 120 features, are used to examine the effectiveness of the method with a KNN used for fitness evaluation. Compared against BGWO (both versions), we have BPSO and GA on classification accuracy, precision,F-measure, and MCC. The method reduces the features to an average of 42.46 and a precision of 0.9493, outperforming all the other algorithms on all metrics while averaging the lowest computation time.
In 2019, a method was presented to diagnose Parkinson's disease using modified grey wolf optimization [22]. The method involves removing the redundant features in the Parkinson's dataset using the modified GWO. With Parkinson's disease, early diagnosis is vital since, with no cure, treatment administered early helps with mitigation of its adverse effects. The Parkinson's datasets, split into training and testing and comprising hand, hand meander, speech, and voice, have many features on which the MGWO-driven feature selection, using a machine learning model as the source of the error rate, is run. In this paper, KNN, random forest, and decision tree performance were compared on the accuracy, detection rate, and false alarm rate on the 4 datasets with random tree performing better and thus used to compare against an optimized cuttlefish algorithm-based feature selection method. The proposed method achieved a higher accuracy in all datasets but a lower feature subset in 3 of the 4 datasets, proving the viability of the method and the applicability of GWO in feature selection.
In 2019, a modified binary grey wolf optimization method was proposed [23] to increase the accuracy of intrusion detection systems by applying feature selection to the data to select the optimal number of features. The modifications to the original grey wolf entailed having four wolves used in the position update instead of three and updating the fitness function to use the inverse number of selected features instead of the ratio of the number of selected features to the total number of features. The NSL-KDD dataset was used as the benchmark with a support vector machine for classification. The proposed algorithm was compared against variants of GWO on its average accuracy performance and average number of features with the algorithm achieving a higher average accuracy highlighting the impact of the two modifications applied to the GWO. The proposed method was then tested against BGWO, binary PSO, and binary BAT on an 8 to 2 train and test data split and 4 different attack methods, and the overall results showed no significant variance in best accuracy but a big difference in feature subset size. Convergence rate comparisons were also carried out against BGWO with the results indicating better evolution of the number of features with accuracy in the modified BGWO. The final simulation results showed an increase in the classification of 99.22% with a feature set reduction from 41 to 14. These results were compared against other state-ofthe-art algorithms, including AdaBoost and PSO-discretize-HNB, indicating exceptional performance considering the conflicting objectives of the intrusion detection system. The authors proposed adapting a velocity parameter in future studies to enhance the performance as in PSO.
In 2019, in order to enhance the diagnosis of paraquatpoisoned patients, a herbicide commonly used for weeding, chaos enhanced grey wolf optimization wrapped ELM was proposed [24]. As with many ailments, early diagnosis increases the likelihood of recovery, and the proposed method was designed to remove redundant features from the dataset and enhance diagnosis accuracy. The chaotic sequence used in the paper was generated from logistic mapping and used to inject randomness and ergodicity and reduce sensitivity of the method to initial conditions. The grey wolf optimizer was used for feature selection to provide the optimal feature subset, with a decision boundary at 0.5 and an embedded chaotic map. An extreme learning machine model was trained and used to identify the PQ patients. The dataset used was from the Medical Ethics Committee of the First Affiliated Hospital of Wenzhou Medical University which included 15 patients. The results showed the method achieving AUC, accuracy, sensitivity, and specificity of 95.14%, 93.89%, 94.44%, and 95.83%, respectively, with the number of features selected ranging from 56 to 73 out of a total of 119 features. The significant features were indicated by the frequency of inclusion in the selected subset with feature nos. 3 and 87 being the most significant. The authors suggest orthogonal learning or quadratic interpolation to further enhance the searching of GWO.
In 2019, a face recognition method based on grey wolf optimization for feature selection [25] was proposed. In this paper, the authors used the grey wolf optimizer to prune out the redundant features in the image dataset and in doing so reduce the runtime of the process while increasing the classification accuracy. The dataset used in the paper was the Yale Face dataset which underwent image processing, feature extraction using Gabor filters, feature reduction using principal component analysis, feature selection using GWO, 3 BioMed Research International and classification using KNN. The effects of the GWO variables were analyzed for different values with accuracy peaking for the boundary of ±15 and maximum iterations of 25. The proposed system was compared against an adaptive cuckoo search algorithm for intrinsic discriminant analysis and outperformed it, achieving a higher accuracy (97% > 88:9%) with lower runtime (6:49 s < 9:10 s). This indicates the method is effective in face recognition and needs viable further research on other biometric datasets.
In 2020, improvements to the binary GWO were proposed [26] which involved a new updating equation for the "a" parameter to balance exploration and exploitation as well as different transfer equations for discretizing the output of the standard GWO positions. The transfer equations used are the standard S curve and 4 V-type functions which were further improved using the value of AD. To validate their application in feature selection, twelve datasets from the UCI machine learning repository were used on BGWO and the proposed algorithms. The KNN was used as the classification algorithm, and the fitness functions used were the K-fold loss and weighted sum ratio of K-fold loss and selected feature ratio. The simulated results prove that the improved optimizer has a better classification accuracy without increasing the selected features on a wide range of data types and dataset sizes. The authors suggested the use of a neural network for classification in combination with KNN to reduce classification error.
In 2021, a method was proposed [27] to tackle anomaly detection problems by utilizing an enhanced grey wolf optimizer in feature selection of the multidimensional dataset by controlling the balancing parameter for exploration and exploitation. The parameter "a" is the main focus with its value increased or decreased depending on the current iterations' performance as compared to the previous iteration. By doing so, the linearly updating value of "a" is changed to an adaptive update, allowing scouting of better search space when a wolf is at a worse fitness value area. The dataset used is the NSL-KDD dataset with 5 methods of attack with an SVM sued as the classifier. During phase one of analysis, the number of features and the accuracy were averaged and the proposed algorithm was compared against MBGWO, BGWO, MGWO, and GWO with the algorithm achieving the highest accuracy with the lowest number of features. The second phase pitted the algorithms on the accuracy and number of features selected for each class with the proposed algorithms' performance, indicating the advantage in parameter control with a good balance of exploration and exploitation. The results show significantly better performance of the optimizer, selecting 19 features of 41 with an 87.46% classification accuracy.

Inferences
Drawn. The literature review shows the effectiveness of GWO and its variants in optimizations and feature selection in specific, with GWO simplicity in design and implementation, few controlling parameters, and ease of modifying with other optimizers. It also shows that the wrapper method is a superior feature selection method with the KNN classifier acting as a suitable error rate generator without leading to overfitting.

Grey Wolf Optimizer
Proposed by Mirjalili et al. in 2014 [11], the grey wolf optimizer is a static population-based metaheuristic optimization algorithm inspired by the grey wolf of the Canidae family. The algorithm mimics the search, hunt, and attack tactics of a pack of grey wolves as a cohesive unit.
Within a pack of wolves, there exists a strict social hierarchy divided into four tiers depicted in Figure 1.
(i) The Alpha (α ). The leader of the pack is responsible for hunting, scheduling among other decisions concerning the pack.
(ii) The Beta (β ). Subordinates to the alphas assist them in the management and decision-making of the pack; they are usually the first in line to acquire the title of alpha if the current alpha passes on or grows too old. They also act as an enforcer to the alpha on the rest of the pack, disciplining them as need be.
(iii) The Delta (δ ). Dominant only to the omegas, this group is mostly composed of the sentinels, elders, hunters, scouts, and caretakers.
(iv) The Omega (ω ). Often referred to as the scapegoat, these wolves are inferior to all the other wolves seeking guidance from all other wolves.
The pack leaders are the spearheads in a hunting formation. They send the omega wolves to encircle the prey once its general location is found, drawing closer and closer as the exact location is searched for. With the prey completely encircled, the wolves attack, securing a meal for the pack.
To express this as a mathematical model, the process may be divided into 3 distinct steps.

Encircling the Prey.
The encircling of the prey is modeled as in equation (1). D ! is calculated as the distance between the current wolf vector X ! ðtÞ and the prey X a ! decreases from 2 to 0 that models the circling of the prey, as the iteration counts up, as expressed in equation ((5)) below: 3.3. Searching and Attacking the Prey. Once the prey stops moving, the pack moves in for the kill, sending all the agents towards the prey from different angles. The mathematical model of these is dependent on vector A ! . When the absolute value of vector A ! is below 1, the search agents converge towards the prey when it is above the agents that search the space.

Iterative Stochastic Gradient Descent
Conceptualizing an objective function as an unknown terrain, the minimum of this terrain may be approached step by step, continually moving against the gradient of the terrain.
Gradient descent implements this as an iterative update process of the position θ.
For a function whose objective function is defined as Jðθ ; x ðiÞ ; y ðiÞ Þ, whose partial derivative with respect to each parameter ofx ðiÞ is∇ θ Jðθ ; x ðiÞ ; y ðiÞ Þ, the core equation of gradient descent is shown in equation (9).
The algorithm loops a set number of times (maximum iterations); at each iteration, the value of θ is updated by calculating the partial derivative of the objective function with respect to the parameters of the input and subtracting this value from θ. In doing so, the algorithm quantifies the effect each parameter has on the objective function and uses this information to control the direction and speed of transversal in the search space. The variable η controls the learning rate, avoiding both oscillations about a minimum caused by large values of the partial derivative and slow convergence rates caused by low partial derivative values.
For a simple function space, the partial derivative encodes the direction in which the closest local minimum exists. As the function increases in complexity with multiple local minima existing, the value of the partial derivative as a vector in the function space may point at the weighted average of the local minima. To avoid this outcome, the algorithm is usually run multiple times with each initial starting location randomized.
This algorithm is well known in implementation in artificial intelligence, specifically in regression and neural networks. With a mathematically derived partial derivative, the algorithm can achieve a high convergence rate, and with multiple random initializations, it avoids local minima.
One of its significant shortcomings is that when the partial derivative calculation is intensive, the computational complexity of the overall process is increased.

Feature Selection
Feature selection is the process of decreasing the number of features in a dataset by removing redundant, irrelevant, and randomly class-corrected data features. In doing so, a model is capable of increasing its accuracy as well as reducing overfitting and training time by utilizing the generated optimal subset. It has applications in many fields including text mining, image processing, medical research, and fault diagnosis.
The general procedure of feature selection involves four key steps [28], as shown in Figure 2: (i) Subset Generation. This is the start of the process as a heuristic search. It could be a forward search which involves starting with an empty feature set and successively adding new features or a backward search which involves starting with all features included and successively removing features or a bidirectional search which involves adding and removing features simultaneously.
(ii) Subset Evaluation. With a generated feature subset, this involves determining the "goodness" of the subset through a defined criterion through which the optimal subset of features is guided. This may be dependent or independent of the mining algorithm.
(iii) Stopping Criterion. This determines when the feature selection process terminates through criteria such as search completion, reaching of specified bounds (minimum no. of features or maximum number of iterations), an optimal (iv) Result Validation. The selected feature subset is selected, and its performance is evaluated.
The methods of feature selection may be classified into the following [14]: (1) Filter Methods. Features are selected based on performance metrics, and the dataset used is not taken into account.
(2) Wrapper Methods. The performance of a given feature subset is measured by a modeling algorithm whose internal workings are "not known." Owing to this, they are versatile, allowing any pairing of the modeling algorithm and search space optimization algorithm. This is done for each iterative subset as per the algorithm in use until the stopping criterion is met. These methods are slower than filters but provide better performance given that the subsets are ensured not to be biased to the modeling algorithm used.
(3) Embedded and Hybrid Methods. Feature selection occurs during the modeling algorithm execution. As a model is being generated from the test data, the redundant features are pruned out to achieve the optimal dataset and model at completion. This is a hybrid of both the wrapper and filter methods.
6. Proposed Method 6.1. Hybrid Gradient Descent Grey Wolf Optimizer. The gradient descent of a wolf position may be conceptualized as the direction from which the wolf seems to smell the prey is at currently. The leading wolves would then smell the air, and each chooses a representative among the worst-performing members of the pack and calls them to itself, then instructs it to "follow" the scent. The basic implementation is given in Pseudocode 2 below.
The partial derivative is calculated as an approximation since the objective function is taken as unknown. This is done by choosing a step variable that is the value added to the wolf position on each dimension as well as subtracted.
The fitness values of these two new positions are calculated and subtracted from each other and lastly divided by the step value. This is the center finite difference for derivative approximation.
6.2. HGDGWO (Binary Version). The binary version adapts the proposed continuous algorithm for feature selection while taking advantage of the finite search space of a dataset: each location in space either including or excluding a feature from consideration in the model.   BioMed Research International As a multiobjective feature selection problem, the fitness function adapts to this as a weighted sum of the two opposing solution requirements, fewer selected features and low error rates.
A significant alteration is in the calculation of the fitness function. Since there are two objectives of the fitness function, in order to incentivize the algorithm to seek out solutions with fewer features, a "cost" is assigned to the number of features and added to the fitness function. It is calculated as below: where ϕ modulates the error of the fitted model, Err, and θ = ð1 − ϕÞ modulates the ratio of the number of selected features S to the total number of features D.
The alterations to the partial derivative for a binary search space are shown in Pseudocode 3 below. For the partial derivative, each feature index goes through the binary 1. Begin 2. Randomly initialize all wolves within the function's limits 3. Evaluate fitness values of all wolves and sort in ascending order a. Set the alpha wolf as the highest fitness value b. Set the beta wolf as the second highest fitness value c. Set the delta wolf as the third highest fitness value 4. While the maximum number of iterations is not exceeded a. For each wolf i. Evaluate A and C using equations (3) and (4) ii. Evaluate all 3 values of D using equations ((6a), (6b), (6c)) iii. Evaluate X 1 , X 2 , and X 3 using equations ((7a), (7b), (7c)) iv. Evaluate the new positions using equation (8) b. End d. Evaluate the partial derivative of the alpha, beta, and delta wolves from equation (10) e. Update the bottom 3 fitness value wolves using equation (9) with the update locations as the alpha, beta, and delta wolves. c. Evaluate the fitness values of the wolves d. Update the alpha, beta, and delta wolves 5. End 6. Solution == alpha wolf 1. Function mutation with partial derivative 2. Pass in: wolf_pos, derivatives, probability 3. Set probability vector as scaled derivative vector multiplied by probability 4. Set selected features as probability vector compared against a generated random number vector 5. Set new_pos as Call XOR of wolf_pos and selected_features 6. Pass out: new_pos Pseudocode 4: Implementation 1.

BioMed Research International
NOT operation, and this value is subtracted from the original partial differential.
Once the partial derivative is calculated, this information is utilized in updating the wolf positions through the mutation by partial derivative functions. Three implementations were used as follows: (1) With this implementation, shown in Pseudocode 4, the partial derivative is used as a simple threshold marker with higher derivatives leading to lower feature index inversion rates. The higher the partial derivative, the higher the chance of changing the wolves' value at the partial derivative index (2) In this implementation, shown in Pseudocode 5, variable a is used to incorporate exploration and exploitation phases during the iterative search process. It does this by modifying the threshold for the wolf index mutation (3) With this, shown in Pseudocode 6, the sigmoid function is used to map the partial derivative to a threshold space with a normalized partial derivative while still incorporating variable a for exploration and exploitation. This ensures injection of inversion even when there seems to be no valid update direction for the wolf With mutation completed, the worst-performing wolves' positions are updated as the modified positions are evaluated. The three worst-performing wolves were chosen to update the positions from the alpha, beta, and delta wolves.
1. Function mutation with partial derivative 2. Pass in: wolf_pos, derivatives, probability threshold, a 3. Set weight as 0.4 * a + 0.1 4. Normalize the partial derivative 5. Calculate the sigmoid of the normalized partial derivative and set to variable sig 6. Use probability threshold and weight to choose the feature indices to change 7. Calculate the new wolf position and pass back Pseudocode 5: Implementation 2.
1. Function mutation with partial derivative 2. Pass in: wolf_pos, derivatives, probability threshold, a 3. Set weight as 0.4 * a + 0.1 4. Set the expoilation_probabilities as mapped derivative on the sigmoid curve centred at zero 5. Set the exploration_probalilities as the ratio of features selected to total features for the selected features and 1-this for negative 6. Set selected_features as the sum of weight * expoilation_probabilities and 1weight * exploration_probalilities multiplied by prob-ability_threshold compared against a vector of randomly generated numbers Pseudocode 6: Implementation 3.

Methodology
The algorithms were implemented in MATLAB and run on an Intel® Core™ i7-7700HQ CPU @ 2.80 GHz with 8 GB of RAM.
The continuous version was first tested for feasibility and gave information on its performance in different continuous function optimization problems. The binary version underwent multiple alterations to improve performance. 7.1. HGDGWO. As a continuous value optimization function, the hybrid algorithm was tested on 23 benchmark functions. Each function was run 500 times, and the convergence curve and final value were recorded.

Test Functions.
Twenty-three benchmark functions were used to compare the performance of the HGDGWO against GWO. They are divided into three types of functions: (1) Unimodal Functions (F1-F7). Exploitation analysis for checking the exploitation capability of the optimizer (Figures 3-9).
(2) Multimodal Functions (F8-F13). Exploration analysis for checking the exploration capability of the optimizer (Figures 10-15).  (iv) Learning rate η = 0:003. This is a variable for the partial derivative, defining the position update rate. A high learning rate increases the convergence rate while increasing oscillations around a local minimum (v)

Parameter Settings
Step size s = 1 E-15. For partial derivative approximation, a high value reduces the accuracy of the value calculated The parameters were tuned iteratively in order to achieve desired results.

Evaluation Metrics
(1) Standard deviation This is a measure of the similarity between different solution runs. A high standard deviation indicates significant changes in the solution as the function runs multiple times.  (2) Average of solutions (3) Minimum solution This is the lowest value of the fitness value achieved over the total number of repetitions.

(4) Timing
The MATLAB timing function "timeit" is used. The algorithms are timed on how long they take on each function, from calling to returning the solution.

HGDGWO (Binary Version)
7.2.1. Datasets. From the UCI machine learning repository, 6 medical datasets were selected and utilized to test HGDGWO in feature selection applications against BGWO implementations 1 and 2 as well as BGWOPSO. Table 1 shows the specific datasets used and the corresponding feature numbers and samples.
(1) Breast Cancer Wisconsin (Diagnostic). Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.
(2) Breast Cancer Wisconsin (Original). Some of its features are clump thickness, uniformity of cell shape, uniformity of cell size, and single epithelial cell size, among others.
(3) SPECT Heart. The dataset describes diagnosing cardiac Single Photon Emission Computed Tomography (SPECT) images. Each of the patients is classified into two categories: normal and abnormal.
(4) Statlog (Heart). The class is grouped as either the absence (1) or the presence (2) of heart disease. The data includes patient information and symptoms as well as medical test results. It includes features such as age, sex, chest pain type, resting blood pressure, cholesterol, and fasting blood sugar.
(5) Lymphography. The dataset is grouped into 4 classes: normal find, metastases, malign lymph, and fibrosis. The features are characteristics of the nodes of the patient  The parameters were tuned iteratively in order to achieve desired results. The wrapper method was chosen for its superior performance, as expressed in the literature review. The KNN classifier was also used for fitness evaluation because of its well-documented preference in feature selection, specifically with GWO variants.

Evaluation Metrics.
The criteria used once a solution was achieved were as follows: (1) Average classification accuracy This is a measure of the validity of a model's predictions.
(2) Average number of selected features This is the ratio of correctly predicted positive cases to total positive cases.

Sensitivity = TP TP + FN
: ð16Þ This is the ratio of correctively predicted positive cases to total predicted positive cases.

Precision = TP TP + FP
: ð17Þ This is the harmonic average of precision and sensitivity.
(7) One-way ANOVA test on the number of features selected by each algorithm. The P value was obtained from the MATLAB function "anova1"

Results and Discussion
8.1. HGDGWO Results. The proposed HGDGWO is compared against GWO and manages to achieve higher average fitness values for 11 of the 23 fitness functions but achieves a lower minimum fitness value in 14 of the 23, as seen from Table 2.
The optimizer also increases the average computation time by a factor of 5.6, as seen from Table 3.
From the graphs, the proposed function achieves a higher convergence rate for functions 2, 6, 10, 14, 15, 20, 21, 22, and 23, as seen in Figures 4,8,12,16,17,22,23,24,and 25. In only functions 3, 4, and 12, in Figures 5, 6, and 14, does the hybrid optimizer perform noticeably worse than the original GWO with each solution starting out at the same rate but diverging as the function traverses the search space.
Performance of the HGDGWO among F1-F7 indicates superiority in exploiting the optimum. The performance in multimodal functions is also promising, indicating that the optimizer would also perform well in objective functions with a large number of local minima.
This indicated the virility of this hybrid function as a possible solution to the optimization function.
The optimizer also increases the average computation time by a factor of 5.6, as seen from Table 3. This was as expected since the partial derivative required in calculation increases the computation time.

HGDGWO (Binary Version) Results
8.2.1. Implementation 1. Table 4 shows the performance of the HGDGWO against the BGWO on the datasets stated in Table 1.  With this implementation, HGDGWO performs markedly better than BGWO2 in accuracy, F-measure, precision, and sensitivity, outperforming it in all datasets apart from SPECT Heart. Although the results were promising, the number of features selected was higher for the markedly low increase in performance, indicating the persistence of redundant features in the feature subsets produced by HGDGWO.

Implementation 2.
The results presented below indicate the performance of HGDGWO's second implementation against the two implementations of the binary GWO. The results are from 50 iterations of 10 test cases each and tabulated as average maximum and minimum values for the 4 datasets.
As seen from Table 5, HGDGWO performs averagely for all metrics apart from sensitivity, performing better than the BGWO2 but worse than BGWO1 on average, but has similar performance in maximum values. In both F-measure and precision, the difference is marginal (~0.01). The maximum metric values for the resultant predictor were identical, each optimizer managing to achieve unity.
As seen from Table 6, HGDGWO outperforms the other optimizers in accuracy, F-measure, and precision of the predictor. The number of maximum features selected for both HGDGWO and BGWO2 is 3 out of the possible 9 with none exceeding 3 features apart from BGWO1. From Table 7, HGDGWO outperforms the BGWO in all predictor metrics but performs medially in both the fitness values and the number of features in the average values. The function also has the best performance in the maximum values.
In this dataset, it is noted that the difference between the maximum and minimum metrics for all three optimizers is very large, indicating a need to run them multiple times to achieve a satisfactory value. This is an effect of a local minimum, trapping the optimizer in some of its iterations.             The results from Table 8 show that BGWO2 outperforms the other two functions in the average values, but in the maximum values, HGDGWO has the best performance in accuracy, precision, and F-measure; this indicates that although averagely HGDGWO performance is low, when run multiple times, it achieves a higher performing predictor from the selected feature set.

Implementation 3.
The results presented below indicate the performance of HGDGWO's third and final implementation against the two implementations of the binary GWO as well as the binary GWO and PSO hybrid. The results are from 50 iterations of 10 test cases each for the 5 datasets and 2 test cases for the Lymphography dataset.
The evaluation is done on the average, minimum, and maximum values of each criterion, using the one-way ANOVAPtest on the number of features selected with the box plot and comparison of criteria for the feature subset with the highest accuracy.
(1) Maximum, Minimum, and Average Value Comparison. As seen from Table 9, the proposed optimization function does not outperform the benchmark functions but achieves the second smallest average feature subset. With a higher F -measure than the smallest average feature subset as from BGWO2, the proposed algorithm shows a better balance in the error rate and feature subset size, especially since the higher performing algorithm also has the largest feature subset average.
The proposed algorithm also achieves the highest possible maximum values in accuracy, F-measure, precision, and sensitivity with 9 maximum features selected which BGWO2 is unable to.    In this dataset, from Table 10, the algorithm averages the lowest number of features with a difference of 3.3 from the maximum in the selected features. With a lower feature subset, it achieves an accuracy difference from BGWOPSO of 0.03. There is a low deviation in the number of features with the maximum at 3 and minimum at 2 from a possible maximum of 9, the same as in HGDGWO2.
The maximum values also indicate that with a smaller average feature subset by 5, the proposed method still manages to achieve unity in the classification parameters.
As in HGDGWO2, from Table 11, implementation 3 outperforms all other functions as well as outperforming HGDGWO2 in accuracy, F-measure, fitness values, and sensitivity, with an averagely lower number of features than HGDGWO2.
The difference in maximum and minimum values of the classification parameters of the proposed method is large, indicating that the method may be getting trapped in local minima.
As seen from Table 12, the proposed function is only outperformed by the BGWO2 function. The average accuracy difference is~0.2 for a 2.58 average feature subset reduction.
In this subset, as seen from Table 13, the proposed algorithm averages a higher value of F-measure and sensi-tivity with an average of 4.6 features while BGWOPSO averages a higher accuracy and precision with an average of 4 features.
With this dataset, as seen from Table 14, the accuracy plummets to exceedingly low values, <0.59 for all algorithms. The proposed algorithm achieves the highest average F -measure but with the largest average feature subset.
From the results (Tables 9-13), the optimizer manages to attain a balance between the performance and the fitness values as set by the value ϕ from equation (10). The variable is set not to overemphasize the significance of one over the other. This is especially critical in the stopping criterion stage of the feature selection criterion. When the limiting criterion is set as the number of selected features, the value of the variable is reset to convey this. HGDGWO2 performs better in the breast cancer dataset while HGDGWO performs better with the heart dataset.
(2) Highest Accuracy Feature Subset Performance Comparison. As seen from Table 15, the proposed method achieves unity classification metrics with a 7-feature subset which is the least with all unity. Inversely, BGWO2 has a lower feature subset but in doing so does not achieve unity classification.     Table 16, HGDGWO achieves unity on accuracy, F -measure, precision, and sensitivity with the smallest feature subset, selecting 2 features as relevant out of the possible 9.
From Table 17, the proposed function seems to have been trapped in a local minimum, having selected the lesser performing set of 4 features, but achieves an equivalent value of accuracy, F-measure, precision, and sensitivity with BGWO1 with fewer features.
In Table 18, the accuracy caps out at 0.8889. With 2 features resulting in the highest metrics, the proposed method was unable to achieve this, instead having an 8-feature subset, the second best.
In Table 19, the proposed method achieves the highest sensitivity and accuracy with 5 features and with the second F-measure accuracy, a difference of 0.02, compared to a possible 13-feature subset. It achieves a good balance of the two objectives.
In Table 20, though the HGDGWO feature subset is among the smallest, the performance of the algorithm in this subset is still subpar, indicating that the search objective was hampered, possibly by the local minima.
8.2.4. One-Way ANOVA Test on Feature Data. The stated hypotheses of the one-way ANOVA test on the number of features are as follows: (i) H0. The null hypothesis indicates that there is no significant difference between the number of features of HGDGWO and the other methods used in testing.
(ii) H1. The alternative hypothesis indicates that there is a significant difference between the number of features of HGDGWO and the other methods used in testing.

BioMed Research International
The significance of the ANOVA test results is obtained from comparison against the standard significance level of 0.05 with P values greater than the stated value, indicating acceptance of the null hypothesis and rejection of the alternative hypothesis, and with P values less than the stated value, indicating rejection of the null hypothesis and acceptance of the alternative hypothesis.
From thePvalue results in Table 21, the Breast Cancer Wisconsin (Original) and Lymphography datasets' BGWO2 vs. HGDGWO3 and the heart disease dataset's BGWOPSO vs. HGDGWO3 show that thePtest values are not lower than the significance level of 0.05 which leads to the rejection of the alternative hypothesis and acceptance of the null hypothesis.
The other comparative P values indicate a rejection of the null hypothesis of no significant difference between the number of features. The general P value comparing all data points of all the algorithms shows that there is a significant difference between at least one of the samples.
As seen from Figures 26(a)-26(f), the proposed method is consistently among the lowest two median numbers of features for all the datasets in accordance with the objective of the algorithm, reducing feature subsets.

Conclusion and Future Work
The proposed hybrid gradient descent grey wolf optimizer encourages the worst-performing wolves to seek out the prey in the direction indicated by the leaders of the pack. The directional information is sourced from the partial derivative of the leading wolves.
The effectiveness of the proposed optimizer's first implementation was determined by evaluating its performance in feature optimization problems from the UCI datasets and compared to the BGWO2, outperforming it in 3 of the 4 datasets.
The second implementation outperforms BGWO implementations 1 and 2 in two datasets.
While the third implementation only completely outperforms the other algorithms (2 out of 6), the maximum accuracy tables show that the algorithm is capable of providing viable feature subsets even if the average may not be the best for different datasets, in 3 of 6 datasets. The margin between the best performing algorithm and the proposed algorithm is low with the maximum metrics always among the highest showing promise in the algorithm.
Improving the proposed algorithm would entail adding a memory module to the algorithm to reduce the computational load incurred in traversing the same points in the search space repeatedly.
Alternatively, a method may also be implemented to detect when the optimizer is stalling on a particular point on the search space, common in the local optimum. This may be modeled as a memory function that detects when the past R number of fitness values is constant, and if true, it increases randomization in the mutation function to allow the optimizer to tunnel out of the local optimum.
To reduce the overhead due to multiple partial derivative calculations, the features may be grouped into clusters with a single feature occurring in multiple clusters. The partial derivative is then calculated for all the bits changed in a cluster, and the partial derivative of each feature is calculated as the average of each cluster's partial derivative for which the feature occurs. This would reduce the number of calculations the algorithm calculates.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.