Feature Selection and Classification of Clinical Datasets Using Bioinspired Algorithms and Super Learner

A computer-aided diagnosis (CAD) system that employs a super learner to diagnose the presence or absence of a disease has been developed. Each clinical dataset is preprocessed and split into training set (60%) and testing set (40%). A wrapper approach that uses three bioinspired algorithms, namely, cat swarm optimization (CSO), krill herd (KH) ,and bacterial foraging optimization (BFO) with the classification accuracy of support vector machine (SVM) as the fitness function has been used for feature selection. The selected features of each bioinspired algorithm are stored in three separate databases. The features selected by each bioinspired algorithm are used to train three back propagation neural networks (BPNN) independently using the conjugate gradient algorithm (CGA). Classifier testing is performed by using the testing set on each trained classifier, and the diagnostic results obtained are used to evaluate the performance of each classifier. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. Experimentation has been carried out using seven clinical datasets from the University of California Irvine (UCI) machine learning repository. The super learner has achieved a classification accuracy of 96.83% for Wisconsin diagnostic breast cancer dataset (WDBC), 86.36% for Statlog heart disease dataset (SHD), 94.74% for hepatocellular carcinoma dataset (HCC), 90.48% for hepatitis dataset (HD), 81.82% for vertebral column dataset (VCD), 84% for Cleveland heart disease dataset (CHD), and 70% for Indian liver patient dataset (ILP).


Introduction
Data related to symptoms observed on a patient at a point of time are stored in electronic health records (EHRs). Interesting patterns can be extracted from the data that are stored in EHRs, and the extracted patterns can be represented as knowledge, and this knowledge can assist the physicians to diagnose the presence or absence of a disease. Data mining tasks, namely, association rule mining, classification, and clustering are used to mine valuable patterns from the data stored in EHRs. Clinical decision support systems (CDSSs) that assist the physicians to diagnose the presence or absence of a disease can be developed from data stored in EHRs using bioinspired algorithms and data mining techniques. Although several algorithms have been proposed by researchers for association rule mining, classification, and clustering, no algorithm can be deliberated to be the "universal best." Quality of data and data distribution are the two key factors that determine the effectiveness of a data mining task. The performance of a data mining task depends on how effective data preprocessing has been done. Classification plays a major role in the development of CDSSs. Classification is a two-step process, first, building the classifier and second, model usage. Building the classifier is the process of training the classifier with a supervised learning algorithm. Model usage is the process of estimating the accuracy of the classifier using testing instances commonly referred to as testing set. Overfitting and underfitting are two major problems associated with building the classifier.
Clinical dataset (s) ðC s Þ used for classifier construction is split into a training set ðT s Þ and a testing set ðT t Þ.
Researchers have proposed different methods to identify the T s and T t . One common method is to split 80% of the dataset into T s and 20% of the dataset into T t . For clinical decisionmaking, a balanced dataset is essential for building a prediction model. Clinical datasets are normally not balanced, and classification methods perform poorly on minority class samples when the dataset is tremendously imbalanced. For example, consider a C s with n instances, each instance associated with a class label c 1 or c 2 . Among the n instances that 75% of the instances in C s are associated with class label c 1 , and 25% of the instances in C s are associated with class label c 2 , it is evident that the class labels in C s are not equally represented and therefore, the C s is imbalanced. In this context, c 1 is the majority class, and c 2 is the minority class, and hence, constructing a classifier with class-imbalanced data will lead to bias in favor of the majority class. One method to handle class imbalance in a C s is to generate additional instances from the minority class. The Synthetic Minority Oversampling Technique (SMOTE) [1] is one of the prevailing methods used to generate additional training and testing instances.
A training instance can be defined as a tuple t i ðf 1 , f 2 , ⋯ f m Þ, where t i represents a training instance, and ð f 1 , f 2 , ⋯f j Þ represents the features corresponding to a training instance. The subscript i in t i can range from 1 to n, where n is the number of instances. The subscript j in f j can range from 1 to m, where m is the number of features. Using irrelevant features to train a classifier will affect its performance. Selecting the optimal features from the C s and then training the classifier will enhance the accuracy of the classifier. Feature selection methods can be supervised, unsupervised, and semisupervised depending upon whether the training set is labeled or not. Commonly used supervised feature selection methods are filter and wrapper methods. The filter method considers the dependency of each feature to the class label and is independent of any classification algorithm. Measures, namely, information gain [2], gain ratio [3], Gini index [4], Laplacian score [5], and cosine similarity [6] can be used to rank the features. Other measures to rank the features can also be used in filter method. The wrapper method considers the classification accuracy of a learning algorithm to select the relevant features. Researchers are using a confluence of disciplines to develop computeraided diagnostic (CAD) systems to assist physicians.
Knowledge mining using rough sets for feature selection and backpropagation neural network (BPNN) for classifying clinical datasets has been proposed in [7]. A CDSS to diagnose Urticaria using Bayes classification is proposed in [8]. CDSSs to diagnose lung disorders are proposed in [9][10][11][12][13][14]. A CDSS to diagnose the severity of gait disturbances using a Q-backpropogated time delay neural network on patients affected by Parkinson's disease is proposed in [15]. A statistical tolerance rough set induced decision tree classifier to classify multivariate time series clinical data is proposed in [16].
A CDSS to diagnose gestational diabetes mellitus using the fuzzy logic and radial basis function neural network is proposed in [17]. Use of fuzzy sets and extreme learning machine to classify clinical datasets is proposed in [18]. Wind-driven swarm optimization, a metaheuristic method to classify clinical datasets, is proposed in [19]. A computer-aided diagnostic system that uses a neural network classifier trained using differential evolution, particle swarm optimization, and gradient descent backpropagation algorithms is proposed in [20]. A radial basis function neural network to classify clinical datasets using k-means clustering algorithm and quantum-behaved particle swarm optimization is proposed in [21]. Classifying clinical unevenly spaced time series data by imputing missing values has been proposed in [22]. A framework to classify unevenly spaced time series clinical data using improved double exponential smoothing, rough sets, neural network, and fuzzy logic is proposed in [23].
An outline of nature-inspired algorithms for optimization is presented in [24]. The cooperative intellectual actions of insects or animal groups in nature, for example, colonies of ants, schools of fish, flock of birds, swarms of bees, and termites, have fascinated the thoughtfulness of researchers.
Entomologists have studied the collective actions of insects or animals to model biological swarms, and engineers have applied these models as a framework to solve complex realworld problems.
In this work, a CAD system that employs a super learner to diagnose the presence or absence of a disease has been proposed. The bioinspired algorithms used in this work are cat swarm optimization (CSO), krill herd (KH), and bacterial foraging optimization (BFO). The classifiers used in this work are support vector machine (SVM) and BPNN trained using the conjugate gradient algorithm.
The rest of the paper is organized as follows: the abbreviation used in the manuscript is presented in Section 2. An outline of the related work is presented in Section 3. An outline of the datasets used is presented in Section 4. The framework of the proposed classifier is presented in Section 5. The results and discussions are presented in Section 6. Finally, conclusion and scope for future work are presented in section 7. Table 1 presents the abbreviation used in the rest of the manuscript in alphabetic order.

Literature Survey
Leema et al. [25] in their work have experimented the significance of fixing the appropriate values of parameters to train artificial neural networks using the backpropagation algorithm. The parameters are initial weight selection, bias, activation function used, number of hidden layers, number of neurons per hidden layer, number of training epochs, minimum error, and momentum term. Twelve backpropagation learning algorithms have been used in this study. Experimentation has been carried out using three clinical 2 Computational and Mathematical Methods in Medicine datasets from the UCI ML repository, namely, PID, hepatitis, and WBC datasets. Elgin et al. [26] in their work have proposed a clinicaldecision making system to diagnose allergic rhinitis. A wrapper approach that uses GA and the accuracy of ELM classifier as the fitness function has been used for feature selection. The selected features have been trained using ELM classifier. Intradermal skin test dataset of 872 patients collected from Good Samaritan Lab Services and Allergy Testing Centre, Chennai, has been used in this work, and an accuracy of 97.7% has been achieved.
Sreejith et al. [27] in their work have proposed a framework for classifying clinical datasets which uses an embedded approach for feature selection and a DISON for classification. The feature selection is performed by computing the feature importance of every attribute using an extremely randomized tree classifier. Classification is performed using DISON which is a feed forward neural network whose weights and bias are optimized in two stages first, by using a strawberry optimization algorithm and then by using a gradient descent BP algorithm. Vertebral column, PID, CHD, and SHD datasets from the UCI ML repository have been used for experimentation. The framework has achieved an accuracy of 87.17% for vertebral column, 90.92% for PID, 93.67% for CHD, and 94.5% for SHD.
Sreejith et al. [28] in their work have proposed a framework for CDSS which addresses the data imbalance problems associated with clinical dataset. The datasets are rebalanced using SMOTE enhanced using Orchard's algorithm. The feature selection is performed using a wrapper approach where CMVO is used to select the feature subsets, and RF classifier is used to evaluate the goodness of the features. The arithmetic mean of MCC and F-score computed using the RF      [29] in their work have proposed a CAD system to diagnose pulmonary emphysema from chest CT slices. Spatial intuitionistic fuzzy C-means clustering algorithm has been used to segment the lung parenchyma and extracting the RoIs. From the RoIs, shape, texture, and runlength features have been extracted, and feature selection has been performed using a wrapper approach using four bioinspired algorithms with the classification accuracy of SVM as the fitness function. The bioinspired algorithms used are MFO, FFO, ABCO, and ACO. Tenfold crossvalidation technique has been used, and each feature set has been trained using an ELM classifier. Two independent datasets, one dataset consisting of CT slices collected from hospitals and the second dataset consisting of CT slices from a benchmark repository, have been used for classification. A maximum classification accuracy of 89.19% for MFO, 91.89% for FFO, 83.78% for ABCO, 86.49% for ACO, and 75.68% without feature selection have been achieved.
Elgin et al. [30] in their work have performed feature selection and instance selection using a wrapper approach that employs cooperative coevolution with the classification accuracy of the random forest classifier as the fitness function. The optimal feature set is used to train a random forest classifier. Seven datasets, namely, WDBC, HD, PID, CHD, SHD, VCD, and HCC from the UCI ML repository have been used for experimentation. An accuracy of 97.1%, 82.3%, 81.01%, 93.4%, 96.8%, 91.4%, and 72.2% for datasets WDBC, HD, PID, CHD, SHD, VCD, and HCC datasets have been achieved, respectively.
Anter et al. [31] in their work have developed CFCSA by integrating chaos theory and the FCM method to find the optimal feature subset. Ten clinical datasets from the UCI ML repository have been used for experimentation. The features of each clinical dataset have been normalized, and then random chaotic motion has been incorporated into CFCSA in the form of chaotic maps. The objective function of the FCM has been used as the fitness function, in which the crow with the best fitness has been considered the best solution. Comparison has been done with chaotic ant lion optimization, binary ant lion optimization, and the binary crow search algorithm, and it has been inferred that CFCSA outperforms these algorithms in all the datasets used for experimentation.
Elgin et al. [32] in their work have proposed a correlation-based ensemble feature selection using a wrapper approach that employs three bioinspired algorithms using differential evolution, lion optimization, and glowworm swarm optimization with the accuracy of the AdaboostSVM classifier as the fitness function. Tenfold crossvalidation technique has been used, and the optimal features selected have been used to train a gradient descent BP neural network with variable learning rates. Two clinical datasets from the UCI  Sweetlin et al. [33] in their work have proposed a CAD system to diagnose pulmonary tuberculosis from chest CT slices. The region growing algorithm has been used for segmenting the lung fields followed by edge reconstruction. The manifestations of pulmonary tuberculosis, namely, cavities, consolidations, and nodules have been considered to be RoIs. After extracting the RoIs, and from the RoI, texture features, run-length features and shape features have been extracted, and feature selection has been performed using a wrapper approach that employs the BCS algorithm with the accuracy of one-against-all multiclass SVM classifier as the fitness function. The Cuckoo search algorithm has been implemented in two ways, first, by using entropy measure and second, without using entropy measure. Using the selected feature training is performed using one-against-all multiclass SVM classifier. An accuracy of 85.54% for BCS algorithm with entropy measure and 84.65% accuracy for BCS algorithm without entropy measure have been achieved.
Sweetlin et al. [34] in their work have proposed a CAD system to diagnose pulmonary hamartoma nodules from chest CT slices. Otsu's thresholding method has been used Input: training set Process: Step 1: initialize the population of N cats (solutions) at random. Each solution is of length n, where n represents the number of features. If the corresponding feature is selected, it is represented as "1;" else, it is represented as "0." Initialize the parameters, namely, SMP, SRD, CDC, SPC, MR, C, and R.
Step 2: calculate the fitness value of each cat (solution) using the SVM classifier, where the accuracy of the SVM classifier is considered as the fitness function. The solution that has the maximum fitness value obtained so far is considered as the best solution.
Step 3: assign the cats to perform seeking mode. Seeking mode refers to the cats at rest and its movement to the next position by looking around itself.
Step 3a: create j (SMP) copies of the current cat. All the copies are considered to be candidate solutions.
Step 3a. i: if the value of SPC is true, one among the candidates retain the position, while the rest changes its position with respect to a randomly selected SRD.
Step 3a. ii: if the value of SPC is false, then all the candidates change their position by a randomly selected CDC.
Step 3b: calculate the probability of each solution being selected using Equation (2) to find the best solution that has the maximum chance to survive. If all the solutions produce the same fitness value, then the probability value is considered as "1." In the above formula, P i is the probability of the current cat i, FS max is the maximum fitness value, and FS min is the minimum fitness value. The values of FS b are assigned FS max ðFS b = FS max Þ if maximum fitness has to be calculated. The values of FS b are assigned FS min ðFS b = FS min Þ if minimum fitness has to be calculated. In our work, the value of FS b is assigned to FS max .
Step 4: perform tracing mode. In this mode, the cats update their position based on the velocity. Calculate the velocity ðV t+1 k,d Þ and update the position ðx t+1 k,d Þ of each cat using Equation (3) and Equation (4).
In the above formula, x t k,d and V t k,d are the position and velocities of current cat k at iteration t: The best solution set from the cats in the population is denoted by x t best,d ; d denotes the dimension to be changed; C is a constant, and R is a random number between 0 and 1.
Step 5: update the best solution that has the maximum fitness value. If the solution in the previous iteration has low fitness value, then replace it with the current best solution; otherwise, retain the previous best solution.
Step 6: repeat step 2 to step 5 for a maximum number of iterations or until the convergence of solution is reached. The solution with the maximum fitness value obtained by the classifier is considered as the optimal feature subset. Output: optimal feature subset.
Algorithm 1:  [35] in their work have proposed a CAD system to diagnose pulmonary bronchitis from CT slices of the lung. Optimal thresholding has been used to segment the left and right lung fields from the lung CT slices. The RoIs are identified, and from the RoIs, texture and shape features have been extracted. Feature selection has been performed using a hybrid ACO algorithm combined with tandem run recruitment based on cosine similarity, and the accuracy of the SVM classifier has been used as the fitness function. The selected features have been used to train a SVM classifier.
An accuracy of 81.66% for ACO with tandem run strategy, 78.10% for ACO without tandem run strategy, and 75.14% without feature selection has been achieved.
Raj et al. [36] in their work have proposed DGA for feature selection to develop a CAD system to diagnose lung disorders from chest CT slices. The entire dataset has been split into two sets one set containing 90% of the entire dataset and the other set containing 10% of the entire dataset. Out of the 90%, 50% has been used as training set and the other 50% as validation set for evaluating the objective function. The set containing 10% of the entire dataset has been used as testing set. The objective function has been defined as the sum of the squared deviation of each data in the training set of each class from each data in the validation set of the corresponding class. GA has been used for feature selection by minimizing the proposed objective function, resulting in the proposed DGA. The GA has been iterated over several generations to obtain individuals that are best fit with respect to the objective function. Classification has been performed using k-NN classifier to classify the RoIs into one of four classes, namely, bronchiectasis, Input: training set Process: Step 1: initialize the population of N krill herds (solutions) at random. Each solution is of length n, where n represents the number of features. If the corresponding feature is selected, it is represented as "1," else as "0." Initialize the parameters maximum induced motion N max , foraging speed V f , maximum random diffusion speed RD max , w n , w f , C t , and δ.
Step 2: calculate the fitness value of each krill herd (solution) using the SVM classifier, where the accuracy of the SVM classifier is considered as the fitness function. The solution with the highest fitness value is considered as the global best solution.
Step 3: update the position of each krill ðx i ðt + ΔtÞÞ using Equations (6) and (7) based on movement induced by other krill individuals, foraging activity, and random diffusion.
where j = 1, 2, ⋯:d:(7) In the above formula, x i ðtÞ is the current position of the krill; Δt is the scaling factor of the velocity vector; N i is the induced motion; F i is the foraging motion; RD i is the random diffusion of the i th krill individual; C t is the step-length scaling factor; d is the total number of krill individuals; UB j is the upper bounds of variable j, and LB j is the lower bounds of variable j.
Step 4: each krill individual maintains a high density and change their position due to their mutual effect. The direction of individual krill is maintained by target effect, local effect, and repulsive effect. The induced movement N i by other krills is calculated using Equations (8) and (9).
In the above formula, N i is the induced motion; N max is the maximum induction speed; α i is the induced direction; w n is the inertia weight of the motion induced; N t i is the last induced motion; α i local is estimated from the local effect, and α i target is the target effect.
Step 5: calculate the foraging motion F i using Equations (10) and (11). It is mainly based on the current location of the food and the previous experience about the food location.
In the above formula, V f is the maximum foraging speed; β i is the foraging motion; w f is the inertia weight of the foraging motion; F t i is the last foraging motion; β i f ood is the food attractive, and β i best is the effect of the best fitness of the i th krill.
Step 6: calculate the random motion for random diffusion RD i using Equation (12) which is characterized with high diffusion speed and a random vector. RD i = RD max × ð1 − I/ I max Þ × δ: (12) In the above formula, RD max is the maximum random diffusion speed; δ is the random directional vector; I is the current iteration number, and I max is the maximum number of iterations.
Step 7: repeat steps 2 to 6 for a maximum number of iterations or until the convergence of solution is reached. The solution with the maximum fitness value obtained by the classifier is considered as the optimal feature subset. Output: optimal feature subset. Zawbaa et al. [37] in their work have performed feature selection using a wrapper approach that uses the MFO algorithm with the accuracy of k-NN classifier as the fitness function. Eighteen datasets from the UCI ML repository have been used for experimentation among which four are clinical datasets. Comparison has been done with PSO and GA, and it has been inferred that MFO outperforms in fourteen datasets among which three are clinical datasets.
Shu-Chuan et al. [38] in their work have presented an algorithm called CSO by modeling the natural behavior of cats. The CSO algorithm considered two biological characteristics of cats, namely, seeking mode and tracking mode. Cats spend utmost of the time when they are awake on resting. Nevertheless, during their rests, their perception is really high, and they are well aware of what is happening around them. Cats continuously observe their environment wisely and consciously and when they perceive a prey, they advance towards it rapidly. Although resting, they move their position cautiously and slowly, occasionally even stay in the original position. Seeking mode has been used to represent this behavior into the CSO, and the tracing mode has been used to represent the behavior of cats advancing towards a prey into the CSO. The performance of CSO has been evaluated by applying CSO, standard PSO, and PSO with weighting factor into six benchmark functions. The results obtained reveal that the proposed CSO performs better compared to PSO and PSO with weighting factor.
Gandomi et al. [39] in their work have proposed a swarm intelligence algorithm named KH algorithm to solve optimization tasks and is centered on the imitation of the herding behavior of krill swarms with respect to precise biological and environmental processes. The fitness function of each krill individual has been defined as the least distance of each individual krill from food and from the highest density of the herd. Three vital actions considered to define the timedependent position of an individual krill are, one, movement induced by other krill individuals, two, foraging activity, and three, random diffusion. The KH algorithm is tested using twenty benchmark functions and compared with eight algorithms. Experimentation results indicate that the KH algorithm can outperform these familiar algorithms.
Chen et al. [40] have proposed a cooperative bacterial foraging optimization algorithm (CBFO). Two cooperative methods are used to solve complex optimization problems in the original BFO [41] and achieved significant improvement. The serial heterogeneous cooperation on the implicit space decomposition level and the hybrid space decomposition level are the two methods used to improve the original BFO. The authors have compared the performance of two CBFO variants with the original BFO, PSO, and GA on four commonly used benchmark functions. The experimental results indicated that the CBFO achieved a better performance over the original BFO, PSO, and GA.
Chen et al. [42] have proposed an adaptive bacterial foraging optimization (ABFO) for optimizing functions. The adaptive foraging approaches are used to increase the performance of the original BFO. It is achieved by enabling the original BFO to adjust the run-length unit parameter dynamically during the time of algorithm implementation. The experimental results are compared with the original BFO, PSO, and GA using 4 benchmark functions. The proposed ABFO indicates the better performance over the original BFO and competitive with the PSO and GA.
From the literature, it is evident that classifier training using relevant features enhances the accuracy of the classifier. It can also be inferred that wrapper-based feature selection that employs bioinspired algorithms performs better in numerous cases compared to traditional feature selection methods.

Outline of the Datasets Used
Seven clinical datasets from the UCI ML repository, namely, WDBC, SHD, HCC, HD, VCD, CHD, and ILP have been used for binary classification. An outline of each dataset used is presented in Table 2.

System Framework
The framework for feature selection and classification of clinical datasets using bioinspired algorithms and super learner is presented in Figure 1. The major building blocks of the framework are data preprocessing, feature selection, classifier training, classifier testing, and dataset construction for super learner, super learner training, and testing. Each building block is outlined below. Normalization has been used to scale the value of a feature so that the value will fall in a specified range and is predominantly useful for constructing a classifier involving a neural network. Training a classifier using normalized data will speedup learning. In this work, the range is 0 to 1, and minmax normalization is being used. When an attribute "A" in a Input: training set Process: Step 1: initialize the population of S bacteria (solutions) at random. Each solution is of length p, where p represents the number of features. If the corresponding feature is selected, it is represented as "1," else as "0." Initialize the parameters p, N c , N s , N re , N ed , P ed , and L ðiÞ (where subscript i in L ðiÞ can range from 1,2,…S), θ i ,x = 0, y = 0, and z = 0.
Step 2: calculate the fitness value of each bacterium (solution) using the SVM classifier, where the accuracy of the SVM classifier is considered as the fitness function.
Step 3: in the elimination-dispersal process, due to environmental changes, the bacteria are eliminated or dispersed from current location. This process is used to strengthen the ability of global optimization. Initiate the elimination-dispersal process and increase the value of z from 0 to N ed .
Step 4: in the reproduction process, the low healthy bacteria die and rest of the other healthiest bacteria are divided into two bacteria. The new bacteria are placed on the same position of their parent. this process is used to maintain the population rate of bacteria. Initiate the reproduction process and increase the value of y from 0 to N re .
Step 5: in the chemotactic process, the E. coli bacterium performs two actions during the entire life time, namely, tumble and swim. Initiate the chemotactic process and increase the value of x from 0 to N c .
Step 6b: perform tumbling action for each bacterium using Equation (14). This action will enable the bacteria to change the present direction for a period of time. : (14) In the above formula, T is the maximum number iterations, and ΔðiÞ/ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ΔðiÞ T ΔðiÞ q is a random forward direction of movement.
Step 6c: based on the tumbled direction obtained by the bacteria, each bacteria move to a random position using Equation (15). θ i ðx + 1, y, zÞ = θ i ðx, y, zÞ + LðiÞ × ∅ðiÞ: (15) In the above formula, θ i ðx, y, zÞ is the i th bacterium position in the x th chemotaxis, y th reproduction, and z th elimination-dispersal procedure, and θ i ðx + 1, y, zÞ means the i th bacterium position in the x + 1 th chemotaxis, y th reproduction, and z th elimination-dispersal procedure.
Step 6f: when the number of steps Ns in the swim process is greater than the swim length (m), then increase the m value to 1 ðm = m + 1Þ: If the value of Jði, x + 1, y, zÞ > J last replace the J last value using the current best objective value Jði, x + 1, y, zÞ, then assign the swim length m = Ns.
Step 7: if x < N c , then go to step 5. Else, go to step 8.
Step 8: execute the reproduction process. In this reproduction process, the accumulated cost of bacterium ð J i health Þ is calculated using equation (17).
Jði, x, y, zÞ: (17) The accumulated cost of bacterium ð J i health Þ represents the health of the bacterium. Bacteria will be sorted in descending order based on the J i health value. If the accumulated cost of bacterium is high, it means that the bacterium did not get enough nutrition or food during its entire lifetime. They are considered to have low health and set to die. The remaining healthy bacteria are divided into two. The reproduced bacteria are positioned at the same place as their parents.
Step 8a: if the number of defined reproduction steps N re is not achieved ð y < N re Þ, then go to step 4.
Step 9: execute the elimination-dispersal process. Based on the elimination probability (P ed ), this process is used to keep the number of bacteria in the population unchanged. If a bacterium is eliminated, a random search is initialized to move to a new position to avoid local optimum, after a certain number of reproduction movements.
Step 10: repeat the process from step 3 to step 9 until the number of elimination dispersal steps ðN ed Þ is greater than the value of z: Otherwise, terminate the process. Output: optimal feature subset. Computational and Mathematical Methods in Medicine clinical dataset C s is subject to min-max normalization, the minimum value (min A ) and maximum value (max A ) in the value set of "A" are first identified, and normalization is performed using the formula presented in equation (1).
If the formula "a′" is the normalized value of an attribute "a," when a is drawn from the value set of "A." Since min-max normalization is being used to normalize the values in the range 0 to 1, the value of new max A is 1 and new min A is 0. The number of instances in each C s used for constructing and testing the classifier prior to generating additional samples using SMOTE, the number of instances in each C s after generating additional samples using SMOTE, the number of instances in the training set ðT s Þ, and the number of instances in the testing set ðT t Þ is presented in Table 3. After preprocessing, each C s is split into training set (60%) and testing set (40%).

Feature Selection.
Feature selection is performed on each T s used for experimentation to select the optimal features for training the classifier. Selecting the optimal features from the T s will improve the classification accuracy. A wrapper approach that uses three bioinspired algorithms, namely, CSO, KH, and BFO with the accuracy of the SVM classifier is used to perform feature selection. An outline of CSO, KH, and BFO used for feature selection is presented below.

Outline of the CSO Algorithm for Feature Selection.
CSO is inspired and modeled based on two main postures of cats, namely, resting and tracing. Mimicking the resting behavior of a cat is named as seeking mode, and mimicking the tracing behavior of a cat is named as tracing mode. The seeking mode relates to a local search process, whereas the    CSO  15  9  20  16  3  6  5  KH  17  10  39  10  3  10  8  BFO  18  9  35  19  2  11  5   Number of hidden  nodes   CSO  30  18  40  32  6  12  10  KH  34  20  78  20  6  20  16  BFO  36  18  70  38  4  22  10 10 Computational and Mathematical Methods in Medicine tracing mode relates to a global search process. The vital parameters that play an important role in CSO are outlined in Table 4. Tracing mode relates to cat's movement while chasing a prey, for example, chasing a rat. The steps to select the optimal feature subset using CSO is outlined below (Algorithm 1):

Outline of the KH Algorithm for Feature Selection.
The KH algorithm is centered on the imitation of the herding behavior of krill swarms with respect to precise biological and environmental processes. Krill density is reduced by predators, namely, seals, penguins, or seabirds. The herd-ing of the krill individuals includes, one, increasing the krill density and two, reaching the food. The fitness function of each krill individual has been defined as the least distance of each individual krill from food and from the highest density of the herd.
Three vital actions considered to define the timedependent position of an individual krill are one, movement induced by other krill individuals, two, foraging activity, and three, random diffusion.
Krill individuals attempt to maintain a high density and hence move due to their mutual effect. Local swarm density, target swarm density, and repulsive swarm density are used Input: training set (FCSO, FKH, FBFO).
Step 1: initialize the parameters, namely, weights and bias, number of hidden layers, and learning rate of the BPNN.
Step 2: the number of hidden nodes are calculated using Equation (18). H = 2n: (18) In the above formula, H is the number of hidden nodes, and n is the number of input nodes.
Step 3: the input of the hidden layer is calculated using Equation (19). (19) In the above formula, I h is the input of the hidden layer; w ij is the weights of each input nodes; ∅ j is the bias.
Step 4: the output of the hidden layer is calculated using Equation (20). (20) where O j is the output of the j th hidden layer, and I j is the input to the neuron from the previous layer.
Step 5: calculate the error rate in the predicted output using Equation (21).
In the above formula, Z k i is the expected output, and C k i is the obtained output.
Step 6: update the new weights and bias based on the learning rate and error rate using CGA.
Step 7: repeat the steps from 2 to 5 until the error rate converges. Output: three BPNN classifiers trained using FCSO, FKH, and FBFO.        to estimate the direction of motion. Food location and prior experience about the food location are the two parameters used to estimate the foraging motion. Random diffusion is used for the exploration of the search space. In the KH algorithm, the population diversity is improved by means of the diffusion function, which is integrated into the krill individuals. Random diffusion is the net movement of each krill individual from high-density to low-density regions. The motion velocity of krill particle applies the Lagrangian model [43] as shown in Equation (5).
In the above formula, dx i /dt is the motion velocity of krill particle i, N i is the induced motion, F i is the foraging motion, and RD i is the random diffusion of the i th krill individual. The vital parameters that play an important role in the KH algorithm are outlined in Table 5.
The steps to select the optimal feature subset using KH is outlined below (Algorithm 2): 5.2.3. Outline of the BFO Algorithm for Feature Selection. The bacterial foraging optimization (BFO) algorithm imitates the pattern exhibited during the foraging process of Escherichia coli bacteria, that includes chemotaxis, swarming, reproduction, and elimination-dispersal operations [41]. The basic idea behind the foraging strategy of E. coli bacteria is to obtain the maximum nutrition in a unit time. The chemotaxis strategy involves the searching of nutrition by taking small movements such as tumbling, moving, and swimming, using its locomotory organ called flagella. The swarming strategy deals with the communication between bacteria. When the bacteria discover high amount of nutrients, they will release chemical substances to attract other bacteria. If they are in danger, they will tend to prevent other bacteria. The reproduction process involves splitting of healthier bacterium into two bacteria, and the low healthy bacteria are set to die. Finally, the elimination-dispersal strategy involves replacing the low health bacterium by randomly generated new ones. The vital parameters that play an important role in the BFO algorithm are outlined in Table 6.
The steps involved in finding the optimal feature subset using the BFO algorithm is outlined below:

Classifier
Training. Each C s is preprocessed and split into training set ðT s − 60%Þ and testing set (T t − 40%). A wrapper approach that uses three bioinspired algorithms CSO, KH, and BFO with the classification accuracy of SVM as the fitness function has been used for feature selection. The features selected by each bioinspired algorithm are used to train three BPNNs independently using CGA. The number of hidden layers for each BPNN is 1, and the activation function used in the hidden layer is sigmoid. The learning rate is 1e-07, and the maximum number of iterations is 100. Since the classification is binary, each BPNN has only one output node, and the activation function used in the output layer is sigmoid. Figure 2 elaborates the process of training BPNN classifiers.
The number of training instances for FCSO, FKH, and FBFO classifiers is presented in Table 3. Though majority of the features selected by each bioinspired algorithm overlap, it has been inferred that the number of features selected by each algorithm is not the same. The parameter settings for each classifier is presented in Table 7.
The steps to train the BPNN classifier using three BPNN classifier and trained using CSO, KH, and BFO algorithms are outlined below:

Classifier Testing and Dataset Construction for Super
Learner. After training the classifier with 60% of the preprocessed C s ðT s Þ, classifier testing is performed using the remaining 40% of the of the preprocessed C s ðT t Þ.  Computational and Mathematical Methods in Medicine elaborates the process of testing the three classifiers and also throws light on the process of training the super learner. Feature selection is performed on the testing set by querying the FCSO, FKH, and FBFO databases. The instances of the testing set containing the features selected by the CSO are used to test the FCSO classifier; similarly, the instances of the testing set containing the features selected by the KH and BFO are used to test the FKH and FBFO classifier. The performance of the FCSO, FKH, and FBFO classifiers are evaluated using the results obtained from the testing set.
The classification result of each instance of the testing set for FCSO, FKH, and FBFO classifiers and the class label corresponding to each instance of the testing set will be the candidate instances for training and testing the super learner.
5.5. Super Learner Training and Testing. As outlined in Section 5.4, the classification result pertaining to each instance of the testing set for FCSO, FKH, and FBFO classifiers and the class label corresponding to each instance of the testing set will be the candidate instances for training and testing the super learner. Figure 4 elaborates the process of training and testing of the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. The number of training and testing instances for the super learner is presented in Table 3.
Super learner is a type of ensemble classifier [44]. In this work, a BPNN classifier trained using CGA is used as the super learner. The parameter settings for the super learner are presented in Table 8.
The super learner is trained using the steps presented in Section 5.3 for training the BPNN classifier using CGA, and the performance of the super learner is evaluated using the testing set.

Accuracy = TP + TN TP + TN + FP + FN
: ð22Þ In the above formula, TP is the number of positive instances predicted as positive by the classifier, TN is the number of negative instances predicted as negative by the classifier, FP is the number of negative instances predicted as positive by the classifier, and FN is the number of positive instances predicted as negative by the classifier.
The super learner has achieved a classification accuracy of 96.83% for WDBC, 86.36% for SHD, 94.74% for HCC, 90.48% for HD, 81.82% for VCD, 84.0% for CHD, and 70.0% for ILP. The classification accuracy of the proposed work has been compared with the performance of the existing work on clinical datasets and the comparison results summarized in Table 16.

Conclusion and Scope for Future Work
A CAD system that employs a super learner to diagnose the presence or absence of a disease has been implemented in this work. Seven C s from the UCI ML repository, namely, WDBC, SHD, HCC, HD, VCD, CHD, and ILP have been used for experimentation. Each C s is preprocessed, and the preprocessed C s is split into training and testing sets. A wrapper-based feature selection approach using three bioinspired algorithms, namely, CSO, KH, and BFO, with the accuracy of SVM classifier has been used to select the optimal feature subsets. The selected feature subsets are used to train three BPNN classifiers using CGA, and the performance of the trained classifiers is evaluated. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The super learner achieved a classification accuracy of 96.83% for WDBC, 86.36% for SHD, 94.74% for HCC, 90.48% for HD, 81.82% for VCD, 84.0% for CHD, and 70.0% for ILP.
CAD systems to diagnose disorders in the human body from different imaging modalities such as X-ray, computed tomography, magnetic resonance imaging, and positron emission tomography are gaining importance. This work can be extended by developing CAD systems to diagnose disorders from the medical images acquired through different imaging modalities. Features based on shape, texture, and run length can be extracted from the images, and the feature selection algorithms used in this work can be used to select the relevant features. The relevant features can be used to build classifier models to predict the presence or absence of disorders from the images.

Data Availability
The data supporting this study are from previously reported studies and datasets, which have been cited. The datasets 15 Computational and Mathematical Methods in Medicine used in this research work are available at UCI Machine Learning Repository.