Exploiting Feature Selection and Neural Network Techniques for Identification of Focal and Nonfocal EEG Signals in TQWT Domain

For drug resistance patients, removal of a portion of the brain as a cause of epileptic seizures is a surgical remedy. However, before surgery, the detailed analysis of the epilepsy localization area is an essential and logical step. The Electroencephalogram (EEG) signals from these areas are distinct and are referred to as focal, while the EEG signals from other normal areas are known as nonfocal. The visual inspection of multiple channels for detecting the focal EEG signal is time-consuming and prone to human error. To address this challenge, we propose a novel method based on differential operator and Tunable Q-factor wavelet transform (TQWT) to distinguish the focal and nonfocal signals. For this purpose, first, the EEG signal was differenced and then decomposed by TQWT. Second, several entropy-based features were derived from the TQWT subbands. Third, the efficacy of the six binary feature selection algorithms, binary bat algorithm (BBA), binary differential evolution (BDE) algorithm, firefly algorithm (FA), genetic algorithm (GA), grey wolf optimization (GWO), and particle swarm optimization (PSO), was evaluated. In the end, the selected features were fed to several machine learning and neural network classifiers. We observed that the PSO with neural networks provides an effective solution for the application of focal EEG signal detection. The proposed framework resulted in an average classification accuracy of 97.68%, a sensitivity of 97.26%, and a specificity of 98.11% in a tenfold cross-validation strategy, which is higher than the state of the art used in the public Bern-Barcelona EEG database.


Background.
Epilepsy is a disease of the central nervous system in which the brain is abnormally active [1][2][3]. Epilepsy, in other words, is a neurological disorder that causes electrical disorders related to all of them occur in the brain. e biggest challenge in neuroscience is understanding the behavior of epilepsy and its effect on the brain. Determining the type of seizure, the location of the attack, how it spreads, the amount of brain that is affected, and how long it lasts plays an important role.
Around 50 million of the world's population, most of whom living in developing countries, suffer from epilepsy [5]. Neurologists classify epilepsy into two categories, partial or focal and generalized epilepsy. If epilepsy attachments occur in limited areas of the brain, this is called focal epilepsy, and if the whole brain is involved, this is called general seizure (see Figure 1).
Epilepsy can afflict anyone at any age, but it can be cured by using antiepilepsy medications. e results of the antiepilepsy drugs are promising, but 25% of patients do not respond well to antiepilepsy drugs [6]. 20% of patients with generalized epilepsy and 60% of patients with focal epilepsy are not treated by antiepilepsy drugs. Surgery is a treatment option for patients with focal epilepsy who are not responding well to antiepilepsy drugs. In such cases, a physician eliminates brain areas that are the source of the epilepsy attack. However, a significant step before surgery is to localize these areas of the brain. Although positron emission tomography (PET) scan [7], single-photon emission computed tomography (SPECT) scan [8], and magnetic resonance imaging (MRI) [9] can localize focal areas of the brain, the main problem is the inaccessibility of this equipment in most of the developing countries.
Electroencephalogram (EEG) signals measure the electrical activity of brain synapses. e EEG signals are commonly used by physicians to diagnose brain disorders or activities. e physician may localize the focal areas of the brain by visual inspection of the EEG signal. In focal epilepsy patients, EEG signals recorded from the focal areas of the brain are distinct and are known as the focal EEG signals. In contrast to focal signals, nonfocal signals are recorded from other parts of the brain (see Figure 1). Visual inspection of multiple channels for detecting the focal EEG signal is timeconsuming and vulnerable to human error. erefore, a computer diagnostic system is required for the accurate detection of focal signals.

Previous Works.
Several machine learning methods have been developed in the literature based on linear and nonlinear features for the classification of focal and nonfocal EEG signals. Any machine learning method has three main steps: (1) feature extraction, (2) feature selection, and (3) classification. We can categorize previous works according to these three steps. e linear and nonlinear features have been extracted from the EEG signal from the time domain, frequency domain, and time-frequency domain.
In [10] delay permutation entropy and in [11][12][13] a combination of linear and nonlinear features have been extracted from EEG signals in the time domain. In other words, these features have been extracted directly from EEG signals. e features extracted from the EEG signal spectrum are frequency-domain features such as [14] in which the EEG signal spectrum was computed by Fourier transform (FT) and mean frequency and root mean square were extracted as discrimination features.
Sample entropy [15,16], Shannon entropy [6,16,19], Renyi's entropy [6,16,19], approximate entropy [16], phase entropies [16,19], log energy (LE) entropy [6,17,[24][25][26]32], Stein's unbiased risk estimate (SURE) entropy [17,24,25], Tsallis wavelet entropy [19], fuzzy entropy [19,26,31], and permutation entropy [19] have been extracted previously from EEG signals coefficients as discrimination feature for classification of focal and nonfocal EEG signals. After extracting the features, the significant features must be selected and fed to classifiers. e feature selection is an important step for designing machine learning applications since if the number of features fed to the classifier is very high, the complexity of the system increases; on the contrary, if the number of features is very low, the accuracy (ACC) of the system decreases and the machine is unable to decide correctly. Almost all of the previously proposed machine learning methods in focal and nonfocal signal classification application use p value for selecting significant features [6,11,13,14,16,17,19,22,23,25,26,31,[33][34][35], in such a way that features with p values less than 0.05 were significant and could be used as an input to classifiers. Although this traditional method can select significant features, it cannot be used as a significant feature selection tool when all of the extracted features have a p value less than 0.05. In this case, search heuristic approaches can be used as a feature selection method. In other words, these approaches optimize the number of feature vector arrays for resulting in the best classification performance. In this work, we used six binary optimization methods for selecting significant features, namely, binary bat algorithm (BBA), binary differential evolution (BDE) algorithm, firefly algorithm (FA), genetic algorithm (GA), grey wolf optimizer (GWO), and particle swarm optimization (PSO). Support vector machine (SVM) and K-nearest neighbor (KNN) algorithms are two well-known classifiers used previously for the classification of focal and nonfocal EEG signals. Although the performance of these two algorithms for focal detection was acceptable, resulting in best classification performance, it is better to check the performance of the feed-forward neural network (FFNN), cascade-forward neural network (CFNN), generalized regression neural network (GRNN), and recurrent neural network (RNN) classifiers for focal detection application. [36][37][38][39][40][41]. Tunable Q-factor wavelet transform (TQWT) has been proposed for analyzing the nonstationary, nonlinear, and oscillatory signals [42].
We can change the TQWT parameters to get optimal conditions that have resulted in the best classification performance. Recently, the authors showed the significant effect of differential operation in the classification of focal and nonfocal signals [23,32]. Moreover, in previous studies, the results of entropy-based features were promising for both focal and nonfocal EEG classifications [6,17,33,34]. For these reasons, the influence of entropy-based features on different EEG signals in the TQWT domain was focused on the present study. erefore, the differences in EEG signal decomposed into several subbands using TQWT and entropy-based features including LE entropy, Log L2-norm (LL2) entropy, SURE entropy, and threshold (TH) entropy have been studied. In most of the previous works, Kruskal-Wallis statistical (KWS) test was used as a feature selection; but, in this work, various algorithms including BBA, BDE, FA, GA, GWO, and PSO are evaluated as feature selection techniques to reduce the number of input feature vector arrays which lead the classifier's calculation to be much simpler. Furthermore, the selected features are tested with six classifiers, KNN, SVM, FFNN, CFNN, GRNN, and RNN, classifiers in tenfold cross-validation strategy.
To the best of the authors' knowledge, the entropies of TQWT subbands of differenced EEG signals and BBA, BDE, FA, GA, GWO, and PSO algorithms as feature selection, as well as FFNN, CFNN, GRNN, and RNN classifiers, have not been previously employed for the focal and nonfocal EEG signals classification.

Organization.
e paper is organized as follows: Section 2 explains the proposed method which consists of the description of the used databases, differential operator, TQWT, entropy-based features, feature selection, and classification algorithms.
e results and discussion are presented in Sections 3 and 4, respectively. Finally, the conclusion of the article is represented in Section 5.

Proposed Method
In this study, TQWT extracts subbands of EEG signals after differencing and four entropy-based features are computed from subbands for discrimination between focal and nonfocal EEG signals. Feature vector arrays are reduced by feature selection techniques and fed to classifiers. e block diagram of the proposed method is shown in Figure 2.

Used Database.
In this research, the Bern-Barcelona EEG dataset has been used for the evaluation of the proposed method [62]. e EEG signals of this dataset were recorded from five focal epilepsy patients who are candidates for brain surgery. is dataset consists of 3750 focal and 3750 nonfocal EEG signals. e sampling frequency was 512 and the duration of each signal is 20 seconds, so each signal has 10240 samples. Each signal has two columns, namely, "X" and "Y," which have been recorded from adjacent channels. In this work, "X-Y" is used for noise reduction and interference effects [6,17]. Figure 3 shows a sample of "X," "Y," and "X-Y" signals for focal and nonfocal groups. In this study, all signals containing more than 41.6 hours of EEG data in the Bern-Barcelona database are used.
Journal of Healthcare Engineering 3

Differential
Operator. By assuming that A[n] � [a 1 , a 2 , . . . , a n ] is a sequence with length n, the differential operator is defined as follows: where A diffe (n) denotes differential of A (n) with n − 1 samples.

Tunable Q-Factor Wavelet
Transform. Tunable Q-factor wavelet transform has been proposed in [42] for analyzing complex and oscillatory signals like EEG. However, the traditional DWT is one of the most commonly used tools in signal analysis applications, but it has several defects including the fixed number of oscillations in the mother wavelet, the fixed oversampling rate, and the fixed bandwidths of the filter bank [61]. e TQWT is a proficient transform that overcomes the mentioned limitations for analyzing oscillatory signals by providing the tunability of the Q-factor [42]. e main input variables to this transform, which can be easily adjusted, include the number of decomposition levels denoted as (J), Q-factor represented as Q, and the total oversampling rate of r. Q represents the number of wavelet oscillations and r denotes the overlap between frequency responses. An increase in Q makes all frequency responses narrower, allowing more decomposition levels to span the same frequency range. An increase in r with a constant Q increases the degree of overlap between adjacent frequency responses, increasing the number of decomposition levels required for covering the same frequency range [59]. e low-pass and high-pass filters with αf s and βf s as the scaling parameters associated with the input signal s [n] having a sampling rate of f s and different decomposition levels are given as follows:

EEG Signals
Preprocessing x(n) = "X-Y" A(n)= x(n)-x(n-1)   : e first and second rows from left to right show the "X," "Y," "X-Y," and differential for focal and nonfocal EEG signals, respectively.
where θ(ω) represents the 2π-periodic power-complementary function selected as the frequency response of the Daubechies filter with two vanishing moments. Meanwhile,θ(ω) is defined by the following equation: A filter bank that was used to perform decomposition can also be used to reconstruct the original input signal with the selected subbands. r and Q-factor in terms of the parameters of the filter bank, that is, α and β [42], can be formulated as follows: where f c and BW denote the center frequency and bandwidth of the subband J, respectively.

Entropy-Based Features.
Nowadays, entropy is one of the most widely used tools in signal processing applications.
In telecommunication, entropy measures the value of missed data, whereas, in physics, entropy matures the uncertainty or degree of disorder in a chaotic system. Generally, entropy can measure the complexity in a nonlinear system like the brain. In this study, LE entropy, LL2 entropy, SURE entropy, and TH entropy were extracted from EEG signal subbands as the discrimination feature between focal and nonfocal signals. So, where s i and ε are the i th represented samples of the signal and positive threshold, respectively. In this work, the value of ε is selected to be 0.2 in both SURE and TH entropies. Also, the entropy MATLAB function is used for the calculation of LE, LL2, SURE, and TH entropies. us, the entropy-based features can be extracted from the TQWT subbands.

Binary Bat Algorithm.
A bat emits a sound and follows an echo that is reflected from the objects in the environment to prevent obstacles, discover prey, and locate their nests in the darkness [63]. Inspired by this behavior of bats, the bat algorithm was proposed. is algorithm uses artificial bats to find an optimal solution in an objective function. e binary bat algorithm (BBA) is a discrete version of the bat algorithm presented by Nakamura et al. [64]. In the BBA, the search space is limited to an n-dimensional Boolean hypercube, in which any bat may move at nodes and corners of the lattice. To select features, each feature is described by a binary bat position vector [63]. erefore, the value of 0 indicates the absence of a feature, and the value of 1 indicates the presence of a feature.
For a target function f(x), x � (x 1 , x 2 , . . . , x n ), the bat population starts with the position x i , velocity v i , and pulse frequency f i . Suppose that x j is the current global best solution in the dimension j, and βϵ[0, 1] is a random number. So, the velocity and position of the ith bat are updated as follows: At first, the pulse rate r i , the loudness A i , and the maximum number of iterations T are initialized. en, at each iteration, these parameters will be updated using the three following equations: where we have variable ε ϵ [−1, 1] and α, β are two constants. In each iteration, for each bat, the sigmoid function is applied using Journal of Healthcare Engineering and the position is updated as follows: where x i j (t) and v i j (t) show the position and velocity of particle i at iteration t in dimension j and ρ ∼ U(0, 1). In our study, the BBA is iterated 40 times by using four bats and setting α � β � 1.

Differential Evolution Algorithm.
e differential evolution algorithm is a heuristic population-based random search scheme for global optimization. Many objective functions are nondifferentiable, discrete, nonlinear, noisy, flat, multidimensional, constrained, or stochastic. Differential evolution algorithm can be exploited to solve such problems. Basic operations such as selection, crossover, and mutation are the basis of the difference algorithm [65]. e binary differential evolution (BDE) algorithm was introduced by Pampara et al. [66], and the trigonometric function is utilized to produce 0 − 1 string to map floating-point variables into binary numbers. For the feature selection issues, the primary vector is operated as a candidate feature subset and the feature subset is changed with the mutation and the crossover actions. e distance between interclass and intraclass samples is computed as the target function to rate the quality of the feature subset.
Suppose that the initial population P 0 � X 0 for i � 1, . . . , N consists of N randomly selected individual solutions. In the mutation process, three vectors X r1 , X r2 , and X r3 are randomly selected from each population for each vector X i , such that r1 ≠ r2 ≠ r3 ≠ i. X r1 is called a base vector. X r2 and X r3 individuals determine whether the jth bit of X r3 is flipped (V i,j � 1 − X i,j ) or not. A crossover between the mutant V i and its parent X i determines the trial vector U i . erefore, the difference vector is calculated as [65,67] Difference e mutant vector is then computed as follows: where i and d are the vector order and search space dimension, respectively. Moreover, the trial vector U is produced as follows: where X is the vector, rand is a random number in [1, D], and CR is a constant crossover rate in [0, 1]. In the selection phase, if the fitness of the trial vector is better than the current vector, it will be substituted for the next generation.
In the current study, the maximum number of iterations is fixed to 100, the number of vectors is set to ten, and CR � 0.9. Besides, the total number of features in the particle is set as many as all features in the dataset.

Firefly
Algorithm. e firefly algorithm (FA) is inspired by the flashing behavior of fireflies [68]. Fireflies attract the opposite-sex counterparts by exploiting this flashing behavior. However, in the mathematical model of the FA, the fireflies are unisex, and each firefly may attract other fireflies. e charm of a firefly is equal to its brightness, and, for both fireflies, the brighter one will attract the other. erefore, the firefly with less brightness flies towards a brighter one. e brightness intensity is inversely proportional to the distance. e distance between any two fireflies (i and j) at x i and x j is calculated by the Cartesian distance as follows [68]: where D is the dimension. e attractiveness of a firefly decreases exponentially as the distance increases. In the FA, the primary form of attractiveness function β(r) is given by Here r and β(r) indicate the distance and the attractiveness at r between two fireflies, respectively. e original brightness (β 0 ) is the attractiveness at r � 0 and c is a fixed light absorption coefficient. e movement of a firefly i is specified as follows: e second term is owing to the attraction; X best is the location of the most attractive firefly. Besides, the third term is randomization with α and rand is U[0, 1]. When the firefly i moves towards firefly j, the position of firefly i is changed from a binary number to a floating-point number.
en the position of firefly i is updated as follows: where x id represents each bit of the dimension vector X i and ρ id is U[0, 1]. is guarantees that each bit will be either 0 or 1 [69]. In our study, the FA is iterated 100 times by using six fireflies and setting α � 0.5 and c � 0.

Genetic
Algorithm. e genetic algorithm (GA) is a heuristic search algorithm inspired by Darwin's theory of evolution [70]. GA starts with a set of solutions (chromosomes) called population. A new population is initiated by using these chromosomes. is is motivated by the hope that the new population will be superior to the former one. e chromosomes have been chosen based on their fitness to create new chromosomes (offspring). Crossover and mutation are the two leading operators of GA.
ere are different crossover approaches, such as roulette wheels, tournaments, and rank, which pick genes from parent chromosomes and create new offspring. After crossover, a mutation occurs to prevent falling into the local optimum. Mutations change the new offspring randomly.
is is iterated until the population size or improvement of the best solution is met [71]. Figure 4 summarizes the GA steps.
GA has four basic parameters: the number of chromosomes (N), the maximum number of generations (T), the crossover rate (CR) probability, and the probability of mutation rate (MR). CR determines how many times the crossover needs to be done. If there is no crossover, the offspring is just a copy of the parents. If there is a crossover, some parts of the parent's chromosome form the offspring [70,71]. Mutation probability indicates how many times the parts of the chromosome mutate. If there is no mutation, the offspring will be transplanted without any change. If a mutation is exploited, a part of the chromosome will be modified. We use mutations to prevent the GA from crashing in the extreme, but it does not need to happen very often because the GA turns into a random search. Here, we used GA to select the best features using ten chromosomes and 100 generations. Besides, CR and MR probabilities are initialized to 0.8 and 0.01, respectively, and the number of genes is the total number of features in the dataset.

Grey Wolf Optimization.
Grey wolf optimization (GWO) is a recent population-based optimization approach, which simulates the hunting process of grey wolves in nature [72]. Grey wolves more often prefer to live and hunt in a pack with an average of 5∼12 wolves and pursue stringent t rules in a hierarchy. e most influential wolf in decisionmaking is named alpha (α), which leads the whole pack. e betas (β), probably the best nominees for the alpha, are subordinates of alpha which support it in decision-making or other activities and reinforce its decisions among other lower-level wolves. e omega is the next level of the beta in a hierarchy. e omega (ω) wolves play the role of scapegoat.
ey must surrender to all dominant wolves. ey are the last wolves permitted to feed. If a wolf in the pack does not belong to any group, it is called a delta (δ) wolf. Deltas need to submit alpha and beta, but they dominate the omega [67,72].
In GWO, each wolf updates its position according to the distance from the updated prey position as the best three solutions of alpha, beta, and omega. e distance between each wolf and the prey (D) �� �→ is defined as where C → is the coefficient vector and X → P (t) and X → (t) indicate the position vectors of the prey and a wolf, respectively. Also, t shows the current iteration and r 2 → is a random vector in the [0, 1] interval.
A prey's location is determined by where A → is a coefficient vector, r 2 → is a random vector in the interval [0, 1], and the members of a → are linearly decreased from 2 to 0 during the optimization phase. In addition, the position of wolves is updated as follows:

Journal of Healthcare Engineering
Wolves update their positions to the actual values in the potential search space limited explicitly by the problem constraints. Nevertheless, for some issues, variables and search space are restricted explicitly to binary values [0, 1]. Feature selection is a binary issue and a feature subset is shown as a binary vector and each member in the vector specifies a single feature. A value of 1 for each feature represents that it is selected, and vice versa.
erefore, in binary GWO, variables and search space are mapped from real values to binary values using the sigmoid function [72].
erefore, for a solution x and dimension d, we have [72] x new In our study, the binary GWO algorithm is iterated 100 times by using ten wolves.
2.5.6. Particle Swarm Optimization. Binary particle swarm optimization (PSO) [73] is the discrete version of the PSO algorithm [74] which solves optimization problems based on the social behavior of animals such as the mass movement of birds and fish. Every single solution in the PSO is assumed as a particle. Every particle tries to find the best position over time. It adapts its position with regard to its own experience and the experiences of its neighbors consisting of the current velocity and position and the best prior position experienced by it and its neighbors. is process is performed repeatedly until a predetermined minimum error is reached or up to a certain number of repetitions and so on [67]. e last position of ant particle i that previously had good fitness is stored as the best person (p besti ), and the best particle position that has the best fitness among the population is stored as the best global (g besti ), where r 1 and r 2 are random values in the range of (0, 1) and c 1 and c 2 are cognitive and social parameters, respectively. In binary PSO, particle positions are modeled into the bit string to limit the velocity in the range [0, 1]. Furthermore, the velocity of a particle is defined as the probability that a particle might change its state to one. Traditional binary PSO and most of its variants use different probability functions to cope with discrete optimization problems. e input parameters of binary PSO are the number of iterations (T), the number of particles (N), cognitive learning factor (c 1 ), the social learning factor (c 2 ), the maximum bound on inertia weight (ω Max ), the minimum bound on inertia weight (ω Min ), the maximum velocity (V Max ), and the total features in a particle. Also, ω is called the inertial weight and it plays the tuning role in global and local searches. In case that d is the dimension, the position and velocity vectors of the ith particle are defined as e equation to update the velocity of each particle is as follows: By using the sigmoid function, the position will be updated according to the following equations: where In the current study, the maximum iterations are kept to 100, the total particle in population is set to ten, and the total number of features in the particle is set as many as all features in the dataset. Cognitive and social factors (c 1 and c 2 ) are assigned the value of 2. Also, ω Min , ω Max , and V Max are initialized by 0.4, 0.9, and 6, respectively.

Classification.
After selecting the significant features, these are fed to the classifiers for evaluating the performance of the proposed framework. In the present study, the results of classifiers are reported by deploying a tenfold crossvalidation strategy 2541654, which is used to divide the dataset into ten equal subsets. After that, at any time, one subset will be used as test data and the remaining nine subsets will be used as training data for the classifiers. In other words, any EEG signal of the dataset is used nine times as training data and once as test data.
where T TP , T TN , T FP , and T FN are the total numbers of TP, TN, FP, and FN after ten times training and testing classifier. ACC shows the ability of the classifier to discriminate F and In this work, six classifiers, namely, KNN, SVM, FFNN, CFNN, GRNN, and RNN, are used for classifying the EEG signals in focal and nonfocal groups. In the following subsections, we describe these used classifiers.

K-Nearest Neighbor (KNN).
e well-known Knearest neighbors (KNN) algorithm is a supervised classifier with very easy implementation [6,13,23]. In the KNN algorithm, every test data is classified with K-closed neighbors. at way, test data belongs to the group which has more members in K-closed neighbor. e number of K and distance computation methods are the two main parameters of the KNN classifier. In this work, city-block distance with various numbers of K ranging from 1 to 9 with a step equal to 1 was used to attain the best results.

Support Vector Machine (SVM).
Nowadays, the support vector machine (SVM) algorithm is becoming one of the most widely used classifiers in biomedical machine learning applications. e SVM algorithm maps the features in a high dimensional space by kernel function and constructs an optimum hyperplane for separating the classes [6,12,14,22]. In the new higher-dimension space, features in each class are near together and are far away from the other class. In this work, the radial basis function (RBF) is used as kernel [13] and the sigma values of RBF vary from 0.1 to 1.5 by the step size of 0.1 to attain the best results.

Feed-Forward Neural Network (FFNN).
In feed-forward neural networks (FFNN), neurons are arranged in multiple layers and signals are forwarded from input to output. When an error occurs, these neurons are returned to the previous layer and weights are adjusted again to reduce the error chances. In this study, we use the tan-sigmoid transfer function, a single hidden layer with ten empirically chosen neurons, and the Levenberg-Marquardt algorithm for fast training [37,[39][40][41].

Cascade-Forward Neural Network (CFNN).
In cascade-forward neural networks (CFNN), neurons are interlinked with previous and subsequent layers of neurons [76]. For example, a three-layer CFNN represents the direct connections between layer one and layer two, between layer two and layer three, and between layer one and layer three; that is, neurons are directly and indirectly connected in the input and output layers. ese additional connections help to achieve a better learning speed for the required relationship. Like FFNN, in CFNN, we have utilized the tansigmoid transfer function, one hidden layer with ten neurons selected by a hit and trial manner, and Levenberg-Marquardt method for quick learning.

Generalized Regression Neural Network (GRNN).
e general regression neural network (GRNN) is a singlepass neural network that uses a Gaussian activation function in the hidden layer. GRNN consists of input, hidden, summation, and division layers. e classification accuracy of GRNN is largely dependent on the accurate value of the spread factor. In this study, the spread factor is fixed to 1 after several experiments for the classification of focal and nonfocal EEG signals [1,77,78].
2.6.6. Recurrent Neural Network (RNN). In recurrent neural networks (RNN), neurons can flow in a circle because this network has one or more feedback links. e characteristics of RNN allow the system to process temporarily and recognize the trends. In this study, we are implementing Elman recurrent neural network, which is the prevalent form of RNN. For quick training of the model, the Levenberg-Marquardt method and single hidden layer with ten empirically selected neurons are utilized [77].

Preprocessing.
Each signal of the Bern-Barcelona dataset has two different time series which are "X" and "Y." For noise reduction, "X-Y" is recommended as an input signal in the previous studies [6,11,17,34]. For this reason, we have also used the "X-Y" time series as an input signal. In [23,32], the effect of differential operators in focal and nonfocal EEG signal detection has been discussed and suggested to use different features. For that, we have applied the differential operator to the "X-Y" time series before the TQWT decomposition.

Selection of TQWT Parameters.
e accuracy of a classifier and involved intense calculations are directly dependent on the selection of optimal values of Q, r, and J of TQWT transform. In other words, the selection of the optimal value of these parameters is an important step before signal decomposition. We used three steps for setting the parameters of TQWT. In the first step, Q, r, and J are assumed to be fixed for choosing the best classifier. In the second step, J is assumed to be fixed for setting the optimum values of Q and r parameters. In the third step, by having the optimum values of Q and r, the optimum value of J is found.

First
Step. For choosing the optimal values of TQWT parameters, the values of Q and r were set to 2 and the value of J was selected to be 5. en entropy-based features were extracted from subbands and fed to KNN, SVM, FFNN, CFNN, RNN, and GRNN classifiers. e ten-fold crossvalidation strategy has been employed during the training and testing of the classifiers. Figure 5 shows the resulting classification ACC by these features for various classifiers.

Second
Step. In this step, Q is varied from 2 to 10, r is varied from 2 to 5, and the value of J is kept fixed, which is 5. en, the entropy-based features are extracted from subbands and Journal of Healthcare Engineering 9 fed to the CFNN classifier. Figure 6 illustrates the resulting ACC for various numbers of Q and r. It is evident from Figure 4 that the deployment of TQWT along with the CFNN classifier resulted in the highest classification ACC when the values of both Q and r were 3. ird Step. By fixing Q and r to 3, the maximum decomposition level of TQWT is 35; so, we checked the resulting classification ACC for J from 5 to 35 and illustrated the results in Figure 7. Figure 7 depicts that the ACC of the CFNN classifier is improved by increasing the decomposition level and maximum ACC is achieved when the value of J is 26. Finally, the optimum values of Q, r, and J are found to be 3, 3, and 26, respectively, which lead up to the best classification performance. It should be noted that, by selecting J to be 26, input EEG signals are decomposed to 27 subbands, 1 approximation and 26 details, in such a way that subband1 to subband26 are detail1 to detail26 and subband27 is approximation 26.
Furthermore, by using the selected optimum values for Q, r, and J parameters, Figure 8 shows the designed TQWT filter bank in the frequency domain and Figure 9 shows the decomposed TQWT subbands.

Feature Extraction.
e mean and standard deviation of extracted entropy-based features for focal and nonfocal EEG signals are written in Table 1. It is clear that the values of entropy-based features for the focal group in all subbands except for details 1, 2, and 3 are lesser than those of the nonfocal group. Also, the standard deviation of extracting entropy-based features in the focal group was less than that of the nonfocal group. Since the entropy is a parameter for the quantification of chaotic behavior of signal, we can say that lower mean and standard deviation value of entropies in the focal group indicates less random (more rhythmic) behaviors of the focal EEG signals in comparison to nonfocal EEG signals.

Feature Selection and Classification.
e statistically significant features for focal signal detection have been obtained using a p value in previous studies [6, 11, 13-17, 19, 22, 23, 32-34]. In mentioned studies, the features with a p value of less than 0.05 were selected as significant features. Generally, a lesser p value indicates a better ability to extract features in binary classification. We deployed the KWS test to compute the p value of extracted entropy-based features from TQWT subbands, and the "Kruskal-Wallis" MATLAB function is used for computation purposes.
It is obvious from Table 1 that all of the extracted entropy-based features show good discrimination between focal and nonfocal signals and the p value for all features is less than 0.05. In other words, we can use all of the entropybased features in the classification of focal and nonfocal EEG signals. So, these features were fed to KNN, SVM, FFNN, CFNN, RNN, GRNN, and RNN classifiers. e resulting ACC, SEN, and SPE and the comparison of classifier performances for all of the extracted entropy-based features are given in Table 2. ough by using the entropy-based features a good average classification ACC of 94.77% has been achieved by exploiting CFNN classifier, the feature vector has 108 arrays (i.e., four entropy features extracted from 27 subbands) which lead the proposed method to be more complex. For this reason, we tested various optimization algorithms as a feature selection method for decreasing the arrays of the input feature vector to reduce the complexity of the proposed method. e resulting classification ACC, SEN, and SPE for classifiers with the feature selected by the BBA algorithm are written in Table 3. It is clear from Table 3 that the classifier performance for selected features of the BBA algorithm has increased in comparison to when all entropy-based features are fed to classifiers, although the feature vector has less arrays. So, the BBA algorithm cannot improve the performance of the proposed method. It should be noted that the       7,9,11,12,14,16, and 18, the LL2 entropy of details 5,7,11,17,18,19,20, and 23, approximation 26, the SURE entropy of details 6, 7, 9, 14, 15, 16, 19, 20, 22, and 26, and the TH entropy of 3, 4, 13, 14, 15, 16, 24, and 26. In the same way, the selected entropy-based features by the BDE algorithm are fed to classifiers. e performance of the classifiers with these selected features is given in Table 4 Furthermore, the performances of classifiers for selected features by the FA algorithm are given in Table 5. It can be understood that the results of classifiers with selected features by the FA algorithm are not very different from those in classification ACC, SEN, and SPE for all entropy-based features. In Table 5, the FFNN classifier resulted in 93.02% classification ACC, which is slightly higher than those of the other classifiers. On the other hand, the FA algorithm selected 52 features with the FFNN fitness function. e FA discards 56 features (i.e., FA algorithm made the feature vector 60.48% smaller) which leads the proposed method to be simpler. e selected features by FA algorithm with FFNN fitness function which made input feature vector arrays were LE entropy of details 6,9,11,14,15,17,18,19,22,23, and 24 and approximation 27, LL2 entropy of details 4,6,8,9,10,11,12,16,17,24, and 26 and approximation 27, SURE entropy of details 1, 2, 3, 4, 7, 8, 9, 11, 17, 22, 24, and 25 and approximation 27, and TH entropy of details 1, 2, 3, 4, 5, 6, 8, 9, 11, 15, 18, 20, 22, and 24 and approximation 26. e performance of classifiers with selected features by GA is summarized in Table 6. e highest classification ACC for selected entropy-based features by using GA and CFNN classifier is 93.53%. GA with CFNN fitness function selected 57 features, which means the GA discards 51 features (i.e., GA made the feature vector 55.08% smaller). e selected features by GA with CFNN fitness function were LE entropy of details 6,9,10,12,14,15,16,17,18,19,21,22,23,24,25, and 26 and approximation 27, LL2 entropy of details 1, 2, 4, 8,9,10,11,17,18,19,20,23, and 25, SURE entropy of details 1,2,3,4,5,8,10,11,15,18,19, and 20, and TH entropy of details 1,2,4,7,9,10,12,13,16,17,21,23,25, and 26 and approximation 26, which made the input feature vector arrays. e classifier's performance for selected entropy-based features by the GWO algorithm is given in Table 7. It is clear from Table 7 that the selected entropy-based features by the GWO algorithm with the CFNN classifier resulted in 93.45% classification ACC which is higher than those of other classifiers. e GWO algorithm with the CFNN fitness function selected 88 entropy-based features. In other words, the feature vector has 88 arrays and the GWO algorithm discards 20 features (i.e., GWO algorithm made the feature vector 21.60% smaller). e discarded features by GWO algorithm with CFNN were LE entropy of details 1, 2, 3, 5, 7, and 8, LL2 entropy of details 7, 11, 12, 18, 22, and 26 and approximation 26, SURE entropy of details 12 and 22, and TH entropy of details 15, 20, 22, 25, and 26. e resulting classification ACC, SEN, and SPE for selected features by the PSO algorithm are shown in Table 8. It is evident from Table 8 that selected entropy-based features by the PSO algorithm improved the performance of the classifier. e best performance resulted from the CFNN classifier, which achieved a perfect average classification ACC of 97.68% for 63 entropy-based features selected by the PSO algorithm with CFNN fitness function. e selected entropy-based features by PSO algorithm with CFNN fitness function were LE entropy of details 4,7,9,10,11,14,15,16,17,18,19,20,21,22,23, and 24 and approximation 26, LL2     ). A comparison between the performances of classifiers and feature selection methods is illustrated in Figure 10.
From Tables 2-8 and Figure 11, we can say that PSO is a significant feature selection method and FFNN, CFNN, and RNN classifiers are more appropriate than the other feature selection methods and classifiers for focal EEG signal detection application, respectively. We should note that although the KWS test resulted in acceptable classification performance in comparison to most of the feature selection methods, it did not increase the feature vector arrays and used all extracted entropy-based features. e receiver operating characteristic (ROC) value for the classifiers for each feature section algorithm is shown in Figure 11. It is clear from Figure 11 that the area under the curve of the PSO algorithm is significantly higher than those of the other feature selection algorithms. e computational time of the algorithm for all signals of the Bern-Barcelona EEG dataset including preprocessing and differencing, the generation of TQWT filter bank, subband separation, and extraction of LE, LL2, SURE, and TH entropies as features using i5-M480 CPU (2.67 GHz), 6 GB RAM, and MATLAB 2014a is 215.6 seconds (i.e., 0.0287 seconds for any input signal) which indicates the robustness of the proposed method. e Bern-Barcelona dataset has more than 41.6 hours of EEG signals and the CFNN classifier required only 0.17 s for the classification of an input test signal. e algorithm time can be further reduced by using a powerful machine and another computationally efficient software package.

Discussion
e correct classification of focal and nonfocal EEG signals is directly linked to minimizing the surgical complications for the patients who are immune to antiepileptic drugs. In the present study, we proposed a computer-based method for the correct classification of focal and nonfocal EEG signals.
e Bern-Barcelona dataset was used for the evaluation of the proposed method. Each file of the Bern-Barcelona dataset has two signals, namely, "X" and "Y," recorded from adjacent channels. In the proposed framework, the "X-Y" signal is applied on differencing operator and decomposed using TQWT in optimal condition by setting Q, r, and J to 3, 3, and 26, respectively. LE, LL2, SURE, ad TH entropies are extracted from TQWT subbands as features.
e performance of several feature selection and classification methods is checked among which PSO algorithm and CFNN classifier are chosen as the feature selection and classification methods. A proposed framework achieved an average classification ACC of 97.68%, SEN of 97.26%, and SPE of 98.11% in a ten-fold cross-validation strategy. Boxplots of selected features by PSO algorithm with CFNN fitness function are shown in Figure 12.
We found that proposed entropy-based features are significantly good parameters in the classification of focal and nonfocal EEG signals since their corresponding p value was less than 0.05 for all subbands as depicted in Table 1. Furthermore, the lower mean and standard deviation values of entropies in the focal group indicate less randomness of the focal EEG signals in comparison to nonfocal EEG signals as reported in previous studies [6,10,11,22,23,25,33,34]. We showed that the performance of heuristic algorithms in the reduction of feature vector arrays is better than the KWS test, although features selected by some heuristic algorithms resulted in lesser classification ACC.
We compared the performance of the proposed method with those of the state-of-the-art methods which used the same dataset as ours and details are demonstrated in Table 9. In [10], delay permutation entropy with different delay lags has been extracted as discrimination features from EEG signals and applied to the SVM classifier which resulted in classification ACC of 84% and 75% for 50 and 750 EEG signals, respectively. ey found that the value of delay permutation entropy for focal EEG signals is significantly more than that of the nonfocal EEG signals. In [15], sample entropy and variance of instantaneous frequencies of intrinsic mode functions (IMFs) have been computed as a feature and fed to the least-square SVM (LS-SVM). e authors therein obtained 85% classification ACC for 50 EEG signals.
In [16], average Shannon entropy, average Renyi's entropy, average approximate entropy, average sample entropy, and average phase entropy of IMFs were computed as features which resulted in 87% classification ACC for 50 EEG signals.
ey computed the p value for extracting features and found that the entropies for IMFs can be used as a useful parameter for focal EEG signal detection. In similar research [17], nonlinearity of EEG signals has been measured by centered correntropy, information potential, and   LE and SURE entropies of the IMFs as a feature and applied to SVM classifier which resulted in 89% classification ACC for 50 EEG signals. Also, the calculations of their method were very intense. In [18], the spectrum of EEG signals was obtained by S-transform and then the time-frequency entropy of spectrum was used as a feature which resulted in 86% classification ACC for 50 EEG signals. Although the reported classification ACC was not significantly higher, it resulted in only one feature being used. Besides that, the authors therein found that a spectrogram of EEG signals can    be used as significant parameters for focal EEG signal detection. In [19], Daubechies wavelet of order 4 decomposed the EEG signals to four levels, and entropy-based features have been extracted from DWT coefficients of EEG signals for detecting focal EEG signals which resulted in 84% classification ACC for 50 EEG signals. Although the reported classification ACC of their method was not significantly higher and the calculations for extracting features were heavy, on the other hand, the authors therein proposed a novel integrated discrimination index for focal and nonfocal EEG signal classification based on these features. In [20], EEG signals have been decomposed to coefficients and nine linear features extracted from coefficients and fed to SVM classifier which achieved 83.7% classification ACC for 3750 EEG signals. e authors therein tested the performance of extracting features for fifty-four mother wavelets from seven families, namely, Haar, Daubechies, Meyer, Coiflets, Biorthogonal, Reverse biorthogonal, and Symlets,  and found that the performance of classifiers is not dependent on the choice of wavelet family, and it is more dependent on decomposition levels.
In [6], EMD was used for extraction of the first two IMFs of EEG signals; then DWT decomposed IMFs into four levels. In other words, EEG signals have been decomposed by EMD-DWT to ten subbands (i.e., one approximation and four details for IMF1 and IMF2). en Shannon and Renyi and LE entropies of subbands were calculated as features and fed to KNN classifier which obtained classification ACC of 90.5% and 89.40% for 50 and 3750 EEG signals, respectively. ey reported these results by considering 20% of EEG signals as training data and the other 80% as testing data in the classification process. ey found that first IMFs (i.e., highest frequencies of EEG signals) and LE entropy are more significant in the detection of focal and nonfocal EEG signal classification. In [31], the same authors expanded the features in the VMD-DWT domain and reported 95.2% classification ACC using the SVM classifier for 3750 EEG signals. In [14,[22][23][24], EEG signals are separated into rhythms by Fourier transform [14] and empirical wavelet transform [22][23][24]. In [14], the mean frequency and root mean square of EEG signals were derived as a feature and fed to LS-SVM classifier which obtained 89.7% and 89.5% classification ACC for 50 and 750 EEG signals, respectively. e authors therein found that frequency-based features extracted from rhythms are usable in focal detection.
In [24], the nonlinearity of EEG signals rhythms in EWT domain was computed as feature and fed to classifier in which classification ACC of 93% and 82.6% were reported for 50 and 3750 EEG signals, respectively. In [27], a method based on the combination of the EEG rhythms in FBSE-EWT domain and sparse autoencoder-support vector machine is proposed for detection of focal EEG signals which resulted in perfect classification ACC of 100%. In another method based on deep learning [21], the time-frequency matric of EEG signal was computed in Fourier synchrosqueezing transform domain (FSST) and wavelet synchrosqueezing transform domain (WSST) and fed to a deep convolutional neural network (CNN) for the classification.
is method could achieve classification ACC of 99%. Recently, a method based on Taylor-Fourier filter bank implemented with O-Splines has been proposed to extract the EEG rhythms [5]. In [29], a method based on decomposition of the EEG signal by sliding modesingular spectrum analysis (SM-SSA) with sparse autoencoder hidden layer and radial basis function neural network (SAE-RBFN) classifier was proposed, which resulted in maximum classification ACC of 99.11%. In [28], mixture correntropy and exponential energy of subbands in FBSE-fTF-cwt domain were extracted as features and fed to a LS-SVM classifier; the authors therein reported average classification ACC of 95.85%.
In [22,23], two-dimensional (2D) rhythms were drawn using phase space reconstruction [22] and second-order difference plot [23], respectively, in which the 2D illustration of EEG signals of the focal group in both methods had a more regular shape as compared to the nonfocal group which can be due to synchronous response from the neighboring neurons of the epileptogenic area which gives rise to focal EEG signals. is concept is reported in most of the previous research with other methods [6,10,11,22,23,25,33,34] as well. In the present study, the mean and standard deviation of entropies for focal groups are less than those for nonfocal group, which indicate that the behaviors of focal EEG signals are less random and more rhythmic in comparison to the nonfocal EEG signals as described in Section 3.4.
In [33], the multiscale entropy of TQWT coefficients was extracted as features from EEG signals and fed to LS-SVM classifier which resulted in 84.67% classification ACC for 3750 EEG signals. Furthermore, in [34], K-nearest neighbor entropy estimator, centered correntropy, and fuzzy entropy of TQWT coefficients were extracted as features from EEG signals and fed to LS-SVM classifier, which achieved 95% classification ACC. In [33,34], TQWT decomposed the focal and nonfocal EEG signals to 16 levels (i.e., J � 16), which resulted in 17 subbands. In our study, EEG signals are decomposed to 26 levels, which resulted in 27 subbands, which are more than those in the works in [25,26], but, in our study, the time required for feature extraction from all subbands is significantly less as compared to those in the works in [25,26] because they used multiscale entropy [33], K-nearest neighbor entropy estimator, centered correntropy [3,24], and fuzzy entropy [34] which involve heavy calculations, while we deployed LE, LL2, SURE, and TH entropies that have very simple calculations leading our system to be time-efficient.
It is clear that the proposed method could achieve better classification ACC in comparison to most of the studies [6, 10-20, 22-26, 28, 30-34] in literature, but its classification ACC is less [21,27,29]. e proposed method in [21] needs to use two time-frequency analyses and deep learning technique whereby the value of calculation will be higher than that of our method. On the other hand, the proposed frameworks in [27,29] could achieve better classification ACC with either lesser calculation [29] or fewer features, which shows the superiority of these methods compared with the proposed framework. e advantages of the proposed framework over previous studies are enlisted as follows: (1) is is the first study to compare the performances of several feature selection algorithms and classifiers for focal and nonfocal EEG signal classification. (2) We have evaluated the performance of the proposed framework by using the entire Bern-Barcelona dataset (i.e., 3750 focal and 3750 nonfocal) while in [13,[15][16][17][18][19]23] just 100 EEG signals (i.e., 50 focal and 50 nonfocal) have been used for evaluation of their methods. (3) We only used four entropies for classification of focal and nonfocal EEG signals, but in [16] six, in [19] seven, in [20] nine, in [11] fifty-two, and in [12] twenty-one various types of features have been computed for classification, which makes their studies computationally expensive. (4) Our results are reported in a ten-fold cross-validation strategy for ensuring the reliable classification performance of classifiers. (5) e proposed framework requires less than 0.2s for classifying any input EEG signals in focal or nonfocal groups which made our method time-efficient. (6) In [19], results have been reported with 10% standard deviation, while in our study, the results have around 3% standard deviation which indicates the robustness of our proposed method. (7) e proposed method can be deployed widely before surgery in hospitals as it is cost-effective and easily implementable with a computer and an EEG acquisition system. (8) e proposed method can detect focal EEG signals accurately without human intervention and errors.
Although the proposed method has the above-mentioned merits over the already existing methods, the main limitation of the present study is that the used Bern-Barcelona dataset has only EEG signals of five patients. In the future, the study would be extended by using the datasets involving a bigger number of patients.

Conclusion
e physicians can localize the brain surface before surgery by visual inspection of EEG signals. e correct classification of focal and nonfocal EEG signals by physicians for a long time is very hectic and time-consuming and may be prone to human errors.
us, a computer-based system for distinguishing focal and nonfocal EEG signals with significant ACC is desirable. In the present study, we proposed a method based on entropy-based features extracted from TQWT subbands. Several feature selection, machine learning, and neural network classifiers were evaluated for discrimination of focal and nonfocal EEG signals evaluated on the Bern-Barcelona dataset with more than 41.6 hours of EEG data. e proposed TQWT-based method with different entropy-based features selected with the PSO method and classified by the CFNN resulted in classification ACC of 97.68%, which is higher than those of the previous methods. In the future, the performance of the proposed method is recommended for other biomedical signals to detect abnormal behaviors.

Data Availability
No data were used to support this study.

Disclosure
Muhammad Tariq Sadiq and Hesam Akbari are co-first authors.
Journal of Healthcare Engineering 21