Binary Political Optimizer for Feature Selection Using Gene Expression Data

DNA Microarray technology is an emergent field, which offers the possibility of obtaining simultaneous estimates of the expression levels of several thousand genes in an organism in a single experiment. One of the most significant challenges in this research field is to select high relevant genes from gene expression data. To address this problem, feature selection is a well-known technique to eliminate unnecessary genes in order to ensure accurate classification results. This paper proposes a binary version of Political Optimizer (PO) to solve feature selection problem using gene expression data. Two transfer functions are used to design a binary PO. The first one is based on Sigmoid function and will be noted as BPO-S, while the second one is based on V-shaped function and will be noted as BPO-V. The proposed methods are evaluated using 9 biological datasets and compared with 8 binary well-known metaheuristics. The comparative results show the prevalent performance of the BPO methods especially BPO-V in comparison with other techniques.


Introduction
Molecular biology research evolves through the development of technologies used to carry them out. It is not possible to investigate a countless number of genes using conventional strategies. DNA Microarray is a technology that allows researchers to investigate and treat problems that were once considered untraceable. e expression of many genes can be examined in a solitary response rapidly and productively. DNA Microarray technology is enabling the scientific community to understand the fundamental aspects underlying the growth and development of life, as well as to investigate the hereditary reasons for irregularities in the working of the human body.
erefore, microarray technology remains to this day a useful asset for measuring of gene expression. Beyond the technology itself, the analysis of the data from microarrays is a complex statistical problem. And this is due to the large number of genes and the complexity of biological networks which increase the challenges of understanding and interpreting the resulting mass of data, which often consists of millions of measurements. Hence, extracting relevant biological knowledge from microarray data turns into a hard task due to the curse of dimensionality problem [1].
Generally, gene expression data are often redundant and noisy with large number of genes. In order to reduce the dimensionality of such datasets by selecting the most informative features, Feature Selection (FS) procedure seems to be an essential preprocessing phase before the implementation of machine learning classifiers in order to minimize training times and memory requirements [2].
Feature selection methods are classified into three categories based on the evaluation criteria used: filter, wrapper, and embedded [3]. is categorization depends on the involvement of a learning algorithm in the used approach. e filter methods (Chi-Square [4], Information Gain [5], Gain Ratio [6], and ReliefF [7]) select a subset of variables by preprocessing the data from a model. e selection process is independent of the classification process. One of the advantages is that it is completely independent of the data model we are trying to build. It proposes a satisfactory subset of variables to explain the structure of the hidden data and that the subset is independent of the chosen learning algorithm. On the contrary, wrapper methods aim to generate representative subsets and evaluate them using a classification algorithm.
is evaluation is carried out by calculating a score, e.g., a score of a set will be a compromise between the number of variables eliminated and the success rate of the classification on a test set. erefore, wrapper methods are more exact than the filter approaches since they consider the relations among the features. Another advantage is its conceptual simplicity; we do not need to understand how induction is affected by the selection of variables, just generate and test. Nevertheless, the computational cost is significantly increased and depends on the used learning algorithm [8]. Finally, embedded methods integrate selection directly into the learning process, and decision trees are the most emblematic illustration. However, we classify in this group all techniques that evaluate the importance of a variable in coherence with the criterion used to evaluate the overall relevance of the model. ey are generally known by their reasonable trade-off between efficiency and computing costs [9,10].
FS is regarded as an NP-complete combinatorial optimization problem [11]. e search space size is strongly dependent to the increase of the number of features in the studied dataset. An exhaustive search for the optimal relevant feature often leads to stagnation in local optima [12]. erefore, metaheuristic methods are potentially more suitable to deal with this problem because of their ability to find acceptable solutions in reasonable periods of time [13]. e objective function may be the accuracy of the classification or another criterion that could consider the best compromise between the computational burden of attribute extraction and efficiency [14]. Metaheuristics are stochastic approaches and fall into two categories: population-based approaches and single-solution approaches [14,15]. Generally, they are inspired by nature, social behavior, biological behavior of animals or birds or insects, physical or chemical phenomena, etc.
Generally, these traditional methods suffer from a slow convergence rate, and they have a lot number of parameters to be tuned. Hence, a simple and efficient global search technique is needed. For that, during this work, we use the Policy Optimizer (PO) [31] as the main resolution technique since it is a newly introduced metaheuristic which is human behavior-based algorithm. Moreover, as mentioned in [31], PO produces better solutions for dealing with optimization problems than other well-known metaheuristics in the literature. In this paper, a novel binary version is proposed to find the most representative subset of a given dataset. e binary version introduced here is performed using two different transfer functions. e structure of this paper is as follows: the standard (continuous) version of Political Optimizer (PO) is presented in Section 2. In Section 3, we introduce the binary version of the latter algorithm called BPO. e obtained results and conducted comparisons are reported in Section 4. Finally, the conclusion and several directions for future papers are stated in Section 5.

Overview of the Political Optimizer (PO)
Political Optimizer is a newly proposed metaheuristic based on human behavior and inspired by the multiphased political process. However, it should be noted that the proposed algorithm is not the first of this kind. In PO, the concept of politics is mapped from a different perspective and unlike the recent politics-inspired algorithms, and this is due to four reasons. First, PO tries to model all the important steps in politics such as party formation, party-ticket/constituency allocation, election campaign and party switching, interparty election, and parliamentary affairs after government formation. Second, PO introduces a novel position updating strategy called recent past-based position updating strategy (RPPUS).
is latter represents the learning behavior of politicians from the previous election. ird, each individual solution assumes a double job: a party member and an election candidate. Using this concept, each solution can be updated according to two better solutions: the party leader and the constituency winner. Finally, to improve the results, intermediary solutions needs to cooperate and communicate via a phase named parliamentary affairs.
In PO, each party member is viewed as a candidate solution where its goodwill is considered the position in the search space. Moreover, the evaluation function is computed during the election phase where the number of votes obtained by each member party represents the fitness of the candidate solution.
Political Optimizer (PO) is formed by five main phases as follows: party formation and constituency allocation, election campaign, party switching, interparty election, and parliamentary affairs. It should be mentioned that the first phase (party formation and constituency allocation) is executed only one time to initialize and affect different variables. However, the remaining phases are running in loop, as detailed in Algorithm 1. e used variables in PO are summarized in Table 1.

Party Formation and Constituency Allocation.
In the beginning, the population P is partitioned in N parties, where each party P i includes N members (potential solution). Moreover, each jth member is noted as P j i and represented by a d-dimensional vector, where the value d is the number of input variables of the treated problem and P j i,k is kth dimension of P j i . As mentioned before, each member is considered as an election candidate besides its role as a party member. Hence, N constituencies are formed and contain jth member of each contesting party.
is division is illustrated in Figure 1. Furthermore, the leader of the ith party after computing the 2 Computational Intelligence and Neuroscience fitness of all member is noted as P * i and the set of all the party leaders is represented by P * . On the contrary, after the election, C * regroups the winners from all the constituencies named the parliamentarians, where C * j denotes the winner of jth constituency.

Election Campaign.
During this phase, party members are trying to enhance their chances of being elected by changing their positions according to three aspects. First, they try to learn from previous experience using a novel position updating strategy called recent past-based position updating strategy (RPPUS), as formulated in equations (1) and (2). Second, each party member is trying to update his current position according to the party leader. Finally, candidate positions are updated with reference to the constituency winner: Input: n (number of constituencies, political parties and party members), λ max (upper limit of the party switching rate), T max (total number of iterations) Output: final population P(T max ) / * Initialization/ * Initialize (n * n) candidate members P compute the fitness of each member p j i compute the set of the party leaders P * and the set of the constituency winners C * , by using equation (3) t � 1; PartySwitching (P, λ); / * Election phase * / compute the fitness of each member p j i compute the set of the party leaders P * and the set of the constituency winners C * , by using equation (3) Parliamentary Affairs (C * , P);  Figure 1: Illustration of the logical division of the population P in political parties and constituencies [31].
Computational Intelligence and Neuroscience 3 According to Algorithm 2, which describes the whole process of election campaign, the relationship between current fitness and the previous fitness is the main factor to choose between using equations (1) or (2).

Party Switching.
In order to balance between exploration and exploitation, a phase called party switching is started after the election campaign phase. Using an adaptive parameter λ named party switching rate, each party member P j i can be selected and switched to some randomly chosen party P r. Hence, it is swapped with the least fit member of the party P r , as presented in Algorithm 3.

Election.
is phase aims to evaluate the fitness of all candidates contesting in constituency. After that, the party leaders and constituency winners are updated as follows: (3)

Parliamentary Affairs.
After determining the party leaders and constituency winners (parliamentarians), each parliamentarian aims to improve his performance in order to mimic the interaction and cooperation of the winning candidates to run the government in the postelection phase. is process is presented in Algorithm 4, where each parliamentarian C * j updates its position in relation to randomly chosen parliamentarian C * r . It should be noted that the movement is applied only if the performance of C * j is enhanced.

Binary Political Optimizer (BOP)
As mentioned before, political member's goodwill is considered as a candidate position and moves in the search space towards continuous-valued positions. However, in binary optimization problems, such as feature selection, the search space is modelled as a n-dimensional Boolean lattice, and political member's goodwill needs to be represented by binary vectors.
In order to convert a continuous algorithm to a binary version, we should utilize transfer functions (TF), and it considered as the most efficient and convenient way [32]. Transfer functions are classified into two categories according to their shapes: S-shaped and V-shaped, as illustrated in Figure 2.
In this work, two versions are proposed, based on the transfer function used. In the first one, the political member's goodwill is updated using the Sigmoid function (Sshaped) and called BPO-S. While, in the second one, we used the Hyperbolic Tangent transfer function, called BPO-V.
Without any modification in the previously detailed phases, only two steps are integrated after the continuous computation. e first step is to calculate the probability of changing a position's element to 0 or 1 according to the following equation: where TF is the used transfer function that could be Sigmoid (equation (5)) or Hyperbolic Tangent (equation (6)) and x i d (t) is the ith political member in the dth in the iteration t: In the second step, the probability computed by equation (4) is then inserted in equation (7) in order to convert continuous value of each member position to 0 or 1: where rand is a uniform random number between 0 and 1. e flowchart of the proposed binary algorithm is presented in Figure 3.

Binary Political Optimizer Applied for Feature Selection.
In this section, we exploited the proposed BPO in feature selection for classification problems. As mentioned before, the feature selection problem is an NP-hard combinatorial binary optimization problem. For a feature vector sized N, the different feature combinations would be 2N which increase exponentially the number of possible solutions where an exhaustive search is probably not practical. erefore, we used the proposed BPO in order to find an acceptable solution with reasonable execution time. e main objective is to maximize the classification accuracy and minimize the 4 Computational Intelligence and Neuroscience for j ⟵ 1 to n do r ⟵ random integer in the range 1 to n, where r ≠ j a ⟵ random number from the interval e used fitness function is presented in the following equation [33]: where Acc is the classification accuracy given a chosen classifier, ω is the weight factor which is a value between 0 and 1, sf is the length of selected feature subset, and nf is the total number of features. In this study, we set ω to 0.5 for all the experiments in the next section. For the classifier, we chose to use k-Nearest Neighbor (k-NN) to compute the accuracy of selected subset. Moreover, to ensure the robustness of the obtained results, every used dataset is divided randomly into two different parts: training and testing set, according to 10-fold crossvalidation method.

Experimental Results
In this section, all experiments were repeated for 100 independent times to obtain statistically meaningful results. Furthermore, each algorithm was implemented using MATLAB R2020a and was run on an Intel Core i7 machine, 2.6 GHz CPU, and 16 GB of RAM.

Results and Discussion.
In this section, we start to evaluate statically the performance of the two proposed version of BPO compared to other algorithms. erefore, four different statistical measures are used to start the first step of evaluation. ese measurements were the worst fitness value, the best fitness value, the mean fitness value (avg), and standard deviation (std). Table 4 outlines the obtained results using these measures where the best ones are highlighted in bold text. From the table, we assess the superiority of proposed algorithms, especially BPO-V, compared to others binary version of well-known algorithms. However, BPO-V and BPO-S can be described as unstable methods in most cases. is fact can be explained by the complexity of position update strategy adopted by PO. Furthermore, it can be observed that BASO is the most competitive algorithm with the two version of BPO. From these findings, it can be concluded that BPO-V is better than BPO-S, BGA, BGWO, BBA, BHHO, BDE, BASO, BPSO, and BTGA in extracting the most relevant feature of the tested datasets with the aim to maximize the classification performance and minimization of the number of selected features.
is deduction was confirmed by applying a Wilcoxon Ranked Signed Test to the proposed algorithms compared in pairs with the other algorithms. is test is performed with a statistical significance value α � 0.05. In Tables 5 and 6, the sign "+" in the winner lines designates that the null hypothesis is rejected and the proposed            Computational Intelligence and Neuroscience algorithms (BPO-S or BPO-V) statistically outperform in pairs the other ones with 95% significance level (α � 0.05). In case of inferiority, the sign "−" is used. From these tables, we can reaffirm in first place the superiority of BPO-S and BPO-V. Moreover, as mentioned before, the BASO algorithm is the most concurrent algorithm.
In the second step, to confirm this superiority, BPO-S and BPO-V are evaluated in terms of accuracy and average number of selected features. From Table 7, it can be concluded that BPO-S and BPO-V outperform in an inescapable way the other algorithms regarding the number of selected features. Hence, Figure 4 is drawn to better visualize the obtained results. One more time, BASO showed the most competitive behavior. On the contrary, Table 8 outlines the comparative results in term of accuracy, where it can be seen that BPO-V is the best algorithm. erefore, the proposed algorithms strongly reduce the number of selected features without losing important information to deal with the problem treated by the dataset.
At the end of this evaluation, we compare BPO-V and BPO-S in terms of execution time and convergence. Regarding convergence speed and best fitness score obtained, Figure 5 shows that BPO-V also excels in this point. Generally, after 20 iterations, it reaches its optimum solution. On the contrary, despite the good results of BPO-S in terms of fitness score, this algorithm arrives at its best performance late, generally after 50 iterations. In the second term and which concerns the execution time, BPO-V and BPO-S showed poor results according   to Table 9. is fact can be explained by the complexity of the algorithm proposed in [31] and its large number of functions to execute and large number of conditions to verify.

Conclusions
In this paper, we proposed two versions of binary PO algorithm and applied to feature selection problem on gene expression data. To assess the robustness of our work, we used 9 standard datasets characterized by their huge dimensionality. Obtained results are compared to 8 binary versions of well-known metaheuristics. Experimental results prove the excellence performance of proposed algorithm. e results are evaluated using different indicators assessing convergence, reduction size, accuracy, performance (fitness score), and runtime. In future work, BPO could be hybridized with other metaheuristic algorithms as well as another classifier instead of KNN such as SVM.

Data Availability
e data used to support the findings of the study are available at http://featureselection.asu.edu/datasets.php.

Conflicts of Interest
e authors declare that they have no conflicts of interest.