Feature Selection Based on a Large-Scale Many-Objective Evolutionary Algorithm

The feature selection problem is a fundamental issue in many research fields. In this paper, the feature selection problem is regarded as an optimization problem and addressed by utilizing a large-scale many-objective evolutionary algorithm. Considering the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance, a large-scale many-objective feature selection model is constructed. It is difficult to optimize the large-scale many-objective feature selection optimization problem by using the traditional evolutionary algorithms. Therefore, this paper proposes a modified vector angle-based large-scale many-objective evolutionary algorithm (MALSMEA). The proposed algorithm uses polynomial mutation based on variable grouping instead of naive polynomial mutation to improve the efficiency of solving large-scale problems. And a novel worst-case solution replacement strategy using shift-based density estimation is used to replace the poor solution of two individuals with similar search directions to enhance convergence. The experimental results show that MALSMEA is competitive and can effectively optimize the proposed model.


Introduction
Feature selection involves the selection of a specific number of features from existing features to optimize specific objectives [1]. Feature selection can be regarded as a multiobjective optimization problem that can be solved using evolutionary algorithms. Feature selection has attracted the attention of scholars and has been widely used in gene expression analysis [2], face recognition [3], and drug discovery [4]. For example, a two-stage heuristic algorithm minimal redundancy maximal relevance (mRMR) [5] is used to optimize relevance and redundancy simultaneously. A filter-based algorithm [6] is used to consider the entropybased correlation measure and the combination measure of the redundancy and cardinality of a selected subset. A decomposition algorithm based on a weighted method is utilized to optimize interclass and intraclass distances [7].
Gulsah et al. [8] proposed two algorithms, W-QEISS and F-QEISS, that use nondominated sorting based on classification accuracy, feature number, relevance, and redundancy. Li et al. [9] established a model with feature number, classification performance, interclass distance, and intraclass distance as objectives and proposed a decomposition-based large-scale algorithm (DMEA-FS).
However, some unsolved problems still exist in feature selection using traditional evolutionary algorithms. e first problem is that the selection of a large number of features can be regarded as the optimization of the large-scale optimization problem [1] or the large-scale multiobjective optimization problem (LSMOP) [10], but the traditional evolutionary algorithms cannot effectively solve such problems. e second problem is that feature number and accuracy are two basic objectives, and other objectives are needed to explore the potential information to guide the evolution in feature selection [1]. Correspondingly, more objectives result in many-objective optimization problems (MaOPs) [11,12].
ere are three main types of current algorithms, which are mainly used to solve LSMOPs or MaOPs, but they perform poorly on large-scale many-objective problems (LSMaOPs) [13], which include more than 3 objectives and over 100 decision variables [14,15]. e first kind of algorithms is based on the Pareto dominance, which improves the convergence pressure by modifying the Pareto dominance relation. e new dominance relations are ε-dominance [16], θ-dominance [17], L-optimality [18], simplex dominance [19], grid dominance [20,21], etc. e algorithm using shift-based density estimation (SDE) was proposed in the work of [22], which allows individuals with poor convergence to obtain higher density.
e algorithm based on nondominated sorting approach (NSGA-III) [31] uses evenly distributed reference points to assist the environmental selection. Based on NSGA-III, Gu and Wang [10] introduced an information feedback model to solve LSMaOPs. e reference vector-guided evolutionary algorithm (RVEA) [32] uses reference vectors to guide the optimization.
To more comprehensively describe and better solve the large-scale feature selection problem, this paper studies the existing multiobjective models based on the evolutionary algorithm, combines the existing objectives, constructs the feature selection problem as an LSMaOP, and uses an improved large-scale many-objective evolutionary algorithm (LSMaOEA) for optimization. e main contributions of this paper are summarized as follows: (1) A novel worst-case solution replacement strategy based on SDE is proposed. is strategy allows conditional replacement of poor solutions in terms of convergence and diversity compared to other solutions, thereby maintaining a balance between convergence and diversity.
(2) A modified vector angle-based large-scale manyobjective evolutionary algorithm (MALSMEA) is proposed, which uses variable grouping-based polynomial mutation instead of naive polynomial mutation to improve the efficiency of solving largescale problems. In the environmental selection process, the proposed worst solution replacement strategy is used to improve diversity. (3) A large-scale many-objective feature selection optimization model is constructed, and MALSMEA is used to optimize it. e optimization objectives of this model are the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance. e remainder of this paper is arranged as follows. Section 2 introduces the related works. Section 3 describes the proposed model and MALSMEA in detail. In Section 4, we compare and analyze the experimental results of MALSMEA and four advanced algorithms in solving benchmark LS-MaOPs, as well as the performance of MALSMEA and three feature selection algorithms in optimizing the proposed feature selection model. Section 5 provides a summary of the full paper and prospects of future research.

Large-Scale Many-Objective Optimization
Problem. An LSMaOP can be described as where is the number of decision variables (D ≥ 100), and l i and u i are the lower and upper bounds of decision variables in the ith dimension, respectively. x is the D-dimensional decision vector in Ω, m is the objective number (m > 3), and F(x) ∈ R m is the objective vector of x. If no other solution dominates x, then x is a Pareto optimal solution [33]. e objective vectors corresponding to all Pareto optimal solutions constitute the Pareto optimal front (PF) [34,35].

Shift-Based Density Estimation.
We use the SDE [22] with the kth nearest neighbor [36] to estimate the density of all individuals. For an individual x i , the following method is used to calculate the density value SDE(x i ).
and N is the size of the population. (iv) Finally, SDE(x i ) is calculated as follows: rough the above process of estimating the individual density, we can observe that the smaller the individual density is, the better the performance of the individual. Computational Intelligence and Neuroscience erefore, this paper uses this strategy, considering both diversity and convergence, to judge a pair of individuals with similar search direction, so as to delete the individual with poor performance.

Information eory Criterion Based on Entropy.
e feature selection model uses an entropy-based information theory criterion [8] to measure correlation and redundancy. For a given discrete random variable A, its entropy E(A) is determined as follows: where p(a) � Pr(A � a), A is the set of all possible values of A, a ∈ A. en, the joint entropy of A and B is determined as follows: where en, the mutual information between A and B is determined as follows: Symmetric uncertainty is used to scale the value range of mutual information to [0, 1] [37], which is defined as follows:

Model Design.
e optimization objectives of the feature selection model include the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance, which are described as follows: (1) e Number of Selected Features. It is minimized to ensure the simplification of feature selection: where |S| represents the cardinality of feature set S. (2) Accuracy. e accuracy of the learning algorithm is measured by the classification performance. e higher the classification performance is, the greater the accuracy. In this paper, the extreme learning machine (ELM) classifier [8] is used to calculate the accuracy: where tn, tp, fn, and fp represent the true negative, true positive, false negative, and false positive, respectively.
(3) Relevance. e relevance between features and categorical variables reflects the recognition ability of the selected features. e greater the correlation is, the stronger the recognition ability is: where x i represents the ith feature and y represents the target categorical variable. is objective is normalized according to F 3 (S) � F 3 (S)/max F 3 (S).
(4) Redundancy. e redundancy is used to quantify the level of similarity between selected features. e smaller the redundancy is, the smaller the similarity: where x j represents the jth feature. is objective is normalized according to F 4 (S) � F 4 (S)/max F 4 (S). (5) Interclass Distance. e interclass distance represents the distance between the mean sample of each class and the average of mean samples of all classes, which reflects the recognition ability of samples of different classes. In the evolutionary process, a better sample distribution is obtained by maximizing the distance between classes: where L is the total number of classes and m i is the average value of all samples with feature S in class i. is objective is normalized according to F 5 (S) � F 5 (S)/max F 5 (S). (6) Intraclass Distance. By calculating the distances between the samples with the selected feature and the mean of all samples of the same kind, this value reflects the cohesion of the same kind of samples and can improve the accuracy to a certain extent: where a ij is the jth sample in class i. is objective is normalized according to F 6 (S) � F 6 (S)/max F 6 (S).
erefore, the definition of the feature selection optimization model in this paper is as follows: 3.2. e Proposed Algorithm: MALSMEA. In this paper, a modified vector angle-based large-scale many-objective evolutionary algorithm is proposed, termed as MALSMEA. MALSMEA mainly uses a mutation operator based on variable grouping and the environment selection method of VaEA [38]. Figure 1 shows the program flowchart of MALSMEA. e main process of MALSMEA is as follows: Computational Intelligence and Neuroscience (i) Step 1. Initialize a population P(t) with N individuals randomly in the whole decision space Ω, and set parameters. (ii) Step 2. e mutation operator based on variable grouping is used to mutate the population P(t), in which the grouping method is ordered grouping, to generate the offspring population with the parent population P(t) and obtain the joint population U(t). en, the environmental selection in steps 4-9 is adopted to select N promising individuals from U(t). Step 7. If |P(t + 1)| < N, select the individual with the largest vector angle in F(l) to join the new population P(t + 1) by calculating the vector angles between the individuals in F(l) and the individuals in P(t + 1); otherwise, go to step 9. (viii) Step 8. To maintain the balance between convergence and diversity, the worst individual replacement strategy is used to replace the poor individual with other individuals. Repeat from step Step 10. Repeat from step 2, and stop when the maximum number of generations t max is reached.

e Worst-Case Solution Replacement Strategy Based on SDE.
As the extreme individuals have been selected according to the vector angle and fitness value, for the worst individual replacement strategy in the process of environmental selection, we use the SDE strategy to calculate the density of individuals. e SDE strategy can consider the convergence and diversity of individuals simultaneously.
Using this method, we can replace the poor individuals with similar search directions. e specific process is as follows: if the angle between an individual a in F(l) and an individual b in P(t + 1) is less than the angle between two solutions of N ideal solutions, that is, θ � ((π/2)/N + 1), where N is the population size, then they have similar search directions. In this case, if SDE(a) < SDE(b), then individual b is replaced by a. After replacement, the angle between each individual a ∈ F(l) and the new population P(t + 1) is updated.

e Wrapper Structure of MALSMEA.
MALSMEA is applied to the feature selection model, and the pseudocode of the wrapper structure of MALSMEA is shown in Algorithm 1. e main steps are as follows: (i) First, the input dataset DS is divided into training and test datasets. (ii) en, in the initialization process, MALSMEA allocates the random feature vector W S selected from the data feature matrix W. e selected feature vector W S is encoded as solutions by using the coding technology of [9] to reduce the amount of computation in the evolutionary process, and the mask of W S is regarded as the decision variables, and the population P is formed. (iii) en, in the wrapper structure, the population P is evaluated via six objective functions to obtain objective vectors and obtain the evaluated population P(t). e feature number is calculated according to the decision variables of the solutions. e accuracy can be obtained from the decoded feature subset and the corresponding ELM classifier [8], and other objectives can be calculated according to the corresponding equations.

Start
Initialize population P (t) and set parameters Normalize individuals, calculate fitness, density and angle  Computational Intelligence and Neuroscience (iv) en, the population is optimized by MALSMEA.
(v) Finally, the optimal set P S is obtained.

Time Complexity Analysis.
e time complexity of MALSMEA is composed mainly of the following parts: the time complexity of the mutation operation in MALSMEA is [38], and RVEA is O(mN 2 ) [32]. us, the time complexity of MALSMEA is similar to that of GLMO but greater than that of the other three algorithms.

Experimental Studies
In this section, DTLZ1-DTLZ6 in the Deb, iele, Laumanns, and Zitzler (DTLZ) test suite [41] and LSMOP1-LSMOP9 in the Large-Scale Multi-and Many-Objective Problems (LSMOP) test suite [42] are selected to evaluate the performance of MALSMEA, and four datasets in the University of California at Irvine (UCI) machine learning library [43] are selected to evaluate the ability of MALSMEA to optimize the proposed feature selection model, among which Heart is a two-class dataset, Zoo and Iris are two multiclass datasets, and Musk1 is a high-dimensional dataset. For LSMaOPs, MALSMEA is compared with GLMO [39], LCSA [40], VaEA [38], and RVEA [32]. GLMO and LCSA are large-scale multiobjective evolutionary algorithms. GLMO uses mutation operators based on variable grouping, and LCSA uses a linear combination to reduce dimensionality. VaEA and RVEA are many-objective evolutionary algorithms that use vector angles and reference vectors, respectively. For the proposed six-objective feature selection model, MALSMEA is compared with W-MOSS [44], W-QEISS, and F-QEISS [8].
In the next sections, we introduce the performance indicators and set the parameters in the experiments. en, for all algorithms, when the objective numbers m are 5 and 10, the population sizes N are 126 and 275, and the numbers of decision variables D are 500 and 1000, respectively. Each algorithm runs 20 times independently and stops when the number of function evaluations (FEs) reaches 90,000. e performance of MALSMEA is verified by comparing the average IGD values obtained by five algorithms. In each test instance, the best average IGD value is highlighted in bold. Finally, in four datasets, MALSMEA and three feature selection algorithms are utilized to deal with the proposed sixobjective feature selection optimization model, for which N � 100, the maximum number of FEs is 100, and each algorithm runs independently for 10 times. e optimization ability of MALSMEA is verified by comparing the HV indicator and optimization results.

Experimental Settings
(1) Performance Indicator. In the experiment, IGD [45] and HV [46] are used as evaluation indicators. e smaller (larger) the IGD (HV) indicator value is, the better the performance of the algorithm. e IGD indicator evaluates the algorithm by calculating the average of minimum distances between all sampled individuals on the actual PF and the obtained solution set. e HV indicator quantifies the algorithm performance by calculating the volume between the obtained nondominated solution set and the reference point.

(2) Parameter Settings for the Crossover and Mutation
Operators. In the performance verification experiment of MALSMEA, MALSMEA and GLMO use the mutation operator based on variable grouping to generate offspring. Other algorithms use simulated binary crossover (SBX) [32] and polynomial mutation [47]. e crossover probability is p c � 1.0, the mutation probability is p m � 1/D, and the distribution indicator is η m � 20, where D is the number of decision variables. In the experiment to verify the superiority of MALSMEA with respect to the proposed model, according to [9], p c � 0.8, p m � 0.2. (3) Other Parameter Settings for Algorithms. In MALS-MEA and GLMO [39], the number of groups K is set to 4, and the ordered grouping method is adopted. For RVEA [32], the index α and the frequency f r are set to 2 and 0.1, respectively. e parameters in W-QEISS and F-QEISS are set according to [8], and the searching method is based on r-NSGA-II [48]. e parameters in W-MOSS are set according to [44]. (4) Datasets. e details of 4 UCI datasets utilized are shown in Table 1. (5) ELM Classifier. For the proposed model, the ELM classifier [8] is utilized to evaluate the accuracy of the current solution, which follows the criterion given in [46]: the activation function is g(x) � 1/(1 + e (−x) ) in the hidden layer, and the number of neurons is set to n h � 10. e target classification variable and the (input) features are normalized into ranges [0, 1] and [−1, 1] in each dataset, respectively. To minimize the accuracy deviation, the k-fold cross validation approach is utilized with k � 10, and the average accuracy is used for comparison [9]. Table 2 describes the IGD indicator values obtained by the five algorithms on the 5-and 10-objective DTLZ1-DTLZ6 with 500 and 1000 decision variables. As shown in Table 2, Computational Intelligence and Neuroscience 5

Performance Comparison of Algorithms on DTLZ.
MALSMEA is competitive with the other four algorithms. Specifically, MALSMEA produces 18 best results out of 24 test instances, and its performance on the 10-objective DTLZ is significantly better than that of the other algorithms. e experimental results are analyzed in detail as below. DTLZ1 reflects the convergence of the algorithm. MALSMEA outperforms the other algorithms on the 5-and 10-objective DTLZ1.
ese results demonstrate that MALSMEA has better convergence on the large-scale highdimensional DTLZ1. DTLZ2 is generally used to test the scalability of algorithms with respect to the number of objectives.
e performance of MALSMEA on the 5-objective DTLZ2 is better than that of LCSA but slightly inferior to that of GLMO, VaEA, and RVEA. e performance of MALSMEA on the 10-objective DTLZ2 is better than that of the other four algorithms. us, MALSMEA has better scalability to the objective number.
DTLZ3 is a highly multimodal problem similar to DTLZ1. MALSMEA obtains the smallest IGD indicator value on DTLZ3 with 500 and 1000 decision variables. DTLZ4 is used to test the ability of the algorithm to ensure the diversity of the population. MALSMEA obtains the smallest IGD indicator value on the 10-objective DTLZ4 with 500 and 1000 decision variables. For the 5-objective DTLZ4, VaEA outperforms other algorithms on DTLZ4 with 500 and 1000 decision variables. MALSMEA exhibits greater diversity on the large-scale 10-objective DTLZ4.
For the 5-objective DTLZ5, MALSMEA outperforms LCSA on DTLZ5 with 500 and 1000 decision variables, but inferior to GLMO, VaEA, and RVEA. For the 10-objective DTLZ5, MALSMEA outperforms its counterparts. For DTLZ6, the overall performance of MALSMEA is optimal on instances with up to 1000 decision variables.
To further test the performance of MALSMEA, the nonparametric Friedman test [49] is employed. According to the average IGD indicator values of the five algorithms on DTLZ, Table 3 indicates the average ranking of the five algorithms.
e average ranking of MALSMEA is the smallest, which indicates that MALSMEA performs the best. e average ranking of LCSA is the largest, so its performance is the worst.
To verify the efficiency of MALSMEA, Table 4 presents the running time of MALSMEA and the four other algorithms on the 10-objective DTLZ1 with 1000 decision variables. e running times of MALSMEA and GLMO are quite similar but greater than those of other algorithms.

Performance Comparison of Algorithms on LSMOP.
LSMOP is proposed to test the performance of the algorithm in LSMaOPs. Table 5 lists the IGD indicator values obtained by five algorithms on 5-and 10-objective LSMOP1-LSMOP9 with 500 and 1000 decision variables. MALSMEA produces 26 best results out of 36 test instances. erefore, compared with the other four algorithms, MALSMEA has better performance in solving LSMaOPs.
Specifically, for the LSMOP test suite with 500 decision variables, MALSMEA outperforms the other algorithms on the 5-and 10-objective LSMOP2, LSMOP4, LSMOP5, LSMOP8, and LSMOP9. MALSMEA is inferior to LCSA on LSMOP3. MALSMEA outperforms the other algorithms on the 10-objective LSMOP1 and LSMOP7, but LCSA obtains the smallest IGD indicator value on the 5-objective LSMOP1 and LSMOP7. MALSMEA obtains the smallest IGD indicator value on the 5-objective LSMOP6, while RVEA performs better on the 10-objective LSMOP6.
For the LSMOP test suite with 1000 decision variables, MALSMEA outperforms the other algorithms on the 5-and 10-objective LSMOP2, LSMOP4, LSMOP5, LSMOP8, and LSMOP9. MALSMEA is inferior to LCSA on LSMOP3. LCSA obtains the best performance on the 5-objective LSMOP1 and LSMOP7, and MALSMEA outperforms the other algorithms on the 10-objective LSMOP1 and LSMOP7. e performance of MALSMEA on the 5-objective LSMOP6 is better than that of the other algorithms, but it is slightly inferior to that of LCSA and RVEA on the 10objective LSMOP6.

Conclusion
In this paper, a modified vector angle-based large-scale many-objective evolutionary algorithm called MALSMEA is proposed. In MALSMEA, the polynomial mutation based on variable grouping is used to replace the polynomial mutation to improve the efficiency of solving large-scale optimization problems. A novel worst-case solution replacement strategy based on SDE is proposed to replace the worse one of two individuals with similar search directions to increase diversity. In addition, MALSMEA is compared with four typical algorithms to solve the optimization problem with up to 10 objectives and 1000 decision variables. Experimental results indicate that MALSMEA outperforms the four algorithms on the DTLZ and LSMOP test suites. By studying the existing feature selection models, taking the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance as the optimization objectives, a six-objective optimization model is constructed and solved by using MALSMEA. Compared with the other three feature selection algorithms, MALSMEA has some advantages in solving this model. Future studies will proceed in two directions. e first direction is to add a parallel strategy to MALSMEA to improve efficiency or to further modify its environmental selection method. Another research direction is to solve LSMaOPs in other fields using MALSMEA.
Data Availability e details of the four UCI datasets utilized are shown in Table 1.