A Clustering-Guided Integer Brain Storm Optimizer for Feature Selection in High-Dimensional Data

For high-dimensional data with a large number of redundant features, existing feature selection algorithms still have the problem of “curse of dimensionality.” In view of this, the paper studies a new two-phase evolutionary feature selection algorithm, called clustering-guided integer brain storm optimization algorithm (IBSO-C). In the first phase, an importance-guided feature clusteringmethod is proposed to group similar features, so that the search space in the second phase can be reduced obviously.)e second phase applies oneself to finding optimal feature subset by using an improved integer brain storm optimization.Moreover, a new encoding strategy and a time-varying integer update method for individuals are proposed to improve the search performance of brain storm optimization in the second phase. Since the number of feature clusters is far smaller than the size of original features, IBSO-C can find an optimal feature subset fast. Compared with several existing algorithms on some real-world datasets, experimental results show that IBSO-C can find feature subset with high classification accuracy at less computation cost.


Introduction
Feature selection (FS), as an important dimension reduction method, has been applied in various real problems, such as image processing and text classification [1,2]. In general, a lot of irrelevant/redundant features will slow down the speed of learning algorithms, even reducing their learning accuracy. e purpose of FS is to eliminate those irrelevant and/ or redundant features, thus shorting the learning time while improving the learning accuracy [3][4][5][6].
Brain storm optimization (BSO) is a new swarm intelligence algorithm simulating collective behavior of human being [23,24]. And it has been applied to a lot of real problems including system identification and electromagnetic antenna design [25][26][27][28][29][30][31][32][33]. Recently BSO-based FS algorithms have received much attention. Zhang et al. applied BSO in FS problems for the first time and proposed a continuous BSO-based FS algorithm (CBSO) [34]. Furthermore, they have also developed an improved discrete BSO [35], where two new idea clustering and new idea updating mechanisms were proposed to improve the performance of BSO. Liang et al. proposed a hybrid FS algorithm by combining ant colony optimization and BSO [36]. Combining the Fuzzy Min-Max neural network and BSO to undertake feature selection and classification problems, Pourpanan et al. developed also a hybrid BSO-based FS method [37]. Furthermore, they also presented a novel improved hybrid BSO-based FS method [38] by combining a fuzzy ARTMAP model with BSO. Papa et al. introduced an improved binary BSO-based FS algorithm, where a realvalued solution is mapped onto a Boolean hyper cube by using different transfer functions [39]. ese methods above all enhance the capability of BSO on solving FS problems.
However, as the feature space increases exponentially, the search capability of existing BSO-based methods is inevitably reduced because of the lack of effective space reduction strategies.
For high-dimensional data, this paper develops a new evolutionary feature selection algorithm, called the clustering-guided integer BSO algorithm (IBSO-C). IBSO-C works in the two phases. In the first phase, an importanceguided feature clustering method is developed to group all features into multiple clusters according to their redundancy. Following that, the second phase selects the most representative feature from each feature cluster by employing an improved integer BSO, and all representative features form the final feature subset. Appling IBSO-C in several high-dimensional FS problems, experimental results show its superiority and effectiveness over some state-ofthe-art methods, including one filter method and three evolutionary wrapper feature selection methods.
e main contributions of this paper are as follows: (1) e paper proposes a new two-phase hybrid evolutionary FS framework, which effectively combines the capability of fast dimensionality reduction of clustering-based method with the global search ability of evolutionary algorithm. Since the number of feature clusters is far smaller than the size of original features, the second phase can find an optimal feature subset fast. (2) roposes a new feature clustering method, called the importance-guided feature clustering method. By effectively fusing feature importance and feature correlation, the proposed method can group all features into multiple clusters according to their redundancy at relatively small computational cost. (3) e paper proposes an improved integer BSO (IBSO) for feature selection problems. Several new strategies, including the integer encoding strategy and the time-varying integer update strategy, improve the search performance of IBSO.
e remaining contents are organized as follows: Section 2 shows basic conceptions. Next, the proposed BSO-based FS algorithm is introduced in Section 3. Following that, Section 4 provides experimental analyses. Section 5 concludes this paper.

Feature Selection Problem.
Considering a dataset with D features and H instances, the objective of feature selection is to select d features (d < D) from the original feature set, Fset, so that the classification accuracy, AC, become better. Using a binary string, X, to represent a feature subset, we have where x j � 1 indicates that the j-th feature is selected into the feature subset X; otherwise, it is not selected. A feature selection problem is as follows: To deal with FS problems, existing methods can be divided into three categories [8]: filter, wrapper, and embedded. e filter firstly calculates the importance degrees of all features by a specified measure, such as information gain, distance measure, and dependency measure. Next, all features are ordered with their importance degrees. is kind of approach has less computational cost, but its classification accuracy often is worse than the other two kinds of methods. e wrapper utilizes a learning algorithm to evaluate feature subsets and uses some search methods to find good feature subsets. Because of the need of repeatedly evaluating new feature subsets (or solutions) by a classifier, this kind of methods has high computational cost. But, its classification accuracy often is better than the filter. e embedded method automatically carries out feature selection when training classifier. Since those selected features are closely related to such the used learning algorithm, the robustness of embedded method is poor to the change of algorithms. Since the proposed FS algorithm also uses a classifier to evaluate new solutions (i.e., feature subset) and utilizes BSO to search feature subsets, it belongs to the wrapper. e purpose of this paper is to study a new BSO-based feature selection algorithm for high-dimensional data.

Brain Storm Optimization Algorithm.
In BSO, an idea (i.e., an individual) represents a potential solution of optimized problem. It generates continually new ideas by repeatedly implementing the following three phases, i.e., individual clustering, individual update, and elite selection. Note that an idea is an individual in other evolutionary optimization algorithms.
In the phase of individual clustering, individuals are first grouped into multiple clusters. Here, the most commonly used clustering method is the K-means clustering. Next, the phase of individual update uses those cluster centers or ideas from one/two clusters to generate new solutions. Here, two probability values, i.e., P gen and P cluster , are used to determine which cluster center or ideas to be used. After selecting the two solutions, a new individual is generated by a crossover-like strategy as follows: where X i and X j are two ideas/individuals or cluster centers from the two clusters; Rand is a random number within [0, 1]. e disturbance factor, ξ, is utilized to enhance the diversity of new ideas; t is the current iteration time and T is the maximal iteration time.
2 Discrete Dynamics in Nature and Society Finally, the elite selection compares those new ideas with corresponding old ones. And the better ones are saved and recorded as new ideas.

Symmetric Uncertainty.
Compared with mutual information, symmetric uncertainty (SU) can correct the FS bias by normalizing mutual information values, so that it can fairly analyze the correlation between features. Now, it has been successfully used in FS problems [7,40].
Taking two random variables, X and Y, as example, their SU value is calculated by where H(X|Y) � − y∈Y p(y) x∈X p(x|y)log 2 p(x|y) is the conditional entropy, which is used to evaluate the uncertainty of X when Y is given.
is the information gain, which is used to evaluate the decrease degree of the uncertainty of X when Y is known.

Framework.
To deal with the problem of "curse of dimensionality," the paper proposes a two-phase evolutionary FS algorithm, called clustering-guided integer BSO algorithm (IBSO-C). Figure 1 shows its framework. e framework includes the following phases: clustering features and selecting representative features. Firstly, the first phase groups all features into multiple clusters according to their similarity, by using the proposed importance-guided feature clustering method. After that, the second phase selects representative features from these feature clusters by employing an improved integer BSO, so generating the final feature subset. e main contributions of this paper have been marked with red dotted lines in Figure 1.

Importance-Guided Feature Clustering Strategy.
A good clustering method should be able to group similar features into the same cluster at a less computational cost. e frequently used K-means can divide data more accurately, but it usually has a high computational cost. For the above reason, we propose a new feature clustering method, i.e., the importance-guided feature clustering (IFC).
Algorithm 1 shows the implementing steps of IFC. Firstly, the SU measure is used to evaluate the importance of each feature in Fset (step 1). e greater the SU value between a feature and class labels is, the more important the feature is. Secondly, all the features in Fset are sorted in the decreasing order of SU (step 2), denoted the sorting result by Fsorted. After that, the following steps (steps 4-12) are executed circularly until all features are classified into corresponding clusters: (1) the first feature in Fsorted is set to be a new cluster center, Center i , and the i-th feature cluster is initialized asFcluster i � Center i . (2) All the features in Fsorted are checked against the new cluster center. If the correlation between a feature and the cluster center is more than a threshold η, then the feature is put into the feature cluster, Fcluster i . Repeating the above method, we can get the new feature cluster, Fcluster i . (3) After that, we reset Fsorted � Fsorted/Fcluster i . If Fsorted is not empty, return to step 5; otherwise, stop the clustering method and output all the feature clusters. Note that, we use also US to calculate the correlation between feature and cluster center.
Compared with existing clustering algorithms, IFC has the following advantages: (1) IFC does not need to set the clustering number in advance. Compared with the clustering number, it is easier to set the correlation threshold η. (2) IFC has a lower computation complexity. Considering the worst case, running the 11-th line of Algorithm 1 every time can delete two features from Fsorted, and the times of running the SU measure to calculate the correlation between features and cluster centers are D (D − 1)/4, where D is the size of feature. Moreover, evaluating the importance degrees of all features needs running the SU measure D times. erefore, the times of running the SU measure by IFC are (D 2 + 3D)/4. In most of the existing clustering methods, determining the redundancy between all features usually needs running the SU measure (D − 1)! times.

Selecting Representative Features by an Improved Integer
BSO. In this section, an improved integer BSO is proposed to select representative features from those feature clusters, thus generating the final feature subset. Firstly, we give the encoding method of idea/individual and the fitness evaluation strategy of idea.  Figure 1: Framework of the proposed IBSO-C.

Encoding and Fitness Evaluation
Discrete Dynamics in Nature and Society goal of this section is to produce a good feature subset, X, by selecting a representative feature from each feature cluster, so that the classification performance gets the largest. Taking an integer vector to construct a solution, the optimization model is as follows: where x i � s(0 ≤ s ≤ |Fcluster i |) indicates that the s-th feature in the i-th feature cluster is selected into the feature subset, X. Following the above model, we directly use the integer vector, X, to represent an individual in the population.
is paper adopts the leave-one-out cross-validation (LOOCV) of k-NN to calculate the fitness of an idea in BSO. Due to easy implementation, the one nearest neighbor (1-NN) method is used as the classifier in the following experiment. As we know, the k-NN has used by many FS methods [7,8,10]. In the LOOCV with 1-NN, a single instance from the original dataset is selected as a testing sample, and the remaining ones are used as training samples. en the 1-NN predicts which class this instance belongs to. e above process is repeated so that each instance in the original dataset is used once as the testing sample. Based on this, the classification accuracy of an idea X i is as follows: where All size is the size of all instances in the original dataset and FP size is the number of correctly predicted samples by K-NN. In the proposed method, the AC value of an idea is its fitness.

Time-Varying
Integer Update of Idea. It can be seen by analyzing (3) that since two weights are randomly assigned to the two normal ideas or cluster centers selected, X i and X i , the new generated idea may swing back and forth between the two. If X i and X i are always far away in the iteration process of BSO, such ideas corresponding the two clusters will be difficult to converge, reducing the convergence performance of BSO. Moreover, since traditional BSO is proposed for continuous optimization problems, we must develop a new integer update rule for solving the optimization model described in (5).
To overcome the above problems, this section proposes a time-varying integer update strategy of idea (TVIU). In this strategy, the best one among the two normal ideas or cluster centers will get a large learning weight. And this learning weight will increase as the number of iterations increases.
e new update rule is as follows: where rand 2 and rand 3 are two random numbers within [0, 1], ⌈·⌉ is the roundup function, w 1 and w 2 are two weights, which are used to control the learning proportions of the new solution from two ideas or cluster centers, T is the maximum iteration times of the algorithm, t is the current iteration times, and AC(X i ) and AC(X j ) are the For j � 1: |Fsorted| (8) Calculate the correlation between Center i and the j-th feature in Fsorted; (9) If the correlation > η, then save the j-th feature intoFcluster i ; (10) Endfor (11) Reset Fsorted � Fsorted/Fcluster i , and i � i + 1.  Discrete Dynamics in Nature and Society AC values of the two normal ideas or cluster centers, respectively. From (7) we can see that (1) the bigger the AC value of a normal idea or cluster center, the higher the weight (w) of this idea or center, so that the new solution X new can learn more knowledge from this idea or center. is can improve the quality of new solution to a certain extent. (2) As the iteration times increase, the influence of the AC value of a normal idea or cluster center on the weight gets higher. is indicates that the learning degree from the best idea is higher and higher with the increase of iteration times. Furthermore, this can speed up the convergence of the population in the later stage of the algorithm.

Disturbance Operator.
In the proposed IBSO, a new disturbance operator is unitized to improve the new ideas' diversity. Checking each element in the idea X i in turn, if the random number rand ′ ∈ [0, 1] is smaller than the probability p m , the element will be reinitialized within its search space. In this paper, we set p m � 1/D, where D is the number of features.

Implement Steps of IBSO-C.
Like traditional BSO algorithms [41], the proposed IBSO-C includes still three main steps: clustering ideas, updating ideas, and selecting elite ideas. In the IBSO-C, an idea represents a solution of the optimized problem. e feature clustering strategy proposed in Section 3.2 is used to cluster features, and the improved integer BSO proposed in Section 3.3 is used to update the positions of ideas.
In the first step (i.e., clustering ideas), all ideas are grouped into several clusters. In traditional BSO, the K-means clustering is the commonly used method. Due to the need to repeatedly cluster all instances, the K-means has still the disadvantage of high-computational cost. Focusing on it, Cao et al. [42] introduced random grouping to minimize the clustering cost. In this method, after the population is grouped into randomly M clusters, the fittest idea in each cluster is selected as its center. Compared with the K-means, this method significantly reduces the computational cost of population clustering.
In the second step (i.e., updating ideas), new ideas are generated based on two cluster centers or normal ideas in two clusters. Here the time-varying integer update strategy in Section 3.3 is used to generate new solutions. And the proposed disturbance operator is unitized to improve the new ideas' diversity.
In the third step, the elite selection will be implemented. For the i-th idea in the population, if the classification accuracy of the new idea X new i is better than that of the old one, X i , then replace X i by X new i , i.e., X i ⟵X new i . If X new i and X i have the same classification accuracy, but the feature number in X new i is fewer than that of X i , then X i ⟵X new i . Moreover, Algorithm 2 shows the pseudocode of the proposed IBSO-C. Note that, there are two clustering results in Algorithm 1, i.e., lines 2 and 6. In line 2, the method proposed in Section 3.2 is used to cluster features. Its input is the data to be processed, and the output is feature clusters. Line 6 is to cluster the N ideas or individuals, and its input is the population of BCO.

Experiments and Analyses
is section verifies the effectiveness of IBSO-C. Firstly, we discuss the effectiveness of the proposed two key operators on the performance of IBSO-C, i.e., the proposed clustering strategy and the time-varying integer update strategy. Secondly, IBSO-C is compared with four existing FS algorithms.

Experimental Preparation.
Eight real datasets are used to verify the performance of IBSO-C. Table 1 shows their basic information. ese datasets have been used in many studies [5,7,10], which can be downloaded from http://www.ics.uci. edu/mlearn/MLRepository.html and http://gems-system. org/.
Four representative feature selection algorithms were used, including the ReliefF algorithm (ReliefF) in [43], the binary PSO algorithm (BPSO) in [44], the binary BSO-based algorithm (BBSO) in [35], and the self-adaptive PSO algorithm (SaPSO) in [12]. For fair comparison, all populationbased algorithms use the same swarm/population size (50) and the same maximal iterations (1000). For other parameters, their values are set followed by their original literatures, as shown in Table 2.
ree performance indexes are used to evaluate the quality of an algorithm, i.e., the classification accuracy (AC), the number of selected features (FN), and the running time (Time).
is paper employs the 10-fold cross-validation method to demonstrate the effectiveness of an algorithm. e 9 parts are taken as training data in turn, the remaining part is taken as test data, and the average value of 10 times is used as the final result. All experiments in this paper are carried out on Intel (R) core (TM) i7-8700 CPU, 3.2 GHz, and 16.00 GB RAM.

Analysis on the Proposed Clustering Strategy.
e proposed clustering strategy plays a key role in improving the performance of IBSO-C. We analyze the effectiveness of the strategy in this section. Here, the conventional BSO without feature clustering [34] (CBSO) is selected as a comparison method. IBSO-C and CBSO use the same parameters shown in Table 2. Table 3 show the values of AC, FN, and the running time obtained by IBSO-C and CBSO. We can see that, (1) with the help of the feature clustering strategy proposed in Section 3.2, IBSO-C obtained the best average AC values for all the eight datasets, which are significantly higher than those of CBSO. (2) More importantly, compared with CBSO, IBSO-C takes only very few features to get such good AC values, as shown in their average FN values. Taking the dataset CNS as example, IBSO-C only used less than 90 features to get the classification accuracy, 86.04%. (3) Due to only using very few features, the running time of IBSO-C is also significantly less than that of CBSO. Taking the dataset CNS as example, IBSO-C only costs 1.5936 minutes to get a good solution, while the running time of CBSO is more than two hours. Discrete Dynamics in Nature and Society

Analysis on the Time-Varying Integer Update Strategy.
e time-varying integer update strategy (TVIU) proposed in Section 3.3 also plays a key role in improving the performance of IBSO-C.
is subsection will analyze its effectiveness. Here, the integer update version of (3) is selected as the comparison method, as follows: For convenience, the IBSO-C with (8) is called IBSO-S. Both IBSO-C and IBSO-S use the same parameters shown in Table 2.
Input: e data set to be solved; Output: e optimal feature subset; (1) Set related parameters, including the population size, N, the maximal iteration times, T, P cluster and P t−cluster , and so on; t � 0; (2) Cluster all the features into K clusters by using the method in Section 3.2.
(3) Randomly generate N integer ideas or individuals.
(4) Evaluate the fitness of each idea by equation (6); (5) While t < T (6) Grouping all the N ideas into M clusters by the method in [42]; (7) Select the best idea from each cluster as the cluster center; % the phase of updating ideas % (8) For i � 1: N % from the first idea to the last one (9) If a random number rand() < P gen , then (10) Randomly select a cluster and determine its cluster center; (11) If a random number rand() < P cluster , (12) Select the cluster center; (13) Generate a new idea by the equation (7); (14) Implement the proposed disturbance operator; (15) Else (16) Randomly select a normal idea from this cluster (17) Generate a new idea by the equation (7); (18) Implement the proposed disturbance operator; (19) End if (20) Else (21) Randomly select two clusters; (22) If a random number rand() < P cluster , then (23) Select two cluster centers; (24) Generate a new idea by the equation (7); (25) Implement the proposed disturbance operator; (26) Else (27) Randomly select two normal ideas from the two clusters respectively; (28) Generate a new idea by the equation (7); (29) Implement the proposed disturbance operator; (30) End if (31) End if % the phase of selecting elite ideas % (32) Evaluate the new idea and update correspond old idea by Section 3.

Comparison Analyses.
e proposed IBSO-C algorithm is compared with the four existing algorithms in terms of AC, FN, and the running time. Table 4 lists the average AC values obtained by the five FS algorithms with KNN, and Table 5 shows the average FN values obtained by the five FS algorithms. In addition, we employ the Mann-Whitney U test to investigate whether there is a significant difference between IBSO-C and another algorithm. Here, "+" indicates that IBSO-C is obviously superior to the comparison algorithm, "�" indicates that there is no significant difference between them, and "−" indicates that IBSO-C is obviously inferior to the comparison algorithm. e following can be seen from the two tables: (1) for 6 out of the 8 datasets, IBSO-C obtained the biggest average AC values; for 4 out of the 8 datasets, i.e., Colon, Wra-pAR10P, DBWorld, and CNS, the AC values of IBSO-C are significantly superior to that of all the four comparison algorithms. (2) For the dataset GFE01, IBSO-C obtained the third best AV value, while SaPSO had the best AV value. However, the number of features selected by IBSO-C is significantly less than that of SaPSO, where the FN values of IBSO-C and SaPSO are 14.0 and 73.2, respectively. (3) Like the dataset GFE01, SaPSO obtained the best AV value on the dataset SRBCT, but its FN value is significantly bigger than that of IBSO-C. (4) For all the 8 datasets, IBSO-C obtained the smallest FN values compared with the four comparison algorithms. Table 6 shows the average running time of IBSO-C and the three evolutionary FS algorithms. It reports that the running time of IBSO-C is significantly less than that of BPSO, SaPSO, and BBSO. For all the 8 datasets, the average running time of IBSO-C is 5.33 minutes, while the average running time of BPSO, SaPSO, and BBSO is

Conclusions
is paper studied a new two-phase evolutionary feature selection algorithm, called the clustering-guided integer BSO  algorithm (IBSO-C), for high-dimensional data. In the IBSO-C, the proposed feature clustering strategy in the first phase obviously reduced the search space of the integer BSO in the second phase. Since the number of feature clusters is far smaller than the number of original features, IBSO-C can find an optimal feature subset fast. Moreover, the proposed importance-guided feature clustering method can effectively group features at a relatively small computational cost. e proposed new encoding strategy and the proposed timevarying integer update strategy have improved the search performance of IBSO-C. e proposed IBSO-C was compared with four existing FS algorithms, i.e., relief, BPSO SaPSO, and BBSO on several datasets. e experimental results showed that IBSO-C is a highly competitive FS algorithm, which can obtain relatively good classification accuracy at less computational cost.
A more sophisticated feature clustering method that does not need to set any threshold or parameters manually will be one of our future research directions. In addition, applying multi-or many-objective evolutionary algorithms to cost-sensitive feature selection problems will be another research direction in the future.

Data Availability
Some or all data, models, or codes generated or used during the study are available from the corresponding upon by request.

Conflicts of Interest
e authors declare no conflicts of interest.