An Evolutionary Computation Based Feature Selection Method for Intrusion Detection

1 School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing 210044, China 2Jiangsu Engineering Research Center of Communication and Network Technology, Nanjing University of Posts and Telecommunications, China 3School of Modern Posts, Nanjing University of Posts and Telecommunications, Nanjing 210044, China 4School of Natural and Computing Sciences, University of Aberdeen, AB24 3UE, UK


Introduction
Wireless sensor networks (WSNs) are typical distributed sensor networks, which can realize data acquisition, processing, and transmission. It can monitor, perceive, and collect data from various sources or monitoring objects in the areas covered by the network and transmit them to users after data processing. As an emerging infrastructure for the application of Internet of Things, WSNs are widely used, for example, environmental monitoring, defense, urban management, medical applications, and other aspects [1,2]. At present, most of the deployed WSNs collect scalar data like humidity and location. In practical applications of smart home, traffic monitoring, and medical monitoring, the wireless multimedia sensor network can process multimedia data, such as videos, audios, and images [3]. Therefore, WSNs are increasingly associated with people's usual economic and social activities. However, due to the weakness of wireless links, the lack of physical protection of nodes, and the dynamic nature of topology, WSNs are facing a variety of data security risks. The openness of WSNs allows attackers to easily eavesdrop, intercept, and tamper with packets. The most common attacks are denial of service attacks, Hello flooding attacks, replay routing attacks, and so on [4,5]. These attacks may leak data and cause security problems in WSNs. Users are less likely to use large-scale WSNs that lack security protection and have privacy issues. Therefore, in order to promote the wider use and development of WSN, it is very important to address the security issues of WSN.
Intrusion detection system (IDS) plays a pivotal part in data security protection of WSNs [6], which can identify malicious activities that attempt to violate network security goals. IDSs identify malicious activities by monitoring the system in real time. Once they find abnormal situation, a warning will be issued. Dorothy first proposed an abstract model of the IDS in 1987 [7], which is a real-time IDS framework. At present, various IDSs have been deployed to detect anomalies [8]. In addition, neural networks [9], particle swarm intelligence [10], differential evolution algorithm [11], and other technologies [12] have been used in IDS to improve its performance. Among them, the metaheuristic algorithms have been well used to solve the IDS problems [13][14][15]. At the same time, there are many studies on IDS applied to WSNs [16,17]. In [17], the meaning and function of external signals used in WSN are defined. In addition, it realizes distributed deployment and real-time IDS by improving DCA-RT dendritic cell algorithm. A distributed network IDS which is applicable to wireless networks is put forward in [18]. It is based on the principle of classification rule induction and swarm intelligence theory. Without the need to exchange sensitive data, this system can enable effectual model training for the IDS. Nowadays, large-scale distributed intrusion detection and intrusion detection data fusion technology are the main development directions for IDS [19]. However, IDSs need to detect huge amounts of data. Actually, most of the features in the datasets are redundant or irrelevant, which can result in an increase in training time and low detection speed. In the study of network information security, it is always prominent to find out the methods that can quickly and effectively get information of destroying security from intrusion detection data. Feature selection is significant for intrusion detection, since it can reduce the time complexity of the classifier and improve the efficiency by using optimized feature subsets.
In the current big data environment, mining the knowledge contained in big data is very important for guiding practical life and applications. Feature selection has therefore become more significance [20][21][22][23][24][25]. Feature selection is applied to intrusion detection in our paper. At present, feature selection has become a hot topic in machine learning [26]. As a way of achieving dimension reduction, it selects the best feature combination rather than the whole dataset. According to the independent relationship between feature selection and classifiers, feature selection is usually separated into two groups: the filter and the wrapper [27].
A filter method is independent of any classifiers. It only considers the relevance between features and class labels. It ranks the features through the experience of statistics, information theory, and many other disciplines. Student's t-test [28] and Fisher Discriminant Ratio [29] are typical hypothesis test means in statistical techniques. Meanwhile, features can be sorted from different perspectives such as entropy or information gain [30]. This is a methodological perspective based on information theory. Amiri et al. [31] has put forward an improved feature selection algorithm on account of mutual information. It can effectively identify the characteristics of the attacks by calculating the mutual information. Evaluating the quality of a subset of features can also apply the correlation [32]. If the correlation between a feature subset and classification is high, but the correlation between a feature and the others in the subset is low, then this feature subset is good. Besides, distance measurement is also used for feature selection [33]. The commonly used distance measures include Euclidean distance, standardized Euclidean distance, and martensitic distance.
In Wrappers, the subsequent learning algorithm is embedded into feature selection process and the performance of algorithms is determined by testing prediction performance of the feature subset. Besides, the impact of a single feature on the final result is taken into account. The typical means include sequence forward selection (SFS) [34] and sequence backward selection (SBS) [35]. The disadvantage of SBS is that it can only add features and cannot remove features. SFS is the opposite of SBS. Both SFS and SBS use greedy strategies, which can easily fall into local optimal values. The to selection algorithm (LRS) [36] has been offered to deal with such problem. There are two forms of this algorithm. On the one hand, it is a null set at the beginning. The algorithm appends features each round first and then removes features from it. In this way, the evaluation function value is made to the best. For another, the algorithm begins with the complete set, removing features first round and then adding features to make the evaluation function value optimal. The sequence floating selection is developed by LRS algorithm. Compared with LRS, the distinction of the two lies in that L and R of the sequence floating selection are not fixed but will change. It includes Sequence Floating Forward Selection (SFFS) and Sequence Floating Forward Selection (SFBS) [37]. SFFS starts from an empty set. It selects a subset of unselected features in each round so that the evaluation function is optimal after adding subset and then selects subset from the selected features to optimize the evaluation function after removing subset . SFBS is similar to SFFS, but the difference is that SFBS begins with the complete set. It removes features first in each round and then adds features.
Traditional filter and wrapper methods individually evaluate and select subsets. However, some features are not independent, but they play a great performance when they work with each other. Thus, the traditional method is not very good in this respect. Evolutionary computing (EC) methods have already been used for feature selection and classification in virtue of its overall optimization capabilities [38,39], for instance, Particle Swarm Optimization (PSO) [40][41][42][43], Genetic Algorithm [44][45][46], ant colony optimization [47,48], and some of the algorithms mentioned in [49], whereas the solution space of the feature selection problem increases exponentially with the rise of the dimension of the dataset. Therefore, more and more features lead to huge solution space. Also, a large number of uncorrelated or redundant features produce many local optima in a large solution space. Therefore, most EC methods still have local optimal stagnation problems [50]. Another reason for this problem may be that many of these methods lack the ability to explore and utilize search spaces in an appropriate manner [51]. Therefore, the applicable search methods should be automatically used based on the specific feature selection Security and Communication Networks 3 problems. However, many existing evolutionary algorithms have only one search strategy and cannot effectively deal with the complex situations that arise in real-world problems. In other words, in many existing feature selection algorithms, only one Candidate Solution Generation Strategy (CSGS) is used to generate a new solution. In addition, IDSs need to address large-scale issues. Recently, EC methods using adaptive mechanisms have been exploited to deal with continuous optimization issues, and the performance is promising [52][53][54][55][56]. The adaptive mechanism is rarely used for feature selection in IDS. Therefore, a self-adaptive differential evolution (SaDE) [57] method with several CSGSs are introduced to cope with the issue of feature selection for IDSs. In SaDE, an adaptive mechanism is introduced to DE algorithm and improve its control parameter. DE is an effective method, and mechanism can increase the diversity of solutions. Combining these two can search the optimal strategy for current problem dynamically during the search process.
The remainder of this paper is organized as follows. In Section 2, the SaDE algorithm is presented. Section 3 introduces the experiment and gives results of the discussions. Conclusions and the future research work are provided in Section 4.

Initialization and DE Algorithm.
The DE method is on account of evolutionary theory. As a heuristic random search method in view of group difference, the basic idea stems from the competitive strategy of the survival of the fittest in Darwin's theory of biological evolution. According to the differential vector between the parent's individuals, DE performs mutation, crossover, and selection operations. The algorithm contains the following aspects.

Initialization and Updating Mechanisms in DE.
Unlike traditional initialization methods, this paper uses a mixed initialization strategy. Most particles are initialized with a few features, and the remaining particles are initialized with a large subset of features. It has been demonstrated in [50] that this initialization strategy can greatly improve the selection performance.
represents the best value of single particles.
represents particles' global best value. and are updated according to their classification performance.

Mutation.
The DE algorithm implements the mutation operation by the difference method. Random selection of two diverse individuals in a group and scaling vector differences are the common difference strategy. Afterwards, the vector is synthesized with the individual to be mutated. Formula (1) is used to generate a new individual.
2.1.3. Crossover. Crossover aims to select individuals randomly, because DE is also a random algorithm. The crossover operations are performed between ( ) and ( + 1). The trial vector is generated according to formula (2).
where is called crossover probability. It is a random value between 0 and 1.
is a random integer of [1, 2, . . . , ]. represents the dimensions. A new individual ( + 1) is randomly generated from a probability distribution. The reason for doing such an operation is to ensure that at least one component of ( + 1) is contributed by the corresponding component in ( + 1). Other variables have the same explanation as mentioned above. is the fitness function. Other variables have the same explanation as mentioned above.

Representation of Solutions.
In this paper, feature selection is transformed into combinatorial optimization problems of "0" and "1", with "0" meaning not selecting the corresponding feature and "1" otherwise. The binary string is used to represent the solution. The string dimension set to D dimensions is the same as the total amount of the feature. Threshold is used to limit the vector range for each dimension to between 0 and 1. That is to say, if the value of the ℎ dimension of the position is greater than , the corresponding value in the binary vector will be set to 1, which means choosing the ℎ feature. Otherwise, it will be set to 0.

The Self-Adaptive Mechanism.
The main goal of this mechanism is to generate the probabilities of CSGSs on account of their performance and to choose the suitable CSGS for every particle on account of these probabilities. CSGSs which have been successfully used in recent generations will be in higher probability to be selected in future generations. When a CSGS does not work well, it should be replaced by another CSGS that has good performance. We will give a brief introduction to the mechanism.
The 4 CSGSs used in our paper are assigned the initial probability. During the evolution process, the probability changes. Let represents the selection probability of the ℎ strategy, where q = 1, 2, 3, . . . , Q. Q is the number of CSGSs used, and in this research, Q =4. Then, the initial probability of each CSGS is set to be 1/4. The sum of these probabilities is 1, and is recalculated according to the performance of CSGS in producing new solutions. In this research, the roulette wheel technique is applied to choose CSGSs because it can randomly select targets with high probabilities in each cycle [58]. Subsequently, the selected CSGS is applied to the corresponding particle for generating the candidate solution. The candidate solution is then evaluated and the update mechanism described in the second part is used to determine whether and should be updated. The is the number of particles and Q has mentioned above) in the binary matrices × and × are used to record the information that reflects the relationship between the generated solution and the corresponding . In other words, supposing that newly generated solution is preferable than the old one, afterward, , = 1. Otherwise, , = 1. When a generation starts, × and × are initialized to × -dimensional zero matrices.
In the evolution process, the ℎ particle selects the ℎ strategy to produce new solutions. Supposing that newly generated solution is preferable than the old one, afterwards, the corresponding position of the ℎ strategy used by the ℎ particle in matrix × is set to 1, which is , = 1. Otherwise, the corresponding position in × is set to 1, that is, , = 1. After repeated evolution for the LP generations, and are reinitialized to record the information in the following generation. When the evolution of the present generation is completed, all rows in and will be merged and the results will be recorded in , ( = 1, 2, . . . , , = 1, 2, . . . , , where is the number of generations, is the ℎ generation for each LP generations) and , , respectively. In other words, , records the number of the new solutions that are produced by the ℎ CSGS and succeed in entering into the following generation. Correspondingly, , records the amount of the new solutions produced by the ℎ CSGS that fail to enter into the next generation. After the evolutionary process is repeated for generations, all the elements of , and , make up the matrix × and × , respectively. The strategy selection probabilities of the CSGSs are recalculated based on the statistical data stored in matrices and . Both × and × are initialized to be a × -dimensional zero matrix at the first generation of each generations. After repeating the evolution of the generations, we can obtain the success and failure information of the CSGSs. The following steps are used to recalculate the probability of the ℎ ( = 1, 2, . . . ) strategy.
where (4) is used to compute the sum of each column of matrix × . 3 is the proportion of the new solutions produced according to the ℎ strategy and replaced their corresponding successfully within generations. Meanwhile, the matrices × and × are initialized. In (5), the small value = 0.0001 is applied to avert division by 0. In other words, if 1 = 0, then 2 is equal to . Otherwise, 2 is equal to 1 . The probabilities are normalized by (7) to ensure that they always sum to 1. The above steps are used to produce new probabilities for the CSGSs based on their performance during generations evolution. The CSGSs are chosen according to the new probabilities. Apparently, if the probability value is greater, the probability of selecting the corresponding CSGS is greater.

Candidate Solution Generation Strategy (CSGS).
In our research, we use four powerful CSGSs which are inspired by mutation strategies of DE to generate new solutions [59]. They are used in the mutation operation. For simplicity, the symbol / / is used to represent different mutation operators. represents the basic vector, and represents the number of difference vectors used. They are described as follows: (1) The first strategy is named DE/rand/1. This has been described in formula (1).
(2) The second strategy is the generation of the next generation by the current individual, the current optimal individual, and four different random individuals. It is called DE/current-to-best/2, which is described in (8) as follows: where ( ) is the best individual in the ℎ generation population. ( ) represents the ℎ individual in the ℎ generation population. The meaning of other variables has been introduced previously.
(3) The third strategy is the generation of the next generation by a random individual and four different random individuals. It is called DE/rand/2, which is described as formula (9) as follows. Other variables have been mentioned before. (4) The fourth strategy is called DE/current-to-rand/1. It includes mutation and crossover, which is described as formula (10) as follows: where is the combination coefficient and it is a random number between 0 and 1. Other variables have been mentioned previously. The procedure of the SaDE algorithm is shown Figure 1. The algorithm finally outputs . The variables in the figure have been introduced in the second section.

Experiments and Results
The performance of the proposed method is assessed by carrying out the experiments. The sections below briefly describe the dataset, data preprocessing, parameter settings, and results of the experiments.

Datasets and Data
Preprocessing. The dataset employed in this research is the KDDCUP99 dataset [60]. It is a well-known test dataset in the domain of network IDS. Each instance of this dataset has 41 feature attributes and one label. There are 13 types of content characteristics of Transmission Control Protocol (TCP) connection. There are nine types of time-based network traffic statistics and ten host-based traffic features, including four major categories and twenty-two minor categories of attacks: DoS, Probing, R2L, and U2R [61]. A number of 5 million records are included in the KDDCUP99 dataset. A 10% training subset and the test subset are offered as well. In order to save experimental time, the dataset is randomly reduced. 70% of it is used as a training set and 30% is used as a test set, in which we randomly selected 3,458 training samples and 1,482 test samples together to constitute the experimental data.   Datasets are numerically processed before they are trained, as the classifier can only recognize quantitative. For the sake of testing the function of the algorithm better after improving the parameters, we also generate 4 new datasets from the KDDCUP99 dataset. Among them, we randomly selected 4 times from the original data and randomly selected 1% of the original dataset each time. We denote these datasets as DataNum1, DataNum2, DataNum3, and DataNum4, respectively. The K-Nearest Neighbour (KNN) method is applied as a classification method to evaluate subsets of features generated. In KNN, 3-fold cross validation is employed to measure the classification accuracy.

Parameter Settings.
We choose SFFS, SBFS, standard PSO, and SaDE for comparison. According to past experience, each algorithm runs 26 times on the KDDCUP99 dataset. With regard to 4 CSGSs used in our paper, initial CR=0.5, F is selected from normal distribution with =0.5 and =0.3. Furthermore, =100. The generations of evolution named LP were empirically set to 10.

Results and Analysis.
The results according to solution size and classification accuracy on the training set and the test set will be shown in the part. The solution size is the number of features chosen by the feature selection that are most beneficial to ameliorate the classification accuracy. The best result will be bold. We compare the performance of SaDE and other algorithms on DataNum1 and compare the performance of SaDE after improving the control parameter on DataNum1 to DataNum4. Table 1 shows the classification accuracy of SaDE and other algorithms on training sets. As indicated in Table 1, the results of solution sizes are obtained by the algorithms mentioned above, including Max, Min, mean values (Mean), and standard deviations (Std). Min represents the minimum value of classification accuracy. Max means the opposite meaning of it. Mean expresses the average of the classification accuracy over 26 runs and Std shows the standard deviation in the same situation. The t-test is a statistical test used to check hypothesis with the average value of the given trust level. In our experiments DF (degree freedom) =50, and the t is equal to 2,009 (when the trust level is equal to 0.95). Therefore the results obtained are statistically important when t is less than -2,009 or higher than +2,009. We only check two cases: IMPORTANT (+) or NOT IMPORTANT (-). 'T-Sig' means the algorithm introduced in this paper is significantly distinct from other algorithms. Table 2 provides the solution sizes of the mentioned methods on training sets. Table 3 presents the classification accuracies on the test sets. According to the comparison between the SaDE and other methods, we can see that SaDE has the highest classification accuracy on test sets and training sets. Simultaneously, it has the second fewest discriminative features. Although the SFFS method has the fewest discriminative features, its amount of the selected feature is too few and its classification accuracy is poor. Moreover, the standard deviation of the classification results  In addition, from these Tables 1-3, we can see that other algorithms are inferior to SaDE according to classification detection rate and the number of feature reduction. In summary, we can conclude that SaDE is an effective technique in IDS. It can also select the most useful and representative subset of intrusion detection features to reduce computational cost for IDS. From this, we can see that the adaptive mechanism and multiple CSGSs can improve the performance of the DE algorithm on IDS.
We improve the performance of SaDE by optimizing its parameters. We tested the different values of SaDE parameters in the above four datasets to test their effectiveness on the detection rate. Tables 4 and 5 show the effect of different thresholds on the classification accuracy of test sets and training sets in DataNum1 to DataNum4, respectively. The unit is the percentage. The threshold is a significant part in the initialization phase. It determines whether the features are selected. In experiments comparing SaDE with other algorithms, we set the threshold to 0.6 based on experimental experience. To improve the algorithm's performance, we set 4 different values of the threshold within the range [0, 1]. The reasons for this setting are briefly explained in the Section 2. From the table, we can see that, whether in the test sets or the training sets, when the threshold is set to 0.5, the classification accuracy is the best. Besides, in most cases, the robustness is also the best. Considering the statistically significant difference on the test sets and training sets, when the threshold is set to different values, the results obtained are statistically important between 0.5 and 0.8(0.6), but not significant between 0.5 and 0.7. Therefore, by optimizing the parameters, it helps improve the classification accuracy.

Conclusions and Future Work
Nowadays, information technology is entering the era of Internet of Things (IoTs) from the Internet age. With the application of IoTs, WSN is facing more and more data security problems. Security is a key issue in WSN design, because it seriously affects the application prospect of WSN. Intrusion detection is an important way to ensure network security. The improvement of its technology is also an aspect of guaranteeing the data security of WSN. The feature selection problem has been analyzed and the SaDE algorithm has been introduced to solve this kind of problem of IDS.

Security and Communication Networks
The KDDCUP99 intrusion dataset was applied to assess the performance of the introduced algorithm. Our scheme applies an adaptive mechanism in the DE algorithm to find the CSGS that is most suitable for generating new solutions. At the same time, we have improved the control parameters of SaDE. According to the results of experiments, it can be seen that the improved SaDE can effectively solve the IDS problem. On the one hand, by comparing the SaDE algorithm with other methods, we can see that the SaDE algorithm can reduce about 57% of the features in the problem. In addition, the SaDE method is superior to other algorithms in terms of classification accuracy of training sets and test sets. For another, four datasets generated from KDDCUP99 were used to test the control parameters. When the threshold is set to 0.5, the classification accuracy of SaDE is better than other values, and the performance of SaDE has been improved.
In the problems of intrusion detection, multiobjective feature selection is also a field which has been researched for many years, and SaDE algorithm has not been used in this field. Therefore, we can also resolve the multiobjective feature selection problem in intrusion detection by combining the classifier and SaDE algorithm in the future. Moreover, we can also make improvements in the initialization section.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure
This work is based on the conference paper that was presented in "The 4th International Conference on Cloud Computing and Security (ICCCS2018)".