Building an Effective Intrusion Detection System by Using Hybrid Data Optimization Based on Machine Learning Algorithms

1Computer Virtual Technology and System Integration Laboratory of Hebei Province, College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei, 066000, China 2Hebei University of Engineering, School of Information & Electrical Engineering, Hebei Handan, 056038, China 3Beijing Key Laboratory of Software Security Engineering Technique, Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing, 100081, China


Introduction
With the rapid development of the Internet, the issue of network security has also received more and more attention.Research on the detection of anomaly behavior in the network is an important topic in the field of network security.IDSs are used to analyze network data and detect anomaly behaviors in the network.IDSs are generally classified into two categories: signature-based and anomaly-based detection systems [1].Signature-based intrusion detection systems [2,3], such as Snort intrusion detection systems [3], are designed to detect intrusion by building anomaly behavior character libraries and matching network data.These IDSs have high detection rate, but they are difficult to identify new attacks in the network.Anomaly-based intrusion detection systems establish models according to normal network behavior and conduct intrusion detection based on whether the behaviors are dedicated from the normal behavior.Such IDSs have an excellent recognition efficiency for unknown types of anomaly behavior, but their overall detection rate is low and has a high false alarm rate.
In order to improve the detection rate of IDSs and reduce the false alarm rate, researchers have done a lot of work, trying to apply a variety of methods of data mining and machine learning on IDSs.For example, SVM and neural network models are applied to the research of intrusion detection [4].Koc et al. propose a Hidden Naïve Bayes model (HNB) to build intrusion detection system [5], which shows that the HNB model exhibits a superior overall performance with traditional Naïve Bayes.LP Rajeswari et al. propose a multiple level hybrid classifier to build IDS that uses a combination of tree classifiers of Enhanced C4.5 [6], which can be trained with unlabeled data and detects previously "unseen" attacks.In addition to the improvement of traditional classification 2 Security and Communication Networks methods, some researches focus on the selection of data records.
However, the huge amount of network data and the unbalanced distribution of normal and anomaly behaviors lead to the problems of low detection rate and high false alarm rate in most IDSs.In this paper, an effective IDS by using hybrid data optimization data consists of sampling and feature selection is proposed.Data sampling is to delete outliers in dataset and reduce the negative impact of unbalanced data distribution on Intrusion detection.Feature selection is to search for features that best reflect the difference between anomalous behaviors and normal behaviors and delete useless features to enhance the detection performance of IDS.And an effective IDS based on data sampling and feature selection is built by using RF algorithm.
The organization of this paper is as follows.Section 2 outlines the related works.Section 3 introduces the operational principle of iForest, GA, and RF which will be applied in DO IDS.Section 4 introduces the building of DO IDS in detail.Section 5 describes and analyzes the experiments.Section 6 summarizes and elaborates DO IDS.

Related Work
Data sampling can solve the problem of unbalanced distribution of network data.Data sampling includes oversampling and undersampling.When the data is insufficient for analysis, the oversampling method balances the data by increasing the rare samples, such as SMOTE algorithm.In contrast, undersampling deals with a dataset by reducing some samples, such as EasyEnsemble and BalanceCascade proposed by Liu et al [7].
By using sampling method to extract representative training data and combining with machine learning method, the performance of IDS can be improved effectively.Enamul et al. use sampling technique to select representative dataset and Least Squares SVM to identify anomalous network data [8], proving that data sampling can improve the accuracy and speed of intrusion detection.Alyaseen et al. combine modified K-means with machine learning methods to build intrusion detection models [9][10][11].The modified K-means method can discover similar structures and models between datasets to compress datasets with higher quality.Integrating K-means with C4.5 to construct the classifier of intrusion detection model can greatly reduce the running time of intrusion detection system [9]; with SVM algorithm it can effectively improve performance for detecting DoS anomaly [10]; and with hybrid model of SVM and extreme learning machine (ELM) it can improve accuracy and efficiency of IDS [11].
Some researchers also focus their research on feature selection.Feature selection includes three methods: filter, wrapper, and embedded.Filter method evaluates each feature according to its divergence or correlation and sets threshold to select feature, which is irrelevant to the classification performance of classifier [12].Wrapper method selects features or excludes features according to the objective function, which is usually the effect of classification [13].Embedding method first trains some machine learning models to obtain the weights of each feature and then selects features according to weights, such as Decision Tree [14].
When it is found that some features can contribute more for classification but some make classification confused, feature selection is paid more attention.Wang et al. transform the original features using the logarithms of the marginal density ratios and obtain new and better-quality transformed features [15], which improves the performance of an SVMbased detection model.Vajiheh Hajisalem et al. propose a hybrid classification method based on artificial bee colony (ABC) and artificial fish swarm (AFS) algorithm [16], using fuzzy C-means clustering (FCM) and correlation-based feature selection (CFS) techniques for training data.George et al. apply SVM and PCA to anomaly detection of network data [17].It is proved that PCA can effectively improve classification effect of SVM and increase model training speed.Raman et al. propose the combination of hypergraph, GA, and support vector machine to implement IDS [18].Hypergraph and GA are used to perform parameter estimation of SVM and feature selection.Support vector machine is used to detect anomalous network behaviors after feature selection.It is proved that the combination of feature selection method and SVM can improve the accuracy of classifier.Genetic algorithm, which is also applied in this paper, is a heuristic search algorithm used to solve optimization in the field of computer science and artificial intelligence and is widely used in various directions, such as global optimization, parameter optimization, and feature selection [19].At the same time, many scholars also apply genetic algorithm to network security.Khammassi et al. use GA and logistic regression algorithm to select the optimal feature subset [20] and prove that the feature subset selected by the method is effective for intrusion detection through different decision tree algorithms.Hamamoto A H et al. combine GA and fuzzy logic to detect anomalous events in network and prove that fuzzy logic can improve accuracy [21].In their work, GA is used to generate a digital signature of network segment using flow analysis and fuzzy logic scheme is applied to decide whether an instance represents an anomaly or not.Faris H et al. propose an intelligent detection system that is based on GA and Random Weight Network to deal with email spam detection tasks [22], and the experimental results confirm that the proposed system can achieve remarkable results in terms of accuracy, precision, and recall.Vijayanand R et al. propose a novel intrusion detection system with GA based feature selection and multiple support vector machine classifiers for wireless mesh networks [23].The system proposed by them exhibits a high accuracy of attack detection and is suitable for intrusion detection in wireless mesh networks.

Preliminary
This section introduces the genetic algorithm, iForest algorithm, and RF algorithm that will be used in the next section.

Isolation Forest (iForest).
IForest algorithm is proposed by Liu, Fei et al. in 2012 [24, 25]; this algorithm is a treebased outlier detection model with linear time complexity and high precision and suitable for high-dimensional and large amount of data.
Because anomalies are "less and different," they are more vulnerable to be isolated.In a data oriented random tree, records are recursively cut until all records are isolated.This random partition makes outlier record as a shorter path length because records with distinguishable attribute values are more likely to be separated in early partitions.IForest consists of some iTrees (Isolation Tree).Each iTree is a binary tree.The implementation steps are as follows: (1) Randomly select a fixed number of sample points from training data as subsamples and put them in the root node of the tree.
(2) Randomly specify an attribute and randomly generate a cutting point p in the current node data, cutting point is generated between the maximum and minimum value of the specified attribute in the current node data.
(3) A hyperplane is generated from this cutting point, and the data space of the current node is divided into two subspaces: the data less than p in the specified attribute is put into the left child of the current node, and the data greater than or equal to p is put into the right child of the current node.
(4) Recursively execute steps 2 and 3, until the child node has only one record or the iTree has reached the defined height.
After getting these iTrees, the training of iForest is terminate, and then we can evaluate the testing data using the generated iForest.For a testing record, let it traverse each iTree and then calculate height of the records that eventually fall on each tree.Then we can get the average height of the record in each tree.If the average height is less than the given threshold, then the record is considered an outlier.

Genetic Algorithm.
Genetic algorithm mainly includes four parts: chromosome encoding, initial population generating, fitness calculating, and genetic operator design.
(1) Chromosome Designing.GA expresses the solution space data as genotype string structure before optimization searching.Different combinations of these string structure constitute different chromosomes, and each chromosome represents a possible solution.
(2) Initial Population Generating.Each population contains a certain number of chromosomes, and each chromosome represents a possible solution.The chromosomes are initially generated randomly.
(3) Fitness Calculating.The fitness function indicates the superiority or inferiority of the individual.For different problems, the definition of fitness function is different.
(4) Genetic Operator Design.Genetic operators include three operators: selection, crossover, and mutation.Selection operation refers to reserving individuals with high fitness.Roulette wheel strategy is commonly used in selection operation.Roulette wheel strategy is based on the fitness of each chromosome in the proportion of the total fitness to get a survival probability, the chromosome with this probability to decide whether to inherit to the next generation.Survival probability is shown in Formula (1).

𝑃 (𝑋
(  ) is the fitness for ℎ chromosome   .Crossover operation is the most important genetic operation in GA.It refers to exchanging genes between two chromosomes, resulting in the generating of two new chromosomes.The mutation begins by randomly selecting a chromosome in a population and randomly changing the value of a gene with a certain probability for the selected chromosome.The crossover and mutation operation are shown in Figure 1, and the genetic operation flow is shown in Figure 2.

Random Forest.
Random Forest is an ensemble supervised machine learning algorithm, which was first proposed by Leo [26].Its classification performance is better than other single classifier models in most cases and it can handle both binary classification problems and multiclassification problems.The main idea of RF is to use randomly sampling with replacement to construct multiple decision trees, and the final result is obtained by voting.The process of constructing RF is as follows.
(1) Using randomly sampling with replacement to extract samples from dataset and obtain a training subset.
(2) For the training subset,  features are randomly extracted from the feature set without replacement as the basis for splitting each node in the decision tree.From the root node, a complete decision tree is generated from top to bottom.
(3) The  decision trees are generated by executing ( 1) and ( 2) repeatedly K times.RF classifier is obtained by combining these decision trees.The result of classification is voted by these decision trees.

Proposed DO_IDS
In the network, the normal behavior of users is more than the anomalous behavior, which makes the data distribution of normal behaviors and anomalous behaviors unbalanced.In order to enhance the detection performance of IDS, a hybrid data optimizing method based on multiply machine learning algorithms is proposed in this paper.The data optimizing method consists of two parts: data sampling and feature selection.(i) Data sampling: in this part, iForest outlier detection method is used to sample the data, GA is used to optimize the sampling ratio globally, and the classification performance of RF on candidate sampled data is used as the evaluation indicator.The purpose of data sampling is to search the optimal training dataset and reduce the imbalance of dataset.(ii) Feature Selection: in this paper, the method of integrating GA with RF is used to select features.Like data sampling, GA is used as a search strategy to specify candidate feature subset, and the classification performance of RF as evaluation indicator of candidate feature subset.The purpose of feature selection is to find the best feature subset that can maximize the performance of the detection.Once the optimal training dataset and the optimal feature subset are selected, those will be taken into the classifier training phase which employs RF algorithm.The whole process is shown as Figure 3. 4.1.Data Sampling.The purpose of data sampling is to delete outliers in data and reduce the negative impact of outliers on detection performance.So in this paper, iForest, which can detect outliers quickly and effectively [27], is used to detect and delete outliers in network data at a given ratio, and the data obtained is the sampled data.In order to determine the best sampling ratio of each category, GA is used to optimize the sampling ratio of each category and the performance of RF classification is used to evaluate candidate sampled data.
The description of data sampling in detail is as follows.
In the classification problem, the fitness function is usually set as the accuracy of the classifier.In this paper, the fitness function is assumed to be the F1 score.F1 score is a harmonic function that takes both precision and recall into account.Calculation of F1 score is shown as follows.
Among them, True Positive (TP) is the number of actual anomalous records classified as anomalous ones, True Negative (TN) is the number of actual normal records classified as normal ones, False Positive (FP) is the number of actual normal records classified as anomalous ones, and False Negative (FN) is the number of actual anomalous records classified as normal ones.Confusion matrix is shown as Table 1.
For the genetic operator of GA, in the part data sampling, crossover operation means that the same gene of any two chromosomes exchanges with a certain probability.Mutation operation means changing a gene of chromosomes by adding or subtracting 0.1 with a certain probability.The roulette wheel is applied as a selection function.
In this stage, the algorithm description is shown as Algorithm 1, and Algorithm 2 is the calculation of chromosome fitness in the stage of data sampling.where   is the chromosome with the highest fitness in the final population.  is the set of outliers detected by iForest.   is the optimal training dataset obtained in data sampling.
The first step is to randomly generate a population P composed of N chromosomes.In order to get the next generation of population, GA is applied for population P. Firstly, perform dataset    can be obtained by deleting   from   .

Feature Selection.
In the research of intrusion detection, redundant features can degrade detection performance, so more and more researchers focus on feature selection [2,16,18,20].The process of feature selection in this paper is similar to the data sampling.The difference lies mainly in chromosome designing and mutation.In data sampling, the chromosome contains the number of classes in the dataset, and each gene is a floating-point number, representing the ratio of outliers to be eliminated.In feature selection, the chromosome is a binary string,  = { 1 ,  2 , . . .,   },   ∈ {0, 1}, 1 ≤  ≤ ,  is the number of feature,  1 = 1 represents the ℎ feature is selected, and  1 = 0 represents not.The detailed steps are shown in Algorithm 3, and Algorithm 4 is the calculation of chromosome fitness in the feature selection stage.

Classifier Training.
According to the data sampling and feature selection, the optimal training dataset and the optimal feature subset can be obtained.Dimension reduction is performed on the optimal training set according to the optimal feature subset.Because RF classifier can handle multiclassification problems [28], we can further identify the classes of anomalous behaviors.Assuming that, let normal behaviors be one class, and there are k classes of anomalous behaviors; then, the whole network dataset can be composed of  + 1 classes.For each class, data sampling and feature selection methods are used to get optimal training dataset and optimal feature subset; there will be  + 1 classifiers for all the class trained by their corresponding data.Finally, the final classification is voted by the  + 1 classifiers.The parameters used in the algorithm are obtained by empirical value and set as follows.

Experimental Results and Analysis
In genetic algorithm, population initiation N = 100, the crossover probability  V = 0.5, the mutation probability  V = 0.1, and the termination condition (the number of descendants inherited) G = 50.In data optimization, considering the efficiency factor, the numbers of components of iForest and RF are set as 10.In classifier training, the number of decision trees of RF is set as 200.
The UNSW-NB15 dataset is created by the cyber security research group at the Australian Centre for Cyber Security (ACCS) recently [29]   are shown in Table 2, and the feature description is shown in Table 3.

Experimental Results.
The optimal sampling ratio of each class obtained during data sampling is shown in Table 4, where the data volumes of Analysis, Backdoor, Shellcode, and Worms are too small for sampling, so they are not sampled.Table 5 and Figure 4 show the optimal feature subset of each class of anomalous behaviors.It can be noted that Normal class has the largest number of features in the subset of optimal features, the number of its optimal features is 26, the least is Worms, and the number is 13.Compared with the total number of Features 42, all the classes have achieved considerable dimensionality reduction.Figure 5 shows the selected times of each feature.We can see that the 5th feature has been selected the most; all the classes regard it as an important feature except the class "Backdoor."6 shows the confusion matrix of all classes over the UNSW-NB15 dataset using DO IDS.To verify the effectiveness of the data optimization proposed in this paper, the precision, recall, and F1 score obtained by testing the proposed model are shown in the Table 7 and compared with the simple RF classifier without data sampling and feature selection.Obviously, except for the slight decrease in the precision of Worms and DoS and the recall of Exploits and Shellcode, the precision and recall in other classed have improved significantly, especially for the anomaly behavior with less records, such as Analysis, Backdoor, Shellcode, and Worms.It can be seen that DO IDS has achieved good performance on the detection of network anomaly behavior with unbalanced data distribution.

Comparison with Other Methods. Table
Table 8 shows the comparison of accuracy and false alarm rate (FAR) of all classes between simple RF and DO IDS.FAR refers to the proportion of anomaly behaviors classified as normal to all anomaly behaviors.In the research of IDS, FAR is a significantly important evaluation indicator because in the network data, the number of normal behaviors is far more than the number of anomalous behaviors; even if all network data are classified as normal behavior, the accuracy can reach a high level.As we can see from table 8, both simple RF and DO IDS have high classification accuracy in each class, but FAR of DO IDS is obviously better than
From the comparison in Table 10, it is obvious that the performances of DT, RUSBoost, and AdaBoost are close to RF, so, we further applied data optimization to these four algorithms to see which algorithm is the best in the combined performance with data optimization in Table 11.It can be seen from Table 11 that DO IDS, that is, applying RF as the final classifier, is better overall.

Conclusion
In this paper, we have proposed a data optimization method to build IDS, named DO IDS.The data optimization consists of two parts: data sampling and feature selection.In data sampling, iForest is used to sample data and integration of GA and RF is used to optimize sampling ratio.In feature selection, integration of GA and RF is used again to select the optimal feature subset.Classification is performed by using RF to build IDS.DO IDS has been evaluated by using intrusion detection dataset UNSW-NB15.
DO IDS is a RF classifier based algorithm with data optimization, through experimental comparison; DO IDS performs much better than RF classifier in all the indicators selected in the paper, which indicates the advantage of data optimization in DO IDS.Also, by comparing with traditional machine learning methods, it demonstrates that RF classifier is a much stronger classifier, so the combined effect of data optimization and RF classifier makes DO IDS almost always the best among all especially in detecting the anomalous behaviors with less records, such as DoS, Analysis, Backdoor, and Worms.However, there are still improvements that can be focused on, like much time cost in the data optimization stage and support for online processing.
As a future work, since the proposed data optimization can effectively reduce impact of the unbalanced sample distribution on IDS and has shown encouraging performance, it could be further applied to other anomaly detection fields, such as fraud detection.In addition, because it takes a lot of time to train classifiers, the search strategy could be further optimized.

5. 1 .
Experimental Settings and Dataset Description.Experiments are performed on a PC with Intel(R) Core(TM) i5-4460 at 3.6 GHz CPU and 8GB memory, running on Windows 10.Programs are coded in Python using Pycharm2017 environment on the version of Anaconda3.

Figure 4 :Figure 5 :
Figure 4: Optimal feature subset for each class of anomalous behaviors.
End While   = Chromosome with the highest fitness in    = iFoest(  ,   )    =   −   End Original Dataset   ,   , Chromosome population  = { 1 , . . .,   } Output: Fitness set {( 1 ), ( 2 ), . . ., (  )} for  = 1:     =   -iForest (  ,   ) Train Random Forest classifier  by    Test  based on   and get classification selection operation to retain the optimal individuals and calculate the fitness of each chromosome.Secondly, two chromosomes are randomly assigned to perform crossover operation with probability  V  and perform mutation operation with probability    .In this way, a new population can be obtained.Finally, implement the above process iteratively until the iteration termination condition is reached and then we can get the best chromosome   .Performing iForest on training dataset   according to   can get outlier dataset   .The optimal training Input: New training dataset    , testing dataset   Output: Optimal feature subset   Generate initial population  = { 1 ,  2 , . . .,   } While not reach terminating condition Calculate fitness (   ,   , ) Selection() / *  V ,   are probabilities of crossover and mutation respectively.* / Crossover (,  V  ) Mutation (,    ) End While   = Chromosome with the highest fitness in    = Convert   to feature number End Algorithm 3: GA RF FS ( ).    ,   , Chromosome population  = { 1 , . . .,   } Output: Fitness set {( 1 ), ( 2 ), . . ., (  )} for  = 1:  Extract data from    ,   based on   and get     , Input: Original training dataset   , testing dataset   Output: New training dataset    Generate initial population  While not reach terminating condition Calculate Fitness Sample (  ,   , P) Selection() / *  V ,   are probabilities of crossover and mutation respectively.* / Crossover (,  V  ) Mutation (,    ) Input: the . The dataset contains 2, 540,044 records with 42 attributes, which is divided into training set and testing set.The training set contains 175,341 records, while the test set contains 82,332 records.The parameters of the dataset

Table 4 :
Optimal sampling ratio for each class of anomalous behaviors.

Table 5 :
Optimal feature subset for each class of anomalous behaviors.

Table 6 :
Confusion matrix of all classes over the UNSW-NB15 dataset using DO IDS.