Machine-Learning Approach to Optimize SMOTE Ratio in Class Imbalance Dataset for Intrusion Detection

The KDD CUP 1999 intrusion detection dataset was introduced at the third international knowledge discovery and data mining tools competition, and it has been widely used for many studies. The attack types of KDD CUP 1999 dataset are divided into four categories: user to root (U2R), remote to local (R2L), denial of service (DoS), and Probe. We use five classes by adding the normal class. We define the U2R, R2L, and Probe classes, which are each less than 1% of the total dataset, as rare classes. In this study, we attempt to mitigate the class imbalance of the dataset. Using the synthetic minority oversampling technique (SMOTE), we attempted to optimize the SMOTE ratios for the rare classes (U2R, R2L, and Probe). After randomly generating a number of tuples of SMOTE ratios, these tuples were used to create a numerical model for optimizing the SMOTE ratios of the rare classes. The support vector regression was used to create the model. We assigned each instance in the test dataset to the model and chose the best SMOTE ratios. The experiments using machine-learning techniques were conducted using the best ratios. The results using the proposed method were significantly better than those of previous approach and other related work.


Introduction
e early IDS (intrusion detection system) [1] is divided into the host-based IDS (HIDS) and the network-based IDS (NIDS). HIDS has the advantage of analyzing the system log and resource usage information by the host and user. However, installing an IDS in each host increases the management points and wastes more resources. If networklevel packet analysis is not possible and the attacker takes control of the system, the IDS may be interrupted. NIDS has advantages that it does not need to install an IDS on each host, and NIDS can perform analysis at the entire network level. However, there is a disadvantage in which it is possible to confirm only the attack via the IDS, and it is difficult to confirm the attack attempt at the system level. In early 2003, the IDS was losing the trust of users due to the problem of generating false positives. e causes of false positives are due to the development of erroneous rules, traffic irregularities, and limitations of pattern matching tests. Even though the IDS problem has not been solved to date, "pattern matching" is still being used as a basis for security solutions.
Intrusion detection attacks [2] are divided into misuse detection and anomaly detection. In misuse detection, detected attacks are compared with existing signatures in the database to determine whether they are intrusions. While misuse detection detects only the known attacks, anomaly detection detects a new type of attack that has a pattern different from the normal traffic and the known attack types.
Many researchers have studied intrusion detection. In general, researchers attempted to distinguish the normal class from attack classes using the publicly available intrusion detection evaluation dataset and to identify the exact attack type. However, the classification of rare classes in a huge realtime dataset requires a long computation time, and then it is difficult to achieve good efficiency. It is necessary to create and test many experimental datasets to improve classification performance by adjusting the class ratio.
In this paper, we present a novel method that optimally adjusts the SMOTE [3] ratios for rare classes. e number of cases for the tuple of SMOTE ratios is too large to test all the cases. For that reason, we propose the following efficient method. We randomly generated some tuples of SMOTE ratios and used these tuples to create a model using a support vector regression (SVR) [4]. We input a number of tuples for SMOTE ratios to the SVR model, and we chose the best tuple of SMOTE ratios. Experimental results using the proposed method were significantly better than those of the previous approach [5]. e contributions we make through the proposed method are given as follows. We suggest how to find the SMOTE ratios that show good performance with very few tests. Hence, we dramatically reduce the amount of computations required to find the best SMOTE ratios. We are sure that the proposed method is helpful for the study of class imbalances. e remainder of this paper is organized as follows. Section 2 explains the related works on the KDD CUP 1999 dataset [6] and class imbalances. In Section 3, we present the background of this research. In Section 4, we suggest a new method by creating a numerical model using sampled SMOTE ratios. In Section 5, we explain our experimental environments, procedures, and results. e paper ends with our concluding remarks in Section 6. [7] studied anomaly detection using unsupervised learning algorithms on the KDD CUP 1999 intrusion detection dataset. ese researchers proposed density-based clustering and grid-based clustering algorithms. In density-based clustering, a cluster includes a minimum number of data points. e approach has the advantage of filtering outliers or finding clusters with arbitrary shapes. In the grid-based approach, all clustering operations are conducted on a grid structure. e method has the advantage of a fast computing speed. With the method, a classifier can learn from unlabeled data and detect new types of attacks that were previously unseen. e experimental results showed that the accuracy of their method is similar to one of existing methods, and the method has several advantages in terms of computational complexity.

KDD Dataset. Leung and Leckie
Meng [8] studied intrusion detection machine-learning techniques on the KDD CUP 1999 dataset. ere have been many studies using popular methods, such as artificial neural networks, SVM [9], and decision trees. However, these methods were rarely used in large-scale real intrusion detection systems. is researcher aimed at practical anomaly detection and conducted a comparative study with artificial neural networks, SVMs, and decision trees using the same environment as previous studies. In the analysis of the experimental results, the intrusion detection system with machine-learning techniques showed a high dependency on the test environment, and this researcher concluded that it was important to find a suitable method for applying machine-learning techniques to real environments.
Davis and Clark [10] reviewed the data preprocessing techniques used in anomaly-based network intrusion detection systems. e research focused on network traffic analysis and feature extraction/selection. Most of studies on NIDS dealt with the TCP/IP packet headers of network traffic. Time-based statistics can be derived from the headers to detect network scans, network worms, and DoS attacks. Recent, full service responses are analyzed to detect attacks targeting clients. is focuses on which attack classes can be detected by the reviewed methods.
is review shows the trends that scrutinize packets to extract or select the most relevant features through targeted content parsing. ese contextsensitive features are required to detect network attacks.
Staudemeyer and Omlin [11] used a long short-term memory recurrent neural network (LSTM-RNN) to evaluate the classification performance using the KDD CUP 1999 dataset. LSTM networks can learn "memory" and create a model with time series data.
e LSTM is trained and tested on their modified KDD CUP 1999 dataset. e LSTM network structure and parameters were obtained through experiments. Several performance measures were used to analyze experimental results.
eir results showed that LSTM-RNN can learn all the unknown attack classes in the training dataset. Furthermore, they found that both receiver operating characteristic (ROC) curves and area under the curve (AUC) were well suited for evaluating LSTM-RNN.
Kim et al. [12] proposed a system-call-language-modeling method based on LSTM for designing an anomaly-based host intrusion detection system. ese researchers used an ensemble method to solve the false-alarm rates problem that was common in conventional intrusion detection systems. e method can effectively learn the semantic meaning and interactions of each system call that existing methods cannot handle. ese researchers demonstrated the validity and effectiveness of their method through several tests on publicly available benchmark datasets, and their method has an advantage in that it is easy to transplant to other systems.
Kim et al. [13] investigated artificial intelligence intrusion detection systems that used the deep neural network (DNN) and conducted experiments on the KDD CUP 1999 dataset. Data preprocessing (such as data transformation and normalization was conducted) was used to input the dataset into the DNN model. When a learning model was created, the DNN was used for data refinement. e full dataset was used to verify the learning model. Performance measures, such as the accuracy, detection rate, and falsepositive rate, were used to verify the detection efficiency of the DNN model, and the model showed good performance for intrusion detection.
Le et al. [14] studied deep-learning algorithms to solve the problem of machine-learning techniques (such as SVM and k-NN) that had high false-positive rates in intrusion detection systems. ey found six optimizers that are applicable to the LSTM-RNN model to be the best suited for intrusion detection systems. e LSTM results using the Nadam optimizer were better than previous approaches, with an accuracy of 97.54%, a detection rate of 98.95%, and a false-positive rate of 9.98%. In Table 1, the studies related to intrusion detection are summarized.
Seo [5] tried to adjust the class imbalance of train data to detect attacks in the KDD 1999 intrusion dataset. He tested with machine-learning algorithms to find efficient SMOTE ratios of rare classes such as U2R, R2L, and Probe. He studied to improve the performance of classification focusing on detection of rare classes. e number of instances of rare classes in the train data was increased by 12, 9, and 1.5 times, respectively. e recall metrics of k-NN tests were increased to 0.11 in U2R class and 0.02 in R2L class. e metrics of SVM tests were increased to 0.02 in U2R class and 0.08 in R2L class, and those of decision tree tests were increased to 0.25.

Class Imbalance.
In the study of Japkowicz [15], most previously designed concept-learning systems assume that a training dataset is generally well balanced. is assumption is not necessarily correct. In practice, most instances represent one class, and only a small number of instances represent other ones. ese researchers tried to experimentally demonstrate that a class imbalance degrades the performance of standard classifiers. ese researchers compared the performance of several methods that were previously proposed by other researchers. Japkowicz and Stephen [16] studied class imbalance. Class imbalance has been reported to degrade the performance of some standard classifiers. ey conducted a systematic study by answering the following three problems. First, they attempted to understand the concept complexity, the size of the training set, and the class imbalance level. Second, they discussed several basic resampling or costmodifying methods to compare the efficiency of the previously proposed class imbalance problems. Finally, they conducted studies with the assumption that class imbalance problems also affected other classification systems, such as decision trees, neural networks, and SVMs.
Chawla et al. [17] studied the SMOTEBoost algorithm. In data mining, most of the datasets have the class imbalance problem, and data mining tools learn from imbalanced datasets. e classifier, which learns from a minority class with very few instances, tends to be biased towards a high accuracy in the prediction of the majority class. SMOTE is used in the design of classifiers to train unbalanced datasets. ey presented a new approach to learn from imbalanced datasets by combining the SMOTE algorithm and the boosting procedure. Unlike standard boosting in which the same weight is given to all misclassified examples, SMOTEBoost generates synthetic examples from minority classes. SMOTEBoost indirectly changes the weight by updating and compensating for the skewed distribution. In the experiments with SMOTE-Boost applied to several datasets with a high or moderate class imbalance, the classification performance for the minority class and the overall F-measure was improved.
Drummond and Holte [18] used two commonly used sampling methods for applying machine learning to imbalanced classes and misclassification costs. ey adopted a performance analysis technique called cost curves to explore the interaction of oversampling and undersampling with the decision tree classifier C4. 5. ey showed that applying C4.5 to undersampling could establish a reasonable standard for comparing algorithms. However, it is recommended that the cheapest cost classifier becomes a part of the standard since it can be better than undersampling for relatively modest costs. Oversampling has little influence on the sensitivity and the misclassification costs have no significant effect on performance.
Zhou and Liu [19] demonstrated the effect of sampling and threshold-moving in training cost-sensitive neural networks. Both oversampling and undersampling were considered. ese techniques modified the distribution of training data so that the costs of the instances were explicitly conveyed by the appearances of the instances. resholdmoving moves the output threshold towards inexpensive classes to improve classification performance. e hardensemble and soft-ensemble are used for the experiments. In hard-ensembles and soft-ensembles, all classifiers vote on each class and return the class that receives the most votes. e difference between the two ensembles is that hardensemble uses binary votes and soft-ensemble uses realvalue votes. Twenty-one UCI datasets and actual datasets were used in their experiments.
e experimental results showed that as the number of classes increases, the degree of class imbalance worsens and the efficiency of classification deteriorates. reshold-moving and the soft-ensemble showed relatively good performance in training costsensitive neural networks.
Liu et al. [9] used undersampling to solve the class imbalance problem. Undersampling is a very effective method to mitigate class imbalance using only a subset of the majority class. e disadvantage of the method is that instances of majority classes are ignored. ey presented two algorithms to overcome the drawback. First, the EasyEnsemble algorithm samples several subsets from the majority class, trains a learner using each subset, and then combines the outputs of the learners. EasyEnsemble internally uses the AdaBoost ensemble. e BalanceCascade algorithm trains learners in sequence. At each step, instances of the majority class that are correctly classified by the current trained learners are removed from further consideration. e experimental results showed that both methods produce better solutions than the conventional class imbalance.
Burez and Van den Poel [20] attempted to solve the class imbalance problem to predict customer churn. Customer churn is caused by a customer who changes service provider. Customer churn is a highly rare event in the service industry, but it is a notably interesting and informative research area.

Authors
Year Method Leung and Leckie [7] 2005 Density-based and grid-based clustering Meng [8] 2011 SVM, neural networks, and decision tree Davis and Clark [10] 2011 Data preprocessing Staudemeyer and Omlin [11] 2013 LSTM-RNN Kim et al. [12] 2016 LSTM and ensemble Kim et al. [13] 2017 DNN Le et al. [14] 2017 DNN Seo [5] 2017 SVM, k-NN, and decision tree Computational Intelligence and Neuroscience However, the class imbalance problem in the context of data mining has not paid it considerable attention until recently. ey studied how class imbalance can be better handled in churn prediction. ey have conducted studies to improve the performance of random sampling and undersampling with appropriate evaluation matrices, such as AUC and lift. ey compared gradient boosting, weighted random forest modeling, and some standard modeling techniques. ey studied the performance of both random and advanced undersampling.
ey compared the specific modeling techniques of gradient boosting and weighted random forests with some standard techniques. In their experiment, the use of undersampling improved the prediction accuracy and the AUC values.
Seiffert et al. [21] had stated that class imbalance was a common problem in various applications. Several techniques had been used to mitigate class imbalance problems. ey used a hybrid sampling/boosting algorithm called RUSBoost to train skewed training dataset. e algorithm was simpler and faster as an alternative of SMOTEBoost. ey evaluated the performance of RUSBoost, SMOTEBoost, random undersampling, SMOTE, and AdaBoost. ey chose fifteen datasets in various applications and then conducted experiments with four learners (C4.5D, C4.5N, naive Bayes (NB), and repeated incremental pruning) to produce error reduction (RIPPER) over four evaluation matrices. Both RUSBoost and SMOTEBoost were better than other methods, and RUSBoost performed equal to or better than SMOTEBoost.
Horng et al. [22] proposed an SVM-based intrusion detection system. e system combines a hierarchical clustering algorithm, a simple feature selection procedure, and an SVM technique. e clustering algorithm provided the SVM with fewer, abstracted, and higher qualified training instances. It was able to shorten the training time and improve the performance of a resultant SVM. e obtained SVM model could classify the network traffic data more accurately through the simple feature selection procedure. e KDD Cup 1999 dataset was used to evaluate the proposed system. Compared with other intrusion detection systems that are based on the same dataset, this system showed better performance in the detection of DoS and Probe attacks, and the best performance in overall accuracy. In Table 2, the studies related to class imbalance are summarized.

KDD Dataset.
e KDD CUP 1999 dataset [6] used in our experiments is a modification of data generated by the DARPA (Defense Advanced Research Projects Agency) intrusion detection evaluation program in 1988. e DARPA dataset is intercepted data that contain a wide range of attacks generated in a military network environment. e dataset has greatly contributed to the investigation and evaluation of intrusion detection. e dataset has been prepared and managed by MIT's Lincoln laboratory. In 1999, the modified DARPA dataset was used in the KDD CUP 1999 intrusion detection competition. MIT's Lincoln laboratory has a similar experimental environment to the typical U. S. Air Force LAN (local area network). Raw TCP dump data were generated over nine weeks. As in a real Air Force environment, the LAN was activated and various attacks were executed. However, there was a disadvantage in that there was no noise in the real data. However, the KDD CUP 1999 dataset served as a testbed to overcome the vulnerabilities of signature-based IDSs in detecting new attack types and attracted the attention of many researchers. e KDD CUP 1999 dataset is most widely used for the evaluation of such a system. ere are many previous approaches using the dataset and it will be possible to compare the approaches with a new method. Table 3 represents the files in the KDD CUP 1999 dataset and the details for those. e files "kddcup.data_10_percent.gz" and "corrected.gz" are used as training data and test data, respectively. e training data are compressed binary TCP dump data collected over approximately seven weeks with approximately 5 million connection records. e testing data are collected over approximately two weeks. ey are composed of approximately 2 million connection records. Connection records are a collection of TCP packets flowing from the source IP to the destination IP, and these are classified into a normal or attack class. In the case of connection records belonging to an attack class, these are represented by exactly one specific attack type. e size of each connection record is approximately 100 bytes. Attack types are categorized into four classes, such as DoS, R2L, U2R, and Probe, as shown in Table 4.

SMOTE: Synthetic Minority Oversampling Technique.
SMOTE [3] is a method of generating new instances using existing ones from rare or minority class. First, we identify the k-nearest neighbors in a class with a small number of instances and calculate the differences between a sample and these k neighbors. We multiply the differences by an arbitrary value between 0 and 1 and get a resultant value. Next, an instance that is generated using the resultant value is added to the training data. As a result, SMOTE works by adding any points that slightly move existing instances around its neighbors. In the aspect of increasing the number of instances in rare classes, SMOTE is similar to random oversampling. However, it does not regenerate the same instance. It creates a new instance by appropriately

Problem Definition.
We attempt to maximize classification performance of the KDD CUP 1999 intrusion detection dataset that has class imbalance. e dataset has severe class imbalance. erefore, data preprocessing for adjusting the class ratio is required to alleviate the imbalance. e class imbalance can be adjusted using undersampling, oversampling, and SMOTE techniques. We use the SMOTE technique. All tuples of SMOTE ratios should be tested to optimize the ratios of each class. However, there are time and cost constraints to conduct experiments on all cases. erefore, we try to find the tuple of SMOTE ratios that shows the best performance by experimenting with few tuples of SMOTE ratios. Formula (1) represents the method to calculate class imbalance ratio of each class. Figure 1 shows the structure of the dataset which is used in the proposed method. Table 5 shows class imbalance ratios of Train A, Train B which is the first half of Train A, Validation which is the second half of Train A, and Test. Train A is the original train data. Train B and Validation in Table 5 are basically the same. Train B in Table 5 shows the instances after applying the SMOTE ratios in Table 6. We define the three classes of U2R, R2L, and Probe as rare classes because the classes have relatively fewer instances than other classes.
Label cardinality of D is the average number of labels of the examples in D:

Proposed Method.
We attempt to optimize the SMOTE ratios of rare classes to mitigate the class imbalance. It is difficult to test all tuples of SMOTE ratios in a short period of time. erefore, we attempt to identify an efficient method with a small number of experiments and reduce computation time.
We create an SVR model with a small number of experiments and try to get the best tuple of the SMOTE ratios by inputting enough tuples of SMOTE ratios into the model. We also verify the results through experiments. e numbers of 100 and 1,000,000, which are used in the experiments, are decided by considering computation time and 100 instances are generated randomly from a uniform distribution. We use random sampling method instead of grid one. If we can use more than 100 instances, grid sampling is not bad, but the method is not appropriate to sample very few instances uniformly. We set the ranges for the rare classes through preliminary experiments, as shown in Table 7.
We randomly generate 100 tuples of SMOTE ratios within the maximum ranges of Table 7. We conduct experiments by inputting the 100 tuples into an SVM classifier. As results, five recall values are given to each of the 100 tuples. An SVR model is created using the 100 tuples and the root mean square of the recall values. We randomly generate 1,000,000 tuples of SMOTE ratios and input them into the SVR model to derive the optimal solution. We conduct experiments to verify the quality of the best tuple.
Formula (2) represents procedure of the proposed method. e method shows good performance with very few tests and significantly reduces the amount of computations which are required to find the best SMOTE ratios. Figure 2 represents its pseudocode.
Procedures of the proposed methods as follows: (1) Set the ranges for the rare classes through preliminary experiments, as shown in Table 7. e ranges were searched by inputting successive 2 t where t is a nonnegative integer.
(2) Generate randomly few tuples of SMOTE ratios from a uniform distribution (independent variable). (3) After drawing recall metrics by giving the tuples into an SVM classifier, calculate RMS with the metrics (dependent variable). (4) Create an SVR model [4] with the tuples and RMS. (5) Find the best tuple among a lot of tuples, which are generated randomly from a uniform distribution, through the SVR model.
(2) Figure 3 shows a hierarchy of the methods in LibSVM. Table 8 represents the time complexity of SVM. Table 9 shows the time complexity of the proposed methods.

Experiments
We randomly generate 100 tuples of SMOTE ratios and use the tuples to create an SVR model. We find the best tuple by giving 1,000,000 randomly generated tuples of SMOTE ratios into the SVR model. e experiment results with the best tuple were improved by approximately 20 percent compared with the previous approach [5]. e SVR model was generated using only 100 tuples of SMOTE ratios. As with the SVR model, the computation time was dramatically reduced and the tuple of SMOTE ratios with the highest efficiency was found.
Formula (3) gives the root mean square (RMS) using the recall values, which are the results of experiments with the 100 tuples of the SMOTE ratios of the U2R, R2L, and Probe classes. e 100 tuples are randomly generated within the range of Table 7. Variable N is the normal, U is the U2R, R is the R2L, D is the DoS, and P is the Probe class. Table 10 shows parameters of SVR and SVM. Table 11 shows parameters of RNN-LSTM. Table 12 shows the measures drawn by creating an SVR model using the 100 tuples of SMOTE ratios and the RMS. e correlation coefficient was more than 0.7, which indicates a strong positive linear relationship.
e RMSE was 0.006, which means that the difference between the expected value and the actual one is very small. Since the root relative squared error is a measure that compares the standard deviation of the actual values with the differences between the predicted and actual values, it is not a significant factor in evaluating the performance of the model. Table 13 shows the recall metrics of experiments by the best tuple. e best tuple represents 1,000 times for the U2R, 451 times for the R2L, and 1 time for the Probe, as shown in Table 6. Table 6 shows the difference of SMOTE ratios between the proposed method and the previous one. e proposed method searches an optimal solution among a lot of SMOTE ratios, but the previous one uses only fixed SMOTE ratios.
The number of classes .
(3) Figure 4 compares recall metrics of the proposed method with that of the previous approach [5]. RNN-LSTM is slightly superior to other methods. In the SVM tests, the performances of the U2R, R2L, and Probe classes were improved by approximately 22.6%, 58.9%, and 2.3%, respectively. Figure 5 represents SMOTE ratios of the U2R,        [22] compares the proposed methods with other work by the detection rate. Figure 6 shows a graph for the RMS of the results obtained by inputting 1,000,000 tuples of SMOTE ratios into the SVR model. e reason for defining the RMS of Formula (3) as the objective value is to make the recall values of rare classes well reflected by experimental results. An RMS of the best tuple is about 0.979. Table 15 shows recall values of previous work [5]. Tables 16 and 17 show confusion matrix of SVM and RNN-LSTM, respectively. We conducted experiments with SVM and decision tree on the three dataset combinations of (Train B, Validation), (Train B, Test), and (Train A, Test) datasets. e results showed that SVM was better than the decision tree. Table 16 represents recall values of the previous methods and SVM was superior to other work. Parameters and datasets of the proposed SVM test is identical to those of the previous one.

Conclusions
In this study, we have attempted to mitigate the problem of class imbalance in the KDD CUP 1999 intrusion detection dataset. As results, we obtained the best SMOTE ratios of rare classes, reduced the number of experiments by creating an SVR model, and had a significant performance improvement over the previous approach [5]. e best SMOTE ratios of rare classes drawn by the SVR model were 1,000 times for U2R, 451 times for R2L, and 1 time for Probe. e recall values for rare classes were 0.615 for the U2R in RNN-     LSTM, 0.302 for the R2L in SVM, and 0.997 for the Probe in decision tree, respectively. We proposed a new method to find the best SMOTE ratios that have high efficiency with a small number of experiments. e proposed method dramatically reduced the number of adjustments for classes. erefore, the computation time required for the experiments could be shortened.
In future, it will be meaningful to investigate the change of test results according to the number of tuples of SMOTE ratios. We can identify better SMOTE ratios using the models created by other machine-learning techniques. Also, we will apply evolutionary computations or other metaheuristic algorithms to identify the best tuple.
Data Availability e KDD CUP 1999 data used to support the findings of this study are available at http://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html.