Evolutionary Data Preprocessing to Alleviate Class Imbalance

Intrusion detection technology for network attacks is developing rapidly with the development of artiﬁcial intelligence technology. Recently, machine learning-based methods that can detect new types of attacks have been developed. To improve the classiﬁcation performance of the rare classes in the intrusion detection dataset, we study the eﬃcient data preprocessing method based on machine learning. The UNSW-NB15, a well-known network intrusion detection dataset, is used in the experiments. The dataset includes 9 attack types and has severe class imbalance and overlap, so it is diﬃcult to improve the classiﬁcation performance above a certain level. To improve the classiﬁcation performance by adjusting the number of instances of rare classes is needed. SMOTE techniques and genetic algorithms are used to optimize the ratio between classes in the training dataset. The computation time is reduced by creating a training dataset that samples only a few percent of the UNSW-NB15 dataset. Many new training datasets are generated based on the small training dataset according to the randomly generated SMOTE ratios. The classiﬁcation experiments are conducted with these new training datasets. A new dataset is generated by combining the results of the experiments, and a regression model is generated by training the dataset. The best tuple of SMOTE ratios is searched by applying the model as a ﬁtness function of the genetic algorithm. The D-S-1G combination exhibited the best performance among the test results. It consists of a decision tree classiﬁer and the support vector regressor (SVR). As a result, the computation time was signiﬁcantly reduced, and the optimal SMOTE ratios showed better results than the experimental results of the original UNSW-NB15 dataset. It was found that the classiﬁcation result of each rare class relies heavily on the type of classiﬁer.


Introduction
An intrusion detection system based on artificial intelligence [1][2][3] is needed to detect new and malicious network attacks that traditional firewalls cannot detect. is system protects against network attacks on vulnerable services, data-driven attacks in applications, and privilege escalation and intruder login/access to major files by intruders/malicious software (computer viruses, Trojan horses, and worms) and hostbased attacks.
Intrusion detection can be divided into signature-based and anomaly-based detection. Signature-based detection has a low false positive rate, but it has limited ability to detect new attack types because it works by matching patterns after inputting information for anomalous behavior. Anomaly detection is a method based on statistical analysis and machine learning. Anomaly detection has a higher false positive rate than signature-based methods, but it has the advantage of being able to detect new attack types.
Intrusion detection systems can be divided into network-based IDS (N-IDS), host-based IDS (H-IDS), and multihost-based IDS based on the detection location. In the N-IDS system, the intrusion detection system equipment is installed in the front stage, and the traffic going to and from the network is analyzed. Although the N-IDS system has the advantage of being independent of the operating system, considerable packet loss occurs in high-speed networks. In addition, it is difficult to detect abnormal behavior that occurs inside the host. In the H-IDS system, audit data collected from a specific host system are analyzed to detect abnormal behavior. e H-IDS system accurately detects attacks, decrypts and analyzes encrypted packets, and has the advantage of no packet loss. In contrast, the H-IDS has the disadvantage of being dependent on the operating system and using system resources. A multihost-based IDS is a system that comprehensively analyzes all host systems to detect abnormal behavior.
Recently, with the rapid development of artificial intelligence technology, research on anomaly-based IDSs has been actively conducted. To promote intrusion detection research, datasets such as KDD'99 and UNSW-NB15 have been released. Many scholars have conducted research on intrusion detection using these public datasets. However, since there is a limit to how much the performance of intrusion detection can be improved using only artificial intelligence technology, the trend is to combine intrusion detection systems with the existing signature-based method.
Class imbalance and class overlap make it difficult to improve detection performance using only the intrusion detection dataset and artificial intelligence technology. In general, when learning models are created using a dataset with a high proportion of a specific class, most observations tend to be classified as belonging to the majority class based on the learning model, and it is uncommon for observations to be assigned to the minority class (rare class).
In this study, SMOTE [4], classification, regression, and genetic algorithms (GAs) [5] are used to improve the detection performance for rare classes in the UNSW-NB15 dataset. e classification results are derived by varying the class ratio of the training dataset, and the regression model is created using the experimental results. is model is used as the fitness function of GAs. After finding the optimized SMOTE ratio and creating a training dataset using it, we apply the classification algorithm to assess the classification performance. Finally, the test dataset is input into the training model to measure the attack type detection performance. e contributions of this study are as follows: (1) there are many classification experiments, so only some (approximately 3%) of the UNSW-NB15 dataset is used as a training dataset to reduce the computation time; (2) an optimal solution is derived only with a relatively small number of classification experiments; (3) a regression model is used as a fitness function of GAs to dramatically reduce the computation time. In other words, it is not necessary to perform a classification experiment every time the fitness function is called. (4) We used RMC (root mean cube) to reflect the characteristics of classes well and reduce the effect of outliers instead of the RMS (root mean square) of the previous research [6]. e remainder of this study is organized as follows: Section 2 describes previous studies related to this study. In Section 3, we present a description of the UNSW-NB15 dataset. In Section 4, the problem to be solved is defined, and the proposed methods are explained. In Section 5, we explain our experimental environments, plan, and results. e study ends with some concluding remarks in Section 6.

Related Work
Soltanzadeh and Hashemzadeh [7] conducted a study of the class imbalance problem. To improve the SMOTE algorithm, they tried to overcome the following three issues: (1) overgeneralization due to oversampling of noisy samples, (2) oversampling of uninformative samples, and (3) the increasing overlap of different classes around class boundaries.
To address issues 1 and 2, they applied a sample categorization method to identify minor samples that are suitable for oversampling. To address the third issue, they proposed an improved sample creation process that generates synthetic samples within an accurately calculated safe range. is range is calculated based on the characteristics of the input data to provide a safe oversampling region for each dimension of the feature space. e extracted range is used to control the position of the new synthetic sample in the data space and to prevent it from drifting into the majority class domain.
Bagui and Li [8] used resampling to better balance the classes in a dataset by adjusting the ratios of different classes. In experiments using the benchmark cybersecurity datasets KDD99, UNSW-NB15, UNSW-NB17, and UNSW-NB18, evaluation using macro precision, macro recall, and macro F1-score values led to the following conclusions: first, oversampling increases the training time, and undersampling decreases the training time; second, both oversampling and undersampling significantly increase recall when the data are extremely imbalanced; third, resampling does not have a significant effect if the data imbalance is not severe; and fourth, resampling detects more minority data.
A study by Zoghi and Serpen [9] presented a visual analysis of the UNSW-NB15 intrusion detection dataset. PCA, t-SNE, and k-means clustering algorithms were used to develop graphs and plots for visualization. After visualizing the results, they identified and described two main problems for this dataset: class imbalance and class overlap. is shows that it is necessary to solve the problems of class imbalance and class overlap before using this dataset for classifier model development.
Choudhary and Kesswani [10] used a deep neural network (DNN) to identify IoT attacks. An intelligent intrusion detection system requires an effective dataset. e performance of DNNs to accurately identify attacks was evaluated on the most popular datasets (e.g., KDD-Cup'99, NSL-KDD, and UNSW-NB15). Experimental results showed that the accuracy of the proposed method using DNN was more than 90%.
Kumar et al. [11] proposed a new misuse-based intrusion detection system that detects five categories in the network, namely, exploit, DOS, probe, generic, and normal. ey designed their own unified classification-based model using the UNSW-NB15 dataset. is model showed significantly higher performance than other existing decision tree-based models that detect five categories. Additionally, the NIT Patana CSE lab, which published this study, generated its own real-time dataset, RTNITP18. e RTNITP18 dataset was used as an experimental dataset to evaluate the performance of the proposed intrusion detection model. When the performance of the proposed model was analyzed using UNSW-NB15 and the real-time dataset RTNITP18, it exhibited better performance in terms of accuracy, attack detection rate, average F1-score, average accuracy, attack accuracy, and false alarm rate compared to other models.
Sun et al. [12] conducted a study on class imbalance. e class imbalance problem has been reported to seriously impair the classification performance of many standard learning algorithms and thus has received much attention from researchers in various fields. erefore, several methods have been proposed to solve these problems, such as sampling methods, cost-sensitive learning methods, and ensemble methods based on bagging and boosting. However, the conventional methods for handling class imbalances potentially suffer from loss of useful information, unexpected mistakes, or increased likelihood of overfitting because they can alter the original data distribution. erefore, the imbalanced dataset was first transformed into several balanced datasets. e authors proposed a novel ensemble method to build multiple classifiers based on multiple datasets using a specific classification algorithm. Finally, the classification results of these classifiers on the new dataset were combined by specific ensemble rules. In an empirical study, the authors compared their method with the existing various class imbalance data processing methods. Experiments were conducted on 46 imbalanced datasets, and the experimental results showed that their method was generally superior to the existing imbalanced data processing method when solving problems with severe imbalances.
Nekooeimehr and Lai-Yuen [13] proposed a new oversampling method called adaptive semiunsupervised weighted oversampling (A-SUWO) for classifying imbalanced binary datasets. e proposed method clusters a small number of instances using a semiunsupervised hierarchical clustering approach. It uses classification complexity and cross-validation to adaptively determine the sample size for oversampling each subcluster. e minority instances are then oversampled according to the Euclidean distance of the majority class. A-SUWO aims to identify instances that are difficult to learn by considering the instances of the minority class in each subcluster close to the borderline. It also avoids creating synthetic minority instances that overlap majority classes during clustering and oversampling steps. e results showed that this method achieved much better results on most datasets compared to other sampling methods.
Ali et al. [14] examined the problems arising from various issues of class imbalance classification along with training on the imbalanced class dataset. A survey of traditional approaches to handling classification with imbalanced datasets was provided. Additionally, the authors discussed current trends and advances that could potentially shape the future direction of class imbalance learning and classification.
ey also found that advances in machine learning techniques will mostly benefit big data computing, especially in solving the class imbalance problem that inevitably emerges in many real-world applications, such as medicine and social media.
Salunkhe and Mali [15] attempted to overcome the issue of class imbalance in classification problems.
ere is an imbalance distribution problem in the training dataset that causes the performance degradation of the classifier, and many studies have attempted to address it using resampling. Resampling is used to handle imbalanced distributions but can sometimes remove the required data for classes or cause overfitting. Recently, classifier ensembles have received more attention as an effective technique for handling distorted data. eir method reduces imbalance between classes by preprocessing data to improve classification performance and then inputting the dataset into a classifier ensemble. Experiments performed on eight imbalanced datasets in the KEEL repository helped highlight the importance of the method. Comparative analysis showed performance improvement in terms of the area under the ROC curve (AUC).
Haixiang et al. [16] provided an in-depth review of rare event detection from the perspective of imbalance learning. For this analysis, 517 related papers published over the past 10 years were collected. e authors reviewed all the papers collected from both a technical and a practical point of view. e modeling methods discussed included techniques such as data preprocessing, classification algorithms, and model evaluation. e authors provided a comprehensive classification of existing application domains of imbalance learning and then detailed the applications for each category. Integrating some of the suggestions from the reviewed papers with their experience and judgment, they provided directions for further research in the fields of imbalance learning and rare event detection.
Zheng et al. [17] tried to alleviate the class imbalance problem using SMOTE. SMOTE is the most widely used data-level method, and many derivatives of the original model have been developed to alleviate the class imbalance problem. e authors found that SMOTE has serious flaws and proposed a new oversampling method SNOCC that can compensate for the shortcomings of SMOTE. In SNOCC, increasing the number of seed samples prevents new samples from connecting on the line segment between the two seed samples in SMOTE. e authors used a new algorithm that differs from the previous one to find the nearest neighbor of the sample. With these two improvements, new samples generated by SNOCC can naturally reproduce the distribution of original seed samples. Experimental results showed that SNOCC outperformed SMOTE and CBSO (SMOTE-based methods).
Suleiman and Issac [18] attempted to improve the detection performance of intrusion detection systems (IDS) using machine learning. IDS suffers from setbacks such as false positives (FP), low detection accuracy, and false negatives (FN). To improve the performance of IDS, machine learning classifiers are used to support detection accuracy and significantly reduce false positive and false negative rates. In their study, they used six classifiers based on machine learning. For three types of datasets, such as NSL-KDD, UNSW-NB15, and phishing datasets, their results show that k-NN and decision tree were the best classifiers in terms of detection accuracy, test time, and false positive rate.
Nawir et al. [19] attempted to build a network anomaly detection system using an efficient, effective, and fast machine learning algorithm. ey performed binary classification experiments using the UNSW-NB15 dataset. e experimental results showed that the AODE algorithm was superior in terms of accuracy and computation time for binary classification on the UNSW-NB15 dataset.
Douzas and Bacao [20] approximated the actual data distribution using the conditional version of cGAN Security and Communication Networks (generative adversarial networks) and generated data for minority classes in various imbalanced datasets. ey compared the performance of cGAN with several standard oversampling algorithms. ey presented empirical results showing that the quality of the generated data was significantly improved when cGAN was used as the oversampling algorithm.
Douzas and Bacao [21] proposed a new oversampling method called self-organizing map-based oversampling (SOMO). is method enables the effective generation of artificial data by generating a two-dimensional representation in the input space through the application of a selforganizing map. SOMO consists of three main phases. Initially, the self-organizing map creates an original twodimensional space. Next, it generates intracluster synthetic samples and finally intercluster synthetic samples. Additionally, the authors presented empirical results showing that the performance of the algorithm was improved when using artificial data generated from SOMO and showed that their method outperformed various oversampling methods.
Gong and Kim [22] proposed an effective ensemble classification method called RHSBoost to solve the imbalance classification problem. eir classification rule uses random undersampling and ROSE sampling in the boosting scheme. e experimental results suggested that RHSBoost is an attractive classification model for imbalanced data.

UNSW-NB15 Dataset and Data Preprocessing.
e UNSW-NB15 intrusion detection dataset [23] was generated through the IXIA traffic generation testbed in Figure 1. e IXIA traffic generator consisted of three virtual servers. Server 1 and Server 3 generate normal traffic, and Server 2 generates abnormal or malicious activity in network traffic. After establishing internal communication between the servers and collecting public and private network traffic, there are two virtual interfaces with IP addresses 10 e goal of this testbed is to collect normal or abnormal traffic that originates from the IXIA tool and spreads out to network nodes (e.g., server and clients). e IXIA tool generates attack traffic in addition to normal traffic. To generate attack traffic similar to the actual attack environment, the attack behavior is generated from the common vulnerabilities and exposures (CVE) site [24]. Using the IXIA tool, the first simulation was configured to include 1 attack per second, and the second simulation was configured to include 10 attacks per second. e data captured during the simulation process were 50 GB each.
ere were 47 features provided by the UNSW-NB15 dataset. In the proposed method, the features were extracted from the srcIP and dstIP features, and 53 features were configured, as shown in Table 1. e attack_cat feature was 1-10, and it was used for class classification. When network traffic was normal, the class value was 1. e srcIP feature was divided into srcIP1, srcIP2, srcIP3, and srcIP4 features. e srcIP is the IP address, and the extracted features were srcIP1-4 mean class A-D of the IP address. e dstIP feature was also divided into dstIP1, dstIP2, dstIP3, and dstIP4 in the same way.
e rare class has a relatively small number of instances or shows low classification performance compared to other classes. In this study, five classes were considered rare, namely, reconnaissance (3), DoS (4), worms (8), backdoor (9), and analysis (10).
Given a dataset D, and a set of labels L, where the labels of an example are denoted with Y i we can define label cardinality as below. Label cardinality of D is the average number of labels of the examples in D: e classification experiment using the original training dataset requires too much computation time. e proposed method reduces the computation time by adjusting the class ratio of the training and validation datasets, as shown in the experimental results in Table 2−5 [25]. Tables 2 and 3 show the changes in the distribution of normal and generic classes. Table 2 shows the difference in the number of instances before and after undersampling, and Table 3 shows the change in the class imbalance ratio. e class imbalance ratio is calculated by (1) [26].
While reducing the number of instances occupied by normal and generic classes, undersampling was attempted to the extent that the classification performance of the two classes was not significantly degraded. Of the 229,110 instances, 1/3 was used as the training dataset, and another 1/3 was used as the validation dataset. e test dataset was used by adding the reduced number of instances of normal and generic classes to the remaining 1/3. at is, the test dataset was restored as if there were no undersampling processes.  Table 4 shows that the recall values of all classes changed while reducing the number of instances of the normal class by 1/2, that is, 1/2 n . As the number of instances of the normal class decreased, the weighted average decreased. e change in classification performance for each class was relatively small. When n was 5, the class imbalance ratio of the normal class was reduced to 2.81%, and its classification performance was not significantly affected.
After adjusting the number of instances of the normal class, Table 5 shows the change in recall values according to the change in the number of instances of the generic class. In the case of the generic class, the class imbalance was mitigated to an appropriate level when n was 2. If the number of instances of normal and generic classes was reduced, the recall value of those classes was somewhat lowered, but the computation time was significantly reduced.
Since the proposed method requires more than many SVM classification experiments, a very large amount of computation time is needed. By reducing the number of instances of normal and generic classes, the computation time can be significantly reduced. StratifiedRemoveFolds [27] was used to reduce the number of instances. Table 6 shows the class imbalance ratios of the training, validation, and test datasets. In the test dataset, the class imbalance ratio of normal and generic classes was large. is is because we did not reduce the number of instances of those classes.

SMOTE: Synthetic Minority Oversampling Technique.
e synthetic minority oversampling technique (SMOTE) is an oversampling method that generates a new sample by adding a random value after taking a class sample and adding it to the data. SMOTE uses a k-NN (k nearest neighbor) algorithm to add points at slightly shifted positions from existing instances. SMOTE is similar to random oversampling in which it increases the number of instances of rare classes. However, it does not recreate the same instance. Instead, it creates new instances by appropriately combining existing instances, avoiding the disadvantage of overfitting.

Problem definition.
We attempt to solve the UNSW-NB15 classification problem, a network intrusion detection dataset. e dataset has severe class importance and overlap, so it is not easy to improve classification performance.  Security and Communication Networks Figure 2 [9] is a two-dimensional scatter plot, which is derived by inputting the original training dataset to the t-SNE (tdistributed stochastic neighbor embedding) algorithm. is plot shows that the classes had multiple clusters of different sizes, and the boundaries between the classes were not clear; that is, it shows that there were class overlaps. Many attack classes mimic the behavior of the normal class. Figure 3 [9] shows the degree of class overlap using PCA. Figures 2 and 3 suggest that it is very difficult to increase the classification performance of all classes at the same time. However, by overcoming these difficulties using data preprocessing techniques, the classification performance for the five rare classes can be improved. If there is a class imbalance problem, most of the examples are classified into majority classes, not minority classes. e ratios between classes of the training dataset need to be adjusted to solve this problem. e SMOTE algorithm is used as a method of adjusting the ratio between classes. It is needed to test all tuples of SMOTE ratios to optimize the ratio of each class; however, there are time and cost constraints to test all cases. erefore, it is needed to find the best tuple of SMOTE ratios just by testing with a small number of experiments. Formula (1) is a method of calculating the class imbalance ratio of each class, and Table 6 shows class imbalance ratios and the number of training, validation, and test datasets. In summary, data preprocessing methods that optimize the ratio between classes should be found, and the performance should be proven through experiments.

Proposed Methods.
To find the optimal class ratio, one approach is to experiment with all combinations of SMOTE ratios. Considering time and cost, it is not efficient to    evaluate all possible combinations. erefore, we suggest a method to maximize the classification performance while dramatically reducing the computation time. Figure 4 shows the flowchart of the data preprocessing and experimental process of the proposed method. After the data preprocessing process described in Section 3, tuples of SMOTE ratios are randomly generated according to the range in Table 7. New training datasets are created according to the tuples. A classification experiment is performed using these training datasets and the validation dataset. A regression model is created using the recall values that are the results of these experiments.
ere are two optimization methods using the regression model. First, a large number of tuples of SMOTE ratios are randomly generated. e tuples are entered into the regression model to find the tuple with the best performance. Next, the regression model is used as the fitness function of the GA to find the optimal SMOTE ratios. In the following steps, the pseudocode and formula of the proposed method are shown: (1) Step 1: Randomly generate many tuples of SMOTE ratios for five rare classes according to the range in Table 7. e new training datasets are generated by inputting the ratios of the tuples as one of the parameters of the SMOTE algorithms [4], and the datasets are added to the set, TS.

RModel←Regressors(SR, RMC);
(2) Step 4: e following are two methods to find the optimal tuple of SMOTE ratios.
(4.1) Randomly generate a very large number of tuples of SMOTE ratios for five rare classes. After inputting the tuples as a parameter into the regression model, the tuple with the best results is selected.

Tuples←A set of randomly generated very large number of tuples of SMOTE Ratios;
BestTuple←arg max(RModel(Tuples)); (3) (4.2) A genetic algorithm is used to search for an optimal solution. Randomly generated tuples of SMOTE ratios are used as Population, and the regression model is used as a fitness function. Figure 5 shows the distribution of the initial population of the genetic algorithm. (5)

Experiments
Let m be the maximum number of experiments you can do with your computation device. We randomly generated m tuples of SMOTE ratios and generated m training datasets using the tuples. We obtained m recall values by applying classifiers such as SVM, decision tree, and random subspace to the generated training datasets and validation dataset. e RMC was calculated using the recall values of 10 classes. e dataset used to create the regression model consisted of RMC and the tuples for 5 rare classes, and the number of instances was m. e regression algorithms used to generate the model were MLP regressor, SVR, and random forest. In previous studies [6], RMS was used, but in this study, RMC was used to reflect the characteristics of classes well and reduce the effect of outliers. (5) is an equation for calculating RMC.
We conducted two experiments with the regression model. In the first method, after randomly generating a very large number of tuples of SMOTE ratios, the tuples were input into the regression model to find the best tuple. e second method used generational GAs. e representation consisted of 5 real numbers, the tuples for 5 rare classes. e RMC was calculated using the regression model as the fitness function of the GA. e following were the parameters of the GA: the population was set to m, and the roulette wheel method was used for selection. One-point crossover was used with 100% probability, and bitwise mutation was set to 5%. Replacement left 1% of the superior solution and replaced the remaining 99% with a new child solution. e stop condition Security and Communication Networks occurred when the iteration of 10,000 generations ended or there was no change in the optimal value for more than 50 generations. e parameters of the classifiers were as follows: in SVM, a polykernel was used, and the value of c was 1. In the decision tree, the confidence factor was set to 0.25. REPTree was used as a classifier for random subspace, and sub-SpaceSize was set to 0.5. e parameters of the regressors were as follows: in the MLPRegressor, approximate sigmoid was used as the activation function, and the squared error was used as the loss function. In SVR, a polykernel was used, c was set to 1, and the RegSMOImproved optimizer was used as the optimizer. e epsilon parameter was set to 0.001, and the tolerance was set to 0.001. In random forest, bagSizePercent was set to 100%, and numiterations were set to 100 times.
For a group of n values involving x 1 , x 2 , x 3 , . . . , x n , the RMC is given by (5) Table 8 shows the performance of regression models generated from classification results using SVM, decision tree, and random subspace algorithms. e regression algorithms used to generate the model were MLP regressor, SVR, and random forest. In the SVM and random subspace experiments, the MLP regressor exhibited the best performance, and in the decision tree experiments, the random forest showed the best performance. Among all the experimental results, the experiment using the SVM classifier and MLP regressor was the best. e performance of the experiment using the SVM classifier and random forest regressor was excellent. Table 9 and Figure 6 show the results of the classification experiment using the original training dataset. e classification performance was compared between algorithms. e accuracy of the experiments is as follows: the random subspace is 89.12%, the decision tree is 97.4%, and the SVM is 97.06.
In the decision tree experiment, recall's W. Avg. Was the best at 0.974. e experimental results in Table 10 and Figure 7 show that the performance of D-S-1G was excellent. When the classification experiment was performed with a decision tree and the regression experiment was performed with SVR, SMOTE ratios were derived that exhibited good performance. Table 11 shows the best tuple of SMOTE ratios according to the type of experiment.
Each letter in the name of the test type in Table 10 has a specific meaning. e first letter is the algorithm used in the initial classification experiment: D (decision tree), R (random subspace), and S (SVM). e second letter is the algorithm used to generate the regression model: M (MLP regressor), R (random forest), and S (SVR). e number in the third position represents the classification experiment using the training dataset generated according to the ratios of the best tuple: 1 (random forest), 2 (SVM), 3 (decision tree), and 4 (k-NN). G in the 4th position denotes that the genetic algorithm is used to find the best tuple of SMOTE ratios. For example, D-S-1G has the following meaning: D � perform the decision tree classification experiment, S � generate a regression model using the experimental results and SVR, 1 � the final classification experiment is performed using the generated training dataset and random forest, and G � a genetic algorithm was used to obtain the best tuple of SMOTE ratios. e model is used to find the best tuple of SMOTE ratios, and a training dataset is generated according to the ratios of the best tuple. In Table 10, the accuracy of S-R-1G is 96.70%, R-R-2 is 69.28%, R-R-2G is 95.88%, and D-S-1G is 96.55%  Stdev 10 is the standard deviation between the recall values of all classes (10 classes), and Stdev 5 is the standard deviation between the recall values of rare classes. RMC 10 is the value calculated by applying the recall values of all classes to equation (2), and RMC 5 is the value calculated by applying the recall values of rare classes to equation (2). In this experiment, if the standard deviation is small and the value of RMC 10 or RMC 5 is large, then the classification performance is considered excellent. Figure 8 compares the results of the decision tree experiment, which showed the best performance among experiments using the original training dataset, and the D-S-1G experiment, which showed the best performance using SMOTE. Comparing the two experimental results, we found that the experimental result of D-S-1G was superior to that of the decision tree experiment. ere was a large difference in the classification performance for rare classes. Figure 8 shows the distribution of the initial population of the genetic algorithm in the D-S-1G experiment. Table 12 shows the results of the D-S-1G experiment for various measures. e UNSW-NB15 dataset includes 2,540,047 instances. e training dataset includes 76,337 instances, and the validation dataset is the same. e test dataset has 846,683 instances. e SMOTE algorithm makes it easier to classify rare classes, as in [4]. e algorithm is used to randomly generate many tuples of SMOTE ratios. e tuples are used  Recall is the ratio of correctly classified positive samples among the total positive samples, and it is mainly used to evaluate the classification performance. e reason is that when many normal samples are misclassified as attacks, the system is easily overloaded. erefore, the classification results are mainly evaluated based on the recall metric, and the precision and F1-score are used as auxiliary metrics in evaluating the robustness of the machine learning model. Next, a regression model was generated using the RMC values and the tuples of SMOTE ratios. In evaluating the performance of the regression model, the correlation coefficient is used as the default metric, and the MAE and RMSE are used as auxiliary metrics. ere are two methods to optimize the ratios between classes using the regression model. First, create a very large number of tuples of SMOTE ratios randomly and enter the tuples in the regression model. en, the best SMOTE ratios are chosen. e second is to use the regression model as the fitness function of GAs to derive the optimal SMOTE ratios. e reason for choosing GAs among evolutionary algorithms is that GAs are used to generate high-quality solutions to optimization and search problems by relying on biologically inspired operators such as mutation, crossover and selection between multiple individuals [5].
In Table 10, the D-S-1G experiment shows better performance than other experiments in the exploit, reconnaissance, ShellCode, fuzzers, and backdoor classes. In the experiment, the decision tree classifier is used for classification experiments, the results of the experiments are used to generate a dataset, and a regression model is generated using SVR and the dataset. en, it is used in the GAs' fitness function to find the best SMOTE ratios. A new training dataset is generated according to the best SMOTE ratios, and classification experiments are conducted by using the training dataset, the test dataset, and the random forest classifier.
e D-S-1G test results show smaller standard deviations compared to other experiments. at is, the classification  results for each class do not have bias and show good performance. e weighted average of the recall and F1score is 0.965 and 0.971, respectively, which means that the classification performance is likely to be high. e S-R-1G combination includes an SVM classifier and a random forest regressor. Different from D-S-1G's decision tree-based method, it shows excellent performance in DOS, generic, and worms classes. e R-R-2G is an experiment with a random subspace classifier and random forest regressor, and it shows excellent results in the analysis class.
A comprehensive analysis of the results shows that the type of classifier has a significant impact on the classification performance of each rare class. erefore, it is expected that the classification performance can be improved by analyzing the classifiers that respond well to each rare class and then applying ensemble techniques such as boosting.

Concluding remarks
We studied machine learning-based data preprocessing methods for rare class classification using the UNSW-NB15 dataset, which has severe class imbalances. Only a small percent of the total datasets were used as the training dataset, without having a significant impact on the classification performance. We suggested how to optimize the ratio between classes in the training dataset with SMOTE and genetic algorithms. In the experiment, the optimal SMOTE ratios were found to maximize the classification performance. In the case of the D-S-1G experiment, which showed the best overall classification performance, the SMOTE ratios were 80 times, 14 times, 14,106 times, 8,900 times, and 127 times for reconnaissance, DoS, worms, backdoor, and analysis, respectively. In the case of the S-R-1G experiment, the SMOTE ratios were as follows: 373 times for reconnaissance, 433 times for DoS, 29,241 times for worms, 3,853 times for backdoor, and 3,091 times for analysis.
As a result, the best SMOTE ratio was obtained for rare classes, and the computation time was significantly reduced by generating a regression model. e superiority of the proposed method was verified through experiments, and the classification performance was enhanced by alleviating class importance. Each rare class showed very different classification results depending on the type of classifier.
In the future, it is expected that better results can be derived by applying ensemble methods such as boosting. We would like to present a new data preprocessing method to  mitigate class imbalance with various data augmentation algorithms. We will also experiment with a few other network anomaly datasets.

Data Availability
e UNSW-NB15 data used to support the findings of this study have been deposited in the e UNSW-NB15 network data set repository (DOI: 10.1109/MilCIS.2015.7348942).

Conflicts of Interest
e author declares that he has no conflicts of interest.