Improving the Performance of Machine Learning-Based Network Intrusion Detection Systems on the UNSW-NB15 Dataset

Networks are exposed to an increasing number of cyberattacks due to their vulnerabilities. So, cybersecurity strives to make networks as safe as possible, by introducing defense systems to detect any suspicious activities. However, firewalls and classical intrusion detection systems (IDSs) suffer from continuous updating of their defined databases to detect threats. The new directions of the IDSs aim to leverage the machine learning models to design more robust systems with higher detection rates and lower false alarm rates. This research presents a novel network IDS, which plays an important role in network security and faces the current cyberattacks on networks using the UNSW-NB15 dataset benchmark. Our proposed system is a dynamically scalable multiclass machine learning-based network IDS. It consists of several stages based on supervised machine learning. It starts with the Synthetic Minority Oversampling Technique (SMOTE) method to solve the imbalanced classes problem in the dataset and then selects the important features for each class existing in the dataset by the Gini Impurity criterion using the Extremely Randomized Trees Classifier (Extra Trees Classifier). After that, a pretrained extreme learning machine (ELM) model is responsible for detecting the attacks separately, “One-Versus-All” as a binary classifier for each of them. Finally, the ELM classifier outputs become the inputs to a fully connected layer in order to learn from all their combinations, followed by a logistic regression layer to make soft decisions for all classes. Results show that our proposed system performs better than related works in terms of accuracy, false alarm rate, Receiver Operating Characteristic (ROC), and Precision-Recall Curves (PRCs).


Introduction
Nowadays, the rapid evolution of IoT, cloud, and big data domains has now reached an indescribable level, and the urgent need to use them has become unavoidable. e prevailing data through the emerging technologies have many steps in their life cycle including creation, transfer, storage, and deletion. e portable information in the data has great importance at any stage of its cycle, especially when it is related to financial transactions or governments or the military. Consequently, data privacy and information security were fundamental issues for reducing losses that occur by overlooking them [1].
Due to systems vulnerabilities, intruders try to steal or destroy or alter the information and often damage the systems themselves. us, information security in terms of confidentiality, integrity, and availability (CIA triad) must be taken into consideration when developing systems.
IDS is one of the most common issues in the field of cybersecurity to meet the challenges of any malicious cyberattacks.
IDS is used to detect suspicious activities on the network, network-based IDS, or on the host, host-based IDS, or on both of them, hybrid IDS. It may be either software or hardware or a combination of both.
IDSs are divided into three groups based on the methodology: signature-based IDS matching the traffic flow with stored signatures of known attacks, specification-based IDS applying a set of rules in the incoming packets to monitor any skewness from the normal behavior, and anomaly-based IDS sniffing the suspicious threats [2].
With the proliferation of attacks, the signature-based types suffer from continuously updating their databases, and the specification-based types need more expert knowledge to capture the new undesired traffic.
Because detection of anomalies is considered a classification problem in the world of machine learning, the use of machine learning methods as classifiers for IDS has increased [3], which is known as machine learning-based IDS, a branch of anomaly-based IDS.
Many labs have created datasets to help planning machine learning-based IDSs. e UNSW-NB15 dataset draws much attention from cybersecurity researchers with the latest cyberattacks.
In order to reduce misclassification, SMOTE was proposed as a very popular method of resampling especially when some classes dominate others [4].
In the machine learning community, choosing the optimal features is a big deal that removes the irrelevant or less important features using wrapper methods or filter methods or embedded methods or learning-based methods.
Recently, the use of ensemble learning methods increases in the selection stage of the features. Extra Trees Classifiers outperform other peers in class categorization by selecting the optimal attributes besides computational efficiency [5]. e classifier performance speed is a design requirement during the planning of the systems in many applications especially those running in real time. For this reason, the extreme learning machine method is introduced as one of the fastest learning algorithms, surpassing dozens of learning techniques based on back-propagation [6].
e key metric for evaluating the classification issue is accuracy which is the number of correct predictions made from all predictions. In addition, the false alarm rate is a big deal when working on the classification to know how classifiers are powerful; i.e., they reduce the proportion of wrongly classified instances.
However, the classification accuracy alone is not sufficient information to make a proper decision. erefore, in addition to the accuracy, care should be taken about ROC and PRC plots to avoid illogical results. e research focuses on software machine learningbased network IDS using the abovementioned techniques from a classification problem viewpoint; it also sheds light on the accuracy, false alarm rate, ROC, and PRC. e next parts of this research are organized as follows. Section 2 presents summaries of related studies. en, the proposed system is detailed in Section 3 as well as the used methodologies. Furthermore, results and discussion are given in Section 4. Finally, the conclusion is offered and further suggestions for future works are given in Section 5.

Literature Review
Studies varied over the selected dataset, i.e., UNSW-NB 15, depending on the type of attack or the protocol used, or the threat detection approach. So, some preferred to minimize the detection circuit to catch just one specified attack or perhaps two attacks at most. Others went toward discussing the problem relying on the transport layer protocol, i.e., TCP or UDP. Others did not do the multiclass classification, but they were satisfied with the binary classification.
is section focuses on state-of-the-art works connected to multiclass classifications on the selected dataset.
In [7], for each attack in the UNSW-NB15 dataset, they introduced a hybrid model for IDS based on a Genetic Algorithm (GA) and Support Vector Machine (SVM). ey converted the features into chromosomes and selected the highest accuracy from them. en, as a detection method, they proposed the Least Squares Support Vector Machine (LSSVM). e results were tested for accuracy, true positive rate, and false-positive rate.
In [8], a random forest (RF) was presented as a feature reduction method; they were interested in eight UNSW-NB 15 dataset attacks excluding "Fuzzers" attacks. ey designed a stepwise architecture to detect attacks based on the random forest at each stage. e performance metrics for their study were false alarm rate (FAR) and the undetection rate (UND).
Deep learning methods have also been presented for the multiclass detection approach in anomaly-based detection. For example, the well-known Convolutional Neural Network (CNN) was used in [9], after converting features to 8 × 8 images to be entered into CNN layers. e classification accuracies were high for "Normal" and "Generic" traffic.
In [10], a combination of Artificial Bee Colony (ABC) and Artificial Fish Swarm (AFS) was declared for categorizing attacks. ey split the dataset into subsets and used the Correlation-based Feature Selection (CFS) method to select the optimal attributes. After that, the CART technique was added to generate "If-en" rules to be ready for the hybrid ABC-AFS. e performance was tested according to various values of the number of subsets.
Due to less complexity than other mixture models, a Beta Mixture Model (BMM) was performed as an anomaly-based detection technique in [11]. BMM uses a lower-upper interquartile threshold to distinguish between the normal and the abnormal profiles. ey demonstrated their results in terms of detection rates for all attack classes and ROC curves.
Mixing multiple machine learning methods in studies is strongly recommended to exploit their strengths to improve the overall performance of IDS. For example, the study in [12] demonstrated that IDS can be achieved through a set of layers.
e feature selection layer based on Extra Trees Classifiers for each threat was followed for detection by the extreme learning machine ensemble layer. en, the outputs of the previous layer were collected with the softmax layer to make a soft decision for each attack. Results were limited to accuracy.
To ensure that the design of IDS models will make a good impression in production, multiple model experiments will be applied to many relevant datasets.
us, the study in [13] proposed distributed deep neural network (DNN) models with many hidden layers to monitor threats to the host level and the network level. Models have been tested in benchmark datasets. ey released their framework "Scale-Hybrid-IDS-AlertNet" to detect cyberattacks in real time.

2
Computational Intelligence and Neuroscience IDS architecture could be represented by levels according to the detection approaches such as [14] which explained this idea through a two-level design. e former was a model of binary classification based on a decision tree to detect benign and malignant flows. If malignant flows were predicted, the latter would start with a multiclass classification model based on a hybrid of Recursive Feature Elimination (RFE) and SMOTE to take precise decisions to categorize the abnormal flow.
In search of the high detection rates, the study in [15] illustrated their IDS by a combination of the Genetic Algorithm (GA) to delete irrelevant features and the Self-Organizing Map (SOM) classifier, optimized by GA's selected features.
In [16], they also used GA with random forest (RF) to select the optimum attributes, preceded by the Isolation Forest (iForest) for data sampling. A random forest (RF) classifier was reused to recognize the class type of attacks for a different goal. is suggestion produced high accuracies, high detection rates, and less false alarm rates.
IDS performance with reduction features outperforms others using all features. In [17], the optimal features were selected by applying Mutual Information with Linear Correlation Coefficient (MI-LCC), followed by the Support Vector Machine (SVM) classifier as a multiclass detection method.
Doing statistics of classes within a dataset helps scholars to design a robust IDS, particularly during the preprocessing stage, because machine learning models cannot be trained well whatever the models are at specific rates of classes. us, in [18], the data were resampled using one-side selection (OSS) to decrease majority samples and SMOTE to increase minority samples. en, the spatial features and the temporal ones were extracted by CNN and bidirectional long short-term memory (BiLSTM) respectively, which are the core of the classification stage by combining them.
In [19], they introduced their IDS for the cloud environment, using Chi-square as a feature selection method and deep reinforcement learning as a classification method. ROC curves showed accuracies, FPR, and TPR for each class.
Ensemble learning has been presented to enhance the detection rate in [20]. A long short-term memory (LSTM) algorithm, a homogeneous ensemble method, and a heterogeneous ensemble method based on multiple classifiers were implemented. e proposed models were tested on the selected dataset in two forms as a two-classed dataset and a multiclass dataset.

Our Proposed System
After reviewing the future works related to the research topics, we noticed that the resampling techniques have improved the performance of the multiclass classification. As well, the methods of ensemble learning have done well for selecting the optimum features. Furthermore, the classification has implemented by machine learning rather than deep learning for more effective models with less complexity. As a consequence, our suggested IDS as shown in Figure 1 can be introduced, consisting of multiple stages.

Resampling.
e unequal number of classes in a dataset badly affects the performance of the machine learning-based classifiers, especially when the majority of classes to the minority ones exceed 100 to 1, as many data scientists have stated. Because of the difficulty of creating a standard balanced dataset, preprocessing of the existing dataset should begin with decreasing the majority or increasing the minority or doing both. One of the simplest statistical techniques for dealing with the uneven categories in a dataset is SMOTE, which is applied to certain minority classes in the dataset selected.

Synthetic Minority Oversampling Technique (SMOTE).
e basis for this method resides in the idea of oversampling the minority class by generating synthetic instances from its elements and keeping the majority number as is. e new samples are not only carbon copies of minority examples but are created by composing features from the minority instances and their closest neighbors in the feature space [4]. Figure 2 shows a simple way to oversample the minority cases (the orange squares) in the 2D feature space by drawing lines between them, and the synthetic minority instances reside the lines (the green squares). As well, the majority of cases (the circles) remain unchanged. As a result, the minority percentage only rose, and the classes are equal.

Preprocessing.
Dealing with the raw data set examples requires some analysis and visualization of the values included. Some rows can be duplicated which causes overfitting problems. Some columns have dirty values such as spaces or nulls or various types of data.
To handle the above problems, the selected dataset should be preprocessed to make it free from any errors that affect the postprocessing process.

Data Cleaning.
Fixing the dataset flaws is an essential part which contains the following:  Z-score is one of the ways of the well-known normalization. Relation (1) expresses this strategy by subtracting its mean μ from every feature x and dividing the difference by its standard deviation σ: is step is very important to design efficient machine learning models and to reduce the computational cost of high-dimensional feature space by selecting the most relevant features. Many techniques have been introduced for finding the optimal subset [21]: (i) Wrapper methods: sequential selection algorithms and heuristic search algorithms (ii) Filter methods: correlation criteria and mutual information between features (iii) Embedded methods: MRMR (max-relevancy, minredundancy) and L1 regularization (iv) Learning-based methods: some unsupervised/semisupervised/supervised/ensemble learning algorithms No preferred methods are valid for any model of machine learning; some experiments should be done to find out which one achieves the best results based on the desired dataset or study problem.
Many strong recommendations claim that techniques of selection of features based on ensemble learning-based outperform other procedures especially Extra Trees Classifiers [5].

Extremely Randomized Trees Classifier (Extra Trees
Classifier). One of the most common methods of tree-based ensemble machine learning. As claimed by [6], it gathers many randomized decision trees, without using bootstrapped samples. By using the entire training dataset, each decision tree has fitted in. It selects a split point randomly, based on a mathematical decision, to split tree nodes.
In the context of the suggested system, this algorithm has been exploited to capture the optimal features for each class in the dataset using the Gini Impurity criterion.
Gini Impurity measures the probability of incorrect classification of a particular feature when selected at random [22]. Its values range from 0 to 1; the lower the value is, the more important the relevant feature is.

Classification.
It is a supervised learning approach that categorizes the examples of the dataset into groups by designing supervised learning models for this task. So, a labeled or categorized dataset is required to create models that map a subset of features to each class.
Based on the dataset, it can be binary classification when there are only two classes, or multiclass classification when the number of classes is greater than two, or multilabel classification when each instance is defined with multiple labels.
Hundreds of machine learning models can be declared as classifiers but the system goal, stability, complexity, scalability, and performance make researchers biased in favor of some algorithms over others. erefore, the scope of application of the system must be determined before going into design.
For real-time applications, extreme learning machine methods with a low training time without iterative tuning, perfect generalization, and ease of implementation are strongly recommended.
In particular, these methods were introduced as candidates to apply them to the UNSW-NB15 dataset as mentioned in the survey [23].

Extreme Learning Machines (ELMs).
According to [6], this algorithm is one of single-hidden-layer feedforward neural networks (SLFNs). It has been updated in many forms to improve their generalization ability and performance. However, the system selected has suggested the basic form applied to each class as a "One-Versus-All" binary classifier to make the processing easy and quick.
In general, it has only one hidden layer, with multiple neurons completely connected from one side to the input layer and from the other side to the output layer as declared in Figure 3.
From the previous figure, the ELM output applies the following mathematical relation: where we have the following: (i) N is the number of hidden neurons (ii) N is the number of training instances (iii) β i is the ith weight vector between the ith hidden neuron and the output layer (iv) w i is the ith weight vector between the ith hidden neuron and the input layer (v) b i is the ith bias vector (vi) g is an activation function (vii) x j is the jth input vector with m features (viii) o j is the jth output sample e error between the ELM output o j and the actual target t j in the perfect way should be zero as referred to As a result, formula (2) can be rewritten to become e matrix form of N equations in (4) is Computational Intelligence and Neuroscience For every i, w i and b i are randomly assigned without explicit intervention to calculate H; T is given in the dataset.
e only thing to calculate is β as follows: where H † � (H T H) − 1 H T is the Moore-Penrose generalized inverse of matrix H. β is proven to be the optimal solution for the least-squares error: ‖Hβ − T‖ � min β ‖Hβ − T‖.
From the complexity standpoint, thanks to the simplicity of its structure, ELM significantly reduces computational burdens.

Aggregation.
After creating classification models for each class, their outputs are simultaneously collected in order to design the aggregated model. is architecture makes the proposed system scalable to add any new classes. e aggregated model is a fully connected layer which is followed by a layer of logistic regression. e fully connected layer is very important for capturing all combinations of ELM classifier outputs to improve the classification.
In order to be able to distinguish between all classes, we were interested in the multinomial form of the logistic regression layer. e multinomial logistic regression uses maximum likelihood estimation using Newton's method [24]. e softmax function was represented as an activation function for the logistic regression layer with ten neurons such as the length of the input vector to make soft decisions at the output. e softmax function is defined as follows: where z i is the input vector of the neural network with n neurons. is stage makes the system dealing with the IDS a problem of multiclass classification. In order to improve overall performance, the Adam optimization method was chosen to leverage the simplicity and computational efficiency [25].
Working with information content (entropy) is very intuitive when handling probabilities; sparse categorical crossentropy has been used as a cost function for multiclass classification tasks with the softmax layer [26]. In this way, the cross-entropy H between two probability distributions p, q is Along these lines, the proposed system has offered multiple stages defined by algorithms to be as flexible, fast, and simple as possible.

Experimental Setup and Results
Our proposed system was developed using Python language. It was run on the 8 th generation intel core i7 processor and an 8 GB RAM.
Some details about the UNSW-NB15 dataset should be provided before diving into the results. Due to its advantages over old standard datasets, this dataset is chosen. KDD98, KDDCUP99, and NSLKDD datasets are suffering from the lack of modern cyberattack types, inadequate normal traffic, and the unequal distribution of classes in training and testing sets. e UNSW-NB15 has been presented as a benchmark dataset specialized in IDS design [27] to address these problems.

UNSW-NB15
Dataset. According to [28], the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) at UNSW in Canberra presented the new UNSW-NB15 dataset, considering the limitations of the old existing dataset. IXIA PerfectStorm tool has been used to create a combination of recent malicious and benign behaviors of network traffic. e dataset consists of nine types of modern cyberattacks labeled by Analysis, Backdoors, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, and Worms in addition to the normal packets, named as Normal, which were captured using the Tcpdump tool. e packets within the dataset are defined by 49 different features provided by the Argus, Bro-IDS tools, and twelve additional algorithms. e most used UNSW_NB15_training-set.csv including 175,341 records and UNSW_NB15_testing-set.csv including 82,332 records are partial datasets and publicly available to help researchers develop IDS in training and testing issues, respectively. Table 1 shows the samples for each class and their percentage.

Performance Metrics.
e best ways to illustrate the classification results while applying supervised learning models are accuracy, precision, recall, F1-score, false alarm rate, ROC, and PRC.
False alarm rate (FAR) is one of the important measures that focus on misclassified ratios, which is the average between the ratio of misclassified samples over all normal samples called false positive rate (FPR) and the ratio of misclassified samples over all attack samples called falsenegative rate (FNR) [28]:

Receiver Operating Characteristics (ROC).
e ROC curve is a 2D graphical plot with a true positive ratio (TPR) on the y-axis against a false positive rate (FPR) on the x-axis [30]. To show how classifiers distinguish between two classes, it draws lines between thresholds that are determined when making decisions in binary classification. One common measure with the ROC curve is the area under the curve (AUC) with values between 0 and 1. Higher AUC (more than 0.5) measures how well-trained classifiers are by allocating higher probability for correct predictions and lower probability for incorrect ones. A badly trained classifier has a diagonal line ROC curve with AUC close to 0.5.

Precision-Recall Curve (PRC)
. PRC is an alternative metric for the proper evaluation of binary classifiers for an imbalanced dataset. Like its name, it is a visual plot showing how precision on the y-axis is linked to recall on the x-axis [31].
For each decision threshold to construct the curve of PRC, multiple pair points of recall and precision are defined, respectively. Also, AUC is used with PRC in the same meaning with ROC curves. Figure 1, the implementation of the proposed system consists of several phases that are applied to the training set. e training set is divided into 80% for training and 20% for validation. e results are shown only for the testing set.

4.3.1.
Resampling. SMOTE is used to oversample the minority classes whose percentage in the training set is less than 2%, which are Analysis, Backdoors, Shellcode, and Worms. e other classes are kept without any resampling.  Computational Intelligence and Neuroscience

Preprocessing
Data Cleaning. Some instances whose values are spaces and "-" are dropped. Some numeric values which are stored as text types are converted into number types. Because some types of attacks exist for the same attack name in different syntaxes such as the upper and lower cases, they are unified to the same format. Null values are replaced by the median of the feature column.
One-Hot Encoding. One-hot encoding encodes the three nominal attributes (proto, state, and service) to get new columns filled with ones and zeros.
Z-Score Normalization. After numeric columns are obtained, a z-score is implemented to normalize the attributes scales for every single column.

Feature Selection.
Gini Impurity criterion is used as a decision-maker for Extra Trees Classifier to extract the optimum features for every class in the dataset. After testing multiple Gini Impurity values, the best value is 0.02 which eliminates the number of features, as shown in Figure 4, to the minimum with no overall performance degradation at all.

Classification.
For each ELM classifier, the chosen activation function is "ReLU", and the number of neurons is iteratively tuned to achieve the best results. In the graph shown in Figure 5, the final numbers for the single hidden layer of each ELM classifier are obtained.
In the graphs shown in Figures 6-10, the classification results of the ELM classifiers are collected by applying them to the testing set for each category. Figure 6 shows the per class FPR, FNR, and FAR. Figure 7 illustrates the accuracy, precision, recall, and F1-score.
ROC curves are drawn in Figure 8 for each binary classifier, and Precision-Recall curves are plotted in Figure 9. AUC for ROC and Precision-Recall curves are grouped as shown in Figure 10.

Aggregation.
e classification results of the ELM classifiers are gathered to feed a logistic regression layer with a softmax as an activation function to make soft decisions for each class. Also, as a loss function, the sparse categorical cross-entropy is used, and Adam is chosen as the optimizer. e overall accuracy is ultimately 0.9843.

Comparison with Related Works. After numeric results
have been shown, our suggested system can be compared with other related studies in order to realize the importance of performance improvement of the multiclass classification that our system is presenting. e comparison metrics available are the accuracy, TPR, FPR, and F1-score as outlined in Tables 3, 4, 5, and 6, respectively.

Discussion.
e results offered for comparison with the proposed system in [8,[12][13][14][15][16][17][18][19] are obtained using the partial datasets which are shown in Table 1. However, other results in [7,[9][10][11] are obtained using the full datasets, which are 2,540,044 records including training sets and testing sets [28]. e suggested IDS, as indicated throughout this paper, is a combination of SMOTE, Extra Trees Classifiers. and ELM classifiers. SMOTE makes it possible to classify minority classes, rather than ignoring them as in [7,13,16]. In addition, Extra Trees Classifiers are selected to obtain the minimum number of features, as shown in Figure 4, as compared to [7,17].
Some studies are more interested in recall rather than in other scores, for example, in [10,11,13,14,15,16]. It makes sense that recall, as the ratio of the correctly classified attacks over all attack samples, is the important metric of anomaly detection problems. In these issues, we are focusing more on attack samples than normal ones because the damage to incorrectly classified attacks from attack samples is greater than when the samples are normal.  Computational Intelligence and Neuroscience Some related studies focus on accuracy only, such as [9,12]. However, the accuracy of the classification does not provide enough information about the robustness of machine learning models. Other metrics such as precision, recall, F1-score, and false alarm rate must be taken into consideration during the design of these models.
For example, assuming that a dataset contains several packets, 99% of packets are labeled as normal whereas only 1% are labeled as abnormal. Assume that a model is trained somehow to classify all packets as normal. So, the accuracy will be 99%. Although the classification accuracy is high, logically, the result is disappointing as it will not be able to detect any attacks.
As the number of features extracted from Extra Trees increases as shown in Figures 4 and 5, the number of hidden layer nodes of ELMs increases as another notable issue. Figure 6 shows that the FNR values for most classes are greater than the FPR values because the classifiers are trained by a large number of samples that are not related to a single attack relative to the small number of the attack samples.
at makes the false decisions about samples which are not related to this attack smaller than the false decisions about the attack samples.
Our proposed system is designed to make the false alarm rate as low as possible in general without lowering the accuracy, recall, precision, and F1-score; a trade-off is needed.   In Figure 10, we observe that ROC and PRC AUCs are nearly equal for each class. is shows the strong relationship between each of the two curves related to a single class. e two AUCs are generally different, especially when the classes are very imbalanced. However, thanks to SMOTE, the two AUCs are close enough through balancing the classes.    ere are some noteworthy observations from the results; on the same dataset, the proposed system has outperformed other classification algorithms as explained numerically in Tables 3, 4, 5, and 6, in particular, the lowest false positive rates and the highest accuracy, detection rates, and F1-scores.

Conclusion
As a result, our proposed system has presented a multiclass classifier for all existing categories in the standard dataset with soft decisions. We have shown that it implicitly introduces a binary classifier to detect normal and attack packets because one of the ELM classifiers is for the normal class which is defined by the "One-Versus-All" methodology.
It contains multiple stages defined by algorithms such as SMOTE, Extra Trees Classifiers, and ELM that were chosen to be as flexible, fast, and simple as possible. So, it can easily run on low-performance hardware. Finally, the results are displayed in terms of accuracy, precision, recall, F1-score, false alarm rate, ROC, and Precision-Recall curves to confirm the quality of classifiers.
For future work, the parallel approach makes the system smoothly scalable for any new attacks as new binary classifiers.
e proposed system can also be preceded by an unsupervised learning stage to detect normal and abnormal traffic without labels. If the abnormal behavior sounds like one of the classes that existed, it is categorized more specifically within the system.

Conflicts of Interest
e authors declare no conflicts of interest.