Empirical Evaluation of Noise Influence on Supervised Machine Learning Algorithms Using Intrusion Detection Datasets

Optimizing the detection of intrusions is becoming more crucial due to the continuously rising rates and ferocity of cyber threats and attacks. One of the popular methods to optimize the accuracy of intrusion detection systems (IDSs) is by employing machine learning (ML) techniques. However, there are many factors that affect the accuracy of the ML-based IDSs. One of these factors is noise, which can be in the form of mislabelled instances, outliers, or extreme values. Determining the extent effect of noise helps to design and build more robust ML-based IDSs. )is paper empirically examines the extent effect of noise on the accuracy of the ML-based IDSs by conducting a wide set of different experiments. )e used ML algorithms are decision tree (DT), random forest (RF), support vector machine (SVM), artificial neural networks (ANNs), and Naı̈ve Bayes (NB). In addition, the experiments are conducted on two widely used intrusion datasets, which are NSL-KDD and UNSW-NB15. Moreover, the paper also investigates the use of these ML algorithms as base classifiers with two ensembles of classifiers learning methods, which are bagging and boosting. )e detailed results and findings are illustrated and discussed in this paper.


Introduction
e numbers of cyber threats facing individuals, organizations, and government agencies are increasing rapidly. In addition, the ferocity of such attacks has also increased to more destructive levels. High-level intruders such as the advanced persistent threat (APT) are utilizing stealthier methods to penetrate protected systems and networks. is emphasizes the urgent need to optimize intrusion detection systems (IDSs).
IDSs are a form of technical security controls that can be used to detect different forms of intrusions, malicious patterns, probing attempts, and unauthorized activities. e accuracy of IDSs can be optimized using machine learning (ML) techniques which were used successfully in many fields such as image identification and pattern recognition. In addition, ML-based techniques are also widely deployed in the intrusion detection field. One of the biggest advantages of ML-based IDSs is that they can be utilized to detect both misuse and anomaly attacks. However, MLbased techniques are more suitable to detect anomalies which are extremely important to confront zero-day attacks.
New intrusion detection techniques, whether they are based on ML techniques or not, are usually tested on intrusion-related datasets that contain normal traffic and activities injected with different forms of attacks. ese datasets are widely used to evaluate the effectiveness of newly proposed IDSs. However, the presence of noise might influence the effectiveness and impacts the accuracy of MLbased IDSs, especially when implementing such systems in the real world, where noisy data are more likely to occur.
In this paper, we will empirically investigate the effect of mislabelled instances that can influence the effectiveness of ML-based IDSs. Different levels of noise are injected, and the classification accuracy of the ML-based IDSs is studied. Furthermore, the effect of using noise filtering on the accuracy of ML-based IDSs is also analyzed. Noise will be filtered by excluding outliers and extreme value instances. e aim of analyzing noise with ML algorithm is to test the robustness of different ML algorithms in noisy situations. In other words, it will help to determine the most noise-resilient ML algorithm. In addition, the paper also investigates the employment of ensemble learning techniques with the IDSs. e rest of this paper is organized as follows: Section 2 reviews IDS, relevant ML algorithms, ensemble of classifiers, noise filtering, and several works from the literature related to the use of ML in IDS. Section 3 presents the proposed methodology. Several datasets related to intrusion detection are reviewed. Also, the conducted experiments are described including noise injection, noise filtering, and the metrics used to compare the performance of different ML models. Section 4 discusses the results and analyses the outcomes of the conducted experiments. Finally, Section 5 concludes the paper and presents avenues for future work.

Literature Review
is section reviews intrusion detection, ML, and noise. It starts by demonstrating the types, components, and deployment of IDSs. ML categories, ML algorithms that are used in this paper, and the ensemble learning methods are also discussed. It also discusses noise filtering by removing the outliers and the extreme values. Finally, this section demonstrates previous research efforts that focus on MLbased IDSs.

Intrusion Detection System
. IDS helps to detect many forms of attacks and sends alarms to the system or the security administrators. It is critical to develop an IDS that achieves high detection rates with no or minimum false alarms. IDS can be broadly categorized into misuse and anomaly detection.
Elsayed et al. [1] gave credit to Anderson for the introduction of the concept of intrusion detection in his 1980 paper. at paper discussed the feasibility of detecting misuse by investigating the audit trails. Elsayed et al. [1] pointed out that the original IDS was based on expert systems.
According to Rathore et al. [2], the concept of the IDS was initially introduced in 1986 [3] and 1987 [4] by Denning based on expert systems. e system was referred to as the intrusion detection expert system (IDES). However, Kumar and Venugopalan [5] gave a balanced credit; according to them, Anderson introduced the concept of the IDS while Denning formalized IDS by introducing a rule-based expert system that can detect malicious users' activities in real time. An overview of the historical IDS milestones can be found in [5].

Types of IDSs.
e way the collected data are inspected defines how efficiently an IDS can detect different categories of probes and intrusions. Generally, the methods that an IDS analysis engine uses to inspect the collected data can be broadly categorized into either misuse detection or anomaly detection.
Misuse detection approaches operate by matching the current behavior with known and previously defined attack patterns, signatures, or rules [6]. erefore, it is critical to maintain an IDS up to date and inclusive of different known attack patterns, signatures, and rules to ensure the effectiveness of the misuse detection IDS. However, it is not feasible to use the misuse-based methods to detect zero-day attacks because no patterns, signatures, or rules are defined for zero-day attacks.
On the other hand, anomaly detection techniques operate based on the ability to detect deviations from normal behavior or pattern [7]. It operates based on the principle that malicious behavior differs from normal behavior, which makes it feasible to detect it [6]. Defining normal behavior requires a period of time; consequently, anomaly-based IDS needs sometime after the installation to become active. Anomaly-based IDSs are more prone to errors producing high rates of false alarms.
ML-based IDS can be used to detect both misuse and anomaly [5,6]. In particular, the supervised ML can be used with misuse detection, whereas unsupervised ML-based techniques can be used to conduct anomaly detection. Table 1 highlights the main differences between the misuse and the anomaly intrusion detection methods [6].

Components of IDSs.
A typical network-based IDS consists of four components, which are a decoder, a preprocessor, a decision engine sensor, and a defense response. e decoder uses data collection tools to gather specific network traffic in raw format and passes these data to the second component. e preprocessor examines the set of protocols that are used in the transmission of the data and extracts certain features. After that, the decision engine sensor utilizes these features to differentiate normal traffic from malicious traffic. If malicious traffic were detected, it sends a signal to the defense response, which, in turn, sends an alert to the security administrators and logs the event in the database [8]. Figure 1 illustrates the typical components of an IDS.

Deployment of IDSs.
IDSs can be deployed either in the network or within the host. An IDS that is deployed across the network is typically called network-based IDS (NIDS) and an IDS that runs at a certain host is called hostbased IDS (HIDS) [9]. NIDS can be used to monitor the traffic within the organization's network, typically installed at the network choke points. On the other hand, HIDS can be installed on critical servers to detect potential intrusions or misuses. It can also be installed on the decoy systems (Honey pots) to detect potential intrusions.

Machine
Learning. Machine learning (ML) is a branch of artificial intelligence (AI). e term "machine learning" was coined by Arthur Samuel back in 1959 when he was working at IBM [5]. Samuel defined ML as "the field of study that gives computers the ability to learn without being explicitly programmed" [5]. It is closely related to computational statistics [10]. e first ML algorithms emerged in the 1970s [11]; nowadays, ML is becoming a popular research area [12].
ML algorithms can be classified into four distinct categories, supervised learning, semisupervised learning, unsupervised learning, and reinforcement learning [13]. Both supervised and unsupervised learning can be divided further into two different subcategories. e general classification of ML is presented in Figure 2.

Supervised Learning.
In supervised learning, an algorithm is presented with a set of data that contains the input data and the corresponding output. e algorithm attempts to discover the underlying relation between the input and the output [11]. In supervised learning, data are labelled. Supervised learning can be typically used in two categories of problems [14]: (i) Classification problems that aim to predict a discrete number of values (ii) Regression problems that aim to predict a continuous-valued output is paper focuses on supervised learning. More specifically, to study the effect of noise on the classification models, each ML model will be trained on labelled instances of two different IDS datasets (described later). In the first dataset, each instance is labelled as either normal or anomaly. In the second dataset, instances are labelled as either 0 (normal) or 1 (malicious).

Semisupervised
Learning. Semisupervised learning algorithm is a combination of supervised and unsupervised algorithms that leverages a small percentage of labelled instances to generalize on a large percentage of unlabelled instances.
is is very useful in certain cases such as searching for a particular person in an extremely large set of images. For instance, a certain person could be manually labelled in few images and the model will start mining for that person in the remaining large part of the images [13].

Unsupervised Learning.
Unsupervised learning attempts to deduce the hidden structure of the datasets. Typically, all data in unsupervised learning algorithms are unlabelled. Unsupervised learning problems can be categorized as follows [11]: (i) Clustering problems that attempt to divide the data into clusters that satisfy certain criteria (ii) Dimensionality reduction problems that attempt to reduce the dimensionality of the data while maintaining the fundamental aspects of the data, i.e., highest variability

Reinforcement Learning.
is form of learning is based on trial-and-error approach, in which a learning system gathers certain data and takes an action; if the action yielded a positive result, then a reward will be recorded. If that action resulted in an unwanted result, then the system will learn that this action in that context will not likely work well in future events. is form of learning is suitable for certain fields such as robotics. It can be utilized to learn robotics how to move, carry an object, or avoid physical obstacles [13].   ere are a large number of ML algorithms. e ML algorithms that are used in this paper are as follows.

Decision Trees (DTs).
DT uses a set of rules to classify data based on the attributes' values [10]. Classification is represented in a tree-structure format where branches represent the selection of the value of input features that lead to those classifications and leaves represent class labels. DTs are characterized by the high accuracy of the classification and the simplicity of implementation. However, decision trees are biased with multilevel features [15].
In this paper, J48 flavor of decision trees is used. During the tree construction, the training dataset is recursively divided into several subsets. e best split is often based on children impurity such as the entropy which is defined as follows [16,17]: where c is the number of classes and 0log 2 0 � 0 in entropy calculations.

Random Forest (RF). RF combines multiple DTs
where each DT is constructed based on the values of independent random vectors. e results of the RF can be controlled by the majority or weighted voting [15]. RF can be considered as one of the ensemble learning methods [2] because RF tends to combine the outcomes of several DTs [6]. When the number of trees is sufficiently large, the upper bound of the generalization error converges according to the following formula [16]: where ρ is the average correlation among the trees and s is a measure of strength of the tree classifiers. e strength refers to the average performance of the classifiers measured probabilistically as where Y θ is the predicted class of X according to a classifier built from some random vector θ. e higher the margin, the more likely the classifier correctly predicts an example X.

Support Vector Machine (SVM)
. SVM is a supervised learning algorithm that uses a hyperplane to classify the data. e hyperplane is used to separate the data into different classes based on the features' space. SVM algorithm is among the most robust and accurate ML algorithms [10]. A linear SVM classifier differentiates between instances from two classes according to the following equation [13]: where x represents an instance to be classified, w represents the weight vector, and b is the bias value.

Artificial Neural Networks (ANNs)
. ANN was inspired by the human brain. e implementation of the ANN in the computer realm consists of a number of artificial neurons that are distributed across different layers. An ANN typically consists of three layers [6,18,19], which are the input layer, the hidden layer, and the output layer. e dataset is passed to the input nodes and the output nodes are used to present the classification results. is paper uses multilayer perceptron (MLP) to implement ANN within the experiments. MLP is an ANN, usually trained using backpropagation algorithm [17,20]. e output of the hidden neuron i can be calculated as follows: where x i is the input data, x i , i � 1, . . ., n; W ij denotes the weight connecting input neuron j to hidden neuron i, b i is the bias for neuron i, and f 1 is the used transfer function. e output of the output neuron i can be calculated as follows [17]:  where n k is the number of hidden neurons and V ij denotes the weight connecting hidden neuron i to the output neuron j. e goal of ANN is to minimize the total sum of squared error between the predicted output and the true output (ground truth) [16]: e weights of the network can be updated according to the gradient descent method as follows [16]:

Naïve Bayes (NB)
. NB is called naïve because it relies on a simplifying assumption, which assumes that the attribute values are conditionally independent of each other. One of the main advantages of Naïve Bayes is the speed of training and testing. However, the attribute values of some intrusion detection datasets such as the KDD CUP 99 and the NSL-KDD are highly dependent on each other [7]. is means that the accuracy of NB is highly influenced by the degree of dependencies between the attribute values of the datasets [15]. When building an IDS classifier based on the NB algorithm, the classification is performed according to the following equation: (9) where m represents the number of features, k represents the number of classes, f i stands for the i-th feature, C k stands for the k-th class, p(C k ) is the prior probability of C k , and p(f i |C k ) represents the conditional probability.

Comparison of the ML Algorithms.
A comparison between the above ML algorithms is illustrated in Table 2 [15,21]. More details about the differences between common ML algorithms can be found in [21].

Ensemble of Classifiers.
Ensemble learning consolidates several techniques to obtain better overall accuracy, which outperforms the accuracy of each single technique [8]. One of the main advantages of using ensemble learning is that it might solve the overfitting issues and improves the classification accuracy. On the other hand, the disadvantage of ensemble learning includes the increase in the required time and memory to build the model when compared to individual classifiers. Additionally, learning concepts might become more difficult to understand, i.e., less plausible. Popular methods for conducting ensemble learning are bagging, boosting, voting, and stacking [6]. is paper will utilize both bagging and boosting.
Bagging, also called bootstrap aggregation, is an ensemble method that is used to improve accuracy and reduce overfitting. is is achieved by deploying a model-averaging technique. On the other hand, boosting operates by training multiple weak learners and aggregating the weighted results [15]. In this paper, we have used the AdaBoost method to implement boosting on the intrusion datasets.

Noise
Filtering. Noisy data may affect the accuracy of different ML algorithms.
is paper adopts an empirical approach to study the potential effect of noise. It discusses the removal of outliers and extreme values to filter noisy data. ere is a wide set of different algorithms that can be used to conduct noise filtering. is paper uses interquartile range (IQR) filter for noise filtering.
is filter detects outliers and extreme values from a given dataset. According to this filter, the outliers are defined according to the following equations [22]: On the other hand, the extreme values are located according to the following equations: where Q1 � 25% quartile, Q3 � 75% quartile, IQR is the interquartile range, which represents the difference between Q1 and Q3, OF is the outlier factor, and EVF is the extreme value factor. After applying this filter, outliers and extreme values will be labelled into two additional attributes. After that, all instances within these two attributes can be removed from the dataset. is means that outlier and extreme values will be excluded from the dataset, resulting in an instant reduction. Table 3 illustrates the original number of instances and the new number of instances in the used datasets before and after applying noise filtering for both datasets.
After verifying that the removal of the noisy instances did not eliminate all the attack instances, it is also critical to ensure that benign and malicious traffic were not entirely eliminated. To accomplish that, the distribution of the different classes of network traffic before and after noise filtering for both the training and the testing portions of the UNSW-NB15 dataset was calculated and the results are illustrated in Table 4, which illustrate that none of the network traffic classes were entirely eliminated after the noise filtering process. is was feasible to calculate because one of the UNSW-NB15 attributes, called the attack_cat attribute, was originally dedicated to illustrate the number of instances within each category. Unfortunately, it was unfeasible to calculate the distribution of the different categories of the network traffic for the NSL-KDD dataset because there was no attribute that distinguishes these categories from each other. Security and Communication Networks 2.6. Relevant Literature. Based on the authors' research in the literature, there is a great lack of papers that empirically discuss the subject of using ML techniques on IDS-related datasets in the presence of noise.
An empirical study between several classification algorithms on two intrusion-related datasets, namely, KDD CUP 99 and NSL-KDD, was conducted by Hussain and Lalmuanawma [23]. e noise was injected to a specific set of attributes by 10% and 20%. However, NSL-KDD is basically an enhanced version of the KDD CUP 99. e redundant instances that existed in the KDD CUP 99 dataset were removed in the NSL-KDD dataset. e ML algorithms, in this paper, achieved more realistic results with the NSL-KDD dataset in comparison with the KDD CUP 99. is was  [15,21].

ML algorithm
Advantages Disadvantages DT (i) Very simple and fast (i) Requires long time to train the model (ii) Not affected by the increase of the dimensionality of the data (ii) Requires larger amount of memory for analyzing large databases (iii) e model is easily understood (plausible) (iii) Not suitable for problems that require diagonal partitioning (iv) Generates good accuracy based on the quality of the data (iv) Can generate a complex representation for some concepts due to the replication (v) Support incremental learning (iv) Computations can be accelerated due to ANN parallel nature (v) Applied successfully to different real-world issues such as handwritten character recognition and laboratory medicine NB (i) Requires short time for training and it is easy to construct the model (i) eoretically, classifiers based on NB algorithms have a low error rate. However, in practice, this is not entirely true due to the assumption that different attributes are independent of each other (ii) Can be computationally optimized (ii) Yields low accuracy results compared to other ML algorithms (iii) Can be used with large datasets (iv) Outcomes are easily interpreted (v) It operates in a well and robust manner even though it might not be the best algorithm for a certain application an expected outcome due to the absence of redundant records in the NSL-KDD dataset. However, in the current manuscript, we will investigate a different form of noise that could take place due to mislabelled instances, which is, in turn, injected only to a single attribute. Furthermore, in this paper, we will use two independent intrusion datasets.
Other efforts such as [24,25] investigated the effect of noise with regression tasks, which entails the attempts to predict a numerical value. However, in this paper, we are focusing on classification, which entails classifying different instances to one of two or more predefined categories.
Wang [14] used deep learning (DL) IDS with adversaries. Different attack algorithms taken from the image processing field are used in the intrusion detection field mainly with the NSL-KDD dataset. DL techniques are vulnerable to adversarial imperceptible malicious examples in the image classification domain. Adversarial examples are noisy data specially crafted to attack the DL model. e Jacobian-based saliency map attack (JSMA) and fast-gradient sign method (FGSM) were used to inject noisy data. Furthermore, the paper injected noisy data in the most salient features, which are the features that have the most influence on the results. Table 5 summarizes several recent journal articles from different publishers that discuss the subject of using ML methods in intrusion detection.

Proposed Methodology
is paper empirically investigates the robustness of ML algorithms on intrusion-related datasets with and without the presence of noise. e paper also examines the effectiveness of noise filtering algorithms with different ML algorithms.
is section discusses several intrusion-related datasets with more focus on the datasets that are used in this paper, an explanation of the proposed set of experiments, noise injection, noise filtering, evaluation metrics, and the tool that is used to conduct the experiments.

Datasets.
Intrusion-related datasets are used to train and test the proposed IDS models. KDD CUP 99 and NSL-KDD have been extensively used in intrusion detection-related research studies [27]. Table 6 compares the major datasets related to intrusion detection [6,10,27,29,35,36].

DARPA 1998.
e DARPA 1998 dataset was created within the Intrusion Detection Evaluation Project. It consists of two sets of data, one for training and the other for testing. e DARPA 1998 dataset is one of the early datasets that were publicly available [37].

KDD CUP 99. Knowledge Discovery in Databases
(KDD) CUP 99 was derived from the DARPA 1998 dataset. It was explicitly generated to develop ML, classification, and clustering algorithms with more focus on security issues [37]. KDD CUP 99 was described as a "de facto benchmark for evaluating the performance of intrusions detection algorithm" [32].
ere is an issue associated with KDD CUP 99 dataset, which is redundancy.
ere are 78% redundant records in the training dataset and 75% redundant records in the testing dataset [38]. Redundant records in the KDD CUP 99 dataset negatively affect the performance of the classifiers, making it biased to the more frequently repeated records.

NSL-KDD.
e NSL-KDD was initially generated to overcome the drawbacks of the KDD CUP 99 dataset [27]. NSL-KDD was generated based on the KDD CUP 99, but NSL-KDD does not have redundant records [31]. is makes NSL-KDD more suitable to evaluate ML classifiers. NSL-KDD has the same 41 attributes from the KDD CUP 99 dataset. Table 7 lists the different data files of the NSL-KDD dataset.
It can be noticed from Table 7 that the number of attributes for both KDD and NSL-KDD is 42. In Table 6, the number of attributes is 41 and this additional attribute is the class, also called the label attribute. Details about the NSL-KDD's attributes such as the type and the description are illustrated in Table 8.      e University of New South Wales created the UNSW-NB15 dataset for evaluating new NIDSs. 100 GB of raw network traffic was collected to build the UNSW-NB15 dataset. e UNSW-NB15 dataset consists of ten categories of traffic; one normal and nine categories represent different forms of attacks [8]. However, ML-based techniques perform better with both KDD CUP 99 and NSL-KDD. is is due to two reasons. Firstly, many values of normal and malicious instances are almost the same in the UNSW-NB15 dataset, whereas there is a relatively reasonable difference between the normal and the malicious values in both KDD CUP 99 and NSL-KDD. Secondly, the data distribution of the UNSW-NB15 dataset is nearly the same, while it is different in the KDD CUP 99 and NSL-KDD due to the existence of new attacks in the testing set, which helps to differentiate between normal and abnormal instances when running ML algorithms [8]. Tables 9 and 10 illustrate the number of attributes, the total number of instances, the number of normal instances, and the number of malicious instances. In addition, the attributes' names and types are also illustrated.

ADFA.
e ADFA dataset was generated to be tested for host-level intrusion detection. It was generated by the Australian Defense Academy (ADFA). e ADFA dataset encompasses two different platforms, Windows (ADFA-WD) and Linux (ADFA-LD) [10]. ere are five categories of attacks within this dataset, which can be found in Table 6.

Dataset Preprocessing.
is paper utilizes two intrusion detection datasets, which are the NSL-KDD dataset and the UNSW-NB15 dataset. e NSL-KDD dataset can be used directly in the experiments as the class value of the NSL-KDD is nominal, whereas the class value of the UNSW-NB15 dataset is numeric. erefore, the class values of the UNSW-B15 dataset must be converted from numeric to nominal.
WEKA data mining tool provides a numeric to nominal filter that can be used to convert numeric values to nominal values. Typically, the class is located as the last attribute in the dataset [20]. In the NSL-KDD dataset, the last attribute, which is attribute number 42, contains two distinct nominal values, called normal and anomaly. On the other hand, the class value of the UNSW-NB15 dataset, which is attribute number 45, consists of numeric values.

Proposed Experiments.
Generally, the proposed methodology consists of six distinct sets of experiments illustrated in Figure 3. Each set of experiments aims to empirically test a certain aspect of ML-based IDS performance in response to the noise. e first set of experiments involves generating the baseline, which will be needed to investigate the potential impact of noise and the effectiveness of noise filtering techniques on ML-based IDSs. Furthermore, it will be used to measure the influence of employing ensemble learning algorithms.
e results of the subsequent stages will be compared with the results of this stage.
In the second set of experiments, interquartile noise filtering algorithm will be applied to the datasets. is filter will help to identify outliers and extreme values from the datasets. e results of this phase will be compared with the baseline. In the third set of experiments, noise is injected into the intrusion datasets with the following percentages: 5%, 10%, 20%, and 30%. en noisy data will be inserted into the ML models. e results will be compared with the baseline obtained from the first stage. is is useful to test the robustness of ML algorithms against noise. e fourth set of experiments entails conducting noise filtering by excluding noisy instances and then injecting different levels of noise, which are 5%, 10%, 20%, and 30%.
is will help to study the influence of noise on the ML algorithms that run on intrusion datasets with the absence of outlier and extreme value instances.     In the fifth set of experiments, noise injection is applied before noise filtering. In this experiment, noise is injected in the form of manipulating the labels of a given percentage of the training set. Noise filtering is conducted by removing outliers and extreme values. e sixth set of experiments studies the influence of using ensemble learning techniques on the accuracy of the ML algorithms when applying it to the intrusion datasets.
A generic description of the conducted experiments is presented in the form of pseudocodes.

Noise Injection.
e presence of noise might negatively affect the accuracy of ML algorithms. Noise can take different forms such as mislabelled data or incorrectly classified instances.
To obtain more accurate outcome, noisy data must be filtered first as working with noise-free data helps to avoid potential issues such as overfitting [36]. Some experiments in this paper require injecting different levels of noise into the datasets. In WEKA, noise can be injected into the datasets via the AddNoise filter. AddNoise filter "changes a percentage of a given nominal attribute's values" [20]. e default value of noise injection in the AddNoise filter is 10%. However, it can be adjusted to any other percentage.

Evaluation Metrics.
It is critical to specify the adequate metrics that will be used to evaluate the ML algorithms with the intrusion detection datasets. However, the evaluation metrics should be relevant to the subject under examination. Ahmad et al. [26] used three evaluation metrics to compare the performance of SVM, RF, and EL for the IDSs. ese evaluation metrics are accuracy, precision, and recall.
In this paper, the authors will use the same evaluation metrics. e classification of each testing instance falls within four cases (usually referred collectively as the confusion matrix). e four cases are as follows: (i) True positive (TP): malicious instances that were correctly classified as malicious by the IDS (ii) True negative (TN): normal instances that were correctly classified as normal by the IDS (iii) False positive (FP): normal instances that were incorrectly classified by the IDS as malicious (iv) False negative (FN): malicious instances that were incorrectly classified by the IDS as normal e above four values are used to calculate the evaluation metrics [37]. Accuracy represents the ratio of correctly classified instances to the total number of instances in the test dataset: Precision represents the accuracy of the positive predictions: Recall is the ratio of positive instances that the classifier manages to classify correctly: Accuracy is the most relevant metric to the objective of this paper as it indicates the success percentage of the ML algorithm in classifying the testing instances. Classifying the testing instances involves determining whether it represents an intrusion or not. is will help to define the factors that can influence, either positively or negatively, the accuracy of the ML algorithms. erefore, only the accuracy metric will be used to assess and compare the different ML algorithms from different experiments with the baseline, and the other metrics will be presented to provide a more holistic perspective on the effectiveness of the ML algorithm.

Tool Used in is Research (WEKA).
is paper utilizes Waikato Environment for Knowledge Analysis (WEKA) version 3.8.3 as the main tool to conduct empirical experiments. WEKA is an open-source tool written in Java and designed to run ML algorithms. It was developed by the University of Waikato, New Zealand [39]. WEKA provides the ability to run many ML algorithms such as J48 (a DT flavor), RF, and SVM on any dataset.

e ML Algorithms' Parameters.
Each ML algorithm has its own set of parameters that determines exactly how it operates. ese parameters can be fine-tuned to enhance the performance of the ML-based classifiers. Table 11 lists the assigned parameters for the ML algorithms as used in the conducted empirical experiments.

Device Specifications.
e device that we have used to conduct all the experiments is Dell OptiPlex 9020 with Intel i7-4770 CPU, 12 GB of random access memory (RAM), 500 GB hard disk drive (HDD), and the operating system (OS) is Microsoft Windows 10 Pro.

Results and Discussions
is section discusses the conducted experiments and the obtained results and analysis of the results.

Baseline Experiment.
e main objective of this set of experiments is to determine the data that will be used in subsequent experiments and to establish a baseline to compare it with the results of subsequent experiments. erefore, three baseline experiments are conducted. Each of these baselines will be constructed using the same ML algorithms. However, the only difference is in the method of using the dataset as the following: (i) Training the model using the training dataset and testing it on a separate testing dataset. (ii) Each dataset will be divided into three equal parts.
Two-thirds will be used for training and one-third will be used for testing. is method is referred to as the percentage split. In this paper, all ML algorithms were trained on the same two-thirds of the datasets and all were tested on the same one-third that was dedicated for testing. (iii) e entire training data of each dataset will be used for training and testing the constructed models. is will be achieved using an iterative manner called cross-validation, which entails dividing the dataset to a certain number of equal partitions (folds). All of these partitions have the same size. en, the tests are performed by selecting one of these folds for testing and the remaining folds for training. is is achieved in an iterative manner. In this paper, we have divided the training portion to 10-folds.
e results of the first set of experiments are aggregated in Table 12 which includes details about the testing mode, the used ML methods, the names of the intrusion datasets, and the evaluation metrics. Figure 4(a) compares the results using different testing modes using the NSL-KDD dataset while Figure 4(b) compares the results using different testing modes using the UNSW-NB15 dataset.
In the supplied test set, both KDDTrain+ and KDDTest+ from the NSL-KDD were used for training and testing, respectively. On the other hand, the UNSW_NB15_training-set and the UNSW_NB15_testing-set from UNSW-NB15 were used for training and testing, respectively. In percentage split and cross-validation, the KDDTrain+ from NSL-KDD and UNSW_NB15_training-set from UNSW-NB15 were used.
With the NSL-KDD dataset, it can be noticed that the test mode highly affects the classification accuracy of the used ML algorithms. However, the observations are illustrated in the following points:  Subsequent experiments will be conducted using the supplied test set only rather than splitting a single dataset or cross-validation as this provides a better understanding of different factors that might affect the accuracy of the ML algorithms on the intrusion dataset. For the NSL-KDD dataset, KDDTrain+ and KDDTest+ will be used, to train and test the model, respectively. On the other hand, for the UNSW-NB15, UNSW_NB15_training-set and the UNSW_NB15_testing-set will also be used to train and test the model, respectively.

Noise Filtering Experiment.
e second set of experiments includes using noise filtering by removing outliers and extreme values from the training and the testing datasets. e results from this set of experiments are listed in Table 13. In addition, the classification accuracy results are compared with the baseline. Figures 5(a) and 5(b) show the results of using the filtered NSL-KDD dataset and the filtered UNSW-NB15 dataset, respectively. e effects of removing the outliers and extreme instances on the accuracy of the filtered NSL-KDD dataset are as follows:   e accuracy values of the UNSW-NB15 dataset after conducting the instance reduction by removing the outliers and extreme value instances are as follows: (i) For DT (J48), SVM, and ANN, the accuracy values of the filtered dataset remained as the baseline (100%). is means that the removal of the outliers and the extreme value instances did not affect the classification accuracy of these ML algorithms when used with the UNSW-NB15 intrusion dataset. (ii) e accuracy of the RF algorithm was slightly improved by +1.4493% due to the removal of the outliers and the extreme value instances. (iii) On the other hand, the accuracy value of the NB algorithm was decreased by −11.9827% as a result of noise filtering. is means the NB algorithm performs better with the presence of the outliers and extreme values in the UNSW-NB15 dataset.

Noise Injection Experiment.
e third set of experiments aims to investigate the extent of the effect of noise over the five considered ML algorithms when applying them over the intrusion-related datasets. In this set of experiments, different levels of noise (5%, 10%, 20%, and 30%) are added to each training dataset, and no noise will be injected into the testing dataset. e noise will be injected using WEKA's AddNoise filter. e noise will be injected solely into the class attribute. is will help to investigate the classification accuracy of the ML algorithms on the testing dataset, knowing that it was trained on a noisy dataset. e results after injecting the different levels of noise are illustrated in Table 14. Figures 6(a) and 6(b) show the results of using noise with the NSL-KDD dataset and the UNSW-NB15 dataset, respectively.
Noise injection affected the accuracy of the ML algorithms tested on the NSL-KDD dataset. e accuracy of some algorithms decreases with the increase of noise, and the accuracy of certain algorithm increases with the increase of noise. e observations from this set of experiments are as follows: (i) e accuracy of the DT (J48) algorithm is slightly influenced by the different levels of noise compared to the baseline, which is 81.5339%. DT (J48) achieved better accuracy with 10% noise compared to 5% noise and it also achieved better accuracy with 30% noise in comparison with 20% noise injection.
(iii) Interestingly, with the SVM algorithm, the increase of noise causes an increase in the classification accuracy when compared with the baseline (75.3948%). With the different levels of noise, the SVM algorithm scored the following accuracy results (76.2731%, 76.2376%, 78.7615%, and 78.7216%). It can be concluded that SVM managed to correctly classify more testing instances, knowing that it was originally trained on a noisy (misclassified) data. except that, with the 30% noise injection, the classification accuracy increased when compared with 10% and 20% noise injection (74.9867%, 73.9354%, 71.6776%, and 74.4677%).
e effect of noise injection on the accuracy of the ML algorithms tested on the UNSW-NB15 dataset can be summarized as follows: (i) e classification accuracy of the DT (J48) algorithm is not affected by the 5%, 10%, and 20% noise and it remained as the baseline (100%). However, the accuracy was intensely affected when a 30% noise was injected. e DT accuracy recorded 66.9412% classification accuracy.
is means that the DT algorithm with the UNSW-NB15 dataset can maintain high classification accuracy until a certain limit where it starts to heavily degrade and cannot be used reliably.
(ii) With the RF algorithm, the classification accuracy continuously decreased with the increase of noise. It can be concluded that RF algorithm is sensitive to noise and it should be trained on a correctly labelled dataset in order to obtain better classification outcomes.
(iii) e classification accuracy of the SVM algorithm was not affected by any level of noise. e classification accuracy remained as the baseline (100%) even after injecting different levels of noise. is means that the SVM algorithm managed to successfully classify all testing instances even though it was trained on a mislabelled dataset. erefore, it can be considered as the most robust algorithm in this set of experiments.
(iv) e classification accuracy of the ANN algorithm was slightly decreased with the increase in noise. e baseline accuracy of the ANN algorithm is 100%, and the accuracy results after injecting the different levels of noise are 99.9988%, 100%, 99.9988%, and 99.9818. is means that training the ANN algorithm on noisy dataset does not greatly affect the classification accuracy.
(v) e baseline value of the NB algorithm is 87.4435%, and the accuracy results after inserting the different noise levels are 77.5992%, 77.5883%, 77.2713%, and 77.7511%, respectively. is indicates that the effect of different levels of noise on the NB algorithm is almost identical.   Figure 7 illustrates the impact of injecting different levels of noise in the NLS-KDD dataset and the corresponding accuracy of the ML algorithms. It can be noticed that SVM algorithm is the most resilient algorithm to noise injection whereas the RF algorithm is the least resilient algorithm to noise injection. Figure 8 illustrates the impact of injecting different levels of noise in the UNSW-NB15 dataset and the corresponding accuracy of the ML algorithms. e major degradation in the classification accuracy of the DT algorithm with the 30% noise injection is readily apparent in addition to the continuous decrease in the classification accuracy of the RF algorithm with the increase of the percentage of noise ratio.

Noise Filtering en Injection Experiment.
In the fourth set of experiments, noise is filtered by removing outliers and extreme value instances and then noise is injected with different levels (5%, 10%, 20%, and 30%) to the intrusion dataset. e third experiment investigated the effect of noise injection. e current experiment inspects the potential   Figure 10 shows the impact of injecting different levels of noise to the filtered NLS-KDD dataset and the corresponding accuracy of the ML algorithms. It can be noticed that the removal of outliers and extreme value instances and then the manipulation of the instances' labels with different noise percentages have made the classification accuracy of the ML algorithms nearly similar except for the RF algorithm with the 20% noise injection and the DT algorithm with the 30% noise injection. Figure 11 shows the impact of injecting different levels of noise to the filtered UNSW-NB15 dataset and the corresponding accuracy of the ML algorithms. e resilience of the RF algorithm has remained weak to the different levels of noise injection even after the removal of outliers and the extreme value instances.

Noise Injection en Filtering Experiment.
is set of experiments aims to investigate the influence of noise injection before filtering the dataset from outliers and extreme values. e difference between this set of experiments and the previous set of experiments is the order of dealing with noise. More specifically, the previous set of experiments conducted noise filtering then injection. However, in this set of experiments, noise injection will be conducted before filtering the noisy instances.    with the 20% noise injection when compared with the 10% noise injection.
(ii) e classification accuracy of the RF algorithm continuously decreased with the increase of the noise levels. (iii) e classification accuracies of both SVM and NB algorithms were identical in all different noise levels. is is also identical to the previous set of experiments, which means that the order of noise filtering and injection does not change the impact of the classification accuracy on these ML algorithms. (iv) ANN algorithm achieved a slight increase in the classification accuracy with the 5% noise injection (+0.1505%) comparing to the baseline. In contrast, there was a continuous degradation in the classification accuracy with the increase in the noise level.
For the UNSW-NB15 dataset, the accuracy of ML algorithms is similar to the previous set of experiments. is means that the order of conducting noise filtering and injection does not remarkably affect the classification accuracy. In particular, (i) DT algorithm maintained the same accuracy result with the 5% and the 10% noise levels. A large degradation in the classification accuracy has occurred with the 20% and the 30% noise levels (−26.608% and −29.7731%, respectively).
(ii) In the RF algorithm, the classification accuracy decreased with the increase in the noise levels. (iii) Both the SVM and the ANN maintained the same accuracy results across 5%, 10%, and 20%. A small decrease in accuracy occurred with the 30% noise injection. SVM accuracy degraded −0.0127% and ANN accuracy degraded by −0.1938%. (iv) NB algorithm achieved similar accuracy results across different levels of noise. Figure 13 shows the impact of injecting and then filtering noise in the NLS-KDD dataset and the corresponding accuracy of the ML algorithms. e performance of all algorithms degraded with DT showing the highest accuracy with the maximum noise level. Figure 14 shows the impact of filtering and then injecting noise in the UNSW-NB15 dataset and the corresponding accuracy of the ML algorithms. DT performance shows the greatest drop among all ML algorithms.

Ensemble of Classifiers Experiment.
In this experiment, two ensemble learning classifiers are used, which are bagging and boosting. In this experiment, each ML algorithm will be used as a base classifier with bagging and boosting. e detailed results of this set of experiments are illustrated in Table 17. Figures 15(a) and 15(b) show the results of ensemble of classifiers learning applied to both NSL-KDD dataset and UNSW-NB15 dataset, respectively. With the NSL-KDD dataset, it can be noticed that none of the ensemble methods caused a remarkable impact. In fact, the accuracy results indicate that the ML algorithms are slightly affected by the ensemble learning methods. In particular, (i) e baseline accuracy of the DT algorithm was 81.5339% and it achieved a small increase in accuracy with bagging (83.7252%) which represents +2.1913% increase. With boosting, the achieved classification accuracy (77.8522%) represents a decrease by −3.6817%. (ii) e classification accuracy of the RF algorithm was negatively affected by both ensemble learning methods. In bagging, it achieved 80.1499% classification accuracy, which is a slight decrease of −0.3017%. e classification accuracy with boosting was (79.4092%), which is also a decrease by −1.0424%. (iii) Ensembled SVM achieved lower accuracy with bagging and higher accuracy with boosting. SVM classification accuracy with bagging is 75.0311% which represents a decrease by −0.3637%. Boosted SVM achieved an accuracy of 75.6343% which is slightly greater than the baseline by +0.2395%. (iv) Bagged ANN algorithm achieved an accuracy of 76.0956% which is slightly lower than the baseline by −1.6836%. However, boosting yielded the same accuracy of the baseline, which is 77.7147%. (v) Similar to the DT algorithm, the NB algorithm generated a greater accuracy value with bagging and a lower accuracy value with boosting. With bagging, it achieved 76.2952%, which is slightly greater than the baseline by +0.1774%, and with boosting, it achieved a lower classification accuracy of 73.7447%, which is less than the baseline by −2.3731%.
In the UNSW-NB15 dataset, ensemble methods did not remarkably improve the classification accuracy of the ML algorithms, except for the RF and the NB algorithms. e observations of the classification accuracy are as follows:   (i) DT (J48), SVM, and ANN are not affected by bagging and boosting, and the accuracy after applying bagging and boosting remained as the baseline accuracy, which is 100%. (ii) With the RF algorithm, there is a slight increase in accuracy with bagging (+1.5073%). However, with boosting, the classification accuracy was decreased by −7.9326%. (iii) With the NB algorithm, there is a slight decrease in accuracy with bagging (−0.1506%). However, there is a large increase in the classification accuracy with boosting (+12.5055%). e accuracy value increased from 87.4435% to 99.949%. e factors that caused the sudden increase of the classification accuracy of the NB algorithm with boosting need to be further investigated. is might pave the way for developing NB-based IDSs for conducting near real-time detection due to the high speed of the NB algorithm [6].

Conclusion and Future Work
Optimizing IDS capabilities is becoming an essential requirement due to the continuous increase of cyber threats and attacks. ML-based IDSs represent one of the emerging paradigms that can be used to confront misuse and anomaly attacks. However, the classification accuracy of the MLbased IDSs is prone to several factors. Determining these factors helps to build better ML-based IDSs. e results from the empirical experiments of this paper illustrate that the classification accuracy of the ML-based IDSs can be influenced by certain factors. ese factors are the method of using the dataset for training and testing, the removal of outliers and extreme value instances, the injection of mislabelled instances, and the use of ensemble learning techniques. ese factors have a diverse impact on the classification accuracy. In certain cases, these factors cause a negative impact on the classification accuracy such as the effect of noise on the accuracy of the RF algorithm. On the other hand, some of these factors caused a massive improvement of the classification accuracy such as the NB algorithm with boosting when applied on the UNSW-NB15 dataset, and in other situations, these factors did not affect the accuracy of the ML-based IDSs.
is work can be expanded in several ways. First, algorithms for noise filtering and instance reduction such as the Decremental Reduction Optimization Procedures DROP3 and DROP5 can be used, rather than using the built-in noise filtering functions of WEKA. Other datasets such as ADFA, CIC-IDS 2017, or CIC-DDoS 2019 can also be tested. e work can also be expanded by employing the fine-tuned NB algorithm [40] which improved NB performance by finding a better estimation of NB's probability terms, which makes the algorithm less stable and more suitable for building a bag of ensembles of classifiers [41]. Investigating the efficacy of these methods for the evaluation of the effect of noise using intrusion detection datasets is an interesting future research. Feature selection can also be included in future work. e impact of including or removing different features can be studied empirically, which might represent an important factor that can optimize the classification accuracy of the ML-based IDSs.
Data Availability e two intrusion-related datasets that are used in this paper are publicly available for download. e NSL-KDD dataset can be accessed via https://www.unb.ca/cic/datasets/nsl.html and the training and the testing portions of the UNSW-NB15 dataset can be downloaded from https://www.unsw.adfa.edu.au/unswcanberra-cyber/cybersecurity/ADFA-NB15-Datasets/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.