A Stacking Ensemble for Network Intrusion Detection Using Heterogeneous Datasets

)e problem of network intrusion detection poses innumerable challenges to the research community, industry, and commercial sectors. Moreover, the persistent attacks occurring on the cyber-threat landscape compel researchers to devise robust approaches in order to address the recurring problem. Given the presence of massive network traffic, conventional machine learning algorithms when applied in the field of network intrusion detection are quite ineffective. Instead, a hybrid multimodel solution when sought improves performance thereby producing reliable predictions. )erefore, this article presents an ensemble model using metaclassification approach enabled by stacked generalization. Two contemporary as well as heterogeneous datasets, namely, UNSW NB-15, a packet-based dataset, and UGR’16, a flow-based dataset, that were captured in emulated as well as real network traffic environment, respectively, were used for experimentation. Empirical results indicate that the proposed stacking ensemble is capable of generating superior predictions with respect to a real-time dataset (97% accuracy) than an emulated one (94% accuracy).


Introduction
Network intrusion detection is a significant research area since cyber attacks are increasing at an alarming rate [1]. Numerous studies have been put forward in order to propose noteworthy approaches for combating malicious cyber activities. However, as and when cyber attacks become more complex, the existing approaches fail to address the problem effectively. Traditional defensive strategies like firewalls, antivirus, and authentication seem to be inefficient for many complex threats because cyber-attack vectors are highly sophisticated [2]. Network intrusion detection is a major decision-making problem that can be addressed by the application of classification algorithms [3]. Several machine learning algorithms like fuzzy logic, neural networks, support vector machine, Naïve Bayes, K nearest neighbor, and decision trees have been employed in the field of network intrusion detection [4]. Whenever a combination or an ensemble approach is introduced, performance of individual algorithms can be enhanced. Ensemble paradigm is a notable machine learning approach wherein different algorithms are employed to improve predictions. Some studies have also demonstrated that the application of ensemble paradigm can prove to be versatile and certainly boost prediction accuracy and detection speed [5][6][7][8]. Going by the same assertion, the proposed approach emphasises the application of supervised machine learning algorithms to propose a classification framework using a concept called stacked generalization. As illustrated in [9][10][11], stacking or stacked generalization is advantageous since the concept is based on combining predictions from different individual classifiers that can substantially improve generalization too. e advantage of stacking was explained in [12] to perform protein classification, and desirable accuracy was accomplished. As explained in [13], classifier ensembles or combiners or committees offer better solutions by handling bias-variance trade-off more effectively than individual classifiers. A comparative analysis was conducted to analyse SVM's performance along with classifiers like AdaBoost, J48, random forest, BayesNet, and logistic regression. It was conspicuous that all the algorithmic combinations with SVM produced better results than individual SVM [14]. e implementation of the ensemble learning algorithm called super learner resulted in improved predictions using the MAWILab dataset [15]. One such ensemble learning paradigm is stacking that considers several machine learning algorithms, uses a metamodel to combine predictions from individual algorithms, and thereby improves overall performance. By combining the advantages of multiple algorithms, detection effect can be enhanced [16]. e stacking method was employed to detect malware on mobile devices that showed an improvement in accuracy and F measure [17].

Related Work
Several methods have been put forth by researchers to perform network intrusion detection using a combination of algorithms.
is section presents an overview of such combinative approaches that focus on improving the overall performance. An emerging approach for intrusion detection involving an ensemble design was put forth using neutrosophic logic classifier, an extension to fuzzy logic. A genetic algorithm was used to generate rules. e aforesaid design could decrease the false alarm rate to 3.19% as compared to other approaches [18]. Support vector machine (SVM) is a well-known classifier that can classify from a limited set of samples given to it but still can optimize predictions [19]. It was demonstrated by Chen et al. [20] that SVM was superior to artificial neural networks (ANNs) in terms of detecting intrusions while experimenting with basic security module (BSM) audit data from Defence Advanced Research Projects Agency (DARPA) intrusion detection dataset. is is because ANN requires lot of training data, whereas SVM can perform better with relatively less data and can execute much faster. However, SVM is known to excel primarily with respect to binary classification, but when combined with other classifiers, SVM can yield better results for multiclass classification too.
An ensemble design involving multilayer perceptron and radial basis function demonstrated that superior performance could be attained by consolidating two individual models. Compared to individual models, the hybrid model devised by Govindarajan and Chandrasekaran [21] seemed to be more accurate. is study used a dataset developed at the University of New Mexico which consisted of both normal and abnormal traces pertaining to mail application. An intrusion detection system was designed using a combination of SVM and K nearest neighbor (KNN). Particle swarm optimization (PSO) generated weights were used to create an ensemble design that accomplished an improvement of 0.756% in accuracy as compared to the best base expert [22].
Rangadurai Karthick et al. [23] developed an adaptive intrusion detection approach by combining hidden Markov and Naïve Bayesian models. Empirical results indicated that the aforementioned combinative approach yielded favourable results and learned the nature of traffic quite efficiently. Traces from Center of Applied Internet Data Analysis (CAIDA) and DARPA datasets were used to implement the hybrid model.
Another two-step hybrid method based on binary classification and KNN was proposed to decrease the bias, normally encountered pertaining to classwise predictions.
Step 1 involved the usage of binary classifiers, and an aggregation module was employed to recognize abnormal connections, whereas in Step 2, KNN was used to classify those instances whose classes were undetermined after Step 1 [24].
A hybrid intrusion detection technique was proposed by Malik et al. [25] using binary particle swarm optimisation (BPSO) and random forest (RF) to classify probe attack patterns. BPSO, being a good search optimizer, and RF, an efficient classifier, contributed towards achieving better performance. is method was compared with eight other classifiers, and it was interesting to note that BPSO-RF combination yielded better results when compared to individual classifiers.
An ensemble classifier using random forest, C4.5, and forest by penalizing attributes (FPA) was proposed by Zhou and Cheng [26].
is study used average of probability (AOP) algorithm to merge the decisions from different classifiers using a modern intrusion detection dataset CIC-IDS2017. Results indicated a very good increase in accuracy, i.e., 96.76%.
An insightful study was conducted by Khammassi and Krichen [27] using a combination of genetic algorithm and decision trees, wherein the genetic algorithm was used as a search strategy and decision trees were used for classification. It was observed that this approach achieved 81.42% accuracy and 6.39% false alarm rate using the UNSW NB-15 dataset.

Implementation Strategy
e objective of the proposed approach is to obtain reliable predictions by using an ensemble technique called stacking. e proposed study delineates the results obtained from two datasets captured in two diverse environments: (i) Binary and multiclass classification results with respect to UNSW NB-15 [28,29] (an emulated dataset) (ii) Results obtained using UGR'16 [30] (a cyclostationary dataset formulated through real traffic) e University of New South Wales Network based 2015 (UNSW NB-15) is a dataset created by a cyber security research group at the Australian Center for Cyber Security [28,29]. e IXIA Perfect Storm tool was used to capture nine attack categories. is tool incorporates all the updated information needed to include newer attacks from Common Vulnerabilities and Exposures (CVE) site. is dataset has 47 features with two class labels. Tcpdump traces were collected for a span of 31 hours to generate UNSW NB-15 dataset. Since synthetic generation of network traffic was administered to develop this dataset, it failed to trap genuine behaviors of the Internet [30]. e University of Granada (UGR'16) [30] dataset is a more pragmatic attempt made at capturing netflow traces spanning more than four months of network traffic from an Internet service provider (ISP). Founders of this dataset mentioned explicitly that cyclostationary nature of network was considered for the development of this dataset. An important advantage of this dataset is that the background traffic was adequately captured from sensors located in ISP network which normally harbors heterogeneous profiles of clients [30]. is dataset comprising of 16,900 million unidirectional flows offers immense scope to perform extensive experimentation [31]. Figure 1 depicts the stacking framework that comprises base and metaclassifiers, namely, logistic regression (LR), K nearest neighbor (KNN), random forest (RF), and support vector machine (SVM), respectively. e publication of the article Super Learner [32] proclaimed that combination of individual algorithms leads to optimal predictions. Stacking or stacked generalization is a concept proposed by Wolpert [33]. Different machine learning algorithms determine their individual biases on a learning set ultimately filtering out biases. e implementation of a stacked ensemble involves two kinds of models: (i) base models (level 0 classifiers) and (ii) metamodels (level 1 or metaclassifier). e core logic of stacking lies in using the metaclassifier to predict the samples by learning from level 0 classifiers. A significant advantage of the stacking classifier was illustrated, wherein Yan and Han [34] mentioned that stacking can improve the prediction accuracy while considering unbalanced datasets. A study [35] was conducted to emphasize upon the application of artificial intelligence-(AI-) based classifiers. e authors explained that ensembles possess the ability to adapt to the vigorous behaviors of malicious and normal traffic quite effectively. Tables 1 and 2 enumerate the details of network instances considered for experimentation from UNSW NB-15 and UGR'16 datasets, respectively.

Preprocessing and Selection of Features.
Preprocessing was applied to handle miscellaneous data found in the dataset. In order to remove noise and to resolve inconsistencies found in the data, a statistical transformation tool is necessary. In the proposed work, missing values and outliers were compensated by making the distribution normal. However, missing values depend on individual features. While some features may have zero as a missing value, others have zero as part of its value wherever binary data are considered. In order to avoid predicaments, considering relevant features that promise optimal predictions is necessary. Hence, a combination of information gain (IG) and hashing was used to extract the most desirable features. Feature scaling was applied to ensure that those features possessing a greater numeric range do not dominate the ones in smaller numeric ranges. UNSW NB-15 has many features but not all seem to be significant. e essential features were assigned weights in order to prioritise them, and only the best features were extracted. Dimensionality of the features was reduced using hashing technique. It is worthwhile to mention that only eleven features were selected from UNSW NB-15 dataset like sbytes, sttl, sload, tcprtt, smean, ct_srv_src, ct_state_ttl, ct_src_dport_ltm, ct_dst_src_ltm, ct_srv_dst, and service. Alternatively, the following five features were considered from UGR'16 dataset: source_ip, destination_port, forwarding status, packets exchanged in the flow, and number of bytes. For a detailed explanation of the abovementioned features and different attack types, [28][29][30] can be consulted.

Classification.
e critical hyperparameters used for tuning and optimizing the performance of the classifiers are enumerated in Table 3. e strategy to implement the classification framework involved the application of multiple classifiers to resolve the underlying intricacies of data found in both packet-based and flow-based datasets.
Basically, KNN relies on a distance function that computes similarity or difference between two network instances found in the datasets under consideration. e Euclidean distance d(x, y) can be calculated by using the following equation: where x i refers to the i th feature of the instance x, whereas y i refers to the i th feature of the instance y. "n" refers to the total number of features found in the dataset. Let ere are "p" labels in the dataset.
Let "x" be the new sample to be predicted. e objective of KNN classifier is to determine "k" vectors that are close to x.
If the majority of the vectors belong to class C m , then x will be assigned the class label C m . Radial basis function (RBF) is a preferred kernel function for many classification problems in machine learning. e following equation defines the RBF: where ‖x − y ′ ‖ 2 denotes the squared Euclidean distance between two data points x and y. RBF kernel consists of two significant components, namely, gamma and c. Gamma is the decision region. c denotes the penalty for wrongly classifying data points. Whenever "c" is large, SVM will be penalized heavily. e value of c is maintained as 1.0 which indicates that SVM is fairly tolerant of misclassifications that eventually lead to less variance. A higher value when assigned to c can lead to overfitting (Algorithm 1).

Results and Discussion
false positive rate � FP FP + TN , false negative rate � FN FN + TP .
In order to precisely estimate the efficacy of the proposed approach and corroborate the results obtained from stacked ensemble, both binary and multiclass classification results are presented in this section. Table 4 depicts the results obtained upon classifying the network instances of the UNSW NB-15 dataset into either attack or normal.
In order to testify the predictions and also to affirm that the models do not overfit, mean training accuracy (MTA), mean training precision (MTP), and mean training recall (MTR) values are also mentioned in Table 5. Table 6 represents actual versus predicted classifications corresponding to each class, namely, normal (N), reconnaissance (R), backdoor (B), denial of service (D), exploits (E), analysis (A), fuzzers (F), worms (W), shellcode (S), and generic (G). e highest detection rate (recall) of 98.32% is obtained for generic attack type, whereas the least detection rate is reported for backdoor attack type, i.e., 10.79%. However, it is still a challenge to improve the detection rate of attack types like analysis, denial of service (DOS), worms, and backdoor. Precision refers to the relevant results presented by the model. e netflow traces found in UGR'16 include real background traffic for a substantial duration of four months. e primary reason behind considering this dataset to develop the intrusion detection model can be attributed to the presence of controlled attack traffic that influences the cyclostationary evolution of traffic. us the validation of the proposed approach will be more genuine and meaningful using this realistic dataset. 1,048,576 netflow traces of each attack type were considered to comprehend the performance of stacking approach. Figure 2 is a pictorial representation to perceive the performance of stacking ensemble on the UGR'16 dataset by depicting the scores of accuracy, precision, and recall pertaining to different attack types.   As per the confusion matrix illustrated in Table 7, it is evident that all the seven attack types found in the UGR'16 dataset were differentiated quite aptly by the stacking classifier. e highest attack detection rate was reported for blacklist attack type. It can be noted that this kind of attack detection ability when exhibited by intrusion detection models can prove to be beneficial for counteracting emerging attacks like DDOS, DOS, and scan attacks. Although network instances belonging to the aforesaid attack types are found in conventional datasets like KDD cup 99 and NSL-KDD, such attack traces are definitely obsolete because newer attacks have emerged in recent Output: Predictions from the ensemble E Step 1. Impose cross validation in order to prepare a training set for meta-classifier Step 2. Randomly split T into "m" equal size subsets, i.e., T � T 1 , T 2 , T 3 . . . T m Step 3. for m ⟵ 1 to M Learn base classifiers namely random forest, KNN, and logistic regression for n ⟵ 1 to N Learn a classifier P mn from T or T m End for Step 4. Formulate a training set for metaclassifier (SVM) for each X i ϵ Tm Extract a new instance (

End for End for
Step 5. Return y i � y 1 , y 2 , y 3 , . . . , y n from ensemble ALGORITHM 1: Strategy for implementing the stacking ensemble.   years despite similar nomenclature. Table 8 highlights the classwise performance of the seven attack types found in the UGR'16 dataset. e proposed ensemble model could detect the occurrence of blacklist attack type in the most efficient manner. In order to present reliable results, performance metrics like precision and recall were also considered in addition to accuracy. Recall can be defined as the capability of the intrusion detection model to determine the positive cases correctly, whereas precision refers to the ability of the model to determine the percentage of positive predictions that were correct.
Normally, there is a trade-off that occurs between recall and precision. Since F1 score takes into account both precision and recall, it is often used as a performance metric to assess the efficacy of intrusion detection systems. As observable from Table 8, the false alarm rate is considerably low with respect to all the attack categories, and it is an indication that the overall performance of the ensemble model is definitely good. Both false positives and false negatives hamper the performance of network intrusion detection systems. If legitimate traffic is reported as an intrusion, then security analysts may unnecessarily invest their time and resources trying to comprehend a traffic scenario that is absolutely normal. A greater damage is caused when malicious network traffic is identified as normal because such adverse traffic situations may force security experts to overlook some really detrimental traffic scenarios. Any intrusion detection system should not generate too many false alarms. In the current study, the performance of the ensemble model has been considerably good due to the low false alarm rate reported during experimentation. From Table 8, it is obvious that the false alarm rate is quite low pertaining to different attack categories considered in the study.
Typically, receiver operating characteristic (ROC) curve is a pictorial representation of sensitivity vs. 1 − specificity for the entire threshold value. Here, the term sensitivity represents true positives which is projected as a positive rate (which is similar to the recall measurement). It is also written as P(Pred � positive | True � positive). Likewise, the term specificity represents P(Pred � negative | True � negative). Based on the ratio of true negatives predicted as negatives, ROC curves are used to visualize the relationship between detection rate and false positive rate of a classifier. With respect to the UGR'16 dataset, different attack types have true positive rate around 0.99 and false positive rate ranges between 0.05 and 0.23. Hence, an average value has been obtained for plotting the ROC curve as shown in Figure 3.
Network intrusion detection presents numerous challenges to researchers like recurring cyber attacks, lack of publicly available datasets, and problems associated with benchmark datasets to name a few. Another important parameter for considering an intrusion detection dataset is definitely the kind of network traffic environment used to generate it. Normally, intrusion detection datasets are formulated in either real or emulated network traffic scenarios.
is work has considered two datasets for experimentation (UNSW NB-15 and UGR'16) that are modern in their approach and proposed an ensemble model using supervised machine learning algorithms. Although the nomenclature of attack types found in many intrusion detection datasets is similar, the network traffic environment used to capture the attack traces plays a vital role in deciding whether the intrusion detection framework can be closely modelled to the real world or not. For example, denial of service attack traces are found in KDD cup 99, UNSW NB-15, and UGR'16, but it cannot be generalized that all these attack signatures are similar because they were captured in emulated as well as real network traffic scenarios, respectively, with substantial differences pertaining to attack tools, traffic generators, and test beds [28][29][30][31].
Likewise, the credibility of any approach proposed for network intrusion detection can be ascertained by its potential to differentiate between modern attacks (traces of  modern attack types are found in UNSW NB-15 and UGR '16). It can be noted that 20 features were used in [27] to achieve the results using the UNSW NB-15 dataset as compared to the proposed approach wherein only 11 features were used to accomplish a superior accuracy and a reasonably lower false alarm rate.
Moreover, as noted in [27], three decision tree classifiers were used to perform classification, but the proposed study employed a diverse set of classifiers to achieve the desired objective quite efficiently. Given the presence of massive network traffic in the real world, it is prudent to consider large number of instances for experimentation as in the case of the UGR'16 dataset. With the advent of Internet of things (IOT), network traffic will only become more and more complex in the coming years [36,37].
A very negligible false alarm rate has been reported with respect to the UGR'16 dataset, and the least reported false alarm rate is 0.54% pertaining to blacklist attack type. Binary classification results tend to focus only on either normal or attack classification whereby the problem of misclassification between various attack types tends to dissipate, often resulting in a higher accuracy. Hence, multiclass classification becomes indispensable. However, it is still a challenge to improve the attack detection rates of some attack types. Such problems are common while experimenting with multiclass datasets that normally comprise of unbalanced samples. As explained clearly in [38], ensemble of classifiers can be considered as a feasible solution for class imbalance problem. UGR'16 is relatively new and there are no studies pertaining to the implementation of supervised learning algorithms on this dataset. As elaborated in [30], cyclostationary characteristics of the network are well captured by this dataset.
Network traffic, in all possibilities, is cyclostationary because unpredictable fluctuations can be observed that strongly depend on the time of the day and year. As discussed in [30], network traffic exhibits temporal behaviour. In essence, when cyclostationary characteristics are captured by a dataset, it is possible to comprehend the dynamics of network traffic and analyse periodic behaviour. In real world, there is a need to design network intrusion detection systems that take into account cyclostationary features. Apart from the UGR'16 dataset, there is no publicly available dataset at present where cyclostationarity has been captured. erefore, in order to validate the effectiveness of the proposed approach in a better manner, a dataset that depicts the characteristics of real traffic is also included in the study. e adoption of network flows in the field of network intrusion detection is extremely important; a missing element in most of the traditional datasets is used for evaluating the performance of intrusion detection systems. An important advantage of the UGR'16 dataset is the presence of unidirectional flows instead of packets unlike the UNSW NB-15 dataset that can be used for decisive anomaly detection. As described in [39], network flows present an aggregated view of the network. erefore, the time spent to analyse such flows is considerably less. Predominantly, the usage of any flow-based intrusion dataset is advantageous over other datasets because they can be used in a novel manner to detect intrusions in highspeed networks [40].

Conclusion
is work has proposed an ensemble approach using the concept of stacking for effective network intrusion detection. Two heterogeneous datasets like UNSW NB-15 (emulated) and UGR'16 (real-time) were used for experimentation. A combination of algorithms, namely, random forest, logistic regression, K nearest neighbor, and support vector machine, resulted in superior predictions with respect to a real-time dataset than an emulated one. e implementation strategy can be further extended to conduct experimentation on different datasets that include recent attack categories. Sophisticated computing engines like Apache Spark can be used in future to increase the processing speed and facilitate scalability for large volumes of network data. From the series of experimentation conducted during the course of this research work, it can be inferred that the proposed approach serves as a competitive perspective for real-time   Security and Communication Networks network intrusion detection. Traffic periodicity and longterm evolution of network traffic cannot be performed using only conventional packet-based intrusion detection datasets. us, heterogeneous datasets when applied in the field of network intrusion detection prove to be quite instrumental for gaining better insights into building secure applications.