The explosive growth of network traffic and its multitype on Internet have brought new and severe challenges to DDoS attack detection. To get the higher True Negative Rate (TNR), accuracy, and precision and to guarantee the robustness, stability, and universality of detection system, in this paper, we propose a DDoS attack detection method based on hybrid heterogeneous multiclassifier ensemble learning and design a heuristic detection algorithm based on Singular Value Decomposition (SVD) to construct our detection system. Experimental results show that our detection method is excellent in TNR, accuracy, and precision. Therefore, our algorithm has good detective performance for DDoS attack. Through the comparisons with Random Forest,
The explosive growth of network traffic and its multitype on Internet have brought new and severe challenges to network attack behavior detection. Some traditional detection methods and techniques have not met the needs of efficient and exact detection for the diversity and complexity of attack traffic in the high-speed network environment, especially such as DDoS attack.
Distributed Denial of Service (DDoS) attack is launched by some remote-controlled Zombies. It is implemented by forcing a kidnapped computer or consuming its resources, such as CPU cycle, memory, and network bandwidth. Moreover, Palmieri et al. [
In 2014, Luo et al. [
However, the existing detection methods still suffer from low True Negative Rate (TNR), accuracy, and precision. And their methods or models are homogeneous, so the robustness, stability, and universality are difficult to be guaranteed. To address the abovementioned problems, in this paper, we propose the DDoS attack detection method based on hybrid heterogeneous multiclassifier ensemble learning.
Ensemble learning finishes the learning task by structuring and combining multiple individual classifiers. It is homogeneous for the ensemble of the same type of individual classifiers, and this kind of individual classifier is known as “base classifier” or “weak classifier.” Ensemble learning can also contain the different types of individual classifiers, and the ensemble is heterogeneous. In heterogeneous ensemble, the individual classifiers are generated by different learning algorithms. The classifiers are called as “component classifier.” For the research of homogeneous base classifier, there is a key hypothesis that the errors of base classifier are independent of each other. However, for the actual attack traffic detection, they apparently are impossible. In addition, the accuracy and the diversity of individual classifiers conflict in nature. When the accuracy is very high, increasing the diversity becomes extremely difficult. Therefore, to generate the robust generalization ability, the individual classifiers ought to be excellent and different.
An overwhelming majority of classifier ensembles are currently constructed based on the homogeneous base classifier model. It was proved to obtain a relatively good classification performance. However, according to some theoretical analyses, while error correlation between every two individual classifiers is smaller, the error of ensemble system is smaller. Simultaneously, while the error of classifier is increased, the negative effect came into being. Therefore, the ensemble learning model for homogeneous individual classifiers cannot satisfy the needs of the higher ensemble performance [
According to the different measured standard in [
This paper makes the following contributions. (i) To the best of our knowledge, this is the first attempt to apply the heterogeneous multiclassifier ensemble model to DDoS attack detection, and we provide the system model and its formulation. (ii) We design a heuristic detection algorithm based on Singular Value Decomposition (SVD) to construct the heterogeneous DDoS detection system, and we conduct thorough numerical comparisons between our method and several famous machine learning algorithms by SVD and by un-SVD.
The rest of this paper is organized as follows. Section
The classification learning model based on Rotation Forest and SVD aims at building the accurate and diverse individual component classifiers. Here, the model is used for DDoS attack detection. Rodríguez et al. [
There are two key points to construct heterogeneous multiclassifier ensemble learning model [
(i) Firstly, classifiers in an ensemble should be different from each other; otherwise, there is no gain in combining them. These differences cover diversity, orthogonality, and complementarity [
(ii) The majority voting method is chosen as our combined strategy of all component classifiers. For the prediction label of every record in testing data set, we choose those labels whose votes are more than half as final predicting outcomes. The majority voting is given by
As shown in Figure
Hybrid heterogeneous multiclassifier ensemble classification model.
We assume that the
Suppose that
Then, the computational formulae of the singular value and the eigenvectors are given by
The new training data subset
So far, the singular value and their eigenvectors of the corresponding eigenvalue by SVD for every subset are obtained. Next, the rotation matrix is got. It is denoted by
The primitive testing data set
One of the keys for good performance of ensembles is the diversity. There are several ways to inject diversity into an ensemble; the most common is the use of sampling [
In addition, a statistical test should be employed to eliminate the bias in the comparison of the tested algorithms. In our model, we use the statistical normalization [
In this section, we first describe DDoS attack detection process in heterogeneous multiclassifier ensemble model, and then the detection algorithm based on SVD is presented.
Firstly, all primitive training data set and all testing data set are split into
The classification detection algorithm is shown as follows.
The final label of a testing data record, label =
In our algorithm, in order to select the component classifiers, we use the parallelization principle. The component classifiers have no strong dependencies by the principle. In addition, we select Bagging, Random Forest, and
In this section, we discuss how to apply the heterogeneous classification ensemble model in detecting DDoS attack traffic. We first present the data set and the data pretreatment method used in our experiments. Then, the experimental results are given, and we analyze and make comparisons with the homogeneous models based on three selected algorithms by SVD and by un-SVD. Here, the computer environment to run our experiments is listed in Table
Computer experimental condition.
CPU | Memory | Hard disk | OS | MATLAB |
---|---|---|---|---|
Intel® Xeon® CPU E5-2640 v2 @2.00 GHz |
32 GB | 2 TB | Windows Server 2008 R2 Enterprise | R2013a |
In this paper, we use the famous Knowledge Discovery and Data mining (KDD) Cup 1999 dataset [
All 41 features in the four types.
Number | |
---|---|
|
|
|
duration |
|
protocol_type |
|
service |
|
flag |
|
src_bytes |
|
dst_bytes |
|
land |
|
wrong_fragment |
|
urgent |
|
|
|
|
|
hot |
|
num_failed_logins |
|
logged_in |
|
num_compromised |
|
root_shell |
|
su_attempted |
|
num_root |
|
num_file_creations |
|
num_shells |
|
num_access_files |
|
num_outbound_cmds |
|
is_hot_login |
|
is_guest_login |
|
|
|
|
|
count |
|
srv_count |
|
serror_rate |
|
srv_serror_rate |
|
rerror_rate |
|
srv_rerror_rate |
|
same_srv_rate |
|
diff_srv_rate |
|
srv_diff_host_rate |
|
|
|
|
|
dst_host_count |
|
dst_host_srv_count |
|
dst_host_same_srv_rate |
|
dst_host_diff_srv_rate |
|
dst_host_same_src_port_rate |
|
dst_host_srv_diff_host_rate |
|
dst_host_serror_rate |
|
dst_host_srv_serror_rate |
|
dst_host_rerror_rate |
|
dst_host_srv_rerror_rate |
In addition, KDD CUP 1999 data set covers four main categories of attack, and these are DoS, R2L, U2R, and Probing. Because the traffic records of “neptune” and “smurf” for DoS account for more than 99% and 96% in the abovementioned training subset and testing subset, we choose the two types for DoS as our algorithm evaluation and comparison with the three famous existing machine learning algorithms in this paper.
Firstly, because the training subset of 10 percent and the “corrected” testing subset in KDD CUP 1999 data set include hundreds of thousands of network records, the hardware configuration of our sever cannot load the calculation to process the abovementioned data sets. Here, we use the built-in “random
Data sets used in our experiments.
Category | Training data set | Testing data set |
---|---|---|
Normal | 9728 | 6059 |
DoS | 38800 | 22209 |
Secondly, for each network traffic record, it includes the information that has been separated into 41 features plus 1 class label [ TCP, UDP, and ICMP in the “protocol_type” feature are marked as 1, 2, and 3, respectively. The 70 kinds of “service” for the destination host are sorted by the percentage in the training subset of 10 percent. We get the top three types, and they are ecr_i, private and http. The three types account for over 90%. The ecr_i, private, http, and all other types are marked as 1, 2, 3, and 0, respectively. The “SF” is marked as 1, and the other ten false connection statuses are marked as 0 in the “flag” feature.
Cross-validation is an effective statistical technique to ensure the robustness of a model. In this paper, to improve the reliability of the experimental data and to verify the availability of our model, a fivefold cross-validation approach is used in our experiments.
The training data set is randomly split into five parts. In turn, we take out one part as the actual training data set and the other four parts and testing data set as the final testing data set. The aforementioned statistical normalization method in Section
Whether normal or attack for the network traffic belongs to the category of the binary classification, we need some evaluation indexes to evaluate it. In this paper, we use three typical indexes to measure our detection model, and they are TNR, accuracy, and precision. Here, TNR denotes the proportion of normal samples that are correctly recognized as normal samples in the testing data set. It reflects the veracity that detection model discerns normal samples. Accuracy denotes the proportion between the number of correctly classified samples and the total number of samples in the testing data set. It reflects the distinguishing ability to differentiate normal samples from attack samples. Precision denotes the proportion of true attack samples in all attack samples recognized by detection model in the testing data set. TNR, accuracy, and precision are formulated as follows:
The performance of a classification detection model is evaluated by the counts of records for the normal samples and the attack samples. The matrix is called as the confusion matrix [ TP (True Positive) is the number of attacks correctly classified as attacks; FP (False Positive) is the number of normal records incorrectly classified as attacks; TN (True Negative) is the number of normal records correctly classified as normal records; FN (False Negative) is the number of attacks incorrectly classified as normal records.
Confusion matrix.
Predicted DoS | Predicted normal | Total | |
---|---|---|---|
Original DoS | TP | FN | P |
Original normal | FP | TN | N |
Total | P′ | N′ | P + N (P′ + N′) |
In this section, our heterogeneous detection model is compared with Random Forest,
We refer to the past experience threshold value along with conducting many experiments. In this paper, we finally select eight threshold values to evaluate the performance of our model. Experimental results demonstrate that TNR, accuracy, and precision of our model are excellent in the detection model, and the model is more stable than the previous three algorithms in TNR, accuracy, and precision.
In Figure
TNR for comparing our model with the other algorithms.
TNR comparison based on SVD
TNR comparison based on un-SVD
In Figure
In Figure
Accuracy for comparing our model with the other algorithms.
Accuracy comparison based on SVD
Accuracy comparison based on un-SVD
In Figure
In Figure
Precision for comparing our model with the other algorithms.
Precision comparison based on SVD
Precision comparison based on un-SVD
In Figure
The efficient and exact DDoS attack detection is a key problem for diversity and complexity of attack traffic in high-speed Internet environment. In this paper, we study the problem from the perspective of hybrid heterogeneous multiclassifier ensemble learning. What is more, in order to get the stronger generalization and the more sufficient complementarity, we propose a heterogeneous detection system model, and we construct the component classifiers of the model based on Bagging, Random Forest, and
The authors declare that they have no competing interests.
The work in this paper is supported by the Joint Funds of National Natural Science Foundation of China and Xinjiang (Project U1603261).