Intrusion Detection System Based on Decision Tree over Big Data in Fog Environment

1College of Engineering, Huaqiao University, Quanzhou, Fujian 362021, China 2Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, BC, Canada V6T 1Z4 3Fujian Provincial Academic Engineering Research Centre in Industrial Intellectual Techniques and Systems, Quanzhou, Fujian 362021, China 4State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China


Introduction
Fog computing [1,2] was defined as a highly virtualized computing platform for migrating cloud computing center tasks to network edge devices. Fog computing provides computing, storage, and networking service between mobile users and traditional Cloud platform, which is complementary to Cloud. The fog computing introduces the middle layer between the cloud and the mobile users, extending the cloudbased network architecture [3][4][5][6]. A basic fog framework is shown in Figure 1, each mobile user is connected to one of the fog nodes. Meanwhile, fog nodes could be interconnected with each other and are linked to the Cloud [7]. The fog computing reduces unnecessary multiple communication between the cloud computing center and the mobile users. For instance, when the number of users has increased dramatically, these users can obtain the service by visiting the contents of the cache in the fog servers so as to reduce network delay [8]. And it also significantly reduces the bandwidth of the backbone link load [9,10]. Unfortunately, the nodes in fog environment are close to the mobile users, and fog computing nodes are usually composed of devices with weak computing ability. Traditional network attacks are widely presented in fog environment; fog devices may face network security challenges. However, Intrusion Detection Systems (IDS) can be used for fog environment [11].
IDS is designed to ensure network security and the main task is detect malicious activities of the host or network and then respond in a timely manner [12]. The definition of intrusion detection was first formally described in the 1980s [13]. In addition, the concept of real-time anomaly detection was proposed by Denning [14]. Pattern matching algorithm is one of the core technologies of IDS. Misuse detection based on AC, BM, MWM, and other matching algorithms [15] can make IDS have a passive detection of known attacks. However, modern attacks are increasingly inclined to form an unknown intrusion technology by integrating a variety of known intrusion technology. Meanwhile, improved IDS methods usually take proactive protection based on deviation detection and user behavior anomaly detection. For instance, statistical model, Bayesian reasoning, and cluster analysis [16] can make up for the lack of pattern matching, so that the system has a certain detection of unknown attacks. KNN algorithm [17] is widely used in pattern recognition, classification, and regression. Same as KNN, vector automatic classification algorithms, support vector machine [18][19][20], neural network algorithm [21], Bayesian algorithm [22][23][24], and means algorithm are also widely used for IDS [25,26].
Although the IDS in tradition network has been well investigated, unfortunately directly use them in fog computing environment may not inappropriate. Fog nodes produce massive amounts of data at all times, and, thus, enabling an IDS system over big data in fog environment is of paramount importance. More specifically, the existing researches mainly present the experiments on 10% KDDCUP99 dataset [27]. Although these methods have achieved good results, we cannot judge their efficiency when they are presented in the big data environment, even in the full dataset. In addition, there are four classification methods for network attacks, and also twenty-two classification methods in KDDCUP99. However, the existing research mainly focuses on the detection precision of four attacks but did not consider the detection of twenty-two attacks.
In order to address the above issue, we propose an IDS system based on decision tree over Anaconda [28]. Firstly, we propose a preprocessing algorithm to digitize the strings in the given dataset and then normalize the whole data, to ensure the quality of the input data so as to improve the efficiency of detection. Secondly, we use decision tree method for the detection of network attacks in our proposed IDS system, and then we compare this method with Naïve Bayesian method as well as KNN method. More specifically, three modes of Naïve Bayesian method are compared. And the experiment results show that our proposed IDS system is precise.
Our contributions in this study can be summarized as follows.
(1) For one thing, both the 10% dataset and the full dataset are tested in our IDS system, which proves that our IDS system is effective for big data environment.
(2) For another, we not only complete the detection of four kinds of attacks but also implement the detection of twenty-two kinds of attacks. The results show that our IDS system has a higher detection coverage of network attacks.
(3) In addition, the calculation time of each method is compared. To ensure the detection accuracy, although the calculation time of decision tree is not the best one, the time is also acceptable and can be used for big data environment.
The rest of the paper is organized as follows. In Section 2, the preliminaries are introduced. Section 3 specifies our proposed IDS system. The experimental evaluation is described in Section 4. Section 5 presents the related work. Finally, we conclude our work and describe the future work in Section 6.

Preliminaries
In this section, we firstly introduce the problem model and relevant formulas in Section 2.1 and then introduce the evaluation indicators of IDS detection in Section 2.2.

Problem Model and Relevant
Formulas. The object of decision tree is to construct a decision tree model based on a given dataset to enable it to classify the new instances correctly. There are many methods to construct the decision tree, such as ID3 and C4.5 [29] and CART (Classification and Regression Trees) [30,31]. In this study, we will use CART over Anaconda [28] for our IDS system. The relevant formulas are shown as follows.
For given dataset is the input instance and represents a network packet record. has features. indicates the number of records of the packets contained in the dataset . ∈ {0, 1, 2, . . . , −1} is the class tag which means the result of each detection record.
Let represent the data at node , where is the training data in node .
For each split = ( , ) which consists of a feature and a threshold , the data is divided into two subsets of 1 ( ) and 2 ( ): The impurity of can be obtained by using an impurity function (): Select the parameter to minimize the impurity: * = arg min ( , ) .
Recourse for both 1 ( * ) and 2 ( * ) until reaches the maximum depth and thus < min samples or = 1. For the classification of IDS, ∈ {0, 1, 2, . . . , − 1} for node represents a region of with instances of . Assume that is the proportion of class instance in and can be obtained by the following formula: The measure of impurity is generally named as Gini and can be obtained by the following formula: Cross entropy can be obtained by the following formula: Misclassification can be obtained by the following formula:

Evaluation Indicators.
In this section, we mainly introduce the indicators of IDS.
(1) F1 Score. Assuming that we classify a sample dataset as both normal and abnormal, there are four cases of classification. As shown in Table 1, that is, the True Positive, False Positive, False Negative, and True Negative. True means that the classification is correct while False means that the classification is wrong. Positive means that the classifier is divided into normal (positive samples) and Negative means that the classifier is divided into abnormal (negative samples): (1) True Positive: normal instance is detected correctly.
(2) False Positive: abnormal instance is incorrectly classified as normal.
(3) False Negative: normal instance is misclassified as abnormal one.
(4) True Negative: abnormal instance is detected correctly. Precision represents the proportion of relevant instances among the detected instances. can be obtained by the following formula: Recall that represents the proportion of relevant instances that have been detected over the total amount of relevant instances. can be obtained by the following formula: Actually, indicators of and are sometimes contradictory, and thus 1 score is the common evaluation indicator.
1 score is the weighted average of and which can be obtained by the following formula: More especially, where = 1, 1 score will get the new formula, and thus (2) The Calculation Time. The calculation time of the IDS detection algorithm. contains the mode construction time and the detection time of proposed method.

A New IDS System for Fog Computing
In this section, a new IDS system for fog computing is presented. The main steps of this system are shown as follows. As shown in Figure 2, our proposed IDS system mainly consists of three steps: Step 1: the data preprocess; Step 2: data normalization; Step 3: decision tree detection. And the main work of each step is shown as follows.
Step 1 (data preprocess). The given dataset is usually composed of numbers and strings. We cannot compare the value of string directly, and thus we need to digitize the string by using string replace operation. The details are shown in Algorithm 1. We firstly traverse the given dataset and find all the strings in dataset and obtain the corresponding columns by using find () function (Line (1) to Line (3)). Secondly, we call the replace function to replace with random number (Line (7) to (9)). Finally the processed dataset is retuned. In addition, will be the input for Step 2.

Input: dataset
Step 1: data preprocess Step 2: normalize Output: detection result Step 3: decision tree detection Figure 2: A new IDS detection system for fog computing.
Step 2 (data normalization). Notice that the range of numbers in may not uniform. That means large numbers of columns will cause the role of small columns to be ignored, and in fact there are some small numbers of columns that may play a very important role. And thus we should perform the normalization process before executing the detection algorithm; the object of normalization is to make the characteristic data shrink [0-1]. The main content is shown in Algorithm 2. Firstly, we randomly select % of the training dataset 1 from as the training dataset 1 and the remaining 2 equals (1 − %) as the testing dataset (Line (1)). Then we obtain the normalization results train and test by using normalization function (Line (2) to Line (3)). Obviously, X train and X test will be the input for Algorithm 3.
Step 3 (decision tree detection method). In this step, we mainly construct the decision tree by using given training dataset train and then get the detection result of test dataset test. As shown in Algorithm 3, firstly, the decision tree mode mo is established by using CART function according to the related formula (2) to (8) illustrated in Section 2.1 (Line 1). Secondly, the labels in test are obtained by using the mode mo (Line 2). Last but not least, we obtain the results of 1 score and calculation time by using statistics () function Algorithm 2: Data normalization method. according to the related formula (9) to (12) illustrated in Section 2.2 (Line 3).

Experimental Environment.
In this section, we evaluate our proposed IDS system on KDDCUP99 dataset. The experiment is implemented by Python on a windows 10 Operating System, where the processor is Inter Core i7 2.7 GHZ, the RAM is 16 GB, and the main software platform is Eclipse and Anaconda 2.7 SCIkit-learn.

The Introduce of Dataset.
For research of IDS, a large number of valid experimental data is needed. Data collection can be obtained through some capture tools, such as TCPdump, Libdump, and Wireshark, and then connection record is generated as the data source for IDS. In this study, we use KDDCUP99 [27] dataset for our test. The dataset is a 9-week network connection data collected from a simulated LAN of US Air Force. The dataset contains two kinds, the former one is 10% dataset named as KDDcup.Data.10.percent.correceted and the latter one is the full dataset named kddcup.data.corrected. Each connection record in KDDCUP99 dataset contains forty-one fixed feature attributes and a class label. Among the forty-one features, nine features are symbolic while the other ones are continuous. As shown in Table 2, the class identifier indicates that the connection record is normal or a specific kind of attack. In addition, we can see that the DOS, Probing, R2L, and U2R have a more detailed division. In this study, the attack kinds are marked as numbers. The corresponding marks are shown in Tables 3 and 4. On the one hand, we will complete the detection of four kinds of attacks; on the other hand, we will complete the detection of twenty-two kinds of attacks. Meanwhile, both the 10% dataset and the full dataset are tested in our experiment. And thus we will perform four group experiments for each method: (1) four kinds of attacks

Experiment Result and Discussion.
We compare the experiment results from the aspects of 1 Score and calculation time. In order to cover all the attack kinds and ensure the effectiveness of the test results, we randomly divided the dataset, 60% of which was used as a training dataset and 40% as a test dataset. As a result, the Naïve Bayesian contains three models: MultinomialNB, BernoulliNB, and GaussianNB [32]. And therefore, we firstly test Bayesian method and find the best one for IDS. And then compare it with the other two methods. For each method, we conduct 10 group experiments and then compare their average.

Experiment Result Contrast of Three Modes of Bayesian.
Firstly, we test the Bayesian method. The calculation time contrast results are shown in Figure 3. MultinomialNB gets the least calculation time among all the test cases, followed by BernoulliNB, and GaussianNB is the last one. And then we compare the results according to the test results of 1 score. Our principle is shown as follows. We firstly see the detection precision of normal class, as in the actual situation, the proportion of normal class is relatively large, and then  see the detection coverage of all attacks. As the attack type is divided into four kinds and twenty-two kinds, so we firstly discuss the detection result of three methods on four kinds of attacks and then discuss the detection result on twenty-two kinds of attacks.
(1) As shown in Figure 4, for 10% dataset, the detection precision on GaussianNB for the normal type is significantly lower than the other two methods. The BernoulliNB method is slightly lower than the MultinomialNB method for the normal type detection. As shown in Figure 5, for full dataset, detection 1 Score based on GaussianNB for the normal type has increased, but still lower than the other two. In addition, GaussianNB and BernoulliNB can do the detection type coverage. 1 Score of U2R based on MultinomialNB is 0%. However BernoulliNB is relatively stable. Although the 1 Score of U2R detection by GaussianNB is better than BernoulliNB, the detection 1 Score of R2L is much lower than BernoulliNB's, meanwhile, considering 1 Score on normal type by GaussianNB is lower than BernoulliNB method. In addition, the calculation time of the former one is much longer than the latter one. And thus, the BernoulliNB method is the best method for IDS.
(2) Next, we discuss the results of the three modes for detecting twenty-two attacks over both datasets. Similarly, we first discuss the normal class of test results. As shown in Figure 6, the same as a result in the above scenario, for 10% dataset, the detection 1 Score based on GaussianNB for the normal type is significantly lower than the other two methods. The BernoulliNB method obtains the same precision for normal type detection with the MultinomialNB method. In addition, in view of the detection 1 Score of twentytwo kinds of attacks, the BernoulliNB method is the best. As shown in Figure 7, for the full dataset test, the 1 Score of the  normal class by GaussianNB has improved but is still lower than the other two methods. The analytical method is the same as above. Considering the 1 Score of detection for other attacks, the Bernoulli method is the best.
And thus, among the three modes of Bayesian, the BernoulliNB model is the most suitable one for IDS. Next, we will compare it with the other two methods in the next experiment.

Experiment Results
Contrast of Three Methods. Next, we will compare BernoulliNB with decision tree and KNN. The calculation time contrast results are shown in Table 5. BernoulliNB gets the least calculation time among all the test cases, followed by decision tree, and KNN is the last one.
As shown in the    (1) We first discuss detection results of the two methods of four kinds of attacks and then discuss the situation of twentytwo kinds of attacks. As shown in Figure 8, for each 10% dataset, the 1 Score of all attacks based on the decision tree is higher than the BernoulliNB method; as shown in Figure 9, for full dataset detection, all attack detection 1 Score on decision tree is higher than BernoulliNB except U2R.
(2) Next, we discuss the results of twenty-two attacks of the two methods over both datasets. Similarly, we firstly discuss results of the normal class. As shown in Figure 10, the decision tree obtains the same precision with BernoulliNB on No. 8 Figure 11, for full dataset, the decision tree method obtains the same 1 Score with BernoulliNB on No. 9 attack. In addition, the 1 Score of BernoulliNB method is slightly lower than BernoulliNB for No. 4 attack. Moreover, the decision tree method is better than BernoulliNB in all other cases. In addition, the calculation time of the former one is much longer than the latter one. Above all, decision tree method is the most suitable one for IDS over big data in fog environment.  algorithm is much better. From the point of view of the calculation time, although is not the best, the calculation time of the decision tree is acceptable. The authors in [24] point out that the calculation time of Naïve Bayesian is generally 7 times faster than that of decision trees by using C4.5. However, in this study, we can conclude that the decision tree based on CART is much faster. The multiple comparison of calculation time is shown as Table 7.

Discussion and
(1) BernoulliNB is 2.364 times faster than decision tree in the case of four kinds of attacks over full dataset. In particular, the time gap is narrowed over the situation of twenty-two kinds of attacks.
(2) In order to make the comparison more comprehensive, we simply look at it with the other two Bayesian cases. Compared with GaussianNB, the decision tree is much faster than GaussianNB over the situation of twenty-two kinds of attacks, even when compared with BernoulliNB which is the most time efficiency mode of Naïve Bayesian, MultinomialNB is only 4.857 times faster than decision tree in the worst situation.
However, taking into account the detection accuracy, as well as the coverage of the attacks, there is no doubt that the decision tree is the best choice for IDS over big data in fog computing.
Above all, our proposed IDS system is efficient and precise. As shown in Figure 1, our proposed system can be deployed in a common node of fog layer without extra requirement. According to the above experiment, we can conclude that the system performance is stable and performs very well in big data environment.

Related Work
Fog computing was for the first time proposed by Cisco in 2012 and defined as a highly virtualized computing platform for migrating the tasks of Cloud to network mobile users. The fog computing [4] introduces the middle layer between the cloud and the mobile users, extending the cloud-based network structure, and provides computing, storage, as well as network service between mobile devices and Cloud. The fog computing reduces unnecessary multiple communication between the cloud computing center and the mobile users [8]. It not only reduces the network delay for mobile users but also significantly reduces the link bandwidth backbone [9,10]. Although there are many advantages of fog computing, some security issues still need to be solved. More specifically, fog computing nodes are usually composed of weak computing power. Traditional network attacks become more common in fog computing environment, such as eavesdrop or hijack the mobile user data and even attempt to destroy the fog system. Fortunately, Intrusion Detection Systems (IDS) can also be applied in fog environment [11].
After decades of development, IDS has become a more successful security technology. IDS which represented by Snort [33] has made an outstanding contribution to network security in recent years. ISS RealSecure is also well known, and it mainly consists of two parts, the engine part and the console part. The former one is responsible for detecting information and generate alarms and the latter one receives the alarm and is a central point for configuring and generating the database report. Pattern matching algorithm is one of the core technologies of IDS products. Misuse detection based on AC, BM, MWM, and other matching algorithms [15] can make IDS have a passive detection of known attacks with wide and obvious characteristics. However, modern attacks are increasingly inclined to form an unknown intrusion technology by integrating a variety of known intrusion technologies. Meanwhile, improved IDS methods usually take proactive protection based on deviation detection and user behavior anomaly detection. Statistical model, Bayesian reasoning, cluster analysis [16], and other excellent algorithms like DB can make up for the lack of pattern matching. KNN algorithm known as nearest neighbor algorithm [17] is widely used in pattern recognition, classification, and regression [18]. Same as KNN, vector automatic classification algorithms, support vector machine [19,20], neural network algorithm [21], Bayesian algorithm [22][23][24], and means algorithm are also widely used for IDS [25,26].
Although the IDS in tradition network has been well investigated, unfortunately directly use them in fog computing environment may not inappropriate. More specifically, the existing researches mainly present the experiments on 10% KDDCUP99 dataset. Although these methods have achieved good results, we cannot judge their efficiency when they are presented in big data environment, even in the full dataset of KDDCUP99. In addition, there are four kinds of attacks classification, as well as twenty-two attacks classification in KDDCUP99. However, the existing researchers mainly focus on the detection of four attacks but fail to consider the detection of twenty-two attacks. In order to address the aforementioned problem, we propose an IDS system based on Anaconda, we use decision tree for our IDS detection, and multimethods are compared. Although the author in [24] also uses Bayesian and decision tree methods for IDS. Different from them, we conducted a more adequate experiment. And we compare decision tree with three modes of Naïve Bayesian method, as well as KNN method. More specifically, both the 10% dataset and the full dataset are tested in our IDS system. We not only complete the detection of four kinds of attacks but also accomplish the detection of twenty-two kinds of attacks. In addition, the calculation time of each method is compared. The authors in [20] also consider the calculation time of their algorithm; however, they also only present their experiments on 10% dataset, and thus we cannot judge the performance of the algorithm over big data environment. Above all, the experiment results show that our proposed system is effective and precise.

Conclusion
Tradition network attacks are widely present in fog computing environment. Although the IDS in tradition network have been well investigated, unfortunately directly use of them in fog computing environment may not inappropriate. In this study, we propose a system based on the decision tree, multimethods are compared with this one, not only the 10% dataset but also the full dataset is tested, and the experiment results show that our system is effective. In addition, we also compared the detection time for each method. In the case of guaranteed accuracy, although the decision tree time is not the best one, the calculation time is also acceptable. Above all, our IDS system can be used in fog computing environment over big data. In our future, we will engage in the research of the IDS for other kinds of attacks.

Conflicts of Interest
The authors declare that they have no conflicts of interest.