An Effective Algorithm for Intrusion Detection Using Random Shapelet Forest

Detection of abnormal network traffic is an important issue when builds intrusion detection systems. An effective way to address this issue is time series mining, in which the network traffic is naturally represented as a set of time series. In this paper, we propose a novel efficient algorithm, called RSFID (Random Shapelet Forest for Intrusion Detection), to detect abnormal traffic flow patterns in periodic network packets. Firstly, the Fast Correlation-based Filter (FCBF) algorithm is employed to remove irrelevant features to decrease the overfitting as well as the time complexity. Then, a random forest which is built upon a set of shapelet candidates is used to classify the normal and abnormal traffic flow patterns. Specifically, the Symbolic Aggregate approXimation (SAX) and random sampling technique are adopted to mitigate the high time complexity caused by enumerating shapelet candidates. Experimental results show the effectiveness and efficiency of the proposed algorithm.


Introduction
Intrusion detection system (IDS) is an important part of modern network security protection infrastructure. It is aimed at analyzing the traffic packages online or offline to identify the intrusion behaviors from networks. However, some attacks are very difficult to be detected. For example, distributed denial of service (DDos) attack creates tens of thousands of zombie computers and orders them attack a target server at the same time. It not only fabricates source IP address to avoid detection, but also increases the traffic exponentially. Therefore, an efficient technique for detection of intrusion behaviors is required.
The basic principle of intrusion detection technology is to build a normal or abnormal behavior model through the analysis of relevant data which may be stored in security log or audit database and compare the model with the user behavior to identify the potentially harmful behavior [1]. It is obvious that the key to victory is the discovery of the effective behavior characteristics (or patterns) from relevant data.
As an effective technology to search and mine hidden information from massive data, data mining is very suitable for intrusion detection. So far, a variety of data mining technologies, including classification, clustering, and anomaly detection, have been successfully applied in intrusion detection.
Classification is a popular technology in intrusion detection. Given a set of labeled instances, it learns a function which can assign a label to a new unlabeled instance. Lee and Stolfo [2] firstly extracted rules from audit data and used the rules for detection of abnormal behavior in network traffic. Gao et al. [3] employed the Apriori algorithm to extract traffic flow patterns from network data and subsequently used the K-means cluster algorithm to generate a detection model. Besides that, many popular techniques of classification were adopted in intrusion detection, such as K-nearest neighbor [4,5], decision tree [6,7], and support vector machine [8]. Recently, deep learning, which attracts lots of attentions from community, is employed in intrusion detection and achieves state-of-art performance [9,10]. However, the intrinsic defect of deep learning, a.k.a. lack of interpretability, prevents it to be a ready-made panacea.
In this paper, we employ time series classification (TSC) technique to detect abnormal behaviors based on the offline traffic flow data. Specifically, the adopted technique is called shapelet, which is a new primitive in the field of TSC. The contributions of this work include the following: (1) We propose a novel TSC framework for intrusion detection which is composed of a feature selection algorithm (FCBF) and a shapelet-based random forest classifier (2) The traffic flow data is represented by SAX and the shapelet candidates, which are used to train the classifier and are sampled randomly. By this way, the running time is greatly mitigated The proposed algorithm, called RSFID, is validated on several data sets of intrusion detection. The results prove that RSFID is effective to detect abnormal behavior in traffic flow. Since the intrinsic advantage of the shapelet-based method, i.e., good interpretability, our work provides a different solution to solve the problem of intrusion detection The rest of the paper is organized as follows. Section 2 briefly introduces the development of IDS and recalls the basic knowledge of shapelet-based TSC. Section 3 explains the details of the RSFID algorithm, and the theoretical analysis of complexity is also given. Next, the experimental details are introduced, and the results are analyzed in deep in Section 4. Finally, Section 5 gives conclusions.

Intrusion Detection and Time Series Classification.
Intrusion detection is aimed at extracting patterns or characteristics of user's behaviors by analyzing the security log and then identifying the dangerous behavior in the system. The solutions can be divided into two types. The first is building a safe/normal behavior model as the evaluation criteria of user behavior. When the user behavior is obviously different from the safe/normal behavior model, it is considered to be an intrusion. The second is building an unsafe/abnormal model (a.k.a. intrusion behavior) based on a set of obtained data of intrusion. If the detected behavior is similar with the unsafe/abnormal model, we think it is an intrusion.
There are abundant ways to handle the intrusion detection problem, such as classification, clustering, and abnormal detection. Besides those, time series classification is considered to be a suitable solution because the traffic flow data is temporal ordered. Luo et al. [11] modeled the brain activity as time series and used the K-nearest neighbor algorithm to detect the abnormal. Chin et al. [12] evaluated abundant algorithms of anomaly detection which based on symbolic time series analysis. Recently, Wei et al. [13] proposed an assumption-free technique for anomaly detection using time series classification. Kim et al. [14] introduced a shapelet-based method to detect abnormal behavior in net-work traffic. However, the algorithm is based on exhaustive search; hence, it is too time consuming.

Shapelet-Based Time Series
Classification. Shapelet refers to time series subsequences that are maximally representative of a class [15]. Due to the strong interpretability, it has attracted abundant attentions from the community. In the last decade, over a hundred papers have been published to develop this technique. Later, we will recall some basic knowledge in this field.
Definition 1 (Time series). The time series is denoted by a sequence of values T = t 1 , t 2 , ⋯, t jTj , where jTj is the length of time series. Data points t 1 , t 2 , ⋯, t jTj are typically arranged by temporal order and spaced at equal time interval.
Definition 2 (Time series data set). A time series data set D is a set of pairs of time series T i and its corresponding label c i ∈ C, i.e., D = fhT 1 , c 1 i, hT 2 , c 2 i,⋯,hT n , c n ig, where n is the number of time series in the data set and C is the set of labels.
Furthermore, since most of the time series data in real world are multidimensional, such as the monitoring data collected from Internet of Things system, the ECG monitoring system, and the IDS, we use T i,j to represent the j-th dimension of the i-th time series and the k-th position of T i,j can be written as T i,j,k .
Definition 3 (Subsequence). A time series subsequence S is a contiguous sequence of a time series. Subsequence of length l of time series T i,j starting at position k can be denoted as S k,l i,j = T i,j,k , T i,j,k+1 , ⋯, T i,j,k+l−1 . Furthermore, the overall subsequence of time series T with length l is denoted as ΨðT, lÞ.
For simplicity, lots of concepts introduced below only explain the one-dimension time series and all of them can be naturally extended to multidimension.
Definition 4 (α distance and β distance). The α distance and β distance define the distance between two time series T 1 , T 2 with the same length and the distance between a subsequence S and a time series T, respectively.
In this paper, we also use Euclidean distance to measure the two types of distance and the formulas are given below, where m is the length of two time series.
Shapelets which are maximally representative of a class are essentially a set of subsequence. Our purpose is to choose a subset of subsequences which have strong discriminatory 2 Wireless Communications and Mobile Computing power to build a classifier. To measure the discriminatory power of a shapelet candidate, we give the definition of split and information gain (IG).
Definition 5 (Split). A split is a tuple η = hS, τi, where S is a time series subsequence and τ is a distance threshold which can split the data set D into two subsets D L and D R .
Given a time series subsequence S, we can calculate the distance between S and all series in D, i.e., dist β ðS, T i Þ. If dist β ðS, T i Þ ≤ τ, the time series T i will be added to D L ; otherwise it will be added to D R .
Definition 6 (IG). The information gain of a split η = hS, τi can be calculated as follows: The symbols n L and n R denote the number of time series in D L and D R , respectively, and EðDÞ = ∑ jCj i=1 ðn i /nÞ log ðn i /nÞ is the entropy of data set.
Given a time series subsequence S and a data set of time series D, we can calculate the distance between S and all series in D and obtain a set of distance sorted in ascending order hd 1 , d 2 :⋯, d n i. We say a split η = hS, τi is a shapelet candidate that there is no η′ = hS, τ′i that IGðη′Þ > IGðηÞ. To distinguish the shapelet candidate with split, we use symbol θ = ðS, τÞ to represent it. It is not difficult to find that there are infinite splits for a specific subsequence. To limit the search space, we only detect the mean value of any two adjacent distance value, i.e., ðd i + d i+1 Þ/2.
Ye and Keogh [15] firstly introduced the concept of shapelet; meanwhile, they proposed a Brute-Force algorithm to search the best candidate to be the final shapelet embedded into a decision tree classifier. The algorithm suffers from two problems that the exhaustive search is too time-consuming, and the decision tree training is embedded in the search process. There are some solutions to address the first problem, including [15][16][17]. Due to the limit of page, we skip the introduction of these techniques. Next, we introduce an interesting technique, called shapelet transformation, which separates the shapelet searching and the classifier building by transforming the original time series data set to a new feature space [18].
Definition 7 (Shapelet transformation). Given a time series data set D = fT 1 , T 2 ,⋯,T n g and a feature space Σ consisted of a set of selected shapelet, i.e., Σ = fS 1 , S 2 ,⋯,S k g, shapelet transformation is a matrix M with n rows and k columns, It is easy to find that, by shapelet transformation, the temporal characteristic in original time series has been removed. Hence, a large amount of classical data mining techniques can be applied to the time series mining. However, there are also some problems in this technique. For example, the process of shapelet selection is also time-consuming, and the selected shapelets are always be irrelevant and redundant [19][20][21].

The Proposed Method
3.1. The RSFID Algorithm. The idea of the RSFID algorithm (Random Shapelet Forest for Intrusion Detection) is descripted as Figure 1. There are five steps that learn a random shapelet forest (a.k.a. the classifier) from the original time series. Firstly, the raw data of network traffic requires to be represented by SAX [22]. Although there are some other techniques for presentation of time series data, such as PAA, APCA, and DFT, SAX has been proven to be the most efficient technique to compress time series data [23]. The details of SAX technique can be found in [22]. It must be noted that the traffic flow data not only contains real value, but also includes other data types. For example, the KDD CUP 99, which is a famous data set of intrusion detection, contains real value and nominal value. Therefore, the raw data must be preprocessed and converted to normalized real value. After that, the time series data is represented by a set of symbolic words.
The second step is in charge of randomly selecting a set of shapelet candidates. In [19], the authors have validated that random sampling is an effective technique which can greatly reduce the running time by 3~4 orders of magnitude than the Fast Shapelet (FS) algorithm, but without loss of accuracy. Different with [19], we combine the random sampling with SAX presentation which can further improve the scalability of the algorithm. The third step is merging shapelets extracted from the instances of different classes in the same dimension. During this step, part of self-similar shapelets will be removed to reduce the redundancy of the features. Then, the time series data are transformed to the new feature space. We should calculate the distance between shapelets and all series in data sets. In the fifth step, we adopt classical feature selection algorithm to reduce the dimension of new data sets, i.e., the matrix. Finally, we train a set of random forest classifiers for each dimension, which will be used to adjudge whether a network traffic is an intrusion attack or not.
The pseudo-code of the RSFID algorithm is given Algorithm 1. It is not difficult to obtain the idea of the proposed algorithm. From steps 3 to 9, it is composed of two loops. The first loop is aimed at generating m random forest classifiers, i.e., each forest corresponds to a dimension of the time series (a.k.a. network traffic data). For prediction of a new time series, the label is decided by the voting of all classifiers. The inner loop is for generation of p decision trees for the forest. There are two key steps in the inner loop. The function shapelet_sampling is to randomly sample r shapelets from the j-th dimension of the data set D ′ , which is represented by the SAX method. Another function random_sha-pelet_tree is to generate a decision tree based on the obtained shapelets S i,j and D ′ . Next, we will explain the two functions in detail.

Shapelet
Sampling. Since exhaustive search leads to exponential growth of training time, researchers tested the random sampling technique and the results show that it can reduce the running time by 3~4 orders of magnitude than the exhaustive search, without loss of accuracy [19]. However, the existing work does not consider the redundancy and diversity of the sampled shapelets. In this section, we firstly introduce definitions of self-similarity and utility, which are used to filter out nonsimilar shapelets with strong power of discrimination. Then, we explain the code of shapelet samplingðD ′ , j, rÞ.
Definition 8 (Self-similarity) [23]. Given two subsequences of time series S 1 and S 2 , let id 1 and id 2 be the index number of time series that we extract S 1 , S 2 from, and pos 1 , pos 2 and len 1 , len 2 denote the start position and the length of S 1 , S 2 , respectively. We say S 1 and S 2 have self-similarity, when id 1 = id 2 ∧ jpos 1 − pos 2 j ≤ σ ∧ jlen 1 − len 2 j ≤ λ.
Here, symbols σ and λ are two user-defined threshold. The former denotes the allowed distance between the starting positions of two shapelets, and the latter represents the allowed difference of two shapelet lengths. Next, we give the definition of utility.
Definition 9 (Utility). Given a shapelet candidate θ = hS, τi, c denotes the label of the instance that we extract θ from, Cð·Þ is a function that returns the label of an instance. We denote the precision, recall, and utility as follows: It is easy to find that utility is, essentially, the f-score integrated with precision value and recall value which is regarded as the quality score of a shapelet candidate. Next, we show the pseudo-code in Algorithm 2. In step 2, the algorithm refines the data set of time series that only keeps the j -th dimension of D. From steps 3 to 8, the algorithm randomly extracts a subsequence of a time series and generates a shapelet candidate θ. If the θ is self-similar with any candidates in Θ, it would discard it and resample a new one; otherwise, the θ would be added into the shapelet set Θ. The extraction will be repeated for r × κ times where κ is a coefficient for controlling the total number of shapelet candidates for evaluation. After that, we sort the shapelet candidates in Θ by their utility; then, we keep the top r best shapelets as the final choice.

Random Shapelet Tree Generation.
The pseudo-code of the function random_shapelet_tree is shown in Algorithm 3. It is aimed at generating a decision tree based on a set of shapelets. The algorithm is a typical recursive algorithm which is usually adopted in tree generation. In the third step, the function bestShapelet is to find the best shapelet from Θ which has the highest information gain. If two or more shapelets have the same gain, we choose the one that maximizes the separation gap [16]. After that, we remove the selected shapelet from Θ in step 4. The function distribute is used to separate the instances in D into two groups, those with a distance dist β ðS, T i Þ ≤ τ and those with a distance dist β ðS , T i Þ > τ. Then, we invoke random_shapelet_tree to generate the left subtree and right subtree based on D L and D R , respectively. Finally, the function makeLeaf returns a representation of a leaf in the generated tree by simply assigning the class label that occurs most frequently among the instances reaching the node, dealing with ties by selecting a label at random according to a uniform distribution.  Wireless Communications and Mobile Computing less than the generation of random shapelet forest, we only discuss the latter part. In Algorithm 2, the function generateShapelet requires to find the best split of a subsequence whose worst time complexity is Oðnl 2 Þ where n is the number of instances in data set and l is the length of time series. Besides, the time complexity of the function self_similar is Oðr 2 Þ which is far less than Oðnl 2 Þ. Therefore, the time complexity of shapelet_sampling is Oðrκnl 2 Þ. In Algorithm 3, the function random_shapelet_tree requires to select the shapelet that has the highest information gain whose time complexity is Oðrn 2 l 2 Þ. Then, it recursively builds the left subtree and the right subtree. The worst case is that the data set is separated into two subsets with equal size in each Input: D: a data set of time series; p: the number of trees in forest; r: the number of shapelet for each tree Output: Ω = fF 1 , F 2 ,⋯,F m g: a set of random forests and each for one dimension. 1 Ω ⟵ ∅; 2 D′ ⟵ SAXðDÞ; 3 for j = 1 to m do 4 F j ⟵ ∅; 5 for i = 1 to p do 6 Θ i,j ⟵ shapelet samplingðD′, j, rÞ; Obviously, Oðrκnl 2 Þ is less than Oðrn 2 l 2 Þ; hence, the overall time complexity of the IDRSF algorithm is Oð rpmn 2 l 2 Þ. Recall that, the symbols r, p, and m represent the number of sampled shapelets, the number of trees in forest, and the number of dimensions in time series, respectively, and it is not difficult to finger out that its time complexity is far less than the time complexity of classical shapelet algorithm, i.e., Oðmn 2 l 4 Þ.

Data Sets and Parameter
Setting. The data sets in the experiments include UNIT [24] and KDD CUP 99 [25], both of which are usually adopted in the field of network security. The UNIT data set includes 14 million records of network attack flows. The collected instances are divided into three groups, which are malicious traffic (attact), sideeffect traffic (S-effect), and unknown traffic (normal). Because the size of the UNIT data set is too large to be handled, and meanwhile it lacks normal network traffic, Winter et al. [26] sampled part of instances according to the distribution of the original data set and supplement 1904 instances of normal network traffic. In this paper, we adopt Winter et al.'s data set. KDD CUP 99 is a famous data set for intrusion detection which has 5 million instances of net-work traffic. There are four different types of network attack in KDD CUP 99, which are labeled as Probing, DoS, U2R, and R2L, respectively. We also sampled 10 percent instances of the original data set and obtained a training data set with 494021 instances and a testing data set with 311029 instances. The details of UNIT and KDD CUP 99 data sets are given in Tables 1 and 2, respectively. Additionally, both data sets were preprocessed, including transform of nominal value to integer value and z-normalization.
The experiments were performed on a PC with Intel Core i7-8700 3.2 GHz CPU and 32 GB RAM. In the proposed algorithm, there are five parameters. We performed cross-validation to decide the parameter settings. The     , the number of trees p in forest is set to 500, the sampling coefficient κ is set to 1.2, and the two parameters, i.e., δ and λ, for self-similarity detection are set to 2 and 5, respectively.

Experimental Results and Analysis.
To evaluate the effectiveness of the IDRSF algorithm, we choose four algorithms for comparison in the experiments. The first is a classical algorithm, named as 1NN+DTW, which employs onenearest-neighbor classifier and dynamic time warping. Wang et al. [27] proved that the 1NN+DTW is a classic algorithm for time series classification which is hard to be defeated. Except 1NN+DTW, other three algorithms are all based on shapelet technique, including Naïve Shapelet (NS), Shapelet Transform-based CART (ST-CART) algorithm, and Shapelet Transform-based SVM (ST-SVM).
Additionally, we employed three metrics to evaluate the effectiveness, which are recall, precision, and f-score. It is well known that there are four possible results when predicting a new instance, i.e., true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP and TN refer to the correct prediction of normal behavior and attack behavior. FP and FN refer to the incorrect prediction. Then, the formulas of the three metrics are given below.

Recall
The precision and recall value of five algorithms on two data sets are given in Tables 3 and 4, respectively. In each table, the experimental results are listed according to the class label. We can find in Table 3 that all the precision values of five algorithms on class "normal" are very low, just from 2% to 5.7%. The rationale behind the result is the unbalance of the UNIT data set. The number of instances in class "normal" is only several hundreds, but tens of thousands of "attack" instances are assigned the label "normal" in the prediction. This dramatically decreases the precision value. The same phenomenon appears in Table 4, e.g., the precision value and the recall value on class "U2L." For more intuitive comparison, the f-scores obtained by the five algorithms on two data sets are shown in Figures 2  and 3. From the two tables and the two figures, it is not difficult to find that the precision value and the recall value obtained by the IDRSF algorithm on two data sets are obviously better than other four algorithms. Moreover, we can see that the IDRSF algorithm usually performs better on the recall metric, and this is very important for an IDS system. Furthermore, we compare the results of the IDRSF algorithm with that of the state-of-art algorithm (named DSSVM) reported in [28], and we can find that the IDRSF is superior to the DSSVM algorithm. This proves the effectiveness of the proposed algorithm.
However, we cannot ignore that the precision and recall values obtained by the IDRSF algorithm on classes "U2L" and "R2L" are not satisfactory. The reason behind the results is that the testing instances of the two classes include lots of "new patterns" which not appears in the training data set. The shapelet-based technique is essentially a pattern-based method; therefore, it is not easy for the IDRSF to deal with this problem.

Conclusions
In this paper, we propose a novel algorithm, named IDRSF, to handle the intrusion detection problem. The algorithm is based on a new primitive "shapelet" in the field of TSC. The advantages of this technique not only include the better ability of classification than traditional techniques in TSC, but also have good interpretability which is not provided by the deep learning methods. The IDRSF algorithm is evaluated on two famous data sets of intrusion detection, i.e., UNIT and KDD CUP 99, and it is compared with four classical algorithms in the field of TSC in which three are based