TR-IDS : Anomaly-Based Intrusion Detection through Text-Convolutional Neural Network and Random Forest

As we head towards the IoT (Internet of Things) era, protecting network infrastructures and information security has become increasingly crucial. In recent years, Anomaly-Based Network Intrusion Detection Systems (ANIDSs) have gained extensive attention for their capability of detecting novel attacks. However, most ANIDSs focus on packet header information and omit the valuable information in payloads, despite the fact that payload-based attacks have become ubiquitous. In this paper, we propose a novel intrusion detection system named TR-IDS, which takes advantage of both statistical features and payload features. Word embedding and text-convolutional neural network (Text-CNN) are applied to extract effective information from payloads. After that, the sophisticated random forest algorithm is performed on the combination of statistical features and payload features. Extensive experimental evaluations demonstrate the effectiveness of the proposed methods.


Introduction
Due to the advancements in Internet, cyberspace security has gained increasing attention [1,2], which has encouraged many researchers to design effective defense systems called Network Intrusion Detection Systems (NIDSs).Currently, existing intrusion detection techniques fall into two main categories: misuse-based detection (also known as signature-based detection or knowledge-based detection) and anomaly-based detection (also known as behaviorbased detection).Misuse-based detection systems extract the discriminative features and patterns from known attacks and hand-code them into the system.These rules are compared with the traffic to detect attacks.They are effective and efficient for detecting known type of attacks and have a very low False Alarm.Therefore, Misuse-based detection systems are currently the mainstream NIDSs and some sophisticated ones have been deposited in real scenarios, e.g., snort [3].However, misuse detection systems require updating the rules and signatures frequently and they are incapable to identify any novel or unknown attacks.In recent years, anomaly-based network intrusion detection systems (ANIDSs) have attracted much attention for their capability of detecting zero-day attacks.They adopt statistical methods, machine learning algorithms, or data mining algorithms to model the pattern of normal network behavior and detect anomalies as deviations from normal behavior.
Various algorithms have been proposed to model network behavior and detect anomaly flows, including artificial neural networks [4], fuzzy association rules [5], Bayesian network [6], clustering [7], decision trees [8], ensemble learning [9], support vector machine [10], and so on [11,12].However, these methods mostly exploit the information in packet headers or the statistical information of entire flows and fail to detect the malicious content (e.g., SQL injection, crosssite scripting, and shellcode) in packet payloads.Classic processing methods for payloads can be divided into two categories.The first category requires prior knowledge of protocol formats, which cannot be applied to unknown protocols.The second category does not require expert domain knowledge; instead, they calculate some statistical features or conduct N-gram analysis, but they usually suffer from a high false positive rate.In recent years, deep learning algorithms [13,14] have achieved remarkable results in many fields, e.g., Computer Vision (CV) [15], Natural Language Processing (NLP) [16], and Automatic Speech Recognition (ASR) [17].They are proven to be capable to extract salient features from unstructured data.Considering the fact that the payloads of network traffic are sequence data similar to texts, we can apply modern deep learning techniques in NLP to the feature extraction of network payloads.
In this paper, we adopt word embedding [18] and textconvolutional neural network (Text-CNN) [19] to extract features from the payloads in network traffic.We combine the statistical features with payload features and then run random forest [20] for the final classification.The rest of this paper is organized as follows.In Section 2, we describe the related work.In Section 3, we describe the design and implementation of our methods.In Section 4, we show extensive experimental results to show the effectiveness of our methods.Finally, in Section 5, we conclude this paper.

Payload-Based Intrusion Detection.
In these days, payload-based attacks have become more prevalent, while older attacks such as network Probe, DoS, DDoS, and network worm attacks have become less popular.Many attacks place the exploit codes inside the payload of network packets; thus, header-based approaches cannot detect them.In this case, many payload-based detection techniques have been proposed.The first class of these methods is creating protocol parsers or decoders for different kinds of application.Snort [3] includes a number of protocol parsers for protocol anomaly detection.For example, the http inspect preprocessor parses and normalizes HTTP fields, making them available to detect oversized header fields, non-RFC characters, or Unicode encoding.ALAD [21] builds models of allowed keywords in text-based application protocols such as FTP, HTTP, and SMTP.The anomaly score is increased when a rare keyword is used for a particular service.These parser-based methods have a high detection rate for known protocols.However, these methods require manually specified by experts and cannot deal with unknown protocols.The second class applies NLP techniques, e.g., N-gram analysis [22] to network traffic payloads.PAYL [23] uses 1-grams and unsupervised learning to build a byte frequency distribution model of payloads.McPAD [24] creates 2]-grams and applies a sliding window to cover all sets of 2 bytes, ] positions apart in each network traffic payload.They require no expert domain knowledge and can detect zero-day worms, because payloads with exploit codes generally have an unusual byte frequency distribution.The drawbacks of them are unsatisfactory detection rate and relatively high computational overhead compared with parser-based methods.

Deep Learning for Intrusion Detection.
Many deep learning techniques have been used for developing ANIDS.Ma et al. [25] evaluated deep neural network on the KDDCUP99 dataset, and Niyaz et al. [26] applied deep belief networks to intrusion detection on the NSL-KDD dataset.However, they only tested deep learning techniques on manually designed features, while their powerful ability to learn features from raw data has not been exploited.Recently, several attempts to learn effective features from raw packets have emerged.Yu et al. [27,28] and Mahmood et al. [29] used autoencoder to detect anomaly traffic.Wang et al. [30] applied CNN to learn the spatial features of network traffic and used the image classification method to classify malware traffic, despite the fact that network payloads are more similar to documents.Torres et al. [31] transformed network traffic features into character sequence and used RNN to learn the temporal features, while Wang et al. [32] combined CNN and LSTM together to learn both spatial and temporal features.These methods are of great insights yet have evident weaknesses.Firstly, some time-based traffic features such as flow duration, packet frequency, and average packet length cannot be learned automatically by both CNN and LSTM.Besides, they ignore the semantic relation between each byte, which is a critical factor in NLP.In this paper, we remedy both problems by taking advantage of both expert domain knowledge and deep neural networks.The statistical features are manually designed and the payload features are extracted by deep learning techniques in NLP.To the best of our knowledge, no studies have made use of the advantages of both.

TR-IDS
TR-IDS aims at automatically extracting features from payloads of raw network packets to improve the accuracy of IDS.Since random forest has superior performance on structured data while convolutional network is suitable to handle unstructured data [33], we combine the advantages of both.It performs classification on bidirectional network flows (Biflow), which contains more temporal information than packet level datasets.The implementation schemes are illustrated in Figure 1, and the different stages of TR-IDS are described as follows: (i) Statistical features extraction: we extract some critical statistical features from each network flow.These features include fields in packet headers and statistical attributes of the entire flow.
(ii) Payload features extraction: we map each byte in payloads into a word vector using word embedding and then extract salient features of payloads using text-convolutional neural network.
(iii) Classification through random forest: the statistical features and payload features are concatenated together, and then, the random forest algorithm is applied to classify the generated new dataset.

Statistical Features Extraction.
In this section, we manually extract some discriminative features from the bidirectional network flows, where the first packet in each flow determines the forward (source to destination) and backward (destination to source) direction.We extract 44 statistical features from each flow, and most of them are calculated separately in both forward and backward direction.To be more specific, we first extract some basic features such as protocol, source port, and destination port, while ip addresses are not included because they vary in different networks and thus cannot generalize the characteristic of attacks.Then, some statistical attributes such as packet number, bytes number, and tcp flag number are calculated.After that, some time-based statistical measures are also extracted, such as the speed of transmission and time interval between two packets.These features are vital signatures for detecting attacks such as Probe, DoS, DDoS, Scan, U2R, and U2L, which have distinctive traffic patterns.We list all these features in Table 1 3.2.Payload Features Extraction.In this section, we introduce our deep-learning-based method of extracting features from network payloads.Word embedding technique is used to transfer one-hot representation of each byte to continuous vector representation.Then, text-convolutional network is utilized to extract the most salient features from each payload.
Byte-Level Word Embedding.The effective representation of each byte in payloads is a critical step.Yu et al. [27] took the decimal value of each byte as a feature.This method is not suitable as it introduces order relation to each byte.Wang et al. [32] adopted one-hot encoding to each byte and consider each sample as a picture; then a conventional CNN is applied to extract features.However, this method neglects the similarity in semantics and syntax of different bytes, and the worse is that it significantly increases the computation complexity.To remedy this problem, we utilize word embedding to map each byte into a low dimensional vector, preserving the semantic information and consuming much less computational cost.By now, the most well-known method of word embedding is word2vec [34], which is convenient to implement and has superior performance.Two popular kinds of implementation of word2vec are CBoW and Skip-Gram [35].Since Skip-Gram generally has a better performance [35], in this paper, we apply Skip-Gram to our byte-embedding task.
The task of Skip-Gram is, given one word, predicting the surrounding words.The trained model does not perform any new task; instead, we just need the projection matrix, which contains the vector representation of each word.We define two parameter matrices,  ∈ R ×|| and   ∈ R ||× , where  is the embedding dimension which can be set as an arbitrary size.Note that  is the vocabulary set and || is the size of .Each word in  is represented as a || × 1 one-hot vector.The architecture of Skip-Gram is illustrated in Figure 2, and Skip-Gram works in the following 4 steps.
Step 1. Generate the one-hot input vector   ∈ R || of the center word.
Step 2. Get the embedded vector of the center word V  =   ∈ R  .
Step 4. Match the generated probability vectors with the true probabilities, which are the one-hot vectors of the actual output,  − , . . .,  −1 ,  +1 , . . .,  + .The divergence  between generated probabilities and true probabilities is the loss function for optimizing the parameters.
When it comes to the byte-level word embedding in our algorithms, each byte is considered as a word and represented as a one-hot vector.We first extract the payloads of all packets in each flow and concatenate them together as a flow payload.Each flow payload can be analogized as a sentence and they are composed of a text corpus, i.e., a training dataset.The embedding size can be set as a relatively small value (e.g., 10).After the training of Skip-Gram, we obtain the embedded representation of each byte.
Extract Payload Features through Text-CNN.We apply Text-CNN to extract features from the embedded payloads.Text-CNN is a slight variant of the CNN architecture and achieves excellent results on many benchmarks of sentence classification (or document classification) [19].Text-CNN adopts the one-dimensional convolution operation to extract features from the embedded sentences.In Text-CNN, filters have a fixed width of embedding size, but have varying heights in the one layer, while in conventional CNNs, the sizes of filters in one layer are usually the same.The architecture of Text-CNN is illustrated in Figure 3.
Let x  ∈ R  be a -dimensional word vector corresponding to the embedded representation of th word in a sentence (in our task, each byte corresponds to a word; thus, each payload is considered as a sentence).A sentence of length  (padded if the length is smaller than ) is denoted as Note that ⊕ is the concatenation operator.When executing a convolution operation, a convolution filter w ∈ R ℎ× is applied to a window of ℎ words in the sentence to generate a new feature.To be specific, a feature   is calculated as follows: where x :+ℎ−1 is a window of words,  is a bias, and  is a nonlinear function.This filter is applied to each possible window [x 1:ℎ , x 2:ℎ+1 , . . ., x −ℎ+1: ] to generate a new feature map c = [ 1 ,  2 , . . .,  −ℎ−1 ] and c ∈ R −ℎ+1 .Then, a maxpooling operation is applied to the feature map to obtain the maximum value  max = max(c), which is the most important feature of each feature map.
The process of extracting one feature by one filter is described above, and we have multiple filters with varying window size to extract multiple features.Note that, in the original Text-CNN, the features are concatenated and directly passed to a fully-connected  layer to output the probabilities of different classes.But in our implementation, we insert a feature layer between the concatenated layer and output layer.After the supervised training of the model, we extract features of each payload from this layer.
Classification through Random Forest.The Random Forest (RF) [20] is an ensemble algorithm consisting of a collection of tree-structured classifiers.Each tree is constructed by a different bootstrap sample from the original data using a decision tree algorithm, and each node of trees only selects a small subset of features for the split.The learning samples not selected with bootstrap are used for evaluation of the tree, called out-of-bag (OOB) evaluation, which is an unbiased estimator of generalization error.After the construction of the forest, once a new sample needs to be classified, it is fed into each tree in the forest and each tree casts a unit vote to certain class which indicates the decision of the tree.The forest chooses the class with the most votes for the input sample.
RF has the following advantages: (i) It has excellent performance in accuracy on structured data.
(ii) It is robust against noise and does not over-fit in most cases.
(iii) It is computational efficient and can run on large-scale datasets with high dimensions.
(iv) It can handle unbalanced datasets.
(v) It can output the importance weight of each feature.
These merits of RF encourage us to choose it as our final classification.In this step, we concatenate the statistical features and payload features to generate the final representation of the network flows.Then, this new dataset is fed into the RF algorithm for training and validation.

Performance Evaluation
4.1.Datasets and Preprocessing.We evaluate the performance of our method on ISCX2012 dataset [36].It is an intrusion detection dataset generated by the Information Security Center of Excellence (ISCX) of the University of New Brunswick (UNB) in Canada in 2012.This dataset consists of 7 days of network activity, including normal traffic and four types of attack traffic, i.e., Infiltrating, HttpDoS, DDoS, and BruteForce SSH.Although KDDCUP99 dataset [37] is widely used to evaluate IDS techniques, it is really old-fashioned and cannot actually reflect the behavior of modern attacks.
In contrast, ISCX2012 is much more updated and closer to reality.This dataset consists of seven raw pcap files and a list of label files.The label files record the basic information of each network flow, e.g., label, ip address, port, start time, and stop time.We have to split the network flows in the pcap files and label them using records in the label files.Note that the labeled files contain a few problems.For example, the packet numbers recorded in them are not identical to the actual packet number in pcap files.Besides, the time records in them do not exactly correspond to the timestamps in the pcap files.Therefore, we have to remove all incorrect and confused records.We chose most attack samples and randomly chose a small subset of legitimate ones to generate a relatively balanced dataset.Then, we divided the preprocessed dataset into training and testing set using a ratio of 70% and 30%, respectively.Our preprocessing results are shown in Table 2.

Evaluation
where TP is the number of instances correctly classified as A, TN is the number of instances correctly classified as Not-A, FP is the number of instances incorrectly classified as A, and FN is the number of instances incorrectly classified as Not-A.

Experimental Results.
In this section, we show the experimental results of our methods.We set the number of extracted payload features as 50 and the truncated length of bytes in each payload as 1000.Table 3 shows the result of 5-class classification on ISCX2012 and Table 4 shows the confusion matrix of the classification.It is obvious that our method can nearly identify all attacks of Infiltration, BFSSH, and HttpDoS but confuses a few DDoS attacks with the normal traffic.The reason is that some network flows of DDoS are really similar to normal traffic; thus, it is unrealistic to identify each flow in a DDoS attack.Since ISCX2012 dataset was published much later than DARPA1998, there are much fewer available corresponding experimental results.Although some existing methods are evaluated on it, they have different preprocessing procedures and even use different proportions of the dataset.Thus, it is unfair to compare our methods with these methods.In this case, in order to demonstrate the effectiveness of our method, we implemented five other methods.The first four ones are support vector machine (SVM), fully-connected network (NN), convolutional neural network (CNN), and random forest (RF-1), and their inputs are statistical features combined with 1000 raw bytes.The fifth one is running random forest on just statistical features (RF-2).Table 5 compares TR-IDS with the five methods.Note that the performance of RF-1 is inferior to that of RF-2, which means the features of raw bytes may even deteriorate the performance of intrusion detection.The superior performance of TF-IDS demonstrates the effectiveness of the proposed feature extraction techniques.4.5.Sensitivity Analysis.In this section, we show the results of sensitivity tests on the two important hyperparameters, i.e., the truncated length of payloads for feature extraction and the number of features extracted from payloads.We first fixed the truncated length of payloads as 1000 and then varied the extracted feature number from 5 to 100.After that, we fixed the extracted feature number and varied the truncated length from 500 to 3000.As we can see in Table 6, our methods are not sensitive to the two hype-parameters.The best value of feature number locates at the middle of 5 and 100, i.e., 50.The reason is that too few features cannot contain enough information of the entire payload, and too many features bring noise to the final classification algorithm.
For the truncated length of payloads, we find that large length contributes to a better performance; it is because a small length may result in the loss of information of payloads.Nevertheless, a large length also leads to a high computational cost.

Conclusion
In this paper, we propose a novel intrusion detection framework, i.e., TR-IDS, which utilizes both manually designed features and payload features to improve the performance.It adopts two modern NLP techniques, i.e., word embedding and Text-CNN, to extract salient features from payloads.The word embedding technique retains the semantic relations between each byte and reduces the feature dimension, and then Text-CNN is used to extract features from each payload.We also apply the sophisticated random forest algorithm for the final classification.Finally, extensive experiments show the superior performance of our method.

Figure 1 :
Figure 1: The general architecture of TR-IDS.

Table 1 :
Statistical features of the network flow.
Metrics.Three metrics are used to evaluate the performance of TR-IDS: Accuracy (ACC), Detection Rate (DR), and False Alarm Rate (FAR), which are frequently used in the evaluation of intrusion detection.ACC is a good metric to evaluate the overall performance of a system.DR is used to evaluate the attack detection rate.FAR is used to evaluate misclassification of normal traffic.The three metrics are formulated as follows: Xeon e5 CPUs with 10 cores and 64GB memory.Four Nvidia Tesla K80 GPUs are used to accelerate the training of CNN.In our all experiments, the Text-CNN contains convolution filters with three different size, i.e., 3, 4, and 5, and there are 100 channels for each.The stride is 1 and no padding is used.The mini-batch size is 100 and optimizer is Adam with default parameters.The parameters of RF are set by default, except the number of trees, which is set as 200.

Table 2 :
Preprocessing results of the ISCX2012 dataset.

Table 4 :
Confusion matrix of the 5-class classification task.

Table 6 :
Influence of the payload length and feature number (%).