When traditional machine learning methods are applied to network intrusion detection, they need to rely on expert knowledge to extract feature vectors in advance, which incurs lack of flexibility and versatility. Recently, deep learning methods have shown superior performance compared with traditional machine learning methods. Deep learning methods can learn the raw data directly, but they are faced with expensive computing cost. To solve this problem, a preprocessing method based on multipacket input unit and compression is proposed, which takes

At present, the security of cyberspace has been widely concerned by all sectors of society, and ensuring the security of network information and network equipment is the focus of network security. The research of anomaly-based Intrusion Detection System (IDS) is the main research direction in the field of intrusion detection. In recent years, deep learning has been widely used in many fields, including image processing and network traffic analysis [

There are many network intrusion detection methods based on deep learning model, and it is considered to be an effective way of network intrusion detection [

In addition, the detection result lags behind when using the raw traffic data directly, because the input sample flow often contains a lot of data packets or lasts for a long time (for example, many network data flows often last tens of seconds or even minutes). Therefore, it needs to be detected after the whole flow. In this paper, a data preprocessing method based on multipacket is proposed to solve the problem of detection lag. After fully considering the advantages and disadvantages of data packet and flow as data unit, respectively,

Some researchers believe that for intrusion detection, it is equally important to improve the computational efficiency of the deep learning model (including training time and model trainable parameters) [

Our previous work tried to directly use the raw traffic for analysis [

The main contributions of this paper are as follows:

For the first time, a method of network traffic data preprocessing based on Naive Compression (NC) and multipacket is proposed, which not only overcomes the disadvantages of manual feature extraction, but also ensures the computational efficiency.

A network intrusion detection model FCNN based on deep learning is designed. In order to further improve the computational efficiency and detection accuracy of the traditional convolution neural network, the weights of some convolution layers are assigned directly by Gabor filter. In addition, the network structure, activation function, and optimizer of the model are optimally designed, and the influence of Gabor filter parameters on accuracy with the change of sample dimension (sample dimension is related to compression ratio) is studied, and the optimal Gabor filter direction parameter

This paper compares the effects of a variety of compression preprocessing methods, including the proposed NC method, Principal Component Analysis (PCA) and Locally Linear Embedding (LLE), and deep learning Autoencoder (AE) compression methods, on the detection accuracy and training time of FCNN and other deep learning intrusion detection models. In this way, the optimal compression methods and related parameters suitable for intrusion detection based on raw traffic under different conditions are found. The experimental results verify the advantages of the NC compression method proposed in this paper.

The organization structure of this paper is as follows. Section

Some deep learning methods have shown good performance in traffic classification tasks. Kim et al. [

In order to solve this problem, Wang et al. [

The comparison of the related works is presented in Table

Related work summary.

Categories | DL model used in related works | Literature | Advantages | Disadvantages |
---|---|---|---|---|

DL methods based on feature extraction | LSTM | [ | High computational efficiency | More complex preprocessing operations; lack of generality |

RNN | [ | |||

CNN | [ | |||

DNN | [ | |||

Autoencoder | [ | |||

DL methods based on raw traffic | CNN-LSTM | [ | Strong versatility; high accuracy | Low computational efficiency |

CNN | [ | |||

Autoencoder | [ |

Gabor Convolution Network (GCN) is a deep convolution neural network using Gabor Filter (GoF). GCN is created by manipulating convolution filters learned by Gabor filter banks to generate enhanced feature graphs [

Many methods combine Gabor filter with convolution network. Most of these methods solve the problem of image processing and cannot be directly applied to the field of intrusion detection. This paper mainly explores the Gabor network structure with excellent performance for intrusion detection.

The data compression method proposed in this paper is universal when dealing with the raw traffic, and the improved and optimized GCN deep learning method can take into account the computational efficiency and detection accuracy in network intrusion detection.

The intrusion detection method proposed in this paper includes data preprocessing, model training, and model testing, and the flowchart is shown in Figure

Intrusion detection flowchart.

Marín et al. [

The method based on multipacket considers not only the fine-grained detection of packets as data units, but also the coarse-grained detection of flows as data units. In data preprocessing, _{time_out}, the data packets received during this period (less

Multipacket preprocessing when

In order to overcome the problem of low computational efficiency caused by using the raw traffic data as input, the data is compressed before training. This paper proposes a simple and effective compression method, NC, which divides the bytes in the multipacket unit into several parts averagely, each part contains the same number of bytes and averages the ASCII code values of all bytes with the number of _{i,j} represents the _{1} and _{2}, the probability is very small. Furthermore, it is less likely that the compressed strings of two different packets or two different input units are exactly the same.

The min-max input scaling is used in this paper to have the networking focusing on the input component with the higher range. The minimum-maximum scaler of the normalization function is defined as

CNN consists of one or more convolution layers and pooled layers to form a multilayer neural network. The convolution layer shares many weights to detect the local connection of the elements in the upper layer and then samples the output of the convolution layer through the pooling layer to merge semantically similar elements. In this section, we first design the best CNN model for intrusion detection and then introduce in detail how to use GoF to further improve the performance of CNN.

The reasons why we choose CNN to train and predict the data processed by the proposed preprocessing method are as follows. First, because of the shared convolution kernel [

Use

In a convolutional neural network, the output expression of each neuron is

In the pooled layer,

The structure of the CNN model designed in this paper is shown in Table

CNN model.

Sequence | Description |
---|---|

1 | Convolution layer (48 units, kernel size: 3, activation function: ReLU) |

2 | Dropout layer (dropout rate: 0.1) |

3 | Convolution layer (48 units, kernel size: 3, activation function: ReLU) |

4 | Pooled layer (size: 2) |

5 | Convolution layer (128 units, kernel size: 3, activation function: ReLU) |

6 | Dropout layer (dropout rate: 0.1) |

7 | Convolution layer (128 units, kernel size: 3, activation function: ReLU) |

8 | Pooled layer (size: 2) |

9 | Flatten layer |

10 | Dense layer (128 units, kernel size: 3, activation function: ReLU) |

11 | Dropout layer (dropout rate: 0.1) |

12 | Dense layer (1 unit, activation function: sigmoid) |

GoF is a narrow-band band-pass filter with direction and frequency selectivity, which has good local performance in both space and frequency domain [

In the formula, each parameter means the following:

Wavelength (

Direction (

Phase offset (

Aspect ratio (

In the FCNN model, GoF is used to directly assign the weight of some convolution layers and no longer perform training in these layers, while the rest of the convolution layers can still be trained normally. In this paper, the network layer in which GoF is used for weight distribution is analyzed, and the influence of each parameter of GoF on the detection accuracy is discussed when the step size of compression parameters changes. The model and parameters of FCNN are shown in Figure

FCNN network structure diagram.

In order to make the parameters in GoF suitable for network anomaly detection and reduce the loss of detection accuracy, each parameter in the test GoF is optimally selected: the wavelength (

Relationship between input sample dimension and

Sample dimension | |
---|---|

<100 | 0 ∼ 0.5 |

100 ∼ 500 | 0 ∼ |

500 ∼ 1920 | 0 ∼ 2 |

In addition to the proposed simple compression method NC, this paper also evaluates the performance of other compression methods to compare the raw traffic data. Here, two traditional data dimensionality reduction (compression) methods PCA and LLE, which are commonly used and with relatively low computational complexity, and the deep learning method autoencoder is considered. Among them, PCA is an unsupervised learning method, which does not need to set parameters or interfere with the calculation according to any experience. Compared with the traditional dimensionality reduction methods such as PCA and LDA which focus on sample variance, LLE focuses on maintaining the local linear characteristics of samples when dimensionality reduction.

PCA is a statistical process, which uses orthogonal transformation to project a group of possible related variables into a group of linearly unrelated variables, namely, principal components. This projection follows the definition that the first principal component has the largest possible variance (that is, explaining the variability in the data as much as possible), and each subsequent element or component has the highest possible variance orthogonal to the previous element under constraints. And the number of principal components is less than or equal to the number of raw variables. In the PCA method, one of the most important problems is how to measure the quality of the projection vector. Mathematically, there are three ways to measure the pros and cons of projection. PCA is defined as the orthogonal projection of data on a low-dimensional linear space, which is called the principal subspace (principal subspace), so that the variance of the projected data is maximized, that is, the maximum variance theory. Similarly, it can also be defined as the linear projection that makes the average projection cost most expensive, that is, the minimum error theory.

LLE algorithm believes that each data point can be constructed from the linear weighted combination of its nearest neighbor points. The main steps of the algorithm are divided into three steps: (1) to find the

The AE can learn automatically from the data samples. It is a neural network that uses back propagation to make the output value equal to the input value. It consists of two parts: the encoder and the decoder. Among them, the encoder compresses the input into a potential spatial representation, and the decoder reconstructs the output. The purpose of this is to enable the neural network of each layer of the encoder to extract features automatically. The output of the encoder can be regarded as compressed data. When evaluating the compression method, in order to take into account both computational efficiency and information fidelity, an optimal autoencoder model suitable for the raw network traffic is designed.

In order to compare the performance of compression algorithms more intuitively, in the next section, we keep the data dimensions of all the compressed methods the same, then evaluate the impact of compression methods on the performance of FCNN, and discuss the optimal compression methods under different conditions.

In the training phase, the training data set is divided into a subset of training data and a subset of verification data. The training set and test set are randomly selected from the complete data set of IDS2018. In order to ensure that the samples contained in the training set and test set do not repeat as much as possible, the number of samples in the training set and test set is much smaller than the total sample number (for example, the total sample number of the Sub_DS1 subset is 1044751; the sample number of the training set and the test set is 3000 and 1800, respectively). At this stage, the training set and test set will be randomly selected for many times (usually 50 times), so the training set and test set can fully reflect the characteristics and distribution of the overall sample. Because the total sample is large enough to represent most of the possibilities of normal and aggressive behavior, and the training set and test set can better reflect the overall sample, the model has the ability to detect future attacks to a certain extent. These samples are used as inputs to the training model, and the training steps are stopped after the training period

In the testing phase, the test sample data are input into the training model output in the training phase for classification prediction. It is worth emphasizing that all the above sample subsets are randomly selected from the total sample set, not artificially selected. All the deep learning models mentioned in this article use an efficient adaptive learning rate optimizer: Adam [

CSE-CIC-IDS2018 is a mixed data set of a large number of network traffic and system log, which contains 10 days of data, and the daily data form a subset of data with a total size of more than 400G. The data set includes 7 attack types and 16 attack subtypes, including brute force attacks, DoS attacks, surveillance network attacks, and penetration attacks. However, there are few deep learning intrusion detection methods for this data set [

CSE-CIC-IDS2018 data subset.

Data subset | Collection time | Attack type | Total number of samples |
---|---|---|---|

Sub_DS1 | Wednesday-14-02-2018 | Benign | 663,808 |

FTP-BruteForce | 193,354 | ||

SSH-BruteForce | 187,589 | ||

Sub_DS2 | Thursday-15-02-2018 | Benign | 988,050 |

DoS-GoldenEye | 41,508 | ||

DoS-slowloris | 10,990 | ||

Sub_DS3 | Thursday-01-03-2018 | Benign | 235,778 |

Penetration attack | 92,403 |

In order to measure the performance of deep learning models, global accuracy Acc, model training time, and normalized compressed data time are used to compare and analyze the models involved [

The relevant indicators are defined as follows:

Acc = (True positive + True negative)/total number of samples.

Normalized training time = training time/maximum training time.

Normalized compression time = compression time/maximum compression time.

Training time is usually positively correlated with other indicators of computational efficiency. Here, normalized training time is used as an easy-to-measure index of computational efficiency. He and Sun [

First of all, when the compression method is NC, the performance of the deep learning detection model is evaluated. Then, when the detection model is FCNN, the effects of the four compression methods involved in this paper on the detection accuracy and computational efficiency are compared. The Keras implementation algorithm of FCNN is shown in Appendix.

The hardware environment of the experiment has an Intel Xeon (Cascade Lake) Platinum 8269 GHz/3.2 GHz 4-core CPU and 8 GB memory. On the common network data set CSE-CIC-IDS2018 [

In order to make a fair comparison, keeping the default parameters unchanged, the detection accuracy of FCNN, CNN-LSTM, and IDS-DNN algorithms are compared in CSE-CIC-IDS2018 data set. As you can see from Figure

Comparison of Acc of different models on three data subsets of CSE-CIC-IDS2018.

As can be seen from Figure

Comparison of accuracy and computational efficiency of raw traffic data before and after compression of the FCNN model.

Figures

Influence of sample compression parameters on detection accuracy.

Influence of sample compression parameters on training time.

When step changes from 5 to 50, when Acc is almost unchanged, the training time of FCNN is always shorter than that of CNN-LSTM. Compared with CNN-LSTM, FCNN reduced training time by 23.82%. The reason is that FCNN uses GoF to initialize the weight values of the two convolution layers in the CNN model, thus saving some training time. When step is less than or equal to 15, the training time of IDS-DNN and FCNN is almost the same, but the accuracy of IDS-DNN model is lower than that of FCNN. When step is greater than 30, the training time of IDS-DNN is much longer than that of FCNN and CNN-LSTM, but the Acc of IDS-DNN model is not significantly improved.

The maximum number of packets in a single input sample is

Figure

Influence of the maximum number of packets of each sample on the detection accuracy.

Figure

Comparison of ROC curves.

This section compares the computational efficiency of the proposed compression method NC with the other three compression methods PCA, LLE, and AE and their impact on the detection accuracy of the detection model FCNN.

Figure

Comparison of detection accuracy of compression methods when the detection model is FCNN.

Figure

Comparison of compression time of the compression method.

Comparing Figures

To address the limitation that the existing network intrusion detection methods rely on manual design features and the efficiency bottleneck of deep learning models when dealing with high-dimensional massive raw traffic data, this paper proposes a data preprocessing method based on multipacket input unit to compare the raw traffic and designs a deep network intrusion detection model with superior performance by using the powerful feature extraction ability of Gabor filter. Through comparative experiments, it is proved that this method can effectively take into account both computational efficiency and detection accuracy in the benchmark data set and the proposed FCNN model is better than the other two deep learning algorithms which perform well in the field of intrusion detection.

Our main research directions in the future include how to improve the proposed preprocessing methods and explore better data compression methods and improvement of the universality of F-CNN, such as make F-CNN be suitable for the Internet of things, industrial Internet, and wide area network with many protocols and various attack types or the backbone network of high-speed switches or routers and other communication equipment. Future research will continue to apply the cutting-edge achievements of deep learning to improve the generalization ability of detection methods, such as using transfer learning to accelerate training speed and improve prediction accuracy. In image processing and natural language processing, transfer learning can significantly improve the prediction accuracy and speed up the training speed. That is to say, when the model is trained on large data sets, it can improve the performance after moving to small data sets by fine-tuning. There are few researches on the application of transfer learning in network anomaly detection. In the future, we will introduce transfer learning into the field of network anomaly detection and conduct in-depth research, train a network anomaly detection model based on large data sets, and publish it as a pretraining model.

The Keras implementation algorithm of the proposed approach is presented in Algorithm

Algorithm: FCNN training algorithm of raw traffic intrusion detection model based on Gabor network

Input： custom_gabor, custom_gabor2 # gabor filter

X_train # training set

Y_train # training label

X_val # validation set

Y_val # verify tag

epochs = loop_num # iterations

batch_size = 10 # batch size

path = “model/fcnn” # training model save path

Output: model FCNN

FCNN = Sequential ()

FCNN.add (Convolution1D (48, 3, trainable = False,

kernel_initializer = custom_gabor,

border_mode = “same”,

activation = “relu”,

input_shape = (featureNum, 1)))

FCNN.add (Convolution1D (48, 3, border_mode = “same”, activation = “relu”))

FCNN.add (MaxPooling1D (pool_length = (2)))

FCNN.add (Convolution1D (128, 3, trainable = False,

kernel_initializer = custom_gabor2,

border_mode = “same”,

activation = “relu”))

FCNN.add (Convolution1D (128, 3, border_mode = “same”, activation = “relu”))

FCNN.add (MaxPooling1D (pool_length = (2)))

FCNN.add (Flatten ())

FCNN.add (Dense (128, activation = “relu”))

FCNN.add (Dropout (0.1))

FCNN.add (Dense (1, activation = “sigmoid”)) # the construction process of FCNN model

FCNN.compile (loss = “binary_crossentropy”, optimizer = “Adam”, metrics = [“accuracy”])

FCNN.fit (X_train, y_train,

validation_data = (X_val, y_val),

batch_size = 10,

epochs = loop_num) # training FCNN

FCNN.save (path + “FCNN_model.hdf5”) # Save the model

The data used in this study are available at

The authors declare that they have no conflicts of interest regarding the publication of this paper.

This work was sponsored by the National Key Research and Development Program of China (Grants nos. 2018YFB0804002 and 2017YFB0803204).