DL-IDS: Extracting Features Using CNN-LSTM Hybrid Network for Intrusion Detection System

Many studies utilized machine learning schemes to improve network intrusion detection systems recently. Most of the research is based on manually extracted features, but this approach not only requires a lot of labor costs but also loses a lot of information in the original data, resulting in low judgment accuracy and cannot be deployed in actual situations. *is paper develops a DL-IDS (deep learning-based intrusion detection system), which uses the hybrid network of Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) to extract the spatial and temporal features of network traffic data and to provide a better intrusion detection system. To reduce the influence of an unbalanced number of samples of different attack types in model training samples on model performance, DL-IDS used a category weight optimization method to improve the robustness. Finally, DL-IDS is tested on CICIDS2017, a reliable intrusion detection dataset that covers all the common, updated intrusions and cyberattacks. In the multiclassification test, DL-IDS reached 98.67% in overall accuracy, and the accuracy of each attack type was above 99.50%.


Background.
In recent years, with the rapid development of emerging communications and information technologies such as 5G communications, mobile Internet, Internet of ings, "cloud computing," and big data, network security has become increasingly important. As an important research content of network security, intrusion detection has been paid attention by experts and scholars. Problems that are common under traditional anomaly-based detection methods include the inaccurate feature extraction of network traffic and difficulty in building attack detection models, which leads to high false alarm rate when judging attack traffic. It is difficult for network security personnel to find unknown threats, which makes the defense inherently passive. In other words, traditional methods are no longer applicable to today's Internet as per its massive data scale.
In recent years, many scholars have explored how to use artificial intelligence (AI) to detect and analyze network traffic for intrusion detection and defense systems. Hassan et al. [1] proposed an ensemble-learning model based on the combination of a random subspace (RS) learning method with random tree (RT), which detected cyberattacks of SCADA by using the network traffic from the SCADA-based IoT platform. Khan and Gumaei [2] compared the most popular machine learning methods for intrusion detection in terms of accuracy, precision, recall, and training time cost. Alqahtani et al. [3] proposed GXGBoost model to detect intrusion attacks based on a genetic algorithm and an extreme gradient boosting (XGBoost) classifier. Derhab et al. [4] proposed a security architecture that integrates the Blockchain and the software-defined network (SDN) technologies, which focuses on the security of commands in industrial IoT against forged commands and misrouting of commands. e current mainstream methods are intrusion detection systems based on machine learning (ML) or deep learning (DL). Among them, the ML-based system mainly classifies and detects network traffic by analyzing the manually extracted features of network traffic, while the DLbased system can not only analyze the manually extracted features but also automatically extract the features from the original traffic. erefore, DL-based systems can circumvent the manual feature extraction problem and enhance the detection accuracy compared to general ML-based systems.
To achieve higher accuracy, DL-based intrusion detection methods require a large amount of data for training, especially different types of attack traffic data. In the actual environment and the existing datasets [KDD99, NSL-KDD, and CICIDS2017], the attack traffic is always less compared with normal traffic. Moreover, because some types of attack traffic are difficult to capture and simulate, the amount of data available for model training is particularly small. ese problems greatly restrict the accuracy of the DL-based method, making it difficult to judge certain types of attacks.

Key Contributions.
is paper proposes a DL-based intrusion detection system, DL-IDS, which uses the hybrid network of Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) to extract the temporal and spatial features of network traffic data to improve the accuracy of intrusion detection. In the model training phase, DL-IDS uses category weights to optimize the model. is method reduces the effect of the number of unbalanced samples of several attack types in model training samples on model performance and improves the robustness of training and prediction. Finally, we test DL-IDS to classify multiple types of network traffic on the CICIDS2017 dataset and compare it with the CNN-only model, LSTM-only model, and other machine learning models because CICIDS2017 is a recent original network traffic dataset simulating real situations.
e results show that DL-IDS reached 98.67% in overall accuracy, and the accuracy of each attack type was above 99.50%, which achieved the best results in all models.

Paper Organization.
e remainder of this paper is organized as follows. Section 2 discusses the classification of abnormal traffic as per previously published studies. Section 3 describes in detail the datasets and data preprocessing methods we used in this study. e classifier structure and classification methods used for traffic classification under the proposed model are described in Section 4. Section 5 presents the results of our model evaluation with various hyperparameters. Section 6 summarizes the paper and discusses potential future development trends.

Related Work
With the continual expansion of the Internet, network security has become a problem that cannot be ignored. Malicious network behaviors such as DDos and brute force attacks tend to be "mixed into" malicious traffic. Security researchers seek to effectively analyze the malicious traffic in a given network so as to identify potential attacks and quickly stop them [5][6][7][8].

Traditional Intrusion Detection
System. Traditional methods of intrusion detection mainly include statistical analysis methods [9], threshold analysis methods [10], and signature analysis methods [11]. ese methods do reveal malicious traffic behavior; however, they require security researchers to input data related to their personal experience; to this effect, their various rules and set parameters are very inefficient. Said "experience" is also only a summary of the malicious traffic behavior found in the past and is typically difficult to quantify, so these methods cannot be readily adapted to the huge amount of network data and volatile network attacks of today's Internet.

Intrusion Detection System Based on ML.
Advancements in machine learning have produced models that effectively classify and cluster traffic for the purposes of network security. Early researchers attempted simple machine learning algorithms for classification-clustering problems in other fields, such as the k-Nearest Neighbor (KNN) [12], support vector machine (SVM) [13], and self-organizing maps (SOM) [14], with good results on KDD99, NSL-KDD, DARPA, and other datasets. ese datasets are out of date, unfortunately, and contain not only normal data but also attack data that are overly simple. It is difficult to use these datasets to simulate today's highly complex network environment. It is also difficult to achieve the expected effect using these algorithms to analyze malicious traffic in a relatively new dataset, as evidenced by our work in this study.

Intrusion Detection System Based on Deep Neural
Network.
e success of machine learning algorithms generally depends on data representation [15]. Representation learning, also called feature learning, is a technique in deep neural network, which can be used to learn the explanatory factors of variation behind the data. Ma et al. combine spectral clustering and deep neural network algorithms to detect intrusion behaviors [16]. Niyaz et al. used deep belief networks for developing an efficient and flexible intrusion detection system [17]. But these research methods construct their models to learn representations from manually designed traffic features, not taking full advantage of the ability of deep neural networks. Eesa et al. showed that higher detection rate and accuracy rate with lower false alarm rate can be obtained by using improved traffic feature set [18]. Learning features directly from traffic raw data should be feasible, such as in the fields of computer vision and natural language processing [19].
Two most widely used deep neural network models are CNN and RNN. e CNN uses original data as the direct input to the network, does not necessitate feature extraction or image reconstruction, has relatively few parameters, and requires relatively little data in process. CNNs have been proven to be highly effective in the field of image recognition [20]. For certain network traffic of protocols, CNNs can perform well through fast training. Fan and Ling-zhi [21] extracted very accurate features by using a multilayer CNN, wherein the convolution layer connects to the sampling layer below; this model outperformed classical detection algorithms (e.g., SVM) on the KDD99 dataset. However, the CNN can only analyze a single input package-it cannot analyze timing information in a given traffic scenario. In reality, a single packet in an attack traffic scenario is normal data. When a large number of packets are sent at the same time or in a short period, this packet becomes malicious traffic. e CNN does not apply in this situation, which in practice may lead to a large number of missed alerts. e recurrent neural network (RNN) is also often used to analyze sequential information. LSTM, a branch of RNNs, performs well in sequence information analysis applications such as natural language processing. Kim et al. [22] compared the LSTM-RNN network against Generalized Regression Neural Network (GRNN), Product-based Neural Network (PNN), k-Nearest Neighbor (KNN), SVM, Bayesian, and other algorithms on the KDD99 dataset to find that it was superior in every aspect they tested. e LSTM network alone, however, centers on a direct relationship between sequences rather than the analysis of a single packet, so it cannot readily replace the CNN in this regard.
Wu and Guo [23] proposed a hierarchical CNN + RNN neural network and carried out experiments on NSL-KDD and UNSW-NB15 datasets. Hsu et al. [24] and Ahsan and Nygard [25] used another CNN + LSTM model to perform multiclassification experiments on the NSL-KDD dataset. Hassan et al. [26] proposed a hybrid deep learning model to detect network intrusions based on CNN network and a weight-dropped, long short-term memory (WDLSTM) network.
is paper mainly conducted experiments on UNSW-NB15 dataset. However, these studies are still based on extracted features in advance.
Abdulhammed et al. [27] used Autoencoder and Principle Component Analysis to reduce the CICIDS2017 dataset's feature dimensions. e resulting low-dimensional features from both techniques are then used to build various classifiers to detect malicious attacks. Musafer et al. [28] proposed a novel architecture of IDS based on advanced Sparse Autoencoder and Random Forest to classify the patterns of the normal packets from those of the network attacks and got good results.
In this study, we adopted a malicious traffic analysis method based on CNN and LSTM to extract and analyze network traffic information of network raw dataset from both spatial and temporal dimensions. We conducted training and testing based on the CICIDS2017 dataset that well simulates the real network environment. We ran a series of experiments to show that the proposed model facilitates very effective malicious flow analysis.

Dataset.
e IDS is the most important defense tool against complex and large-scale network attacks, but the lack of available public dataset yet hinders its further development. Many researchers have used private data within a single company or conducted manual data collection to test IDS applications, which affects the credibility of their results to some extent. Public datasets such as KDD99 and NSL-KDD [29] are comprised of data encompassing manually selected stream characteristics rather than original network traffic. e timing of the data collection is also outdated compared to modern attack methods.
In this study, in an effort to best reflect real traffic scenarios in real networks as well as newer means of attack, we chose the CICIDS2017 dataset (Canadian Institute for Cybersecurity) [30] which contains benign traffic and up-todate common attack traffic representative of a real network environment. is dataset constructs the abstract behavior of 25 users based on HTTP, HTTPS, FTP, SSH, and e-mail protocols to accurately simulate a real network environment. e data capture period was from 9 a.m. on July 3, 2017, to 5 p.m. on July 7, 2017; a total of 51.1 g data flow was generated over this five-day period. e attack traffic collected includes eight types of attack: FTP-Patator, SSH-Patator, DoS, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS. As shown in Table 1, the attacks were carried out on Tuesday, Wednesday, ursday, and Friday morning and afternoon. Normal traffic was generated throughout the day on Monday and during the nonaggressive period from Tuesday to Friday. e data type for this dataset is a pcap file.
After acquiring the dataset, we analyzed the original data and selected seven types of data for subsequent assessment according to the amount of data and its noise rate. ey are Normal, FTP-Patator, SSH-Patator, DoS, Heartbleed, Infiltration, and PortScan.

Network Traffic Segmentation Method.
e format of the CICIDS2017 dataset is one pcap file per day. ese pcap files contain a great deal of information, which is not conducive to training the machine. erefore, the primary task of traffic classification based on machine learning is to divide continuous pcap files into several discrete units according to a certain granularity.
ere are six ways to slice network traffic: by TCP, by connection, by network flow, by session, by service class, and by host. When the original traffic data is segmented according to different methods, it splits into quite different forms, so the selected network traffic segmentation method markedly influences the subsequent analysis.
We adopted a session sharding method in this study. A session is any packet that consists of a bidirectional flow, that is, any packet that has the same quad (source IP, source port, destination IP, destination port, and transport layer protocol) and interchangeable source and destination addresses and ports.

Data Preprocessing.
Data preprocessing begins with the original flow, namely, the data in pcap format, for formatting the model input data. e CICIDS2017 dataset provides an original dataset in pcap format and a CSV file detailing some of the traffic. To transform the original data into the model input format, we conducted time division, traffic Security and Communication Networks segmentation, PKL file generation, PKL file labeling, matrix generation, and one_hot encoding. A flow chart of this process is given in Figure 1.
Step 1 (time division). Time division refers to intercepting the pcap file of the corresponding period from the original pcap file according to the attack time and type [30]. e input format is, again, a pcap file; the output format is still a pcap file. e time periods corresponding to the specific type and the size of the file are shown in Table 2.
Step 2 (traffic segmentation). Traffic segmentation refers to dividing the pcap file obtained in Step 1 into corresponding sessions by sharding according to the IP of attack host and victim host corresponding to each time period [31]. e specific process is shown in Figure 2.
is step involves shredding the pcap file of Step 1 into the corresponding flow using pkt2flow [31], which can split the pcap package into the flow format (i.e., the original pcap package is divided into different pcap packages according to the flow with different five-tuples). Next, the pcap package is merged under the premise that the source and destination are interchangeable. Finally, the pcap package of Step 1 is divided into sessions.
Step 3 (generate the PKL file) and Step 4 (tag the PKL file). As shown in Table 1, the pcap file is still large in size after extraction; this creates a serious challenge for data reading in the model. To accelerate the data reading process, we packaged the traffic using the pickle tool in Python. We use PortScan type traffic as an example of the packaging process here. In this class, many sessions are generated after Step 2, each of which is saved in a pcap file. We generated a label of the corresponding attack type for each session. Each session contains several data flows, and each data flow contains an n 1 packet. We then saved the n 2 sessions in a PKL file to speed up the process of reading the data. n 1 can be changed as needed; according to the experimental results, we finally selected the best value, that is, n 1 � 8. e value of n 2 can be calculated by formula (1). In this case, we packaged each type of sessions into a PKL file. e structure of the entire PKL file is shown in Figure 3.
total number of packets of this type n 1 . (1) Step 5 (matrix generation). e input of the model must have a fixed length, so the next step is to unify the length of each session. e difference between each attack is mainly in the header, so we dealt with the packet according to the uniform length of PACKET_LEN bytes; that is, if the packet length was greater than PACK-ET_LEN, then bytes were intercepted, and if the packet length was less than PACKET_LEN, bytes were filled with -1. Each session is then divided into a matrix MAX_PACKET_NUM_PER_SESSION * PACKET_LEN. According to the results of our experiment, we finally chose MAX_PACKET_NUM_PER_SESSION as 8 and PACKET_LEN as 40.
Step 6 (one_hot encoding). To effectively learn and classify the model, the data from Step 5 are processed by one_hot encoding to convert qualitative features into quantitative features: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 where B is a byte in the data packet; num is the number used for encoding in one_hot encoding. In the model implementation, num � 1, ohe i is a bit in the one_hot encoding of a byte, ⊕ is the series notation, and OHE j is the one_hot encoding of a byte.

DL-IDS Architecture
is section introduces the traffic classifier we built into DL-IDS, which uses a combination of CNN and LSTM to learn and classify traffic packets in both time and space. According to the different characteristics of different types of traffic, we also used the weight of classes to improve the stability of the model. e overall architecture of the classifier is shown in Figure 4. e classifier is composed of CNN and LSTM. e CNN section is composed of an input and embedded layer, convolution layer 1, pooling layer 2, convolution layer 3, pooling layer 4, and full connection layer 5. Upon receiving a preprocessed PKL file, the CNN section processes it and returns a high-dimensional package vector to the LSTM section. e LSTM section is composed of the LSTM layer 6, LSTM layer 7, full connection layer 8, and the OUTPUT layer. It can process a series of high-dimensional package vectors and output a vector that represents the probability that the session belongs to each class. e Softmax layer outputs the final result of the classification according to the vector of probability.

CNN in DL-IDS.
We converted the data packets obtained from the preprocessed data into a traffic image. e so-called traffic image is a combination of all or part of the bit values of a network traffic packet into a two-dimensional matrix. e data in the network traffic packet is composed of bytes. e value range of the bytes is 0-255, which is the same as the value range of the bytes in images. We took the x byte in the header of a packet and the y byte in the payload and composed them into a traffic image for subsequent processing, as discussed below. As mentioned above, the CNN section is composed of input and embedded layers, convolution layer 1, pooling layer 2, convolution layer 3, pooling layer 4, and full connection layer 5. We combined convolution layer 1 and pooling layer 2 into Combination I and convolution layer 3 and pooling layer 4 into Combination II. Each Combination allows for the analysis of input layer characteristics from different prospective. In Combination I, a convolution layer with a small convolution kernel is used to extract local features of traffic image details (e.g., IP and Port). Clear-cut features and stable results can be obtained in the pooling layer. In Combination II, a large convolution kernel is used to analyze the relationship between two bits that are far apart, such as information in the traffic payload.
After preprocessing and one_hot coding, the network traffic constitutes the input vector of the input layer. In the input layer, length information is intercepted from the ith packet Pkg i � (B 1 , B 2 , . . . , B Length ) followed by synthesis S of n Pkgs information set S � (Pkg 1 , Pkg 2 , . . . , Pkg n ).
Formula 3 shows a convolution layer, where f is the side length of the convolution kernel. In the two convolution layers, f � 7 and f � 5. s is stride, p is padding, b is bias, w is weight, c is channel, l is layer, L l is the size of Z l , and Z(i, j) is the pixel of the corresponding feature map. Additionally

Security and Communication Networks
e convolution layer contains an activation function (formula 4) that assists in the expression of complex features. K is the number of channels in the characteristic graph and A represents the output vector of the Z vector through the activation function. We used sigmoid and ReLU, respectively, after two convolution layers.
After feature extraction in the convolution layer, the output image is transferred to the pooling layer for feature selection and information filtering. e pooling layer contains a preset pooling function that replaces the result of a single point in the feature map with the feature graph statistics of its adjacent region. e pooling layer is calculated by formula 5, where p is the prespecified parameter. We applied the maximum pooling in this study, that is, p ⟶ ∞.
We also used a back-propagation algorithm to adjust the model parameters. In the weight adjustment algorithm (formula 6), δ is delta error of loss function to the layer, and α is the learning rate.
We used the categorical cross-entropy algorithm in the loss function. In order to reduce training time and enhance the gradient descent accuracy, we used the RmsProp optimization function to adjust the learning rate.
After two convolution and pooling operations, we extracted the entire traffic image into a smaller feature block, which represents the feature information of the whole traffic packet. e block can then be fed into the RNN system as an input to the RNN layer.

LSTM in DL-IDS.
Normal network communications and network attacks both are carried out according to a certain network protocol.
is means that attack packets must be ensconced in traffic alongside packets containing fixed parts of the network protocol, such as normal connection establishments, key exchanges, connections, and disconnections. In the normal portion of the attack traffic, no data can be used to determine whether the packet is intended to cause an attack. Using a CNN alone to train the characteristics of a single packet as the basis for the system to judge the nature of the traffic makes the data difficult to mark, leaves too much "dirty" data in the traffic, and produces altogether poor training results. In this study, we remedied this by introducing the LSTM, which takes the data of a single connection (from initiation to disconnection) as a group and judges the characteristics of all data packets in said group and the relations among them as a basis to judge the nature of the traffic. e natural language processing model performs well in traffic information processing [32] under a similar methodology as the grouping judgment method proposed here. e LSTM section is composed of LSTM layer 6, LSTM layer 7, full connection layer 8, and Softmax and output layers. e main functions are realized by two LSTM layers. e LSTM is a special RNN designed to resolve gradient disappearance and gradient explosion problems in the process of long sequence training. General RNN networks only have one tanh layer, while LSTM networks perform better processing timing prediction through their unique forgetting and selective memory gates. Here, we call the LSTM node a cell (C t ), the input and output of which are x t and h t , respectively. e first step in the LSTM layer is to determine what information the model will discard from the cell state. is decision is made through the forgetting gate (formula 7). e gate reads h t−1 and x t and outputs a value between 0 and 1 to each number in the C t−1 cell state; 1 means "completely retained" and 0 means "completely discarded." W and b are weight and bias in the neural network, respectively.
GlobalMaxPool • • • Figure 4: Architecture of DL-IDS. e next step is to decide how much new information to add to the cell state. First, a sigmoid layer determines which information needs to be updated (formula 8). A tanh layer generates a vector as an alternative for updating (formula 9). e two parts are then combined to make an update to the state of the cell (formula 10).
e output gate determines the output of the cell. First, a sigmoid layer determines which parts of the cell state are exported (formula 11). Next, the cell state is processed through a tanh function to obtain a value between −1 and 1 and then multiplied by the output of the sigmoid gate. e output is determined accordingly (formula 12).
In the proposed model, the feature maps of n data packets in a group of traffic images in a connection serve as the input of the LSTM section. e feature relations between these n data packets were analyzed through the two LSTM layers. e first few packets may be used to establish connections; such packets may exist in the normal data streams,but they may occur in the attack data streams too. e next few packets may contain long payloads as well as attack data. e LSTM finds the groups containing attack data and marks all packets of those whole groups as attack groups.
LSTM layer 6 in DL-IDS has a linear activation function designed to minimize the training time. LSTM layer 7 is nonlinearly activated through the ReLU function. e flow comprises a multiclassification system, so the model is trained to minimize multiclass cross-entropy. We did not update the ownership weight at every step but instead only needed to add the initial weight according to the volume of various types of data.

Weight of Classes.
e data obtained after preprocessing is shown in Table 3, where, clearly, the quantities ("numbers") of different data types are uneven. e number of type 0 is the highest, while those of types 2 and 4 are the lowest.
is may affect the final learning outcome of the classification. For example, if the machine were to judge all the traffic as type 0, the accuracy of the model would seem to be relatively high. We introduced the weights of classes to resolve this problem: classes with different sample numbers in the classification were given different weights, class_weight is set according to the number of samples, and class weight[i] is used instead of 1 to punish the errors in the class [i] samples. A higher class_weight means a greater emphasis on the class. Compared with the case without considering the weight, more samples are classified into high-weight classes. e class weight is calculated via formula 13, where w i represents the class weight of class i and n i represents the amount of traffic of class i.
When training the model, the weighted loss function in formula 14 makes the model focus more on samples from underrepresented classes. K is the number of categories, y is the label (if the sample category is i, then y i � 1; otherwise y i � 0) and p is the output of the neural network, which is the probability that the model predicts that the category is i and is calculated by Softmax in this model. Loss function J is defined as follows:

Experimental Results and Analysis
We evaluated the performance of the proposed model on the CICIDS2017 dataset using a series of selected parameters: (1) the impact of the length of data packets involved in training; (2) the influence of the number of packets in each flow; (3) the impact of the selected batch size; (4) the effect of the number of units in LSTM; and (5) the influence of the weight of classes. We optimized the DL-IDS parameters accordingly and then compared them against a sole CNN and a sole LSTM. e ratio of Train set, Validation set, and Test set is 18 : 1 : 1.
For each type of attack, TP is the number of samples correctly classified as this type, TN is the number of samples correctly classified as not this type, FP is the number of samples incorrectly classified as this type, and FN is the number of samples incorrectly classified as not this type. e definitions of TP, TN, FP, and FN are given in Figure 5.

Experimental Environment.
e experimental configuration we used to evaluate the model parameters is described in Table 4.

LSTM Unit Quantity Effects on Model Performance.
e number of units in the LSTM represents the model's output dimension. In our experiments, we found that model performance is first enhanced and then begins to decline as the number of LSTM units continually increases. We ultimately selected 85 as the optimal number of LSTM units. Figure 6 shows the changes in ACC, TPR, and FPR with increase in the length of packets extracted during training. As per the training results, model performance significantly declines when the package length exceeds 70. It is possible that excessively long training data packets increase the proportion of data packets smaller than the current packet length, leading to an increase in the proportion of units with a median value of −1 and thus reducing the accuracy of the model. However, the data packet must exceed a certain length to ensure that effective, credible, and scientific content is put into training. is also prevents overfitting effects and provides more accurate classification ability for data packets with partial header similarity. We found that a length of 40 is optimal.

Training Packet Length Effects on Model Performance.
Under the condition that the packet length is 40, the efficiency and performance of the DL-IDS intrusion detection system in identifying various kinds of traffic are shown in Table 5.

Per-Flow Packet Quantity Effects on Model Performance.
As the number of data packets in each flow involved in the training process increases, the features extracted by the model become more obvious and the recognition accuracy   of the model is enhanced. If this number is too high, however, the proportion of filling data packets increases, thus affecting the model's ability to extract features. Figure 7 shows the impact of the number of packets per flow on model performance. We found that when the number of packets in each flow exceeds 8, the performance of the model declines significantly. We chose 8 as the optimal value of perflow packet quantity in the network.

Batch Size Effects on Model Performance.
Batch size is an important parameter in the model training process. Within a reasonable range, increasing the batch size can improve the memory utilization and speed up the data processing. If increased to an inappropriate extent, however, it can significantly slow down the process. As shown in Figure 8, we found that a batch size of 20 is optimal. Table 6 shows a comparison of two groups of experimental results with and without class weights. Introducing the class weight does appear to reduce the impact of the imbalance of the number of data of various types in the CICIDS2017 dataset on model performance.

Model Evaluation.
e LSTM unit can effectively extract the temporal relationship between packets. Table 7 shows a comparison of accuracy between the DL-IDS model and models with the CNN or LSTM alone.
e LSTM unit appears to effectively improve the identification efficiency of SSH-Patator, Infiltration, PortScan, and other attack traffic for enhanced model performance, possibly due to the apparent timing of these attacks. Compared to the LSTM model alone, however, adding a CNN further improves the identification efficiency of most attack traffic. As shown in Table 7, the proposed DL-IDS intrusion detection model has very low false alarm rate and can classify network traffic more accurately than the CNN or LSTM alone. Table 8 shows a comparison of models using CNN and LSTM with traditional machine learning algorithms [33].
e DL-IDS model achieves the best performance among them, with the largest ACC value and the lowest FPR value.
e data input to DL-IDS is raw network traffic. ere is no special feature extraction in the model; the training and testing time include the feature extraction time. e traditional machine learning algorithm does not consider data extraction or processing time, so we could not directly compare the time consumption of the various algorithms in Table 8. e training time and testing time of the model were under 600 s and 100 s, respectively, so we believe that the DL-IDS achieves optimal detection effects in the same time frame as the traditional algorithm.

Conclusions and Future Research Directions
In this study, we proposed a DL-based intrusion detection system named DL-IDS, which utilized a hybrid of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to extract features from the network data flow to analyze the network traffic. In DL-IDS, CNN and LSTM, respectively, extract the spatial features of a single packet and the temporal feature of the data stream and finally fuse them, which improve the performance of intrusion detection system. Moreover, DL-IDS uses category weights for optimization in the training phase. is optimization method reduced the adverse of the number of unbalanced samples of attack types in Train set and improved the robustness of the model.
To evaluate the proposed system, we experimented on the CICIDS2017 dataset, which is often used by researchers for the benchmark. Normal traffic data and some attack data of six typical types of FTP-Patator, SSH-Patator, Dos, Heartbleed, Infiltration, and PortScan were selected to test the ability of DL-IDS to detect attack data. Besides, we also used the same data to test the CNN-only model, the LSTMonly model, and some commonly used machine learning models.
e results show that DL-IDS reached 98.67% and 93.32% in overall accuracy and F1-score, respectively, which performed better than all machine learning models. Also, compared with the CNN-only model and the LSTM-only model, DL-IDS reached over 99.50% in the accuracy of all attack types and achieved the best performance among these three models.
ere are yet certain drawbacks to the proposed model, including low detection accuracy on Heartbleed and SSH-Patator attacks due to data lack. Generative Adversarial Networks (GAN) may be considered to overcome the drawback to some degree. Further, combining with some traditional traffic features may enhance the overall model performance. We plan to resolve these problems through further research.

Data Availability
Data will be made available upon request.

Conflicts of Interest
e authors declare no conflicts of interest.