A SYN Flood Attack Detection Method Based on Hierarchical Multihead Self-Attention Mechanism

Existing SYN flood attack detection methods have obvious problems such as poor feature selectivity, weak generalization ability, easy overfitting, and low accuracy during training. In the paper, we present a SYN flood attack detection method based on the Hierarchical Multihad Self-Attention (HMHSA) mechanism. First, we use one-hot encoding and normalization to preprocess traffic data. )en the preprocessed traffic data is transmitted to the Feature-based Multihead Self-Attention (FBMHA) layer for feature selection. Finally, we use data slices to determine the features of the preprocessed traffic data under time series by passing the preprocessed traffic data into the Slice-based Multihead Self-Attention (SBMHA) layer. We tested the proposed method on different datasets. )e experimental results show that compared with other works, our method presents better in feature selection and higher detection accuracy (even up to 99.97%).


Introduction
With the development of the shared and open Internet, network security is facing unprecedented challenges. Distributed denial of service (DDoS) attack has been a challenge for cyberspace security, and its number, frequency, complexity, and impact of DDoS are proliferating. e attack methods become particularly difficult to mitigate [1]. SYN flood attack is one of the most popular DDoS attack methods mainly exploiting the three-way handshake defect in the TCP protocol and IP spoofing techniques. e three-way handshake mechanism establishes the TCP connection between the client and the server. In order to establish a TCP connection, the client must send a synchronize (SYN) packet to the server. After receiving the SYN message sent from the client, the server returns a SYN-ACK packet. When the client receives the SYN-ACK packet, it sends an ACK packet to the server. So far, the three-way handshake is completed [2]. An attacker exploits the server's half-opened connection state (SYN_RECV) to perform the SYN flood attack on the server. e attacker sends a large number of SYN request packets with forged source IP addresses. e server treats these requests as legitimate. First, the server allocates memory and resources for these IP sources. en, it sends the SYN-ACK packet to the client and finally waits for the client's ACK packet in the half-opened state. Attackers send a large number of illegal SYN requests to cause the TCP backlog queue to overflow and create half-opened connections until the system resources are exhausted. Many operating systems and even firewalls and routers are unable to defend against this attack effectively, and SYN flood attacks have a huge impact on fields such as finance, education, and media. e attack principle can be depicted as in Figure 1.
Deep learning provides a new idea for the study of network anomaly traffic detection. However, the information stored by encoded vectors and their relationship is limited by the distance between sequences for long-term sequences, and it results in the loss of important features between sequences. erefore, we propose an SYN flood attack detection method based on the HMHSA mechanism. e method uses Bidirectional Gated Recurrent Unit (Bi-GRU) neural network to encode the input sequence and fully considers the influence before and after information of each attribute. en we add the Multihead Self-Attention mechanism to learn dependencies between sequences and extract salient features. In the self-attention mechanism, each datum is needed to be calculated with attention with all data. No matter how long the distance is, the maximum path length is also 1. erefore, it can better capture long-distance dependence. Moreover, the multihead can learn relevant information from different representation subspaces to improve feature selection. e experimental results verify the effectiveness of the HMHSA mechanism. Compared with other methods, it improves the accuracy of SYN flood attack detection. e main contributions of this paper include those as follows: (1) We apply double-layer Bi-GRU in encoding input sequence, which has fewer parameters, good feature selectivity, and can improve to weak adaptability of network (2) We add the Multihead Self-Attention mechanism to highlight important features. Compared with the attention mechanism, the Multihead Self-Attention can learn the deep feature information of a long sequence and can improve the accuracy while preventing over-fitting (3) We verify the generalization ability of our method with three different datasets, the CICDDoS2019 dataset [29], Mirai dataset [3], and the TDS_SELF dataset (self-made traffic dataset), and the accuracy was up to 99.97% e rest of this paper is organized as follows. In Section 2, we discuss relevant work in the domain of SYN flood attack detection. Section 3 details the proposed methodology. e results and analysis of experiments are given in Section 4. Finally, Section 5 describes the conclusion of this paper and the direction of future work.

Related Work
At present, the research on SYN flood attack detection can be divided into three categories: statistical methods [4,5], machine learning methods [6][7][8][9], and deep learning methods. e statistical methods require feature vector extraction based on professional knowledge, but the uncertainty of human factors may affect accuracy. Due to traditional machine learning limitations, it is impossible to obtain the deep features from the long-term sequence of attack traffic.
In recent years, deep learning has been effective in DDoS detection and SYN flood detection. In 2017, Yuan et al. [10] proposed a DDoS detection method based on the long shortterm memory (LSTM) network using the UNB ISCX 2012 dataset. It detects abnormal traffic by extracting 20 fields from a sequence of continuous-flow packets and using a sliding time window. Brun et al. [11] proposed a deep learning method based on the dense random neural network, analyzed the SYN flood attack on the IoT network, determined the related indicators of different attacks, and explained how to calculate features from packet capture. Li and Lu [12] developed a DDoS detection mechanism based on LSTM and Bayes (LSTM-BA). rough the LSTM method, partial DDoS attacks with high confidence output were identifiable. For those outputs with low confidence, the Bayes method was further used for secondary judgment to improve accuracy. Shaahan et al. [13] proposed to apply a convolutional neural network (CNN) for DDoS detection. e results show that CNN achieves superior performance compared to traditional machine learning methods. However, the CNN convolution kernel still needed to be optimized. Asad et al. [14] used feedforward backpropagation architecture and seven hidden layers to classify network flows. Evmorfos et al. [15] compared the random neural network with the LSTM. e experimental results show that the random network provides better attack detection and significantly reduces error rate [10]. Odumuyiwa et al. [16] compared the effect of unsupervised learning algorithms in DDoS detection and found that autoencoder works the best, but it needed to be replicated in larger systems to detect damaged endpoints. Nagaraju et al. [17] proposed a binary fruit fly algorithm for the real-time prediction model of SYN flood attack, which used swarm intelligence to find the optimal parameters. However, the authors only trained on one dataset, which cannot verify the generalization ability of the proposed model. Rehman et al. [18] proposed and evaluated four anomaly detection algorithms for DDoS attacks, which were gated recurrent unit (GRU), recurrent neural networks (RNN), naïve Bayes (NB), and sequential minimal optimization (SMO). SMO had the best effect on SYN flood detection. Britto and Priya [19] proposed an improved DDoS attack detection method in the cloud. It uses the deep belief network and the support vector machine (SVM) as a learning mechanism to improve detection accuracy. e XGBoost had been used as a classification model in the literature [20,21]. e XGBoost algorithm has higher accuracy and lower false positive rate than other algorithms. In addition, XGBoost detects extraordinarily quickly. Ravi et al. [22] proposed AEGIS (a similarity measure) to detect and mitigate SYN flood attack for SDN controllers through regular checks. Besides, Wan et al. [23] proposed a similarity measure. According to the similarity of the data collected by the nodes, the correlation function is defined in the fuzzy theory to classify nodes and select redundant nodes. e PSO method [24][25][26][27] was also used to detect SYN flood attacks. is defense strategy improves the performance of the system in both memory usage and attack request dwell time.
rough the above related research, it can be seen that the LSTM method is used the most, but the LSTM parameter is too large and the training speed is slow. e accuracy of other methods needs to be improved. Besides, most of these studies do not consider the impact of time series on SYN flood attack detection.

Proposed Methodology
First, the HMHSA mechanism performs data preprocessing, including missing value processing, data transformation, and normalization. en, we build and train the HMHSA mechanism. Finally, the vector with high weight is extracted, and the classification results are obtained by the Softmax. e overall architecture of the HMHSA mechanism is shown in Figure 2.

Bi-GRU Neural
Network. GRU is a lightweight version of LSTM. GRU has only two gate structures, namely, update gate and reset gate. e structure of the GRU neural unit is depicted as in Figure 3.
e GRU uses the update gate to store the amount of information saved from the previous memory to the current timestep and uses the rest gate to determine how to combine the new input information with the historical information [28]. A GRU network is defined as Security and Communication Networks where z t is the activation result of the update gate. e input vector x t of the t timestep and the information h t−1 of the previous timestep are linearly transformed and then are put into the Sigmoid activation function, which maps the variables between 0 and 1. r t is the result of reset gate, which measures the opening size of the gate. h t ′ calculates the current memory content by reset gate and uses the Hadamard function to determine the previous information to be retained and forgotten, where ⊙ represents the Hadamard product. h t is the final updated node status. e advantage of GRU is that it can discard and retain the information in the dimension simultaneously by using a gate z t . In this paper, we used a bidirectional GRU to process the input sequence forward and backward in turn so that the output node of each timestep contains the complete past and future information under the current moment in the input sequence. e Bi-GRU is given by 3.2. Multihead Self-Attention Mechanism. Attention mechanisms can selectively focus on important information of the research subject. Using the transformer model for reference, we use the Multihead Self-Attention mechanism to extract the dependencies in the sequence space. is mechanism captures the information of the same sequence in different subspaces by combining multiple parallel self-attention calculations and then obtains more comprehensive correlation features from multiple perspectives and levels. e structure of the Multihead Self-Attention mechanism is shown in Figure 4, which mainly includes two parts.
(1) Scaled dot-Product attention: e scaling weight prevents the vector dimension from being too high to cause the calculated dot product result to be too large as e inputs are composed of query (Q), key (K), and value (V) matrix. When Q � K � V, it is self-attention mechanism. QK T is the attention matrix, and �� d k turns the attention matrix into a standard normal distribution.
(2) Attention calculation: e original Q, K, and V are linear mapped several times, and the result of each mapping is input into the scaled dot-product attention, and the result obtained each time is called a head, which is computed as

3.3.
e HMHSA Mechanism. e HMHSA mechanism consists of two layers-Bi-GRU, and each layer introduces a Multihead Self-Attention mechanism. Feature-based Multihead Self-Attention is used to enhance the expression of traffic features, and Slice-based Multihead Self-Attention is used for grouping traffic data. Figure 5 shows the structural outline of the HMHSA mechanism.
3.3.1. FBMHA Layer. Not all features are equally important, so to fully capture the significant features, we used this mechanism to determine which features should be the focus.
(1) First, an N-dimensional sample is given: ], next the word vector of byte data is encoded by Bi-GRU neural network, then the characteristics of byte data are learned and finally the h j i is generated: (2) e FBMHA mechanism is introduced to calculate the weight distribution of sequence data, and the sequence information with important contributions is highlighted. e input vector comes from the output vector of the Bi-GRU layer; d is the dimension of Q and K vector; linear transformation of Q, K, and V with different parameters W Q i , W K i , and W V i . We set the number of heads h as 2, after two calculations of scaled dot product attention, the output weight matrix W o is used to connect two parallel heads to obtain result a j i : After obtaining the weight matrix of the sequence data, the weighted sum of the weight matrix a j i and the byte information feature vector h j i are calculated, and the feature representation of each byte is updated to obtain the final output S i : 3.3.2. SBMHA Layer. Intrusion detection traffic data are related to time. e traffic information of multiple adjacent matrices helps to determine the type of current traffic. Traffic data is grouped and called data slicing. When each traffic group is synthesized into a large data packet, the Bi-GRU network ignores the important influence of some key data packet information on the classification results. By introducing the attention mechanism, the weight distribution of the data packets is calculated, and the traffic information with important contributions is highlighted in a group. (1) e S i generated by the upper layer is used to generate the characteristic vector h i of the network flow through Bi-GRU neural network: (2) Introducing a SBMHA mechanism: For each timestep, the corresponding hidden state h i is fed through a single layer of perception to obtain u i as the hidden representation of h i . Similarly, through the SBMHA mechanism, the similarity is used to evaluate the importance of each flow at different times. Finally, the weighted sum is calculated, which is expressed as the context vector v i : ... 6 Security and Communication Networks where W w and b w represent the weight vector and the bias term, respectively.
We evaluate the importance of each slice at different times using the similarity of u i and u s , where u s is the adjacent slice traffic vector.
e fusion feature a i is obtained through the Multihead Self-Attention layer.
where v i is the weighted sum of the weight matrix a i and the data flow information feature vector h i . A summary of the algorithmic phases of the HMHSA mechanism in Algorithm 1 is provided below.

Dataset.
In order to verify the performance of HMHSA mechanism, experiments are carried out on three datasets, and the data statistics are shown in Table 1.

CICDDoS2019 Dataset.
e CICDDoS2019 dataset contains normal traffic and the latest common DDoS attacks. At present, many DDoS attack detections [30][31][32][33][34] are based on this dataset. e results of the network traffic analysis use CICFlowMeter-V3, which contains traffic based on timestamps, source IP and destination IP, source port and destination port, protocols, attack types, and other markers and extracted more than 80 traffic features [29]. By analyzing the SYN flood traffic characteristics, Table 2 shows that the characteristics selected from the CICDDoS2019 dataset are suitable for our experiment.

Mriai Dataset.
e Mriai dataset is created by Meidan et al. [3]; Mirai is a specific type of botnet malware that overrides networked Linux devices and successfully turns them into bots used for distributed attacks such as DDOS.
e Mirai dataset contains a large number of SYN flood instances.

TDS_SELF Dataset.
TDS_SELF dataset is constituted of our real local network traffic and SYN flood attack traffic. e SYN flood attack traffic is generated through "hping3" simulation, as shown in Figure 6. Wireshark is used to capture packets, and tcp.srcport, tcp.dstport, tcp.flags.ack, tcp.flags.syn, tcp.flags.fin, etc. are selected as features through CICFlowMeter, as shown in Figure 7.

Data Preprocessing.
Preprocessing mainly includes feature transformation and feature normalization. Feature conversion uses one-hot encoding to digitize the sequence, and one-hot encoding can extend the value of discrete features to Euclidean space, making the distance calculation between features more reasonable [35]. We use the Min-Max method for normalization, where min is the minimum value of the sample data and max is the maximum value of the sample data. e Min-Max formula is as follows:

Evaluation.
In order to evaluate the performance of the HMHSA mechanism in the detection of SYN flood attack, we use four indicators: accuracy, precision, recall, and F1score [36].
(1) Accuracy. Measure the proportion of the model's correct prediction of the samples in the dataset: (2) Precision. Measure the proportion of the sample that is predicted to be positive: (3) Recall. Evaluate whether to find all the true positive examples in the sample: (4) F1-score. e balance between accuracy and recall:

Experimental Configuration.
e method proposed in this paper is verified and implemented on PC, and the experimental environment is Lenovo Legion R7000; CPU R7-4800H; memory 16G, hard disk 512G; operating system Windows 1940; compile PyCharm 2021, Python 3.6; and the neural network framework is Tensorflow 1.8.0, Keras 2.1.6.
For the parameter setting, we obtained the optimal parameters through several experiments, and the basic parameter settings of the method are shown in Table 3.

Experimental Analysis.
In this paper, four experiments were conducted: (1) the determination of the number of heads, and the suitable number of heads were selected by comparing the F1-score; (2) when the epoch is 4, it tended to be stable, indicating that the method had a good effect, and small number of training times can achieve high accuracy; (3) the timestep was selected randomly for the data slicing layer to discuss the influence on the convergence performance; (4) the comparison of different attentions further verified the superiority of choosing attention in this study.   e appropriate number of heads can extract the key spatial characteristics of data packets more accurately. Too many or very few heads may cause the lack or interference of effective features. In this paper, we set the number of heads (N_head) as 1-6, and the experimental results are shown in Figure 8. It can be seen that when the number of heads is 2, the classification result or F1-score is better than other [29] heads [3].

Determination of Timestep.
By comparing the loss values of the three datasets at different timesteps, it is found that the minimum loss values are 0.0010, 0.0012, and 0.0026 when the timestep is 3. When the timestep exceeds 3, the loss will increase to some extent so that the timestep can be selected as 3 (see Figure9).

Analysis of Training Results.
e HMHSA mechanism was trained on the CICDDoS2019 dataset and TDS_SELF dataset. e results are shown in Figures 10(a) and 10(b), where Acc_Train and Acc_Test are the accuracies of the training set and the testing set, respectively. As can be seen from the figures, when epoch is 4, the highest accuracies are 99.96% and 99.97%, respectively, and the accuracies tend to be stable.
In order to further verify the generalization ability of the model, we trained on the Mirai dataset and added an Early Stopping function to prevent overfitting. If the training effect is still not improved after a certain number of times, the training is stopped. We adjusted the upper limit of epochs to 100 and stopped training in advance when loss did not decrease for seven consecutive epochs during network training, which further improved the network fit. e experimental result is shown in Figure 10(c). e accuracy of the model on the Mirai dataset can reach [29] 99.97%. [3].

Comparison of Different Attention Mechanisms.
In order to evaluate the effectiveness of the HMHSA mechanism, we designed seven different attention mechanisms for comparison. ey are No Attention (Bi-GRU), Single Attention (SA), Single Self-Attention (SSA), Single Multihead Self-Attention (SMHSA), Hierarchical Attention (HA), Hierarchical Self-Attention (HASA), and HMHSA mechanism. e experiments are carried out on three datasets. e accuracy, precision, recall, and F1-score of different structures are shown in Figures 11-13. Comparative experimental results show that the proposed method can achieve good results in SYN flood attack detection [29].

Conclusion
In this study, we have proposed a SYN flood attack detection method [3] with Hierarchical Multihead Attention mechanism. First, Bi-GRU is used to learn the feature information of byte data, and further, the weight distribution of byte data is calculated through the Multihead Self-Attention mechanism to capture the internal correlation of byte data and to highlight the important contribution byte information.
en, control the value of timestep to perform traffic slicing, merge historical data flow and current data flow, learn data flow feature information through Bi-GRU, and further calculate data flow weight distribution through Multihead Self-Attention. Finally, the classification is performed by the Softmax function.
Results illustrate that the method can perform better feature selection and improve the accuracy. Experiments are performed on the public network dataset CICDDoS2019, Mirai dataset, and the simulated TDS_SELF dataset, and the accuracy can reach 99.96% (CICDDoS2019), 99.97% (Mirai dataset), and 99.97% (TDS_SELF).
In future work, a fast and accurate defense mechanism is needed to detect TCP-SYN flood attacks, we plan to further optimize the model, reduce the network structure of the model, and design a more lightweight and more reliable model with a lower false positive rate. Due to the addition of the attention mechanism, our model lacks fast computation, and future goals will focus on balancing efficiency and accuracy to further study the evolution of attention. We plan to evaluate the performance of the proposed detection method against low-rate SYN flood attack when the attack is similar to the background total traffic. In addition to this, we will evaluate the proposed method on a real testbed using a larger-capacity real network traffic dataset.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.