WEB DDoS Attack Detection Method Based on Semisupervised Learning

Since the services on the Internet are becoming increasingly abundant, all walks of life are inextricably linked with the Internet. Simultaneously, the Internet’s WEB attacks have never stopped. Relative to other commonWEB attacks, WEB DDoS (distributed denial of service) will cause serious damage to the availability of the target network or system resources in a short period of time. At present, most researches are centered around machine learning-related DDoS attack detection algorithms. According to previous studies, unsupervised methods generally have a high false positive rate, while supervisory methods cannot handle large amount of network traffic data, and the performance is often limited by noise and irrelevant data.*erefore, this paper proposes a semisupervised learning detection model combining spectral clustering and random forest to detect the DDoS attack of the WEB application layer and compares it with other existing detection schemes to verify the semisupervised learning model proposed in this paper. While ensuring a low false positive rate, there is a certain improvement in the detection rate, which is more suitable for the WEB application layer DDoS attack detection.


Introduction
In the era of the prevailing development of the Internet, with the rapid development of the Internet, the services on the Internet are increasing, and all walks of life are inextricably linked with the Internet. Under this trend, people have become increasingly dependent on the Internet; whether it is online shopping or travel, it is closely related to the Internet. However, while the Internet is developing comprehensively and rapidly, the attacks on the Internet continue to exist and change constantly. Among them, WEB applications have become the focus of attacks because of their wide range of uses. Common WEB attacks [1] include WEB DDoS attacks, crosssite scripting attacks, and request forgery attacks. With the development of distributed and the proliferation of botnets, WEB DDoS attacks have become the most threatening attack, which can seriously damage the availability of target networks or system resources during the duration of a short attack.
WEB DDoS attacks have three characteristics: distributed, rapid development, and destructiveness [2]. However, traditional attack detection methods cannot effectively and accurately detect WEB DDoS, and with the development of machine learning, many researchers have used it to detect WEB DDoS attacks. In the machine learning [3][4][5][6][7] algorithm, there are two types: unsupervised learning and supervised learning. However, the unsupervised method alone has a high false positive rate, while the supervised method alone cannot handle a large number of unknown attacks. For the new type of attack of network traffic data, researchers have used Kmeans + C4.5 for attack detection, which has been experimentally proved to have a higher detection rate than the use of supervised or unsupervised algorithms alone, but because of its use of K-means compared with other current machine learning algorithms, the C4.5 algorithm has insufficient performance, so its detection accuracy and false positive rate have a lot of room for improvement. erefore, this paper will study and propose a detection method for WEB DDoS attacks.

Related Work
e focus of this paper is on DDoS attacks in the WEB application layer. Research on this direction has never stopped at home and abroad. Moreover, with the development of machine learning technology, machine learning methods have become a mainstream method in DDoS detection research. Both Kim et al. [4] use machine learning methods to identify network traffic. e former finally derives DBSCAN. It is more suitable for clustering. e latter shows that the support vector machine (SVM) performs better in detecting attacks. Calix and Rajesh [5] tested the SVM algorithm on the NSL-KDD data set. e accuracy rate is less than 80%. Literature [8] clusters users by the K-means clustering algorithm, which can be achieved by uniform clustering. Panda [9] compared several classification algorithms, in which a random forest-based set classifier can achieve 99% accuracy. Muniyandi et al. [7] proposed a hybrid algorithm using K-means + C4.5 for attack detection whose detection rate is higher than the one using a supervised algorithm or an unsupervised algorithm alone.
e DDoS detection methods in the literature are mainly divided into two categories: unsupervised methods and supervised methods.
ere are two main problems depending on the benchmark data set used: (1) e false positive rate of unsupervised methods is often high. (2) e supervisory method cannot handle large or new types of attack network traffic data, and its performance is often limited by noise and irrelevant data. (3) Since the K-means + C4.5 method uses the K-means and C4.5 algorithms, its performance is insufficient when compared with other current machine learning algorithms, so its detection accuracy and false positive rate has a lot of room for improvement.
Based on the above three problems, this paper proposes a semisupervised learning model combining spectral clustering and random forest to detect WEB DDoS. Compared with the existing scheme, it has a high performance rate and low false positive rate performance improvement, which is more suitable for current WEB DDoS attack detection.

Detection Methodology
In this paper, the semisupervised learning [10][11][12][13][14][15] model combined with unsupervised learning and supervised learning methods is used to detect WEB DDoS attacks, and the choice of learning methods has a great impact on the performance of this model.
First, for the unsupervised model [16][17][18][19][20][21][22], it includes DBSCAN, K-means, and spectral clustering. e DBSCAN algorithm [23] has a long convergence time when the sample data is too large and is not suitable for the big data network environment. Compared with K-means, the spectral clustering algorithm is very effective for the clustering of sparse data, while K-means is difficult to do. In addition, spectral clustering is processing the network traffic data because of the dimensionality reduction processing. In highdimensional data, the complexity is lower than traditional clustering methods such as K-means. erefore, this paper chooses spectral clustering as an unsupervised learning algorithm for semisupervised learning models.
Second, for the supervised model [24,25], the most commonly used algorithms include SVM, Naive Bayes, C4.5, and Random Forest. Lee et al. [26] compared the above classification algorithm, which proved that the random forest is the best classification effect among these algorithms. Panda et al. [6] also compared several supervised algorithms with two types of classifications. e cluster classifier based on random forest is optimal and can achieve 99% accuracy. Based on the above research, this paper chooses random forest as the supervised learning algorithm of semisupervised learning model.
is section applies the semisupervised learning model based on spectral clustering algorithm and random forest combination to detect WEB DDoS attacks. Firstly, the principle and characteristics of spectral clustering in the model are introduced, and then the classification algorithm applied to the model is random forest. e principle and advantages are introduced. Finally, the design of WEB DDoS detection model framework based on semisupervised learning combined with spectral clustering and random forest is introduced.

Spectral Clustering Algorithm Model.
e clustering algorithm used in this paper is spectral clustering, and the spectral clustering algorithm is theoretically used to establish spectra. Compared with the traditional clustering algorithm, spectral clustering can better divide the sample data into clusters with high similarity regardless of the sample space. e principle of the spectral clustering algorithm [27] is as follows. Firstly, the data of the sample data set is transformed into a similar matrix that reflects the similarity between the sample data. Next, the matrix eigenvalues and eigenvectors are solved. Finally, select the feature vector that can cluster the data relatively well. is algorithm can converge to the global optimal solution. At the beginning of spectrum clustering, there are few studies on computer applications. e field of powerful clustering ability is computer vision and VLSI design. At present, machine learning is also applied to solve clustering problems and research at home and abroad. e efforts of scholars have become a hot clustering algorithm. e spectral clustering algorithm is divided into two types according to different division criteria: 2-way and k-way. e 2-way method includes PF algorithm, SM algorithm, and Mcut algorithm. e previous spectral clustering algorithm generally uses the 2-way method to divide and cluster data samples. However, in most of the current research, it is found that the result of dividing and clustering by more feature vectors and using k-way method is better. Ng et al. [28] proposed the NJW algorithm based on k-way method by solving the first k largest eigenvalues of the Lagrangian matrix and its corresponding eigenvectors and orthogonalizing the k eigenvectors. e sample space R k is obtained, so that the original data and each data point in the R k space form a one-to-one representation, and finally clustering is performed in the R k space. e general process of the spectral clustering algorithm based on the NJW algorithm is shown in Figure 1.
Among them, when constructing the Laplacian matrix, memory consumption can be saved by writing the operation result to the disk, and when the row vector of the feature vector matrix is converted into a unit vector, it is calculated by (1) When the spectral clustering is finally clustered by Kmeans, it is necessary to satisfy the condition that the data sample y i is divided into cluster j if and only if the i row of Y is divided into clusters j.

Random Forest Algorithm Model.
e random forest [29] is based on the basic idea of bagging to train a series of decision trees and improve them according to the characteristics of the decision tree. In the random forest training process, it adopts random attribute selection to improve the relative independence of the constructed decision tree to improve performance. Assuming that the number of nodes is n, the way in which the traditional decision tree selects the best attribute is based on all the attributes of the n nodes, and each node of the decision tree in the random forest is based on k attributes that are randomly selected in advance. e magnitude of the k value is decisive for the degree of randomness and is usually set to log 2 d. In addition, the k value can also be 1 or d, which, respectively, represents a random selection of an attribute and a selection method using a conventional decision tree. e specific flow of the random forest algorithm is shown in Algorithm 1.
It can be seen from the training process of random forests that it only makes some minor changes to bagging, adding the randomness of feature attributes on the basis of random samples, and the generalization of the final integration of random forests. e degree of increase is better. Because the random forest algorithm has the advantages of small computational complexity and small difficulty in solving classification problems and often exhibits strong performance in practical applications, this paper also uses random forest as the classifier in the model.

Attack Detection Model Framework Based on Semisupervised
Learning. e detection model proposed in this paper is based on the semisupervised learning model. e spectral clustering clustering algorithm introduced in Section 3.1 is used as the unsupervised learning algorithm in the model [30][31][32][33][34][35][36][37][38][39]. e abovementioned random forest algorithm is used as the model. ere is a supervised learning algorithm.
rough the cooperation of these two algorithms, this paper will construct a WEB DDoS attack detection framework based on semisupervised learning. e basic framework and process design are as follows. Since this semisupervised learning type detection framework is based on machine learning algorithms, it is similar to the traditional machine learning algorithm [40][41][42][43][44][45], including the training process and the detection process, and the approximate processing of these two processes is shown in Figure 2.
For the training phase, the defined dataset S is (X i , Y i ), i � 1, 2, . . ., N, where X i represents an N-dimensional matrix, Y i � {0,1}, where 0 represents normal flow and 1 represents abnormal flow. In the training process, the training data set is first divided into k disjoint clusters by spectral clustering. e random forest corresponding to each cluster is then trained with the data in each cluster.
For the detection phase, the spectral clustering method is used to calculate which cluster of the k clusters the test data sample belongs to, and the corresponding random forest classifier is found according to the cluster of the sample data to determine whether the data sample is normal data or abnormal data. Table 1 lists the hardware and software environments used in this experiment.

Extraction of Data Set.
is experiment uses the fivefold cross-validation method to test, extract 50,000 data from the NSL-KDD data set, and divide it into 5 equal parts. Each subdata set is divided into four types according to the upper service type, including HTTP, SMTP, FTP, and others. e type of data in each category contains 40% of the attack data. e details of the data contained in each subdataset are shown in Table 2.
According to the k-fold cross-validation principle, each experiment will select the subset of data from the previous experiment that was not selected in the previous experiment.
is model is used to test the trained model, and the remaining word data sets are available. Model training is used for learning, and k experiments are performed in this selection. e experimental results, that is, the performance of the model, are reflected by the average of k experiments. e principle flow of the 50% algorithm and the data set of this experiment are shown in Figures 3 and 4, respectively.

Data Preprocessing.
e learning model's evaluation rules are learned through the marked connections in the dataset. ese connections are TCP data messages sent and received by the same IP address in a unit of time. e connection is marked as normal or abnormal. e features of each dimension of the NSL-KDD data set are divided into discrete and continuous types, and their respective ranges of values are different. erefore, preprocessing is required for these features. e preprocessing includes continuous discrete feature variables and data normalization. e two processes are described as follows.
First, the discrete feature variables need to be continuous. e NSL-KDD data set contains continuous and discrete variables, and the discrete feature variables cannot be quantized, so the data is applied to the model. Previously, it was to be continuously processed. According to statistics, NSL-KDD contains 7 discrete feature variables, 5 of which can be represented by 0 or 1 values, namely, _guest_login, logged_in, land, flag, and is_host_login feature variables. e service and protocol_type characteristic variables require special conversion because they have several different values. e specific conversion methods are shown in Tables 3 and 4.
(1) Input: training set D � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m ) (2) Learning algorithm A (3) Training argument m (4) Output: strong classifier f(x) (5) begin (6) for t � 1, 2, . . ., T do (7) Produced bootstrap samples set and named S t (8) Train a decision tree T j on S t (9) while the number of samples corresponding to the leaf node is greater than n min do (10) Randomly select k variables from all optional d variables (11) Select from these k variables the variables that can lead to the optimal partition (12) Divide the node into two subnodes according to the best variable selected above (13) end (14) end (15)    Since the classification of data samples in this paper is obtained by calculating the degree of similarity between data samples, through the previous research on data sets, it contains many feature attributes, and the range and unit of each feature attribute are different. In order for the degree of similarity of the calculations to better represent the differences between the samples, data normalization is required. Data normalization refers to scaling feature attribute data proportionally so that the range of values of the data is reduced to a specific interval, i.e., [−1, 1] or [0, 1]. is experiment uses the z-score method to normalize the experimental data.

Performance Criteria.
e performance indicators used to evaluate the experimental results are calculated based on the standard confusion matrix. For the sample data of this experiment, the confusion matrix is shown in Table 5. True positive (TP) refers to a record that is correctly classified as attack traffic, while false positive (FP) refers to a record that is misclassified as attack traffic, true negative (TN) is a record that is correctly classified as normal traffic, and false negative (FN) is a record that is misclassified as normal traffic. e formulas for the performance indicators used are defined as follows:  )  HTTP  4000  1600  40  FTP  2000  800  40  SMTP  2000  800  40  Others  2000  800  40  Total  10000  4000  40 Divide the data set into 5 equal parts 1 data is never selected as test data in the sub data set of the test set, and the rest is used as training data.

Calculate each performance indicator
Whether the sub data sets have been selected as test sets In the formula, N refers to the total number of data samples. Among them, formula (2) is the detection rate, which refers to the ratio of the normal data and the abnormal data of the correct classification to the total data. Formula (3) is the precision, which means that the number of attacks correctly divided into attacks is divided into the total proportion of attack data, which can reflect the ability of the model to identify the attack data. Equation (4) is the true positive rate, which represents the proportion of correctly identified attack data instances in all attack data. e higher the value of the above three evaluation indicators, the better the model effect. Formula (5) is a false positive rate, which refers to the ratio of normal data misclassification to the proportion of all attack data occupied by abnormal data. e lower the value, the better the model effect.

Experimental Results Analysis.
rough the extraction and preprocessing fo the NSL-KDD algorithm set, which is then applied to the semisupervised learning model proposed in this paper, the performance of the proposed algorithm is compared with the spectral clustering algorithm, K-means algorithm and K-means + C4.5. As shown in Figures 5-7, the spectral clustering algorithm performs better than the K-means algorithm in terms of detection rate, accuracy, and true positive rate. e detection method of K-means + C4.5 is better than separate K-means or spectral clustering. Compared with other methods, the semisupervised learning model based on spectral clustering and random forest proposed in this paper is optimal in detection rate, precision, and true positive rate. e false positive rate refers to the proportion of misclassification. e lower false positive rate is an important     Figure 8. e proposed semisupervised learning detection model has a lower false positive rate, which is basically consistent with the false positive rate of K-means + C4.5. e detection rate, accuracy, and true positive rate of the semisupervised learning model are higher than K-means + C4.5; therefore, the semisupervised learning detection model is more advantageous.
e experimental results show that the semisupervised learning model proposed in this paper has high accuracy,  Security and Communication Networks low false positive rate, and good performance. It is more suitable for detecting WEB DDoS attacks than other detection models. According to the experimental results, the proposed method maintains a relative low false positive rate which is superior to unsupervised methods, and it can detect new types of attack network traffic data effectively. Additionally, the proposed method outperforms the hybrid method, K-means + C4.5, on all aspects of TPR, FPR, and precision.

Conclusion
In order to improve the detection rate of the existing WEB DDoS attack detection model, this paper proposes a semisupervised learning model based on spectral clustering and random forest. First of all, due to the importance of flow characteristics to the detection scheme, we focus on it to select better features to be applied to the detection model proposed in this paper. en, we analyze the spectral clustering algorithm and the random forest algorithm in detail. Based on the principle and its advantages, spectral clustering and random forest are combined to form a semisupervised learning WEB DDoS attack detection model. Finally, the experiment proposed in this paper is compared with other existing detection schemes to verify the paper.
e proposed semisupervised learning model has a certain improvement in the detection rate while ensuring a low false positive rate and is more suitable for the detection of WEB DDoS attacks. In the future work, we will work on the improvement of the detection model and try some other machine learning methods in different manners.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.