Cost-Sensitive Approach to Improve the HTTP Traffic Detection Performance on Imbalanced Data

Aim. )e purpose of this study is how to better detect attack traffic in imbalance datasets.)e deep learning technology has played an important role in detecting malicious network traffic in recent years. However, it suffers serious imbalance distribution of data if the traffic model skews towards the modeling in the benign direction, because only a small portion of traffic is malicious, while most network traffic is benign. )at is the reason why the authors wrote this manuscript. Methods. We propose a cost-sensitive approach to improve the HTTP traffic detection performance with imbalanced data and also present a character-level abstract feature extraction approach that can provide features with clear decision boundaries in addition. Finally, we design a spark-based HTTP traffic detection system based on these two approaches. Results. )e methods proposed in this paper work well in imbalanced datasets. Compared to othermethods, the experiment results indicate that our system has F1-score in a high precision. Conclusion. For imbalanced HTTP traffic detection, we confirmed that the method of feature extraction and the cost function is very effective. In the future, we may focus on how to use the cost function to further improve detection performance.


Background.
In the past few years, cybersecurity incidents have occurred frequently. In the first half of 2018, 360 Internet Security Center intercepted 140 million malicious programs in all, nearly 795,000 ones per day on average [1]. Moreover, around 8% of Hypertext Transfer Protocol (HTTP) messages in 2017 were reported to be malicious [2].
Deep learning, as one of the most currently remarkable machine learning techniques, has achieved great success in many applications such as image analysis, speech recognition, and text understanding [3]. In the field of objection detection, Girshick et al. [4] greatly improved the accuracy of objection detection through the deep learning technology. Wu et al. [5] used weakly supervised learning to classify and annotate images. Rattani et al. [6] applied deep learning technology to the field of selfie biometrics and has made good progress. In medical image segmentation, U-NET [7] is undoubtedly one of the most successful methods, which was proposed at the MICCAI conference in 2015. In the field of HTTP traffic detection, the deep learning technology is prominent way to detect malicious network traffic. However, it suffers serious imbalanced distribution of data. For example, the traffic detection tasks usually focus on reducing malicious traffic such as web attack, but not the data of web browsing accounts for the majority. e contribution of the majority class to the cost function far exceeds that of the minority class. erefore, it is difficult to identify the small amount of traffic, which brings serious challenges to network traffic classification [8].

Related Work.
e detection technologies of imbalanced data can be classified into three types: data-level methods, feature extraction, and cost-sensitive learning. Oversampling, undersampling, and random sampling are the most commonly used in data level. Jin et al. [9] and Lim et al. [10] applied the data-level methods to rebalance traffic data and improve the performance of imbalanced dataset detection. Oversampling improves classification performance by increasing the number of the minority class samples.
However, due to the large number of copies of the minority class samples, the classification algorithm is difficult to avoid overfitting. Undersampling improves classification performance by reducing the number of the majority class samples. However, in the field of HTTP traffic detection, the majority class samples are far more than the minority class samples, and the quantity difference may be hundreds of times, so the downsampling method may not be suitable. Random sampling randomly abandons the minority class samples, which may remove potentially useful information from the minority class samples. Park et al. [11] proposed an anomaly detection technique for imbalanced HTTP traffic utilizing convolutional autoencoders (CAE), which belongs to the type of feature extraction. However, converting HTTP massage into an image via one-hot encoding will lose some original information. And we will improve the feature extraction method mentioned in the paper. Another common method is cost-sensitive learning, which uses a cost function to train the classifier. e cost-sensitive method in [12] is based on decision tree. However, due to the complexity of the problem, the performance of the algorithm based on neural network is usually better than the algorithm based on decision tree in the field of HTTP traffic detection. Chen et al. [13] introduced a novel imbalanced classification model, named simplex imbalanced data gravitation classification (S-IDGC). is model uses Euclidean distance to calculate gravity, but fails to consider the data distribution characteristics and the results were in a low precision. Recently, the focal-loss cost function proposed by Lin et al. [14] has been proved to be effective in the field of image segmentation. is method performs well in image segmentation. Tong [15] proposed a traffic classification method based on convolutional neural network which consists of two main traffic classification stages and combines the flow and packet-based features to predict the services based on quick UDP internet connection. Aceto et al. [16] used multimodal deep learning to study mobile encrypted traffic classification which has a good result about TSL traffic detection. Lotfollahi et al. [17] combined port-based, payload inspection and statistical machine learning to analyze encrypted traffic classification. Bovenzi [18] imposed a hierarchical hybrid intrusion detection approach, which has been proved to be very effective in the Internet of things scenario. Aceto [19] firstly investigated and experimentally evaluated the adoption of DL-based network traffic classification strategies as supported by BD frameworks. e recent schemes focused on the mobile or light equipment and they analyzed encrypted traffic classification. However, in the field of HTTP traffic detection, the contribution of the minority class samples to the loss function will be reduced according to the predicted value in the training model, which is not conducive to the detection of the minority class samples. e characteristics of recent related works are shown in Table 1.
Contributions: the main motivation of this paper is to detect attacks from serious imbalanced network traffic. To achieve this goal, we address it from two aspects: feature extraction and cost function. e main contributions of this paper can be summarized as follows: (i) In terms of feature extraction, we present characterlevel abstract feature extraction approach which can provide features with clear decision boundaries. (ii) In terms of cost function, we present the HM-loss cost function to improve the http traffic detection performance on imbalanced data. e cost-sensitive approach can reduce the contribution of the majority class in the cost function. (iii) Finally, we design and implement spark-based HTTP traffic detection system and apply the costsensitive approach and the feature extraction approach into this detection system. e experiment results show that proposed scheme has higher precisions, recall, and F1-score. e rest of this paper is organized as follows. Section 2 is a detailed description on character-level abstract feature extraction approach whereas Section 3 describes cost-sensitive approach. e experiment is given in Section 4. Conclusion and future directions are given at the end of the paper.

2.1.
e Character-Level Abstract Feature Extraction Approach. We present the character-level abstract feature extraction approach in this section which combines character-level features and abstract features. Our main work is to extract character-level features based on spark [20] clusters, design a one-dimensional convolutional autoencoder, and then extract abstract features.
For the feature extraction of http traffic, n-gram feature [21] and character-level feature [22] are the most popular methods for converting HTTP messages into input vectors fed into neural networks. However, n-gram features can cause a large amount of information loss and have higher feature dimensions, and character-level features have no clean decision boundaries on imbalanced data because it contains a lot of noise information. Considering that the abstract features generated by CAE have clean decision boundaries, but CAE cannot directly obtain input vectors from http traffic, we present the character-level abstract feature extraction approach based on character-level features. e workflow of the abstract feature extraction approach is shown in Figure 1.

Character-Level Feature Extraction Method Based on
Spark. Zhang et al. [22] have done a lot of research on character-level features. However, Zhang's experiment is implemented on a single computer. In an actual production environment, it will encounter a calculation bottleneck. erefore, we extend the character-level feature extraction method on spark and combine it with abstract features to extract character-level abstract features used in the paper.
We perform preprocessing steps and extract characterlevel features on the spark cluster. First, we install and configure the Hadoop cluster [23] on the ubuntu server, and our spark mode is spark on yarn. Second, we allocate appropriate computing resources for spark tasks based on the amount of task data and expected completion time. For example, the gateway produces 15 GB http traffic stored on Hadoop Distributed File System (HDFS) every 5 minutes. We set the size of HDFS default block to 128 MB, so the traffic can be split into 120 (15 GB/128 MB) tasks. en, we assume that 120 tasks need to run 2 to 3 times in the cluster (according to experience, this configuration can maximize resource utilization). erefore, we can assign 50 executor instances to the cluster, each instance assigning 2 to 3 CPU.
ird, we write spark programs with the Jupyter notebook tool.
e specific steps of data preprocessing and feature extraction are same as in the paper [22]. e pseudocode of the character-level feature extraction algorithm is shown in Algorithm 1. In pseudocode, we will merge the URL and post fields into feature string and then filter out non-ASCII characters. In the end, we need to generate a string of fixed length L, and if the length is greater than L, truncate it; if the length is less than L, repeat filling until the length reaches L. After extracting the character-level features, we transfer the feature data into Kafka [24] system for using in subsequent steps.
For malicious http traffic, the URL and body fields are more likely to contain sensitive attack information. erefore, this paper chooses these two fields as the main detection target. Part of the training set containing only the URL field and the body field is shown in Table 2.

Abstract Feature Extraction by Autoencoder.
e section mainly presents the workflow of extracting the abstract feature generated by the one-dimensional CAE. e main motivation of the abstract features generated by CAE is to generate a clean decision boundary. erefore, we use the one-dimensional CAE to generate abstract features by learning the character-level feature. e results show that this method can effectively reduce the impact of imbalanced data distribution on the malicious traffic detection.
In Figure 2, the classic CAE's input is two-dimensional images. But the URL and post fields for HTTP traffic are onedimensional, unlike images with two-dimensional spatial information; the paper processes the input text data into one-dimensional. Figure 3 shows the structure of the one-dimensional CAE designed by us. Each layer of the encoder consists of multiple nodes, and the last hidden layer of the encoder generates the abstract feature. e decoder also has a multilayered structure that is symmetrical with the corresponding layer in the encoder, and the last layer of decoder is the output layer. e cost function calculates the error based on the output layer and the input layer. And, to reduce overfitting, the dropout ratio between the encoder and the decoder is set to 0.1.
CAE is unsupervised learning, so manual labels are not required and the CAE's input is 300 * 1 character, and after convolution steps, an abstract feature of size 25 * 8 is generated.

The Cost-Sensitive Approach
We present the HM-loss cost function in this section. Our main work consists of two parts. First, we describe the disadvantages of the CE-loss when dealing with imbalanced http traffic. Second, we design the HM-loss cost function which is cost-sensitive. In this approach, we design a coefficient for the loss function and when the classification algorithm predicts the minority class samples, the weight coefficient factor can dynamically adjust the contribution of the majority class samples to the loss function. When the minority class samples are predicted, the contribution of the sample of the loss function is kept unchanged.

e Disadvantages of the CE-Loss on Imbalanced HTTP
Traffic.
e CE-loss function used as the cost function for deep learning techniques is very popular in the classification task. However, it suffers from a low F1-score value when dealing with severely imbalanced HTTP traffic in an actual production environment. Because the contribution of the majority class to the cost function far exceeds that of the minority class, the model's decision tends to support the majority class and ignore the minority exception class [14]. Figure 4 shows the loss value of the CE-loss varies with the prediction probability. As shown in the picture, the predicted value of a single normal sample tends to be large, close to 1, but the contribution to the loss function is small. Conversely, the predicted value of a single malicious sample tends to be small (less than the normal sample), but contributes a lot to the loss function. Now, we assume that the predicted probability of a normal sample is 0.97, and we can calculate that the sum of contribution to the cost function of 500,000 normal samples is 15,229. At the same time, assuming that the predicted probability of a malicious sample is 0.88, the sum of contribution to the cost function of 705 malicious samples is 39.14. e loss value of the benign sample was about 389 (15,229/39.14) times that of the malicious sample. erefore, in the backpropagation of the neural network, the loss of normal samples dominates the decline of the gradient, and the algorithm focuses on the majority class.

e Definition of HM-Loss Cost Function.
In this part, we design the HM-loss cost function and give the definition of HM-loss cost function. First, we present the idea of HM-loss, then give the definition of HM-loss, and finally summarize the advantages and characteristics of HM-loss. e main idea of the HM-loss cost function is to dynamically adjust the weight of sample's contribution to the loss value [14]. When the true label belongs to the majority category (negative), the weight of the contribution to the loss decreases, and the degree of decrease varies according to the predicted probability value, and usually, when the prediction is correct, the contribution to the loss decreases greatly. In addition, our cost function has another property. When the true label belongs to the minority class (positive), the weight of contribution to the loss remains. erefore, we can adjust the algorithm to focus on the minority class samples by giving the majority class samples less attention [25]. We focus on the minority class samples but not the majority class samples. e idea of the HM-loss cost function is present in the previous paragraph. Now, we give the specific definition. e definition of HM-loss is shown in formula (1). "ytrue" represents the real label, and there are only two values of 0 and 1, where 0 represents the positive class and 1 represents the negative class. "ypred" represents the prediction probability value, which ranges from 0 to 1.
e HM-loss cost function derived from the CE-loss consists of two parts. e first part is "ytrue * cos(α * ypred) c " which controls the weight that varies from ypred value. e two hyperparameters contained in this part are to adjust the degree of weight reduction. e second part is "(1 − ytrue)" which controls the weight of the minority class's contribution to the loss function and it remains unchanged. Figure 5 shows the loss value of the HM-loss cost function under different hyperparameters.
We explain the definition of the HM-loss cost function in the previous section, and we present the advantages of it as follows. e first advantage of the HM-loss cost function is that the contribution of the majority class samples to the loss function can be dynamically reduced according to the predicted value. e second is that, regardless of whether the minority classes samples are predicted correctly or not, its contribution to the loss function does not change. e third is that only when the majority class samples are correctly predicted and the probability value is close to 1, the weight value decreases faster. Now, we take a simple example of what the HM-loss cost function does. Figure 6 shows the loss of 500,000 normal samples under different loss functions, and to show the data better, we select the data with the probability value between 0.85 and 1. We assume that the probability of the majority class sample is 0.97, and we can calculate that the sum of 500,000 samples loss value under the CE-loss cost function is 15,229. Similarly, we can calculate that the loss using the HM-loss cost function under different hyperparameters is 4642, 431, and 40, respectively. It can be seen that the HMloss cost function is very effective in this case.

Experiment.
In this section, we explain the details of the datasets and the performance metrics used in the experiments. Using these metrics, we compare the performance of the proposed scheme with related schemes including datalevel methods and Park's method [11]. e experiment consists of three parts. Firstly, we introduce the preparation stage of the experiment, including the dataset and the experimental evaluation index. en, we show the structure of convolutional neural network used to detect malicious samples. e third part shows the experimental results.

Experiment Setup.
We use real traffic data accumulated over time to validate our approach. And we collect around 701,000 HTTP messages from the gateway of a university in 2019 for this experiment. e collected data is highly sensitive because it contains most of the network activities during work hours of teachers and students. For these data, we perform manual verification and tagging. e types and quantities of malicious samples are shown in Table 3. e numbers of normal and anomalous HTTP messages are around 700,000 and 1,000, respectively. We divide it into training datasets and test datasets according to a certain proportion.
Existing studies have shown that AUC has certain limitations in performance evaluation, especially when the numbers of normal and anomalous messages are significantly different [26,27]. erefore, we use F-score [28]. Specifically, the F-score directly related to the recall and the precision is the harmonic average of the precision and recall, and the specific definition is shown in the following formula. (2)

e Neural Network Model Structure.
After obtaining character-level abstract features, we can train the CNN network to classify samples. Our model is based on onedimensional convolutional neural network which can acquire more local feature. e reason for using one-dimensional vector is that HTTP traffic has no two-dimensional space attributes. e structure of the model used by this experiment mainly includes the input layer, the hidden layer, and the output layer. e input layer first converts the input data into the input tensor fed into one-dimensional convolutional neural network. After the convolution, pooling, ReLU, and flattening steps, the softmax function produces the prediction value. e model architecture is shown in Figure 7.     e purpose of this experiment is to find the best two hyperparameters of HM-loss, the experimental data used in this paper. Table 4 shows that when the hyperparameter alpha and gamma take 1.3 and 3, respectively, the HM-loss performs best on the dataset. And precision recall and F1-score value can reach 0.90, 0.84, and 0.87, respectively.

Comprehensive Experimental Results.
e following experimental results are divided into two parts. First, we verify the effectiveness of the character-level feature extraction method. Second, we compare our method with other methods.
(i). We compare Park's feature and character-level abstract features. In this process, we apply different feature extraction methods, but the same algorithm. As shown in Figure 8, our feature has a higher F-score value. erefore, through this comparative experiment, we can conclude that our feature can work.
(ii). We compare our method with others' methods. e ordinary method which does not adopt any strategy. e oversampling method focuses on the data level. Park's method mentioned in paper [11] focuses on feature extraction. And Lin et al.'s method mentioned in the paper [14] mainly focuses on cost function, which is effective in the field of computer vision. Our method focuses on both feature extraction and cost functions.

Security and Communication Networks
As shown in Table 5, through the comparative experiment of ordinary method and our method, we can conclude that our method can work. Comparing our method with other method, we can find that our method has higher accuracy and F-score than oversampling technique, Park's method, and Tsung-Yi Lin's method. From the ROC curve comparison results of the three methods in Figure 9, under the same FPR, the HM loss method proposed by us can obtain higher TPR than the method proposed by Tsung-Yi Lin's and Park's, which is better than the other two methods.
According to the experimental results, we can conclude that our method works and performs better than the above related methods when dealing with the imbalanced traffic dataset.

Discussion
In this paper, we propose a cost-sensitive approach to improve the HTTP traffic detection performance with imbalanced data. In this approach, we design a coefficient for the loss function and when the classification algorithm predicts the minority class samples, the weight coefficient factor can dynamically adjust the contribution of the majority class samples to the loss function. When the minority class samples are predicted, the contribution of the sample to the loss function is kept unchanged. e experimental results show that this approach is more effective than others. In addition, we also present a character-level abstract feature extraction approach that can provide features with clear decision boundaries in addition. In conclusion, the methods proposed in this paper work well in imbalanced datasets. Compared to other methods, the experiment results indicate that our system has F1-score in a high precision. For imbalanced HTTP traffic detection, we confirmed that the method of feature extraction and cost function is very effective.
In our future work, we will analyze the influence of different types of autoencoders in character-level abstract feature extraction and examine their capabilities and characteristics of improving the performance, whether streamlined autoencoders can keep precision and increase computational efficiency. e theoretical causes of the results require more rigorous regulation. In addition, we will explore more about the performance of stacked autoencoder      when extracting HTTP traffic feature, make the focal loss suitable for HTTP traffic feature, and verify that our method is feasible in other mobile communication protocols.

Data Availability
e authors cannot share their data because the data are confidential.

Conflicts of Interest
All authors declare that there are no conflicts of interest.