Using XGBoost to Discover Infected Hosts Based on HTTP Traffic

,


Introduction
With the booming of the Internet and the popularity of computers, today's computers are facing serious security problems, whose biggest cause is the explosive growth of malicious code.
e malicious code refers to a computer code that is intentionally written by individuals or organizations to pose a security risk to a computer or network.It usually contains malicious sharing software, adware Trojans, viruses, worms, etc., each of which has di erent kinds of variants [1][2][3][4][5].In the rst half of 2018, China Internet Security News from 360 Internet Security Center shows that a total of 140 million new malicious programs were intercepted and an average of 795,000 new malicious programs were intercepted every day.Among them, the number of malicious programs on the PC side was 14,098,000, and an average of 779,000 new malicious programs were intercepted every day [6].In the fourth quarter of 2017, McAfee Labs detected the highest number of new malware in history, with a total of 63.4 million new samples.McAfee Labs records an average of eight new malware samples per second, a signi cant increase from the four new samples recorded in the third quarter [7].e malware not only brings huge economic losses to users, but also rapid changes have brought great trouble and pressure to the antikilling technology of malicious programs.e current technology has been difcult to detect malware before the host is infected.
Based on this background, detecting malware-infected hosts in network tra c can make up for the shortcoming [8] because most malware will communicate with externally hosted command and control (C&C) servers using the HTTP protocol after infecting the device.e C&C server is the control center that sends malware execution commands, and it is where malware collects data.After an attacker attacks the host with malware, the controlled host sends a connection request to the C&C server.e traffic generated by the connection is malicious external traffic.Currently, there are two main ways to detect malicious external traffic.One is to filter malicious domain names based on blacklists, and the other is to use rules to match malicious external traffic.Both of these solutions have certain limitations.e blacklist-based filtering scheme can only identify malicious external traffic when connecting to a known malicious website and has no perception of domain name changes.However, based on the feature detection scheme, it is necessary for the security practitioner to analyze the samples one by one, which consumes large manpower and is difficult to detect the malicious external connection traffic of the variant.
As a supplement to the prior art, malicious traffic can be detected through machine learning.Using machine learning to discover the commonality between malicious traffic and use it as a basis to detect malicious traffic, a good algorithm can greatly reduce the workload of security practitioners.Specifically, the contributions of this work are specified as follows: (1) We propose an approach-combined machine learning and HTTP header template to discover traffic involved in malware infection and develop it into the MalDetector system.(2) We use the statistical technique to aggregate similar features of HTTP header fields, which is also called HTTP header template, from large-scale network traffic.
(3) We use the GridSearchCV function to coordinate the eXtreme Gradient Boosting (XGBoost) algorithm and verify their effectiveness in the dataset consisting of malicious external traffic generated from malicious samples from MALWARE-TRAFFIC-ANA-LYSIS.NET [9] running in the sandbox and the UNSW-NB 15 dataset [10].
e structure of this paper is arranged as follows.We introduce the related work in Section 2. Section 3 presents an overview of the proposed approach.e process of template automatic generation from the HTTP header is described in Section 4. Section 5 completes the experimental evaluation metrics and illustrates the experimental results.We make a conclusion of the paper in Section 6.

e Request and Response Statistical Features.
e approach mainly analyzes the behavior characteristics of HTTP request/response time interval, quantity, and packet size to model malicious behavior and identify malware traffic.Perdisci et al. [11] developed a novel network-level behavioral malware clustering system.ey performed coarsegrained clustering through statistical features, such as the total number of HTTP requests, the number of GET requests, the number of POST requests, the average length of the URLs, the average number of parameters in the request, the average amount of data sent by POST requests, and the average response length.en, they performed fine-grained clustering by calculating the difference in URL structure between two malware samples.At last, they merged together fine-grained clusters of malware variants that behave similarly enough.eir work can be able to unveil similarities among malware samples that may not be captured by current system-level behavioral clustering systems.Ogawa et al. [12] extracted new features such as HTTP request interval, body size, and header bag-of-words from HTTP request/response pairs and calculated cluster appearance ratio per communication host pairs and identified malware originated communication host pairs.However, the identification approach based on the request and response statistical features is limited to malware samples that perform some interesting actions (i.e., malicious activities) during the execution time T. e identification approach based on the content of HTTP requests and responses can overcome this limitation.

e Content of HTTP Packets.
e approach performs an analysis of the content of HTTP requests and responses, extracts relevant field information to process it, and combines machine learning algorithm to identify malware traffic.Zhang et al. [17,18] used a learning-based approach to discover dependencies of network with the help of HTTP request features and thus detect malicious traffic.Srivastava et al. [19] developed a system called ExecScent that is closest to this work.ey used all the HTTP header fields to detect botnet traffic.ey manually created templates by themselves, such as URL-Path, Query, and User-Agent, and formatted them using regular expressions.Zhang et al. [20] proposed a method that used the User-Agent field to detect malicious external traffic generated by malware.ey used regular expressions to format HTTP header information and used the operating system's fingerprint technology to identify whether it was a fake user agent domain to infer if there was a malware infection.Grill and Rehak [21] also used the User-Agent field to detect the presence of malicious external traffic.
ey found that all User-Agent field information can be divided into five categories: legitimate user browser information, null, specific, spoofed, and inconsistent.According to their findings, some malware deliberately forged requests that were sent from a web browser, making it difficult to detect malicious outbound traffic from the User-Agent field.Li et al. [22] proposed MalHunter based on behavior-related statistical characteristics.ey detected malware communication patterns from three types of features: character distribution of the URL, HTTP header fields, and HTTP header sequence.However, these 2 Security and Communication Networks approaches are either based on a single field or based on all fields, and their feature validity is low.Moreover, Zhang et al. [23] presented a system SMASH that uses unsupervised data mining methods to detect various attack activities and malicious communication activities, focusing on detecting malicious HTTP activity from the perspective of server-side communication.Mekky et al. [24] put forward a method for identifying HTTP redirected malicious links.ey built per-user chains from passively collected traffic and extracted new statistical features from them to capture the inherent characteristics of malicious redirect cases.e supervised decision tree classifier is then applied to identify malicious links.Liu et al. [25] proposed an identification approach by analyzing HTTP connections established by clients in a monitored network and combining stream classification with graph-based fractional propagation methods to identify previously undetected Internet Service Provider (ISP) networks.

HTTP-Based Infected Host
Detection Approach e proposed HTTP-based infected host detection system includes four modules: HTTP traffic filtering, header feature extraction, template automatic generation, and infected host detection.Figure 1 gives an overview of the framework of our proposed infected host detection approach using HTTP traffic.

HTTP Traffic Filtering and Header Feature Extraction.
We save the HTTP header to reduce the amount of stored traffic.We also select the important information from the HTTP header for further analysis.e number of distinct HTTP header fields could be roughly 10 K.Moreover, some unrelated features may expose the machine learning model to the risk of overfitting.Rare fields are nonversatile, so the selection criteria are that we do not extract fields that appear less than 10 times or never appear in training data.
In addition, we mainly focus on the detection of malware that leverages the HTTP as the primary channel to communicate with the C&C server or to launch attack activities.us, our approach mainly focuses on HTTP requests rather than responses.If the C&C server is temporarily offline or changes its response content, there is little impact on our detection capabilities.erefore, the selected fields are URI, Host, User-Agent, Request-Method, Request-Version, Accept, Accept-Encoding, Connection, Content-type, Cache-Control, Content-length, and some identification fields like Frame-time, srcIP (source IP), srcPort (source port), dstIP (destination IP), and dstPort (destination port).
Table 1 lists the description of the selected fields.e reason for selecting them is that they are often used in HTTP traffic and may be helpful in distinguishing legitimate traffic and malicious traffic.

Template Automatic Generation.
When malware communicates with externally hosted C&C servers, malware developers typically use custom formats to construct packets.e network traffic generated by the malware belonging to the same family usually has a similarity.erefore, we use statistical techniques to aggregate similar features of the HTTP header fields, that is, to generate similar templates for malicious traffic, and then use the template to detect new malicious traffic.A template is a series of strings, the character part represents the same part of the value of an HTTP header field, and * represents the different parts of the value of the header field.Templates are generated to display the variability of words constituting the HTTP header fields and aim to compress their information.
e template automatic generate module consists of three steps: scoring, clustering, and generating templates [27], which is explained in detail in Section 4.

Infected Host Detection.
Many winners in Kaggle's competitions like to use XGBoost [28] due to Parallelization, Distributed Computing, Out-of-Core Computing, and Cache Optimization of data structures and algorithms.us, we use the XGBoost algorithm to classify malicious traffic and normal traffic in this work.

Template Automatic Generation
is section introduces focuses on how template automatic generation algorithm works.

Scoring.
We first calculate the score for each value of the selected HTTP header fields by using the score calculation method, and then sort each selected HTTP header field's values according to their scores.Each field in the HTTP header is divided by the following four separators: space, "/", "�", and ",".us, the score calculation method is that we split each selected HTTP header field by separator and then calculate the percentage of their values' scores.For a value w in the field F, its score is S(w; F), which can be calculated using where pos(w, F) is the position of the value in the field F, len(F) is the number of values in the field F. For example, F � {foo, bar, baz, quz}, w � bar, pos(w, F) � 2, and len (F) � 4. n(X) is the number of times that X appears in all the HTTP header, n(w, pos(w, F), len(F)) represents the number of times that w appears in all the pos field of all data, and n(pos(w, F), len(F)) indicates the number of times that the pos field appears in all the HTTP header.As shown in Figure 2, the score of "rv: 19.0" is 0.33 (S(w, F) � 1/3 � 0.33).

4.2.
Clustering.We use the idea of the DBSCAN [29,30] algorithm to cluster the values of the selected HTTP header fields.In the selected HTTP header field, when the score of the next value differs from the score of the previous value by Security and Communication Networks less than δ, the next value is added as the current cluster; otherwise, the next value is added to the other clusters.
Repeat the above process until all values have been added to the cluster.Here, the DBSCAN algorithm requires two parameters: scan radius (eps) and minimum inclusion points (minPts).e working process of the DBSCAN algorithm is as follows.
Starting with an unvisited point and finding all nearby points within the eps (including eps).If the number of nearby points is not smaller than minPts, the current point forms a cluster with its nearby points, and the starting point is marked as visited.
en recursively, all the points in the cluster that are not marked as visited are processed in the same way, thereby expanding the cluster.If the number of nearby points is smaller than minPts, the point is temporarily marked as a noise point.If the cluster is fully extended, i.e., all points within the cluster are marked as accessed, then the same algorithm is used to process the unvisited points.
Finally, we descript our clustering approach with the scoring method and DBCSAN algorithm in the following.
First, we need to introduce the following two parameters: (δ ≥ 0) and β (0 < β < 1), δ is the minimum distance between two clusters, β × len(F) for the minimum number of points in the cluster, and len(F) refers to the number of value in a field.In this work, the δ is set to 0.1 and β is set to 0.5.
en, we sorted each word in descending score.When the score of the next word differs from the mean score of a cluster by less than δ, the next word is added to the current cluster.Otherwise, the next word is assigned to a new current cluster. is process is repeated until all words are included in either cluster.

Generating Templates.
e results of the clustering are filtered to preserve only the clusters whose values are larger than β × len(F) and the remaining clusters are replaced with " * ", where ä is the minimum distance between two clusters, whose value is not smaller than 0; β × len(F)(0 < β < 1) is the minimum number of points in the cluster, and len(F) is the number of values of the field.e overall generation process is shown in Figure 2.   2.
We also performed statistics on the templates generated by the training data.
e statistical results are shown in Figure 3.
As can be seen from Figure 3, the number of templates for malicious traffic is generally several times larger than the number of templates for normal traffic.
e maximum number of templates generated is the URI and User-Agent fields.It can be inferred that malicious traffic may be distinguished mainly based on templates of these several fields.It has been observed that some fields do not even have the generation of malicious traffic templates.It can be inferred that the HTTP request information of malicious traffic may be short, including only information of several fields.Probably because normal HTTP request traffic is usually a connection made through a browser, the browser logs information for many fields.Malicious traffic is a connection made to the C&C server through malware, and the data format is usually constructed by a malware developer, so the HTTP request message is shorter.

Experiments and Results
is section introduces the dataset, the experimental setup, the performance metrics, and the obtained results.

Dataset.
e malware traffic used in this work is from MALWARE-TRAFFIC-ANALYSIS.NET [9].We collect malicious external traffic by running malicious samples collected from June 2013 to December 2017 in the sandbox and use SecurityOnion (a tool for network security monitoring) to detect traffic and get the result.e normal traffic samples are from the UNSW-NB 15 dataset shared by the Cyber Range Lab of the Australian Cyber Security Center (ACCS) in 2015 [10].ey used the tcpdump tool to capture 100 GB of raw traffic (PCAP files) for evaluating network intrusion detection systems and gave a labeled dataset.
e labeled file contains the time period, the source port, the source IP address, the destination port, the destination IP address, the protocol type, and other information of the threat traffic, which is shown in Table 3.
ere are 373864 HTTP request records and only 6401 malicious traffic records in the 100G raw traffic data.We remove malicious HTTP traffic based on source IP, destination IP, source port, destination port, and the time period (from the start time to the last time) in the given labeled file.When the protocol type is HTTP and the time period, source port, source IP, destination port and destination IP address are matched successfully, the traffic is labeled as malicious traffic.
We set the ratio of the training set to the testing data as 7 : 3. us, the dataset in the experiment is shown in Table 4, which consists of 34,239 malicious HTTP requests and 35,481 normal HTTP requests.

Experimental Setup.
e system had been implemented in Python 3.5, and all experiments were performed using an off-the-shelf server with 64 GB of RAM memory and 6-core processor.In order to evaluate the true positive rates and false positive rates of our detection approach, we tune the model parameters on the training set.e initial key parameters of the XGBoost model are shown in Table 5.
Table 5 shows that the accuracy of cross-validation of the training set with the initial parameters is 99.5%, but the accuracy of the testing set is only 92.89% due to over-fitting.

Security and Communication Networks
In order to further improve the accuracy of the prediction, we further adjust the parameters of the XGBoost algorithm.We use the GirdSerachCV function in the SCIKIT-learn [31] package to adjust the parameters, which traverses the value range of parameters.We adjust three of the key parameters, and the adjustment steps are as follows: (1) We first adjust two parameters max_depth and min_ child_weight that play a decisive role in the model.e value range of max_depth is set to [4,6,8,10,12].e value range of min_child_weight is very large and seriously affects the experimental results.If min_child_ weight is over-fitting, the value of min_child_weight should be increased.us, its value range is set to [1,10,100,1000].e results of the parameter adjustment are shown in Table 6.e experimental results show that the model performs optimally when max_depth � 10 and min_child_weight � 1. (2) Based on the adjusted max_depth and min_child_ weight parameters, we adjust the parameter gamma, which participates in the pruning of the decision tree.e larger the value of the parameter is, the less the impact on the model is.Here, we set the value range of gamma to [0∼8].
e results of the parameter adjustment are shown in Table 7. e experimental results show that the model with the best performance when gamma � 0.
(3) We adjust the two parameters subsample and col-sample_bytree at last, which is related to the proportion of samples used.If the sampling setting ratio is too small, the accuracy may be reduced.Here, the value range of the subsample is set to [0.7∼1], and the value range of colsample_bytree is also to [0.7∼1].e    8. e experimental results show that the model performs best when subsample � 0.8 and colsample_bytree � 0.8.

Evaluation Metrics.
e evaluation metrics of our proposed infected host detection approach using malicious external HTTP traffic are expressed as follows: TP refers to the number of malicious HTTP requests that are recognized as malware HTTP requests, TN indicates that the number of normal HTTP requests that are recognized as normal HTTP requests, FP refers to the number of normal HTTP requests that have been mistaken for malware HTTP requests, and FN indicates that the number of normal HTTP requests that are incorrectly identified as malware HTTP requests.e higher the value of precision, recall, and F1, the better the recognition effect of the infected host detection approach.

Experimental Results.
When the ratio of the number of HTTP requests in the training set and testing set is 7 : 3, the experimental results are shown in Table 9.
e accuracy of the testing set is 98.72%, and the false positive rate is less than 1%.e total testing time is about 7 s.erefore, the proposed approach can quickly detect the network traffic and conclude whether the host is infected by malware so that the user can respond to the action as soon as possible.e PRC curve matching the threshold is shown in Figure 4.It can be seen that the algorithm has maintained a high precision with the increase of the recall rate.Finally, 0.8 is selected as the matching threshold.At this time, the accuracy of the algorithm is 93.56%, the recall rate is 97.14%, and the F-value is 0.9532.

Security and Communication Networks
To better validate our proposed approach, we also compare our approach to the other two methods of Ogawa et al. [12] and Li et al. [22].We reproduced these two comparison experiments using our own data set.e experimental results are shown in Table 10.
Table 10 shows that the ACC, P, R, and F1 of our proposed approach are the largest, and they are 0.9827, 0.9356, 0.9714, and 0.9532, respectively.erefore, our proposed approach using XGBoost and HTTP header statistical template is better to detect HTTP malware traffic than the method that uses HTTP header combined machine learning.e main reason is that Ogawa et al.'s approach and Li et al.'s approach are either based on a single field or based on all fields, their feature validity is low.Our proposed approach uses statistical techniques to aggregate similar features of the malicious HTTP header fields.us, our approach can more effectively characterize malware traffic characteristics, which can further improve the accuracy of malware HTTP traffic recognition.
In addition, we select 10%, 20%, 30%, . . ., 90% of the samples as the training set and set the matching threshold to 0.8 to test other sample data.
e correct rate and false positive rate of malicious traffic and normal traffic are separately measured, whose results are shown in Figure 5.It can be seen that the detection rate of the normal HTTP requests has been maintained above 99%.For malicious samples, the detection accuracy rate is based on the diversity of the model.In the case that the training set is only 10% and the model data is insufficient, the algorithm can still detect 77.65% of malicious traffic, indicating that the algorithm has better generalization ability for malicious traffic variants.
We also change the malicious traffic and normal traffic ratio in our training set and testing set.e experimental results are shown in Table 11.
e accuracy rates under different malware traffic ratios all remained above 90%.However, the model has high precision but a low recall rate when malicious traffic accounted for 10% and 20%, respectively.e main reason is that the proportion of malicious traffic is too small, resulting in insufficient training of the model.e results show that if we want to build a machine learning model which can correctly identify malicious traffic, the proportion of malicious traffic and the normal flow ratio needs to be maintained at a relative balance.Malicious traffic accounts for less than 1% of the data in real-world samples.
us, it is necessary to further process the sample, such as subsampling or oversampling, to increase the proportion of malicious traffic, thereby improving detection accuracy.

MalDetector System
Testing.We also use the malicious traffic samples that do not exist in the training data and testing data to verify if the system has the ability to detect new malware and its variants.e selected malicious traffic       5.5.1.Loki-Bot.Loki-Bot [32] uses a malicious website to push fake "Adobe Flash Player," "APK Installer," "System Update," "Adblock," "Security Certificate," and other application updates to induce user installation.e Loki-Bot malware is a bank hijacking Trojan, a variant of the BankBot Trojan.e traffic sample of running Loki-Bot and the testing result using MalDetector are shown in Figure 6. e experimental results show that MalDetector detects all the malicious HTTP traffic of Loki-Bot.[33] is a new type of banking Trojan in Germany.e sample flow is a new variant of Emotet that appeared in September 2017.It has its own ability to evade safety detection and cannot be recognized by antivirus software.

Emotet. Emotet
e traffic sample of running Emotet and the testing result using MalDetector are shown in Figure 7. e experimental results show that MalDetector detects all the malicious HTTP traffic of Emotet.

Conclusion
e diversification of malware and the complication of its technologies have brought new challenges to cybersecurity.Unfortunately, rule-based traditional malware traffic detection methods are unable to detect malware variants.Machine learning-based methods can make up for this defect, and most malware uses the HTTP protocol to send malicious external traffic to the C&C server.us, we propose an approach to detect infected hosts using HTTP traffic combined with a machine learning algorithm.We mainly extract the common templates for the HTTP traffic header, so it still works for the traffic generated by the confusing malware.We also use the most popular XGBoost algorithm to detect infected hosts, which has the advantages of high efficiency and high accuracy.e experimental results show that the accuracy of the method reaches 98.72% and the false positive rate is less than 1%, where the experimental data is from MALWARE-TRAFFIC-ANA-LYSIS.NET and UNSW-NB 15.We also used two real samples that are Loki-Bot and Emotet to verify the effectiveness of the MalDetector system.We plan to combine the approach with malware dynamic analysis to further improve its detection accuracy in the future.Furthermore, some malware utilizes HTTPS to hide its content from the analyzer so that it further reduces detection possibility.Because the header information of HTTPS traffic has been encrypted, our method cannot be applied.We will consider new fields and combine with DNS traffic to refine the templates to detect anomaly-based malware infection in the future.

Figure 1 :
Figure 1: e framework of our proposed approach.

Figure 5 :
Figure 5: e impact of different ratios between the training set and the testing set.

Figure 6 :
Figure 6: Loki-Bot traffic and the detection result of MalDetector.

Figure 7 :
Figure 7: Emotet traffic and the detection result of MalDetector.

Table 1 :
e description of the selected fields in HTTP request header.

Table 2 :
HTTP header field information and template comparison.

Table 4 :
Dataset in the experiment.

Table 6 :
Tuning results of max_depth and min_child_weight.

Table 7 :
Tuning results of gamma.

Table 8 :
Tuning results of subsample and colsample_bytree.

Table 9 :
e experimental results when the ratio of the number of HTTP requests in the training set and testing set is 7 : 3.
PRFigure 4: PRC curve of the detection approach.

Table 10 :
e experimental result of comparative testing.

Table 11 :
e experimental result under different malware traffic ratios.