APT malware exploits HTTP to establish communication with a C & C server to hide their malicious activities. Thus, HTTP-based APT malware infection can be discovered by analyzing HTTP traffic. Recent methods have been dependent on the extraction of statistical features from HTTP traffic, which is suitable for machine learning. However, the features they extract from the limited HTTP-based APT malware traffic dataset are too simple to detect APT malware with strong randomness insufficiently. In this paper, we propose an innovative approach which could uncover APT malware traffic related to data exfiltration and other suspect APT activities by analyzing the header fields of HTTP traffic. We use the Referer field in the HTTP header to construct a web request graph. Then, we optimize the web request graph by combining URL similarity and redirect reconstruction. We also use a normal uncorrelated request filter to filter the remaining unrelated legitimate requests. We have evaluated the proposed method using 1.48 GB normal HTTP flow from clickminer and 280 MB APT malware HTTP flow from Stratosphere Lab, Contagiodump, and pcapanalysis. The experimental results have shown that the URL-correlation-based APT malware traffic detection method can correctly detect 96.08% APT malware traffic, and its recall rate is 98.87%. We have also conducted experiments to compare our approach against Jiang’s method, MalHunter, and BotDet, and the experimental results have confirmed that our detection approach has a better performance, the accuracy of which reached 96.08% and the F1 value increased by more than 5%.
Advanced Persistent Threats (APT) are the utmost challenging attacks as attackers use sophisticated attacking options to launch persistent attacks on specific targets [
To avoid detection, the most popular APT attackers establish a connection between infected machines with command and control servers through HTTP/HTTPS protocols. It also has contributed to discovering parts of malware traffic that do not rely on HTTPS protocol to build a command and control channel through analyzing HTTP traffic. The rule-based detection method [
In this paper, we propose an effective HTTP-based APT malware infection using URL correlation. Different from most existing correlation-based detection methods, we use graph analysis to discover APT malware based on the dynamic correlation of normal traffic. The user’s normal web requests have a definite correlation, and the requests generated by APT malware will not be related to the current user’s target behaviour. Thus, we can build a web request graph and then identify HTTP-based malware traffic based on the correlation analysis among URLs. Our approach includes three phases: the first phase is to build a web request graph, which is established according to the Referer field in the HTTP header; the second phase is to refine the web request graph through redirection refactoring and URL similarity; and the third step is to filter the remaining unrelated legitimate requests using a normal uncorrelated request filter, which is trained with the user’s normal traffic. The remaining outliers in the request graph may be requests initiated by the malware to establish a C & C channel through HTTP protocol. We have conducted experiments on datasets from clickminer [
In summary, our contributions are as follows: We have proposed an approach to detect HTTP-based APT malware infection based on graph reasoning and used Hviz [ Due to a small percentage of normal requests that do not include a We have used Local Outlier Factor to build a filter, which can filter 83.2% of normal uncorrelated requests.
The rest of this paper is structured as follows:
Detecting C & C channels is one of the most effective ways to detect APT. Based on the different detection granularity and the basic ideas behind C & C, detection methods are subdivided into rule-based detection, machine-learning-based detection, and correlation-based detection [
The rule-based detection method mainly proposes detection rules and generates rule templates based on the differences between C & C traffic and normal traffic, and then matches the rule of traffic to be detected with the generated template to determine whether it is malicious C & C traffic. Giroire et al. [
In order to improve the detection speed, the method of extracting traffic characteristics and rules to generate templates and using template matching for C & C traffic detection has appeared. Zarras et al. [
Due to changes in the network environment and botnet code, not all C & C traffic strictly follows rules and templates. In addition, it is difficult for rule-based detection methods to distinguish normal traffic similar to C & C traffic correctly. In contrast, our method concentrated on the dynamic correlation between normal HTTP traffic without specific rules and used a customized filter to filter normal traffic similar to C & C traffic.
The machine-learning-based detection method extracts features from network traffic, combines machine learning algorithms for model training, and finally uses the trained model to detect C & C traffic. For example, Tegeler et al. [
Machine-learning-based detection method requires large amounts of labeled dataset for feature engineering and model training. However, our method does not need to use labeled malicious C & C data to train the model. The filter we use only needs to be trained for normal benign traffic. We focused on the correlation between reference URLs and requested URLs from HTTP traffic to overcome the lack of large-scale labeled APT malicious datasets.
Based on the spatial-temporal correlation and similarity of traffic, the correlation-based detection method uses correlation algorithms to perform correlation analysis on network behaviour to detect C & C traffic. BotSniffer [
The correlation of these methods mainly analyzes the correlation or similarity of malicious traffic characteristics or attack behaviours, or comprehensively considers the results of different detection methods. Unlike these correlation-based detection methods, the correlation of our proposed approach refers to the dynamic association between HTTP request traffic. This association can be represented by the relevant URLs of the web request graph without massive quantities of the dataset. In addition, the dynamic association of normal traffic is difficult to extract, making it difficult for attackers to disguise malicious traffic as normal traffic in our detection method.
When a user is browsing the Internet, each web page may include hundreds of embedded objects such as images, videos, or JavaScript code, which results in a lot of HTTP requests to ask the server for resource. Usually, the initial HTTP request is sent to ask for HTML code, and the corresponding HTTP response message type is the text/html. Then, the browser needs to load JavaScript, CSS, font files, and images referenced by the requested pages based on the HTML code. When the browser is loading the requested page based on the HTML code, users may further trigger the download of other files or AJAX requests, which generates a lot of HTTP requests. Thus, the order of HTTP requests about a web page is related to the user’s browsing behaviour and the web page composition. Furthermore, HTTP request packet stores the order of HTTP request, where Host field in Header and Request-URI field in Request line are combined to represent the currently requested page. Referer field in Header represents its previous page. Therefore, the request relationship of each page accessed by the browser can be represented by a Web request graph based on the Referer field.
For example, Figure
A web request graph based on URL correlation.
In addition to users’ legitimate HTTP requests, there are HTTP requests initiated by HTTP-based APT malware to establish communication with the C & C server. Many malicious HTTP requests hide in a lot of users’ legitimate HTTP requests to avoid being found by network inspectors. In this work, we found that HTTP requests initiated by HTTP-based APT malware do not contain Referer fields and are not related to users’ web browsing behaviour. In other words, HTTP request of APT malware C & C are independent of the web request graph. Therefore, in the process of constructing a normal web request graph, URLs requested by malware cannot be added into the web request graph and become isolated nodes. According to the above characteristics between normal HTTP requests and malicious HTTP requests of HTTP-based APT malware, we propose to use a web request graph based on URL correlation to detect HTTP-based APT malware traffic.
In the detection process, many HTTP requests need to be processed. In order to further improve the detection efficiency, we classify and filter HTTP requests generated in the monitored network environment. In this section, we divide HTTP requests into the following four types according to how they are generated.
In this section, we present an overview of the proposed approach for using web request graph based on URL correlation to detect HTTP-based APT malware, as shown in Figure
The overview of HTTP-based APT malware traffic detection based on URL correlation.
The preprocessing stage mainly extracts the relevant URL from HTTP request traffic to facilitate the subsequent construction of the web request graph. We use mitmproxy [
Referring to Hviz [
Unfortunately, a small percentage of normal requests do not include a Referer field, for example, directly entering the URL address of a resource in the address bar of the browser, visiting the website through the bookmark maintained by the browser, clicking the link in the external application, etc. Therefore, the web request graph constructed based on the Referer field of HTTP requests is incomplete. To refine the constructed web request graph, we proposed to add normal uncorrelated requests into the graph. This section mainly introduces how to add normal uncorrelated requests to the web request graph based on Referer field. According to the generation of normal uncorrelated requests, we come up with two methods to refine the web request graph, namely, redirection refactoring and URL similarity.
If a user-initiated request is redirected, the Location field of its response will give the redirection address. The browser will automatically jump to the redirection address and send an HTTP request to the redirection address. The subsequent HTTP requests are correlated to the HTTP request of the redirection address. Unfortunately, the Referer field of the HTTP request of the redirection address is empty, which means that there is no HTTP request connected to the user-initiated request. Thus, the user-initiated request will become an uncorrelated request. Due to the lack of some user-initiated requests in the web request graph, the accuracy of HTTP-based APT malware detection will be reduced. In the dataset we used, the proportion of redirects was 9.37%. In order to solve the problem, we introduce redirection refactoring to refine the web request graph according to the characteristics of redirection, and add these “missing” user-initiated requests into the web request graph. We will represent the redirection refactoring in detail in part 1 of Section
In this work, we found that normal HTTP requests with valid Referer fields are only 80% in addition to HTTP requests whose response is redirection. In other words, after redirection refactoring, there are still many normal uncorrelated requests. And, we also found that URLs of normal uncorrelated requests have similarity with other URLs of normal request from the same server. To increase the accuracy of detection, URL similarity is used to add normal uncorrelated requests to the web request graph. The URL similarity focuses on the word-level similarity between URLs. URLs are divided into single word based on the segmentation symbol of URLs and the number of words is counted. Then, we compare the similarity between URLs of normal uncorrelated requests and each URL of the web request graph one by one, and add normal uncorrelated requests into the web request graph. We will introduce URL similarity in detail in part 2 of Section
In addition to users’ normal web browsing, there are many legitimate services and legitimate software in the user’s device using HTTP protocol to communicate. There are some HTTP requests issued by legitimate software and services. At the same time, these requests cannot be associated with the Web request graph and become isolated nodes. If we do not deal with these requests, they will be misidentified as malicious requests in subsequent detection. To deal with normal requests generated by legitimate services and software, we use machine learning to construct a normal request filter to filter these requests. We use the features referred to the feature selection of the existing machine learning methods shown in Table
Feature set for normal uncorrelated request filter.
Feature | Description |
---|---|
URL length | Number of characters of the URL |
URL entropy | The information entropy of the URL |
Number of URL parameters | Number of parameters of the URL |
TLD | The top-level domain of the URL |
Domain entropy | The information entropy of the domain |
Content type | Content type of the HTTP request |
Cookie | Does the HTTP request contain cookies? |
User agent | User agent of the HTTP request |
Since the filter is only used to filter normal traffic, we use labeled normal request traffic for training. The selection of machine learning is shown in Section
Redirection refers to redirecting a network request to a new URL which is different from the originally requested URL. After receiving a redirection request, the web server will send its response with a new URL to the user. The client will automatically send a new HTTP request whose URL is the new URL to the server. In general, we can detect whether the HTTP request URL is redirected according to the
For example, when one user accesses page
Redirection refactor of normal redirection.
Redirection refactor of user directly request.
URL is a uniform resource locator, which can indicate the location of resources in the server. Different URLs that request resources with the same scope or the same type of resources on the same server may have similarities. Some URLs of normal uncorrelated requests similar to the web request graph, share the same parent URL. Based on the above characteristics, we can calculate the similarity between the URL of uncorrelated requests and URLs of the web request graph, and add these normal uncorrelated requests to the web request graph based on URL similarity. There are several calculation methods, like string similarity, space vector similarity, string edit distance, and clustering-based method. Considering the principles and applicability, we choose a similarity calculation method for domain name characteristics. The URL similarity calculation process is as follows:
According to the characteristics of the URL, using “.” and “/” to divide the resource access level, we propose to hierarchically divide each URL into a set of elements. Each URL is divided into two parts based on the Host and Request-URI fields. The host field is split by “.,” while the Request-URI field is split by “/,” and these split elements make up a collection of URL elements. After splitting the URLs that access the same server, it is found that these URLs have the same elements. Therefore, we proposed to count the number of elements of URLs, and calculate the ratio between the number of same elements and the maximum number of two URLs. Therefore, we came up with the following equation. Counting the number of elements in a URL collection and using the following equation, we get the similarity value of two URLs.
In this equation,
We consider the similarity calculation between an unrelated request URL and all URLs in the web request graph as a URL similarity calculation process. After the similarity calculation process of an unrelated request ends, the maximum similarity value is extracted. If the maximum similarity value is greater than the similarity threshold
This section mainly introduces the evaluation criteria, datasets, and experimental results analysis.
This work will use accuracy, recall, and F1-score to comprehensively evaluate the detection effect.
TP represents the number of normal HTTP requests, which is predicted as normal, that is, the number of normal HTTP requests that constitute a web request graph.
TN represents the number of malicious HTTP requests, which is predicted as malicious, that is, the number of malicious HTTP requests.
FP represents the number of malicious HTTP requests, which is predicted as normal, that is, the number of malicious HTTP requests that constitute a web request graph.
FN represents the number of normal HTTP requests, which is predicted as malicious, that is, the number of remaining normal uncorrelated HTTP requests.
Accuracy is the ratio of the number of correctly detected HTTP requests to the number of HTTP requests to be detected. The higher the accuracy, the better the detection. Accuracy calculation is as follows:
Recall is a measure of coverage. Recall is the ratio of correctly detected normal HTTP requests to the number of actual normal HTTP requests. The higher the recall rate, the better the detection. Recall rate is calculated as follows:
The definition of F1-score is based on precision and recall. The higher the F1-score, the better the detection. F1-score is calculated as follows:
The experimental dataset used in this work includes the normal traffic dataset and malicious traffic dataset. We used a normal dataset from clickminer [
Before implementing the experiments, we need to choose a suitable machine learning algorithm to build a normal request filter. Since only normal request traffic is used to train, malicious traffic is an outlier for the normal request filter. Based on such characteristics, we use the anomaly detection algorithm to train a normal uncorrelated request filter, including One Class SVM, Isolation Forest, and Local outlier Factor. The training data are mainly constructed by normal uncorrelated request traffic, which is the remaining request traffic of clickminer after constructing the web request graph. The experimental results are shown in Table
Result of different machine learning.
Machine learning | Accuracy (%) | Recall (%) | |
---|---|---|---|
One class SVM | 90.49 | 91.69 | 93.87 |
Isolation forest | 89.81 | 92.87 | 93.50 |
Local outlier factor | 92.91 | 95.31 | 95.48 |
After splitting the URLs that access the same server, it is found that these URLs have the same elements. We found that when the URL similarity threshold
We mixed the collected dataset and then performed experiments in a computing environment with 3.6 GHz Intel Core i5 and 8 GB of RAM.
As shown in Table
Change of uncorrelated requests during experiment.
Experiment step | Normal (%) | Malicious (%) |
---|---|---|
Construct web request graphs | 6.02 | 19.84 |
Refine web request graphs | 5.20 | 19.47 |
Normal uncorrelated request filter | 0.87 | 19.01 |
Change of the detection rate during experiment.
We compared our proposed detection method with three types of detection methods, which are Jiang’s [
As for the reference [
The results of the comparison experiments were shown in Figure
Comparison of our method with three types of detection method.
The roc of our method and three types of detection methods.
Our proposed method also has some limitations. Our main goal was to detect HTTP communications generated by malware whose Referer field is empty. This is because the malware mainly conducts C & C communication with the server, so most of the HTTP traffic Referers generated by the APT malware is empty and become an isolated node in the request graph. If malware forges the Referer field, it will have a certain impact on our method. For example, some malicious software fakes the Referer field as a known site, and may mistakenly judge it as a benign node during the correlation detection process. In our follow-up work, we will continue to solve the problems caused by forged Referer fields. At the same time, the proposed method also has certain detection capabilities for botnets. This is because when the botnet uses HTTP C & C communication, it will also become an isolated node in the request graph and be detected. For example, Chen et al. [
When HTTP-based APT malware performs C & C communication, it will generate HTTP traffic that is unrelated to the normal HTTP traffic of users’ normal Internet browsing. Based on this observation, this paper proposed a method of detecting HTTP-based APT malware-infected traffic using URL correlation. First, we referred to Hviz to construct a web request graph according to URL correlation. Then, we used the redirection refactoring and URL similarity to refine the web request graph. Finally, we used a normal uncorrelated request filter to increase the detection accuracy. After these three steps were completed, the remaining uncorrelated requests were marked as malicious traffic. This paper used public datasets from clickminer, Stratosphere Lab, Contagiodump, and pcapanalysis to evaluate the proposed method. The accuracy of detecting HTTP-based APT malware traffic was 96.08%. The experimental results have shown that the proposed method can effectively detect APT malware traffic. And, the proposed method does not need to rely on the knowledge of previous attacks, and it is more advantageous for detecting small-scale datasets generated by HTTP-based APT malware.
The malicious traffic datasets were derived from APT malware collected by Stratosphere Lab [
The authors declare that they have no conflicts of interest.
This work was partially supported by the National Key Research and Development Program of China (Grant no. 2016QY13Z2302), the National Natural Science Foundation of China (Grant no. 61902262), the National Defense Innovation Special Zone Program of Science and Technology (Grant no. JG2019055), the Director of Computer Application Research Institute Foundation (Grant no. SJ2020A08), and China Academy of Engineering Physics Innovation and Development Fund Cultivation Project (Grant no. PY20210160).