Malicious domain name attacks have become a serious issue for Internet security. In this study, a malicious domain names detection algorithm based on
While rapid development of Internet has changed our lives positively, different types of malicious cyberattacks have been increasing simultaneously. According to the 36th issue of the 2018 “Network information security and dynamic weekly report” released by CNCERT, Chinese servers were attacked 178,156 times in 2018 by 3,724 malicious domain names such as Conficker, Trojan, and Srizbis, an year-over-year increase of 55.8% [
DNS (domain name system), one of the basic services for realizing the conversion between network domain name and IP addresses in the Internet [
In order to achieve malicious purpose, attackers implant malicious programs through the vulnerabilities of system or service to infect the host, and the infected host is controlled by attackers remotely [
The remaining of this study is organized as follows. A number of related works are reviewed in Section
From the perspective of domain name structure features and lexical composition, there are two main types of malicious domain names detection methods in the literature: domain name model [
There are many differences between the normal domain names and the malicious domain names in terms of behaviors and structures. Therefore, the legitimacy of domain names can be determined by analyzing the behavior and structure of domain names. For example, Truong et al. [
The method of domain name semantic detection includes detection based on character matching and on content analysis. For example, Yadav et al. [
Each of the above malicious domain names detection methods has its own advantages. The domain name model detection methods have high detection accuracy rate and wide application range. However, this kind of detection method has a long data collection period, and it is difficult to obtain a large amount of resolution data from both the local domain name server and the root domain name server, thus resulting in high detection time overhead. Although the detection methods of domain name semantic have the advantage of low detection time overhead, this kind of detection method is based on domain name blacklist to design detection features and cannot effectively detect newly generated domain names.
To address the problems such as low detection accuracy rate and high detection overhead, a new method of malicious domain names detection based on
In the following sections, we provide more details for the components of the proposed algorithm.
Figure
Flowchart of malicious domain names detection algorithm based on
In the process of malicious domain names detection, a test domain name is segmented also by the
Alexa ranking is a service that Amazon provides to the public to evaluate the popularity of domain names [
Through the observation of a large number of normal domain names in Alexa 2013 [
In the process of domain name resolution, when the domain name resolution request is not recorded in local network DNS servers, the resolution request is forwarded to the superior network DNS servers, until it reaches the root domain DNS servers. After reaching the root domain, the resolution requests are forwarded to the DNS servers again where the top-level, second-level, third-level, and other level domain names are located, until the domain name resolution result is found. Given that the domain name resolution request is forwarded from the superior domain to the inferior domain, if the given inferior domain name does not exist, the domain name resolution fails. Then, the domain name resolution result or the cause of the domain name resolution failure will be returned to the host that initiates the domain name resolution according to the original path of the request.
From the domain name resolution process, it is noted that the deeper level a malicious domain name is at, the greater its forwarding number is, thus the heavier burden it creates on the system. Conversely, the closer a malicious domain name is to the top-level domain, the smaller its forwarding number is, and thus the easier it can be found. In addition, because of the small quantity, short length, and high popularity of top-level domains, they are readily identified. Therefore, malicious domain names are rarely found in the top-level domain, but often exist in the second, third, or fourth domain. Hence, this study focuses on each domain name substring excluding the top-level domain.
The character string in the text is segmented by a sliding window with a size of
Principle diagram of 5-Gram segmentation.
When the
Each level substring length proportion.
Length | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|
Proportion (%) | 5.39 | 20.09 | 29.13 | 29.21 | 13.81 |
As seen from Table
For example, after removing the top-level domain “com” of wapseo.chinaz.com, the process of second-level and third-level domains is segmented by the
Process of domain name segmentation.
The top 100,000 domain names in Alexa 2013 are selected, and each domain name excluding the top-level domain is segmented into multiple domain name substrings according to its domain level with the lengths of 3, 4, 5, 6, and 7 by the
When the size of the sliding window
The number of domain name substrings generated when
Sliding window | Number |
---|---|
3 | 21,584 |
4 | 84,431 |
5 | 120,626 |
6 | 116,908 |
7 | 55,274 |
Total | 398,823 |
The extraction of the lexical features of the domain name turns into numerical calculation by calculating the weight values of 398,823 domain name substrings in the domain name whitelist substring set. The weight value of domain name substring is calculated by the following formula:
398,823 domain name substrings are extracted from the top 100,000 domain names in Alexa 2013 by the
Partial domain names substring weight values from the top 100,000 domain names in Alexa 2013.
|
|
|
---|---|---|
ing | 3139 | 10.031 |
ter | 2105 | 9.454 |
line | 1270 | 8.310 |
blog | 1194 | 8.221 |
direc | 587 | 6.875 |
forum | 452 | 6.498 |
ectory | 341 | 5.828 |
ogspot | 293 | 5.609 |
rectory | 341 | 5.606 |
youtube | 220 | 4.974 |
rketing | 167 | 4.576 |
In the process of malicious domain names detection, a test domain name is segmented by the
Flow of malicious domain names detection.
Each domain name excluding the top-level domain is segmented into multiple substrings with the lengths of 3, 4, 5, 6, and 7 by the
For example, after removing the top-level domain “com” and replacing the letter “o” in the normal domain name “taobao.com” with the number “0,” the RV of the normal domain name “taobao.com” and the malicious domain name “ta0ba0.com,” which are similar to the normal domain name “taobao.com,” can be calculated. The RV is calculated as follows:
By calculating the RV of the normal domain name “taobao.com” and the malicious domain name “ta0ba0.com,” it can be seen that the RV of the normal domain name “taobao.com” is 17.227. When the size of
In the process of malicious domain names recognition, the size of threshold decides the accuracy rate of the detection algorithm in this study. In order to attain the superior detection accuracy rate, the variable parameter threshold
Relationship curves of accuracy rate with D.
From Figure
In this study, the threshold for malicious domain names detection is set on the basis of the domain name whitelist substring set that is constructed by the top 100,000 domain names in Alexa 2013. If the domain name whitelist sample on constructing domain name whitelist substring set is replaced, the threshold for malicious domain names detection needs to be reset according to the above steps.
When the threshold is set, the RV of the domain name to be tested is calculated to judge whether the domain name to be tested is malicious based on the size of the RV and threshold for malicious domain names detection. If the RV of the domain name to be tested is greater or equal than the threshold for malicious domain names detection, the domain name is judged to be a normal domain name. If not, it is a malicious domain name.
To verify the performance of the proposed algorithm based on
The experimental environment is shown in Table
Experimental environment.
Parameters | Value |
---|---|
CPU | AMD A12-9700 2.5 GHZ |
GPU | AMD R8 M435DX |
Memory | 8 GB |
OS | 64-bit Windows 10 |
Platform | Jupyter Notebook |
Python | 3.5 |
The top 100,000 domain names in Alexa 2013 are selected as the domain name whitelist sample set. Each domain name excluding the top-level domain is segmented into multiple domain name substrings according to its domain level with the lengths of 3, 4, 5, 6, and 7 by the
In this study, 10,265 domain names from the Alexa 2017 and Malware domain list are collected and collated [
In order to evaluate the performance of the malicious domain names detection algorithm based on
Confusion matrix of TN, FN, FP, and TP.
Actual | Predicted | |
---|---|---|
Negative | Positive | |
Negative | True negative (TN) | False positive (FP) |
Positive | False negative (FN) | True positive (TP) |
Figure
Threshold
From the results of the two experiments presented in Figure
To verify the effectiveness of the proposed algorithm, experiments were also performed using the methods in the latest literatures [
The performance comparison between our approach and methods in [
Method | AR (%) | FNR (%) | FPR (%) | TO (s) |
---|---|---|---|---|
Shi et al. [ |
95.75 | 4.29 | 6.62 | 42.68 |
Ma et al. [ |
91.04 | 2.65 | 7.11 | 18.92 |
Wu et al. [ |
91.52 | 8.48 | 1.50 | 32.08 |
Song et al. [ |
93.47 | 4.42 | 7.43 | 38.27 |
Our approach | 94.04 | 7.42 | 6.14 | 31.75 |
Our proposed method yielded a superior combinational result of accuracy rate and computational efficiency. Other methods either has lower accuracy rate or is computational more expensive. In addition, compared to the other methods that are based on machine learning techniques, our method is much easier to add new data when they become available. While the machine learning algorithms require a new training process of all the data, our method only needs modifications to the weight of relevant substrings.
This study proposes a new method for malicious domain names detection. The main contributions are as follows: The top 100,000 domain names in Alexa 2013 are taken as the domain name whitelist sample set to construct the domain name whitelist substring set The
Compared to the malicious domain names detection methods proposed by [
The Alexa and Malware domain list datasets used to support the findings of this study are cited at relevant places within the text as references [
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research work was supported by the National Science Foundation of China under Grant nos. 51668043 and 61262016, the CERNET Innovation Project under Grant nos. NGII20160311 and NGII20160112, and the Gansu Science Foundation of China under Grant nos. 18JR3RA156.