A botnet is one of the most grievous threats to network security since it can evolve into many attacks, such as Denial-of-Service (DoS), spam, and phishing. However, current detection methods are inefficient to identify unknown botnet. The high-speed network environment makes botnet detection more difficult. To solve these problems, we improve the progress of packet processing technologies such as New Application Programming Interface (NAPI) and zero copy and propose an efficient quasi-real-time intrusion detection system. Our work detects botnet using supervised machine learning approach under the high-speed network environment. Our contributions are summarized as follows: (1) Build a detection framework using PF_RING for sniffing and processing network traces to extract flow features dynamically. (2) Use random forest model to extract promising conversation features. (3) Analyze the performance of different classification algorithms. The proposed method is demonstrated by well-known CTU13 dataset and nonmalicious applications. The experimental results show our conversation-based detection approach can identify botnet with higher accuracy and lower false positive rate than flow-based approach.
Botnet [
In the past, researchers used signature-based [
Currently, the backbone network is based on 1 Gbps or 10 Gbps optical fibers, which renders massive traffic data in short time. Moreover, fast growing P2P applications pose significant strain to data storage. Therefore, identifying botnet traffic under high-speed network is a challenging issue [
The contributions of the paper are threefold. First, a novel botnet detection system with low latency and high accuracy is introduced. Second, our detection method identifies botnet traffic using conversation-based traffic analysis and supervised machine learning. Our approach outperforms the accuracy based on flow since the false positive rate of botnet traffic decrease is 13.2 percent. In addition to the above two, we evaluate performances of the five well supervised machine learning algorithms (MLAs) [
The remainder of this paper is organized as follows: Section
Botnet detection methods fall into two categories: host behavior-based detection [
Host-based detection is the earliest method. To determine whether a host is compromised, this method continuously monitors the change of process, files, network connections, and registries under a controlled environment [
Network-based detection [
Network-based detection method has a high detection rate because it extracts common flow features independent of botnet category. However, in the high-speed and complex network, existing detection platforms based on flow features are ineffective due to high packet drop rate.
In this section, we describe the components of our proposed botnet traffic detection framework.
The framework consists in the following: Traffic process module for clustering captured packets into different flow buffer Flow-based feature extraction module for generating statistical characteristics of flow Conversation-based feature selection module for extracting promising conversation-based feature set Botnet detection module for identifying botnet traffic using machine learning algorithm.
Packet process module is used to extract the required fields out of the packets. After the extraction of the desired information from the packet process module, the flow-based feature extraction module is used for generating flow features. Based on the flow features, conversation-based feature selection module can obtain promising conversation feature set for the botnet detection module. Botnet traffic detection is accomplished using supervised classification algorithm [
Libcap [
The packet process module architecture.
First, the kernel layer of the packet process module reads the configuration file to set the parameter values, like packet length, ClusterId. ClusterId is the ID of Ring Buffer created by PF_RING. Parameter values are stored in the configuration file so that we can modify them at any time. Second, network devices are turned on and Ring Buffer is created using pfring_open (device, snaplen, flags) function of PF_RING, where device denotes the name of the network device, snaplen denotes the packet length, and flags denotes whether it is in mixed mode. Here, we set snaplen value as 60 because header fields of a packet are needed in this paper. Third, we save the header information, payload length, and arrival time of a packet in different flow buffer according to five tuples (SrcIp, DstIp, SrcPort, DstPort, Proto), which is used to mark a flow. That is, if two different packets have the same source/destination host/port and the same protocol, they belong to the same flow.
There are different flow reorganization methods for different transport layer protocols. Using TCP packets as an example, we use a three-way handshake to represent the start of a flow. When a packet whose FIN or RST value is 1 comes, the end of this flow is marked. The detailed TCP flow reorganization process is shown in Figure
The packet process module architecture.
When a packet comes, we decide whether the flow this packet belongs to exists. If a packet whose flag value is 0x02, and the flow does not exist, we create a flow according to Ip, protocol, and port. When the flag of the packet takes other values, this packet needs to be dropped. An instance of a flow reorganization state machine can be in only one of the five states: handshake_1, handshake_2, handshake_3, data transmission, and end. If a packet whose flag value is 0x02, this process is in the status of handshake_1. Only when a packet whose flag value is 0x12 is coming, the flow reorganization will be in handshake_2 status. Then, the arrival of a packet whose flag value is 0x10 marks handshake_3 status. After the three-way handshake, data begins transmitting. In the procedure of flow reorganization, whenever there is a packet whose flag value is 0x02, it turns back to the handshake_2 status.
After analyzing the data characters of a botnet, we find that there is a flow similarity of the same botnet. Here, a conversation contains many flows with different source ports. That is, two flows having the same source/destination IP, destination port, and protocol can be classified as the same conversation. Promising conversation feature generating is based on the flow features. Thus, the flow-based feature module extracts statistical features including flow duration, the average interval of up (down) flow, the maximal/minimum/average length of up (down) flow, the number of valid up (down) packets in a flow, the number of transmission bytes of up (down) flows, and the number of small packets in a flow.
Conversation features.
Feature value | Description of feature value |
---|---|
avg_duration | The average duration time of flows in a conversation |
min_duration | The minimum duration time of flows in a conversation |
max_duration | The maximum duration time of flows in a conversation |
std_duration | The standard deviation of duration time of flows in a conversation |
avg_f(b)inter | The average interval of up (down) flows in a conversation |
avg_f(b)pkl | The average length of up and down flows in a conversation |
min_f(b)pkl | The minimum length of up (down) flows in a conversation |
max_f(b)pkl | The maximum length of up (down) flows in a conversation |
std_avg_f(b)pkl | The standard variation of the length of up (down) flows in a conversation |
avg_f(b)pks | The average number of up (down) valid flows in a conversation |
std_avg_f(b)pks | The standard variation of the number of up (down) valid flows in a conversation |
avg_f(b)pksl | The average of transmission bytes of up (down) flows in a conversation |
std_f(b)pksl | The standard variation of transmission bytes of up (down) flows in a conversation |
min_spacket | The minimum of small packet in a conversation |
max_spacket | The maximum of small packet in a conversation |
avg_spacket | The average of small packet in a conversation |
std_spacket | The standard variance of small packet in a conversation |
We use random forest algorithm [
(1) (2) initialization (3) while (4) draw a bootstrap sample (5) repeat (6) select (7) calculate the Gini coefficient of selected (8) select the feature with lower Gini coefficient among the (9) split the node into two daughter nodes (10) (11) construct decision tree (12) (13)
In the procedure of random forest model establishment, Gini coefficient is used to select feature. Here are 2 classes; thus, the value of
Then, we select promising features according to random forest model. The feature selection process is shown in Algorithm
(1) the current botnet traffic detection rate (2) initialization (3) (4) (5) (6) calculate RF scores of importance (7) rank the RF scores (8) delete the feature with the smallest importance from (9) (10)
In every iteration, we first rank features according to their importance and then delete the feature with minimum value until detection rate no longer changes. The formula for calculating an RF score of features is shown in (
Depending on the following random forest model, the detection rate is generated using testing data. In this work, we use the features including
In order to achieve scalability in botnet detection module, we use API provided by Weka to implement machine learning algorithms [
Famous public datasets used to detect botnet traffic include dataset disclosed from Information Security and Object Technology (ISOT) organization [
Distribution of botnet types in the training dataset.
Botnet name | Type | Portion of dataset |
---|---|---|
Rbot | IRC, DDoS, US | 0.1% |
Virut | SPAM, PS, HTTP | 0.485% |
Menti | PS | 3.89% |
Sogou | HTTP | 0.035% |
Murlo | PS | 1.64% |
Neris | IRC, SPAM, CF, PS | 31.3% |
Distribution of botnet types in the testing dataset.
Botnet name | Type | Portion of dataset |
---|---|---|
Neris | IRC, SPAM, CF | 3.21% |
Rbot | IRC, PS, US | 2.646% |
Rbot | IRC, DDoS, US | 0.088% |
Virut | SPAM, PS, HTTP | 0.4% |
Menti | PS | 3.33% |
Sogou | HTTP | 0.036% |
Murlo | PS | 1.4% |
Neris | IRC, SPAM, CF, PS | 28.9% |
NSIS.ay | P2P | 1.71% |
Virut | SPAM, PS, HTTP | 1.07% |
During the process of experiment, we assess our detection method by adopting the train set and test set from CTU13. The CUT13 dataset provides a better test environment for unknown botnet because this test set contains many types of botnet traffic which do not exist in the training set.
The effectiveness of the top five classifiers, namely, random forest, REPTree, randomTree, BayesNet, and Decision-Tump [
The experiment result is shown in Figure
Detection rate of the top five classifiers.
The whole recognition rate of DecisionTump is the lowest because there is a one-level decision tree in the DecisionTump. Random forest algorithm selects variables automatically during the model formation and establishes the optimal discriminant model. Thus, the detection rate of random forest algorithm is the highest. Meanwhile, random forest has a lower false positive and false negative rates than the other four. Moreover, there is no obvious difference among the detection effect of BayesNet, REPTree, and randomTree. The botnet traffic detection of Decision-Tump is 84.4%. However, the detection accuracy of the other four algorithms is more than 90%. The false positive rate and true negative rate of the top five algorithm are under 10% except for DecisionTump.
Kirubavathi and Anitha [
Detection effect of flow-based and conversation-based features.
As it can be seen from Figure
In theory, the higher the number of classification trees, the higher the classification accuracy rate. However, if the number and depth of classification tree are extremely high, they will reversely affect the classification speed of classifier. In order to determine the two parameter values of the number and depth of classification tree from random forest algorithm in this paper, we analyze the influence on the classification accuracy by adjusting parameters. In the experiment, the number of classification trees can be set as 10, 50, 100, and 200, and the depth of each classification tree can be set as 2, 4, 10, 20, and so forth. The experiment results of different classification tree size and different classification tree depth are shown in Figure
Detection rate for different number and different depth of classification trees.
When the number of the classification trees is 100 and the depth is 10, the detection rate of random forest algorithm reaches the maximum. Afterward, regardless of increasing the number or the depth of the classification trees, the detection rate does not increase anymore. Thus, when the number of the classification trees is set as 100, and the depth of classification tree is set as 10 in the experiment, the random forest works the best.
Our framework has been implemented in Python and utilizes Microsoft Network Monitor to capture packets from a network interface or a pcap file. Because the timeout value of TCP/UDP packets is 60 s, we set the time window as 60 s in this paper to extract conversation feature. While we experimented with different time window settings, the 60-second time window showed the best accuracy at considerably low computational complexity. In the high-speed network environment, we count the number of conversations and the data flows contained in the interval of 60 s and gather the 1 Gbps and 10 Gbps network in many times. The interval of the gathering is 60 s and 30 s, and then we compute the average value. The result is shown in Table
Experimental parameters settings.
Transmit speed (Gbps) | Internal time (s) | |||
---|---|---|---|---|
30 | 60 | |||
The number of flows | The number of conversations | The number of flows | The number of conversations | |
1 | 138825 | 39734 | 203741 | 61380 |
10 | 261630 | 92933 | 452930 | 158722 |
According to Table
In this paper, we propose an efficient botnet traffic detection system which can handle heavy network bandwidths. Our framework utilizes PF_RING to solve the high packet drop rate of Libcap. RF-RING has low latency and low overhead to extract required fields of traffic. Then, feature selection is conducted to reduce the dimensionality of data. Conversation features combine the advantages of the existing detection methods based on flow statistical behaviors and flow similarity. We select promising features using random forest algorithm in order to reduce the feature dimension. This framework selects the machine learning which obtained the best learning performance. The experiments are conducted on the offline public dataset and online real data. The experimental results show that conversation features used in this paper behave better than flow features in the CTU13 open source dataset. Among all the classification algorithms, the detection rate of random forest is the highest, which is up to 93.6%. And the false alarm rate is only 0.3%, which is ten times less than detection based on traffic flow characteristics.
The future work will focus on mining association rules according to our proposed conversation features. Moreover, we need to further identify specific botnet categories in order to design corresponding defense plans.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (Grant nos. 61572115, 61502086, and 61402080) and the Key Basic Research of Sichuan Province (Grant no. 2016JY0007).