A Protocol-Independent Botnet Detection Method Using Flow Similarity

'e detection of botnets has always been a hot spot in the field of network security. However, there are still many challenges in botnet detection. Most of the current botnet detection approaches, such as machine learning and blacklists, cannot discover evolving botnet variants. 'ese methods are usually only valid for specific botnet protocols which are not general. Even they may be difficult to deal with encrypted botnet traffic. In this paper, we design a protocol-independent botnet detectionmethod for these challenges. Our detection method takes advantage of the group characteristic of the botnet, which is the inherent characteristics of the botnet. We use the sequence of packet length as the characteristic of a flow. 'en, we calculate the similarity between these sequences to detect botnets. Our method has an excellent generality, which is not affected by encrypted traffic and the protocols of the botnet. Experiments on a challenging dataset ISCX show that the proposed method can effectively detect botnets with a high average detection rate and low false alarm, which significantly outperforms the state-of-the-art methods. 'erefore, the proposed detection method is robust and has a wide range of adaptability in detecting botnets.


Introduction
A botnet is a one-to-many network formed between the controller and the infected host. ere are many methods that can be used by botnet controllers (attackers) to spread bot viruses. Once the host is infected with a bot virus, it will become a part of this botnet. e infected host will receive the attacker's instructions through a control and command (C&C) channel. e infected computers (bots) are silently driven and commanded by the botnet controller to launch cyberattacks. A botnet is equivalent to a platform for attackers to control bots to perform malicious activities. Attackers can conduct distributed denial of service (DDoS) attacks, spread spam, perform network blackmail, and steal personal information through botnets. It brings great challenges to network security and personal privacy protection. Hosts infected as bots can avoid being discovered by network monitoring agencies in a variety of methods, such as constantly updating themselves, disabling antivirus applications, and preventing DNS from looking up certain domain names. ese methods increase the difficulty of botnet detection.
It is well known that the threat of botnets to the Internet is exceedingly scary. With the development of new technologies, botnet detection is facing increasing challenges. Mirai is a new type of botnet that has emerged in recent years. It is the driving force behind the latest large-scale DDoS attack [1]. Mirai infects more than 100,000 IoTdevices to form a huge botnet, which may be the largest DDoS attack in history. It is estimated that Mirai's throughput has reached 1.2 Tbps. e structure of botnet can be summarized into two categories, namely, centralized and decentralized structure. For a centralized botnet, a communication channel is established between the C&C server and all bots. ere are many botnets that are based on centralized structure, such as AgoBot, SDBot, and RBot [2,3]. e protocols adopted by these botnets are mainly based on HTTP and Internet Relay Chat (IRC) protocols. e flexible and simple structure of the IRC protocol is favored by many hackers. Botnets based on the HTTP protocol are usually concealed and difficult to detect. e decentralized botnet uses P2P-based protocols. When issuing the command, the botmaster randomly selects a bot as the C&C server to communicate with other bots. Since the P2P-based botnets effectively avoid the problem of a single point of failure, they greatly enhance the survivability of the botnet [4]. e botnet detection technology has always been a research hotspot in the field of network security. Researchers have proposed a large number of methods to detect botnet [5,6]. ese methods can be summarized into five categories, that is, signature-based methods, anomaly-based methods, honeypot-based methods, specific protocol structure-based methods, and community-based methods.
e signature-based methods [7] cannot detect unknown botnets and their variants. Moreover, the encryption technology used in the botnet negates the effects of these methods. Anomaly-based detection methods [8] are based on the assumption that the communication pattern of the botnet is different from that of the benign network. However, the bots can mimic the communication pattern of the normal hosts to evade the anomaly detection technology. Detection methods based on honeypot technology can only detect existing botnets.
is method has poor real-time performance. Detection methods based on specific protocols and structures [9] cannot detect botnets with different protocols or structures. For community-based anomaly detection algorithms [10], they cannot accurately identify botnets when there is no complete communication graph. Nowadays, cyberhackers are adopting new technologies to constantly update botnets in terms of creation, maintenance, and communication mechanisms. erefore, existing detection technologies cannot cope with unknown and increasingly complex botnets.
A botnet is defined as a coordinated group of malware instances that are controlled by a botnet master via C&C channels [11]. e bots in the same botnet have the same or similar traffic characteristics. In this paper, we propose a protocol-independent botnet detection framework to identify botnet traffic by analyzing the similarity of the traffic flows. Our method can discover bots who initiate these flows that have similar traffic characteristics. More specifically, if the network traffic initiated by a certain host has a great similarity, it can be concluded that the traffic is generated by botnet activities according to the attributes of the botnet. e hosts involved in this traffic are bots in the monitored network. We use the sequence of packet length as the characteristic of the flow. e sequence of packet length is easy to obtain and is very effective for detecting botnets. e sequence of packet length is a vector composed of the length of all the packets in a flow. Each element in the sequence is arranged in sequence according to the order of packet transmission. e degree of similarity between these flows determines whether these flows are botnet traffic. In addition, although the length of the ciphertext output by the encryption algorithm may be different from that of the plaintext, the length of the ciphertext output by the same encryption algorithm is the same for the plaintext of the same length. erefore, for the packets in a network flow, the encryption algorithm will not change the relationship between the lengths of these packets. Hence, the length of packets applied as the characteristic of the flow makes the detection method very robust.
is paper makes the following major contributions: (i) A protocol-independent botnet detection framework is proposed based on the group characteristics of botnets, which are the inherent characteristics of botnets. Our botnet detection framework is not affected by the C&C protocol. It can be applied to detect bots in both centralized and P2P-based botnets. Compared with the prior work, the detector proposed in this paper is always reliable and efficient, no matter what C&C protocol the botnet adopts. (ii) e sequence of packet length is proposed as the characteristic of the flow, which is easy to obtain and is effective for detecting botnets. e packet length applied as the characteristic of the flow makes the detection method very robust. (iii) A bot detection prototype system is implemented. e detection effect of the system is evaluated on dataset ISCX [12]. e results show that the system has a high true positive rate and a low false positive rate.

Related Work
Many researchers have been making continuous efforts to detect botnet. BotMiner [11] is a framework to detect groups of compromised machines that are part of a botnet. e framework is independent of the C&C protocol. It identifies bots by clustering similar malicious traffic and communication patterns. e authors implement the BotMiner prototype system and evaluate the result using traces of many real-world networks. e results show that BotMiner can detect real-world botnets (such as P2P-based botnets, HTTP-based botnets, and IRC-based botnets) with high accuracy and low false positive rate [11]. However, BotMiner needs to analyze the content of the traffic load, which may fail when the traffic is encrypted.
An adaptive framework for detecting botnets is presented in [13]. It is composed of three components, namely, Behavior Extractor, Behavior Identifier, and Feedback Provider. Behavior Extractors generate Behavior Instances (BIs) of hosts from network traffic periodically. BIs are representations of host behavior in a time period. To classify malicious BIs, Behavior Identifier is implemented, which employs a real-time statistical model named Behavioral Model (BM). Feedback Provider can alert the network administrator when it receives a message that malicious BIs are found by Behavior Identifier. At the same time, the Feedback Provider can update BM based on whether the administrator confirms that the host found by Behavior Identifier is malicious. When a new bot appears, the framework requires the administrator to confirm whether the bot is genuine. erefore, the professional level of the administrator may be the bottleneck that affects its detection of new bots.
DBod is a DGA-based botnet detection framework based on analysis of the query behavior of DNS traffic [14]. e research assumes that bots in the same DGA-based botnet query the same sets of domains in the domain list. Since only a very limited number of the domains are actually associated with an active C&C communication [14], most DNS requests sent by bots will fail and generate NXDomains. e main observation behind DBod is that DGA-based bots are different from benign hosts in the distribution of DNS query time and the count of NXDomains. DBod consists of a filtering module, a clustering module, and a group identification module [14]. DBod does not require prior knowledge for training and can detect new bots. However, it will fail when there is no DNS traffic.
Wang et al. [15] propose a two-stage approach for botnet detection. In the first stage, they perform two different anomaly detection, namely, flow-based anomaly detection and graph-based anomaly detection. In the second stage, they identify the pivotal nodes of the discovered anomalies, evaluate pivotal interaction measures, and construct correlation graphs. Community detection is used to identify botnets. eir approach is based on two observations: (1) Botmasters and victims communicate with many other nodes, which are easy to be detected. (2) e infected hosts often communicate with each other, resulting in a strong correlation between them.
PsyBoG [16] applies signal processing technology to botnet detection. ey analyze the time phase and similarity of DNS traffic to identify botnet. PsyBoG uses power spectral density (PSD) analysis, which is a signal processing technology, to detect the major frequency of periodic botnet behavior. en, it clusters the hosts based on the similarity of traffic patterns [16]. PsyBoG detects previously unknown botnets based on the suspicious DNS manner.
Zhuang et al. [17] propose an effective system, Enhanced PeerHunter, to detect P2P-based botnet. Enhanced Peer-Hunter is based on network flow level community behavior analysis. It is capable of detecting P2P botnets when (a) botnets are in their waiting stage; (b) the C&C channel has been encrypted; (c) the botnet traffic is overlapped with legitimate P2P traffic on the same host; and (d) no statistical traffic pattern is known in advance (unsupervised). To detect P2P botnets, Enhanced PeerHunter first detects P2P network traffic.
en, it builds a network flow level mutual contacts graph. Finally, it uses community detection to discover P2P-based botnets.
With in-depth research on machine learning, it is increasingly applied to the detection of botnets.
Carl et al. [18] compare the performance of network classifiers based on different machine learning techniques (such as J48, naive Bayes, and Bayesian) to find the classifier with the highest recognition rate. e result is that a naive Bayes classifier performs best. In addition, the classification sensitivity to the training set size is determined experimentally by them in this paper. Accurate labels are critical, however. Once the labels of the training data are inaccurate, the performance of the classifier will suffer greatly.
Mohammad et al. [19] propose an approach that exploits the reinforcement learning technique to detect infected hosts in a peer-to-peer (P2P) botnet. Specifically, they develop a traffic reduction method to deal with a high volume of network traffic. However, botnets dynamically change their operations through updating after several life cycle stages. Hence, the proposed approach will fail if it is not improved dynamically throughout time.
In addition, Pektas et al. [20] design a framework that combines convolutional and recurrent neural network to identify botnets. e proposed system extracts network flow features, such as duration, size of packets, and other related flow-based features. However, this method usually has weak generalization ability and cannot effectively identify unknown types of botnet traffic.
Mousavi et al. [21] focus on scalability in high-rate network bandwidth. ey propose a fully scalable big data framework based on Hadoop to deploy many different kinds of botnet detection methods, including statistics-based methods, machine learning-based methods, and graphbased methods.
e experimental results show that the framework can perform well. In addition, the running time of the proposed framework is logarithmic proportional to the volume of the input. Despite its advantages, the framework proposed in this paper has its drawback. It is not affordable for smaller enterprises to provide enough computational resources which are required to install the proposed framework.
Soodeh et al. [22] propose a method based on convolution neural networks and negative selection algorithms to detect botnet. ey focus on the activity of incoming packets and detect botnet traffic from them. Alharbi and Alsubhi [23] exploit a graph-based machine learning model to detect botnet traffic. ey consider the significance of graph features and develop a generalized model for detecting botnets based on features that are selected using five filter-based feature evaluation measures derived from consistency, correlation, and information theory. Biswas and Roy [24] explore a method to detect botnet traffic using deep learning approaches like Artificial Neural Networks (ANN), Gatted Recurrent Units (GRU), and Long or Short Term Memory (LSTM) model. e proposed method has shown how it can perform against both normal attack data and botnet-specific attack data. Javier et al. [25] focus on the method to increase the performance of botnet traffic classification. ey use Information Gain and Gini Importance to select features and evaluate the selected features through performing three models, that is, Decision Tree, Random Forest, and k-Nearest Neighbor. Wan et al. [26] design a multilayer framework to detect botnet traffic. e detection model consists of a filtering module and classification module which exploits machine learning algorithms. eir detection model is based on behavior-based analysis.
is research examines the features useful for creating a behavior-based analysis method for detecting botnets in network traffic. e computational complexity of the machine learning-based method is relatively large, which is difficult to deploy in the realistic setting. Moreover, the generalization ability of models based on machine learning is limited and cannot cope with the endless botnets.
In conclusion, these existing studies have some limitations. Some methods can only be effective for botnets that use specific protocols. ey cannot detect newly emerging botnets. Moreover, some methods are based on historical Security and Communication Networks data. Once the botnet variations appear, these methods will be powerless. e detection method proposed in this paper is based on the group characteristics of the botnet, which are inherent characteristics of the botnet. Our method is independent of the botnet protocol and is not affected by encrypted data.

Research Objectives.
Our research goal is to find the bot in the monitoring network by analyzing the traffic crossing the boundary of the monitoring network. We exploit the fact that all botnets have group characteristics, and the relationship between the length of packets in a flow will not be affected by the encryption algorithm. Specifically, within a certain period of time, the flows generated by bots in the same botnet are similar. We analyze the similarity of network traffic to detect bots of the botnet. It is commonly known that the communication of most botnets is based on Transmission Control Protocol (TCP) [2], such as Waledac botnet [3], storm botnet [4], Conficker botnet [10], and Zeus botnet [9]. erefore, the research of our method mainly focuses on TCP flows. e process of normal hosts evolving into bots can be divided into three stages. In the first stage, hosts are infected by botnet malware. In the second stage, hosts receive the command from the botmaster and join the botnet. Finally, hosts initiate a network attack at an appropriate time. e host will show malicious abnormal behavior in the second and third stages. erefore, the detection method proposed in this paper works during the second and third stages to realize the detection of bots. It cannot be able to recognize the hosts that have just been infected by the malware. In this paper, we do not pay attention to how the host is infected or how the botnet malware is spread. Our research goal is to detect the bots that generate malicious TCP flows in the monitored network.
Our research objectives are as follows: (i) e bot detection framework is independent of the protocol and structure adopted by botnet channels. Its detection performance is not affected by the botnet protocol and structure. (ii) e bot detection framework does not need to analyze the content of the traffic payload. Hence, it is not affected by encrypted traffic and will not violate the privacy of network users. (iii) e bot detection framework can effectively detect botnet traffic and identify bots with a high detection rate and a low false positive rate. (iv) e bot detection framework must have low complexity. It cannot consume too much computing resources and time. Figure 1, the bot detection framework includes five modules, that is, network traffic acquirer, preprocessing module, attack flow recognizer, infection flow recognizer, and result integration module.

Bot Detection Framework. As shown in
Formally, we define F i � (sip i , dip i , sport i , dport i , pktlenseq i ) to denote the TCP flow with the sequence of packet length of host h i , where sip i is the source IP address, sport i is the source port number, dip i is the destination IP address, and dport i is the destination port number. pktlenseq i is the sequence of packet length, which is a vector composed of the length of all the packets in a flow, as defined in (1). Each element in the sequence is arranged in sequence according to the order of packet transmission. e degree of similarity between flows determines whether the monitored traffic is botnet traffic. Although the length of the ciphertext output by the encryption algorithm may be different from that of the plaintext, the length of the ciphertext output by the same encryption algorithm is the same for the plaintext of the same length. erefore, for the packets in a network flow, the encryption algorithm will not change the relationship between the lengths of these packets. erefore, our method based on the packet length for detecting botnets is robust.
Suppose there are two flows F i and F j , R(F i , F j ) refers to the communication relationship between F i and F j , as defined in (2). e communication relationship indicates whether the two flows have the same mapping of source IP address or destination IP address. If there is a commu- In (2), f(ip) is the mapping function of IP address. e simplest mapping is self-mapping; that is, f(ip) � ip. ere are also some other mappings, such as f(ip) � domain name, that is, the mapping relationship between IP address and DNS domain names. If f(ip i ) � f(ip j ), it means that ip i and ip j are the same in the "mapping sense." Spe- � domain name, we can know that ip i and ip j have the same domain name and ip i and ip j belong to the same host. In this paper, we use the self-mapping, namely, f(ip) � ip. FC denotes the set of flows that have communication relationships between each other, as defined in e network traffic acquirer can be deployed not only inside the monitored network to analyze the traffic in the internal network to detect botnet but also at the boundary of the monitored network. When the network traffic acquirer is deployed at the boundary of the monitored network, it is responsible for capturing the traffic entering and leaving the boundary of the monitoring network. In this case, the traffic captured by the network traffic acquirer is between the internal network and the external network, which does not include pure internal network traffic. e packet lengths are obtained by parsing the IP header of the packet. en, they are integrated into the sequence of packet length in ascending order of the TCP sequence number for flow similarity analysis. e preprocessing module is composed of three modules, namely, IP Partition, Port Partition, and Flow First Time Filter. Since the bot detection framework is based on the similarity of flows, we are only interested in flows that have communication relationships with each other. erefore, we must first know which hosts are involved in these flows. e IP Partition module divides the traffic according to whether the collected traffic has a communication relationship (as (3)). It mainly solves the problem of "which hosts communicate with each other." Moreover, the services used for communication between these hosts are also very important. We can determine the services through the TCP port number. e Port Partition module aggregates flows on the same source port number or the same destination port number. It mainly solves the problem of "what communications do the hosts carry out." According to the distribution of port numbers, we divide the flows into two categories, namely, attack flows and infection flows. e attack flows refer to the traffic generated by bots when they launch a network attack. e infection flows refer to the traffic generated by bots when they are in the propagation phase.
We analyze the attack flows and the infection flows from two perspectives, that is, the bot and the vulnerable victim. When a botnet launches a network attack, the vulnerable victims are the attack targets, which may be the target of multiple attacks at the same time. When receiving the attack instruction, the bot will use the maximum resources to launch an attack on the target, such as the traffic of DDoS attacks. erefore, when observing the attack flows from the perspective of the bot, the distribution of the port numbers presents a many-to-one situation, that is, multiple ports of the bot actively establish TCP connections with the same ports of the vulnerable victims. When observing the attack flows from the perspective of the vulnerable victim, the distribution of port numbers presents a one-to-many situation. e ports of vulnerable victims are passively connected with multiple different hosts. e infection flows are traffic generated by bots during the process of conducting malware propagation or vulnerability scanning. Meanwhile, some traffic is the commands conveyed by the botmaster to bots. Hence, when observing the infection flows from the perspective of the bot, the unique TCP port of bots actively establishes connections with multiple hosts. From the perspective of the vulnerable victim, the infection flows present that the unique TCP port is passively communicating with the same port of multiple hosts. For each TCP flow, the first packet time of the flow in both directions (upstream and downstream) determines the initiative and passivity of "establishing a TCP flow." Based on all the above observations, the TCP flows within a certain period of time are grouped according to the IP address and port number to form the TCP flow blocks. en, the sequences of packet length of the flows in the blocks are obtained. Afterward, the attack flow recognizer calculates the similarity of the packet length sequences of these flows from the perspective of the bot. Meanwhile, the infection flow recognizer calculates the similarity of the packet length sequences of these flows from the perspective of the vulnerable victim. Finally, the result integration module is responsible for summarizing the recognition results of the recognizer and obtains a collection of malicious TCP flows. e following sections will detail the implementation of each part of the detection framework.

Network Traffic Acquirer.
We have developed an effective network traffic capture module, namely, network traffic acquirer. In this paper, we limit our interest to TCP flows. Each flow contains the following information: source IP, destination IP, source port, destination port, timestamp, and length of packets in two directions. Our research is based on the fact that Security and Communication Networks flows are submitted as a collection to subsequent modules for analysis to detect malicious flows. WinSize is the minimum number of flows to detect botnet through traffic analysis. In addition, the flow truncation is performed to reduce the computational cost. We empirically use the first 16 packets of the TCP flow rather than the whole TCP flow. If the packet number of the TCP flows is greater than 16, the TCP flow is truncated. e truncation algorithm is shown in Algorithm 1. e packet len seq, the input parameter of Algorithm 1, is the sequence of packet length, which is composed of the length of TCP payload. ere are two thresholds that have been set for TCP flow truncation, that is, len limit 1 and len limit 2 (len limit 1 > len limit 2 ). ey correspond to two situations. e first situation is that the TCP payload lengths of all packets in the entire TCP flow are zero. In this case, the strategy we adopt is to truncate the TCP flow according to len limit 1 . en, a sequence of length len limit 1 is obtained. All elements in this sequence are 0. e second situation is that the number of packets with payload in the TCP flow exceeds len limit 2 . In this case, the flow is truncated at the position P(len limit 2 ). P(x) is a function to obtain the position (index) of the x-th packet with payload in the flow. Hence, P(len limit 2 ) can return the index of the len limit 2 -th packet with payload in the flow. len limit 1 has a higher priority than len limit 2 . erefore, if P(len limit 2 ) > len limit 1 , truncation is performed according to len limit 1 . If the number of packets in a complete TCP flow does not exceed P(len limit 2 ) and len limit 1 , all packets are reserved. In addition, the TCP flags are used to determine the beginning and end of the flow. e SYN flag indicates that a new TCP flow has started. If there is no SYN packet in a TCP flow, the flow can be considered incomplete. In this paper, the incomplete flows will be directly discarded. e FIN flag and RST flag indicate the end of a TCP flow.

Flow Preprocessing.
e flow preprocessing module is responsible for preliminarily segmenting the collected flows in a window according to the IP address and TCP port numbers. In this way, it can determine which flows have communication relations (as (2)) and which flows have the same service. e flow preprocessing module consists of three parts, namely, IP Partition, Port Partition, and Flow First Time Filter.

IP Partition.
e IP addresses of the flows captured by network traffic acquirer are regarded as nodes. If there is a TCP flow between two IP addresses, an edge is connected between the nodes corresponding to the two IP addresses. In this way, an undirected graph G is constructed to represent the connection relationship between hosts, as shown in the left subgraph of Figure 2. e undirected graph G can be represented algebraically by the adjacency matrix. Firstly, the source IP addresses and destination IP addresses of all the flows are extracted. en, the duplicate IP addresses are removed. Finally, we construct the adjacency matrix M corresponding to the undirected graph G according to whether there are TCP flows between these IP addresses. e adjacency matrix M is a square matrix. e size of M is the number of unique IP addresses in the TCP flow collection. If there are TCP flows between ip i and ip j , the elements at the Require: packet len seq Ensure: trunc packetlen seq Param: len limit 1 , len limit 2 if len(packet len seq) > len limit 1    positions (i, j) and (j, i) in the adjacency matrix M are set to 1. Otherwise, the elements are set to 0. erefore, the adjacency matrix is symmetric about the main diagonal.
In a flow collection, there are local connections to form a subgraph structure, which represents the block of nodes. For example, the left of Figure 2 contains two subgraphs. Each subgraph in G needs to be analyzed separately. IP Partition can divide the hosts into blocks according to the connection relationship. e schematic diagram of IP Partition is shown in Figure 2. In Figure 2 ere is no connection relationship between the hosts in Block 1 and the hosts in Block 2 . To extract the blocks from G, two steps must be performed. Firstly, the boundary node of the block needs to be located. e boundary nodes only have adjacent edges to nodes in the block where they are located. en, all nodes in the block can be obtained by walking through the graph from different boundary nodes. If there is an edge connecting two vertices, it can walk from one vertex to another. e algorithm for finding the nodes of the same block is shown in Algorithm 2.
In Algorithm 2, M is the adjacency matrix, and v is a vertex of M. Block v denotes the block to which v belongs. v all represents the set of vertices in M.
To find the "boundary" nodes, the undirected graph G needs to be transformed into a directed graph D through orientation. Due to the bidirectional nature of TCP flows, we use arbitrary orientation in this paper. Firstly, the nodes in the undirected graph G are assigned consecutive numbers. Assuming that there are n nodes in graph G, the numbers of these nodes are one to n. en, the direction of all edges in graph G is determined from the node with the smaller number to the node with the larger number. In this way, the undirected graph G is converted into directed graph D, that is, G ⟶ D.
Let M G and M D denote the adjacency matrix of undirected graph G and directed graph D, respectively. According to the orientation process, it can be concluded that M D � UpTriu(M G ). UpTriu(M G ) is the upper triangular matrix of M G . e in-degree and out-degree of the vertex can be calculated by M D . ere is exactly one directed edge between two vertices in the directed graph D. erefore, if the indegree or the out-degree of the vertex is zero, the vertex is located in the "boundary" of the subgraph. Formally, V − refers to the set of vertices whose in-degrees are 0, and V + refers to the set of vertices whose out-degrees are 0, as defined in Vertices in both V − and V + can be used to determine the boundary vertices. In this paper, the vertices in V − are used to find boundary vertices. As shown in the subfigure on the right side of Figure 2, V − contains two vertices, namely, v1, v7. When starting from the vertices of V − and walking through the undirected graph G, the subgraphs (blocks) are obtained. e algorithm for dividing all nodes into different blocks according to the connection relationship is shown in Algorithm 3. In Algorithm 3, Sections is the set of all blocks.

Port Partition.
In the above sections, we have divided different nodes into different blocks according to the communication relationship between nodes. In this section, the Port Partition module aggregates TCP flows between hosts in the same block.
Given an IP address ip i in a block, according to (3), we can get the set FC of TCP flows that have communication relationships. As introduced in Section 3.2, the Port Partition module firstly divides FC into attack flows and infection flows and then analyzes them from two perspectives of the bot and the vulnerable victim. e attack flows have the following two characteristics when they are observed from the perspective of the bot: (i) the destination port numbers of all attack flows are the same, and (ii) the initiator of the TCP flows is the bot. In addition, there are different characteristics when observing the attack flows from the perspective of the vulnerable victim: (i) e source port numbers are the same, and (ii) the initiator of the TCP flows is the bot. As shown in Figures 3 and 4, the direction of the arrow is from the initiator of the TCP stream to the receiver. Figure 3 shows the attack flows from the perspective of the bot. e IP shared by these TCP flows is the

Security and Communication Networks
IP of the bot. Moreover, the attack flows from the perspective of the vulnerable victim are shown in Figure 4. e hosts at the noncentral location of these TCP streams are bots. ere is one bot shown in Figure 3, and there are three bots shown in Figure 4.
However, the features of infection flows are different. When observing them from the perspective of the bot, there are the following two characteristics: (i) the infection flows have the same source port number, and (ii) the initiator of the TCP flows is the bot. In addition, when observing the infection flows from the perspective of the vulnerable victim, there are the following characteristics: the infection flows have the same destination port number, and the initiator of the TCP flows is the bot. No matter from which point of view, the initiators of the TCP flows are always the bots, as shown in Figures 5 and 6.
Based on the above analysis, we get the following port division schemes. Firstly, the directions of TCP flows in the set FC are adjusted to take ip i as the "source direction." en, these flows are clustered according to the following four strategies: (i) the flows that with the same destination port number and whose initiators are ip i , (ii) the flows that with the same source port number and whose initiators are not ip i , (iii) the flows that with the same source port number and whose initiators are ip i , and (iv) the flows that with the same destination port number and whose initiators are not ip i . e attack flows are aggregated based on (i) and (ii). e infection flows are aggregated based on (iii) and (iv). e initiator of the flows is determined by the Flow First Time Filter.  Security and Communication Networks

Malicious Flow Recognition.
e group nature of botnet makes that the flows of bots often present a certain similarity between the flows of the bots. Once the botnet is active, the traffic generated by different bots has a high similarity with each other. In addition, some botnets use encryption algorithms to avoid detection. However, the relationship between the length of packets in a flow will not be affected by the encryption algorithm. In this paper, we focus on the method of calculating the similarity of TCP flows. e sequence of the packet length is adopted to evaluate the similarity of the flows. e method we adopt to calculate the similarity of the packet length sequence of TCP flows is the Levenshtein algorithm [27]. If the two sequences are completely the same, the similarity is 1. If the two sequences are completely different, the similarity is 0.
e Levenshtein algorithm is mainly used to calculate the distance between two strings, which is the minimum number of editing operations required to convert one string to another. Editing operations allowed during the conversion process include (i) replacing one character with another character, (ii) inserting a character, and (iii) deleting a character.
Given two strings a and b, the Levenshtein algorithm can be formally defined as (5) to calculate the similarity between the string a and b. In (5), i(i > 0) represents the i-th position of string a, and j(j > 0) represents the j-th position of string b. When i � 0 or j � 0, the distance of string a and b is zero. Let SimRatio denote the similarity of the packet length sequences.
en, SimRatio(a, b) equals lev a,b (len(a), len(b)), where len(a) is the length of string a.
When calculating the similarity of the sequences of the packet length, the sequences of packet length are regarded as strings. Hence, the Levenshtein algorithm is applicable. In actual application, some tips are introduced to improve the performance of the algorithm. Firstly, suppose that the elements of the two sequences are the same, and only the length of the two sequences is compared. e similarity of two sequences in terms of length Sim len is defined as (6). Given the similarity threshold Sim thre , if Sim len is less than the similarity threshold Sim thre , it can be directly recognized that the two sequences are different, and the Levenshtein algorithm is no longer required. e complexity of Sim len is much less than that of the Levenshtein algorithm. erefore, the computational complexity can be reduced when calculating the similarity of sequences.
e overall process of bot detection is shown in Figure 7. In Figure 7, the dots are used to represent the hosts. e gray dots represent the hosts in the internal network, and the black dots represent the hosts in the external network. e black solid lines indicate that there are TCP flows between the hosts. e dotted lines represent TCP flows. Figure 7 shows an IP Block with 10 hosts, namely, n 1 , n 2 , . . . , n 10 . We analyze the hosts in the IP Block in turn. e process of analyzing the host n 1 is shown in the dashed box. First, the TCP flows that have a communication relationship with the host n 1 are collected.
ese TCP flows are denoted as FC � f 1 , f 2 , f 3 , . . . , f 12 . en, the flows in FC are divided by the Port Partition algorithm to generate some TCP flow blocks. Finally, the Levenshtein algorithm is used to calculate the similarity of these flows in the flow blocks. e flows with high similarity (exceeding the threshold Sim thre ) are regarded as malicious flows, and the bots are identified according to the strategy adopted in the Port Partition algorithm. In Figure 7, f 1 , f 4 , f 10 , f 12 are finally detected as malicious flows. erefore, the host n 1 can be identified as a malicious bot.

Experimental Analysis
e detection performance of the proposed bot detection method is evaluated in this paper. e detection performance is mainly evaluated from three aspects: (i) the   Figures 8 and 9 show the distribution of the number of TCP flows in the ISCX dataset. Figure 8 shows the traffic distribution in the training dataset, and Figure 9 is about the testing dataset.

Experimental Evaluation.
Since the detection method we designed does not require a training process, the training dataset and the testing dataset are treated the same, and we conducted experimental evaluations in both datasets. e experimental results are compared with [13] from two aspects: bot detection rate (TPR) and false alarm rate (FPR), as defined in TPR � the number of detected bots total number of bots , FPR � false alerts total number of benign hosts .
TPR measures whether our method can effectively detect bots. FPR measures the side effect of our method that benign hosts are incorrectly identified as bots. e higher TPR is, the better it is. e lower FPR is, the better it is.
ere are two parameters that need to be set. One is the size of the flow window WinSize. e other is the similarity threshold Sim thre . Firstly, we set the similarity threshold Sim thre � 0.99 to observe the influence of different flow window sizes on the recognition results. WinSize is set to 10, 50, 100, and 200 in turn, and the recognition results are recorded for comparison, as shown in Table 1. When the WinSize increases, the method proposed in this paper can detect traffic in a larger range, which helps to improve the detection rate of the method. In addition, the number of benign traffic flows will also increase with the increase of WinSize, and the probability of detection errors will also increase slightly. erefore, FPR increases to a certain extent with the increase of WinSize. e experimental results show that a high detection rate can be achieved without setting an excessively large window size. When the window size continues to increase, TPR will quickly reach the optimal stable state. However, FPR increases slightly. e experimental results show that a large window size will increase the false positive rate. In addition, there are 28 bot hosts (or host pairs) in the ISCX-Testing dataset. Our method successfully detects 27 of them (27/28 � 0.96428); only Osx trojan (IP: 172.29.0.109) is not detected. e reason is that there is only one Osx trojan TCP flow in the ISCX-Testing dataset, as shown in Figure 9. Our detection method requires at least two related TCP flows to get a conclusion. Hence, the detection of Osx trojan failed.
In addition, we set the flow window size WinSize � 10 to observe the influence of different thresholds Sim thre on the recognition results. Sim thre is set to 0.7, 0.8, 0.9, and 0.99 in turn, and then the recognition results are recorded for comparison. e results are shown in Table 2.
e experimental results show that the optimal effect can be achieved in the training dataset when Sim thre is set to 0.7. With the increase of Sim thre , the detection rate and false alarm rate have not changed. In the testing dataset, the detection rate does not change when Sim thre keeps increasing, but the false alarm rate (FPR) gradually decreases. erefore, the larger Sim thre , the lower the false alarm rate. Many research works choose different datasets for method verification. e verification results are different by selecting different datasets. erefore, to be relatively fair, we compare the methods proposed in this paper with those of others who also choose the ISCX dataset to verify the model effects. e methods in Table 3 have achieved remarkable results in botnet detection. Meanwhile, they are influential research works. e authors of [13] propose an adaptive botnet detection framework, which uses the SVM model to detect botnet. ey train the model on the ISCX-training dataset and then evaluate the effect on the ISCX test dataset. Beigi et al. [8] focus on the proper selection and experimental assessment of features for accurate detection of botnets. Mohammad Alauthaman et al. [28] present a method based on an adaptive multilayer feedforward neural network in cooperation with decision trees to detect P2Pbased bots. Soodeh Hosseini et al. [22] use a novel botnet detection and classification method based on convolution neural networks and negative selection algorithms. ey all more or less select the ISCX dataset or partial samples in the dataset to verify the performance of the proposed methods.
e comparison results are shown in Table 3. rough the comparison of the experimental results, it can be seen that our method is more effective.

Flow Window Fluctuation.
Since WinSize is the minimum value of the flow window, the size of the flow window fluctuates actually, as shown in Figure 10. e fluctuation of the flow window size affects the use of memory, which is an important aspect of model performance. Figure 10    training and testing datasets, respectively. It can be seen that most of the window sizes fluctuate around the WinSize, except for a sharp increase in the size of individual windows. In addition, the running time of our detection method implemented in python is evaluated on a personal laptop (Intel i7-6500U CPU, 2.59 GHz, 16 GB Memory, Windows 10) with the ISCX-Testing dataset. When evaluating the running time, the time of the network traffic acquirer module is not considered. TCP flow windows are continuously fed to the subsequent TCP flow processing modules (preprocess module, attack flow recognizer module, infection flow recognizer module, and result integration module). e running time is shown in Figure 11. e result is that the larger the window, the longer the running time.    [8] 0.75 0.023 Alauthaman et al. [28] 0.992 0.0075 Hosseini et al. [22] 0.99 -Our method 1.0 0.00706 "-" indicates that the corresponding result is not presented in the paper.

Conclusion
In this paper, we have proposed a protocol-independent bot detection framework based on the similarity of flows to detect botnets. e proposed method does not rely on the protocol and structure of botnets, which exploits the fact that all botnets have group characteristics and the sequence of packet length is not affected by encryption. erefore, the sequence of packet length is used as the characteristic of the TCP flow, and the similarity of TCP flows is calculated to detect botnet traffic. We evaluated the experimental results on the ISCX dataset, and the results show that our method has excellent performance.
In the future, we will consider UDP packets to better deal with the new botnet technology. Meanwhile, we will make the detection system more robust and prevent botnets from using UDP to escape detection. In addition, the performance of the system will be further optimized to enable the system to process traffic in real-time.

Data Availability
Complete information about datasets is available at https:// iscx.ca/botnet-dataset.

Conflicts of Interest
e authors declare that they have no conflicts of interest.