A Classification Detection Algorithm Based on Joint Entropy Vector against Application-Layer DDoS Attack

The application-layer distributed denial of service (AL-DDoS) attack makes a great threat against cyberspace security. The attack detection is an important part of the security protection, which provides effective support for defense system through the rapid and accurate identification of attacks. According to the attacker’s different URL of the Web service, the AL-DDoS attack is divided into three categories, including a random URL attack and a fixed and a traverse one. In order to realize identification of attacks, a mapping matrix of the joint entropy vector is constructed. By defining and computing the value of EUPI and jEIPU, a visual coordinate discrimination diagram of entropy vector is proposed, which also realizes data dimension reduction fromN to two. In terms of boundary discrimination and the regionwhere the entropy vectors fall in, the class ofAL-DDoS attack can be distinguished. Through the study of training data set and classification, the results show that the novel algorithm can effectively distinguish the web server DDoS attack from normal burst traffic.


Introduction
Distributed denial of service (DDoS) attack [1][2][3] is one of the most serious threats to today's networks and has aroused great concern in various countries around the world [4,5].DDoS attack refers to consumption of the victim server resource and keeping the targets from providing services for legitimate users.DDoS attack is categorized into two classes: network-layer DDoS (NL-DDoS) attack and applicationlayer DDoS (AL-DDoS) [6,7].The early DDoS attack was a network-layer attack.In NL-DDoS, attackers send a large number of bogus packets towards the victim host with vulnerability exploitation that exists only on the network and transport layer.For example, IP spoofing uses fake connections to quickly consume the server's bandwidth and hide its location; SYN Flood attackers keep sending unused connections requests with only SYN flags to the server, which would exhaust the bandwidth and resources of the server on a massive number of TCP half-connections.In NL-DDoS attack, the victim server or IDS can easily distinguish legitimate packets from DDoS packets [8].In contrast, in AL-DDoS, perpetrators attack the victim server through a flood of legitimate requests.AL-DDoS attack does not saturate the bandwidth of the victim server through inbound traffic but through outbound traffic.Because AL-DDoSs behave very much like flash crowd, a legitimate behavior where a very large number of users simultaneously access a website, it is not easy to distinguish them.Consequently, due to universality and variety of web service in application layer, AL-DDoS may be stealthier and more dangerous than the traditional NL-DDoS attack.
Considering that impacts of the AL-DDoS attacks are becoming great, researchers at home and abroad have done a lot of related work in this field.Jung et al. [9] deeply analyzed the difference between the AL-DDoS and the flash crowd.When flash crowd occurs, a large number of address clusters recur, while a large number of new address clusters will appear with AL-DDoS attacks.The distribution of access addresses of flash crowd is uneven, while the distribution of access addresses is more uniform when DDoS attacks.Li and others [10] use a mixed measure to detect the shifting of the flow distribution, thus distinguishing between DDoS attacks and flash crowd.Yu and others [11] use the Sibson distance to measure the similarity between the flows and realize the distinction between the DDoS attack flow and the flash crowd flow.Oikonomou and Mirkovic [12] built a normal behavior model to distinguish between attackers and normal visitors.Lee et al. [13] proposed a detection algorithm of AL-DDoS attack based on information entropy.According to the information entropy of URL access rate, the algorithm can detect DDoS attack, but can not distinguish between DDoS attacks and flash crowd.Rathika et al. [14] put forward a method to detect attacks based on the average number of requests per unit of time for each session (ANRS).When the DDoS attack flow is small and occurred in the low-rate, the value of ANRS neither rises significantly nor descends.Xie and Yu [15] simulated the users' access and page request behaviors with Markov chain model.According to the jump probability of each session, the behaviors model takes the degree of deviation as detection indicator.
In this paper, based on the characteristics of user access behavior in application layer, the attacks are classified in terms of access mode of URL.The matrix from IP to URL maps a joint entropy vector, which realizes data dimension reduction.Through defining and computing the EUPI and jEIPU, the coordinate discrimination diagram of entropy vector is constructed.Also by the region where the entropy vector falls in, the type of AL-DDoS attack can be discriminated.The simulation experiment shows that the algorithm can effectively distinguish between DDoS attacks from normal traffic.

Behavior of AL-DDoS Attack
On the web, user behaviors of accessing to a URL include three steps: visiting, lingering, and abandoning.Because different users are interested in the different content and pages, the legitimate access behaviors from users' IP address to URL are random.In contrast, AL-DDoS attacks are usually launched by a specific tool or bot-nets, which make those collaboration-based behaviors more regular and nonrandomized.
According to the attacker's selection of attacked URL, the AL-DDoS attack is divided into three categories.The first category is a fixed URL attack, in which the attacker initiates and determines one or a few of URLs.In order to achieve a better attack effect, the attacker often chooses to download a large picture or file request, which is the most common and easy to implement.For example, the SOAP replay attackers [16] send the soap request message repeatedly to a fixed URL, which exhaust the source of the victim server through outbound traffic.
The second category is a random URL attack, which scans the attack list of site and randomly selects URLs in every attack.The random URL attacks masquerade as normal web access behaviors in a low-density manner.It outwardly likes that a very large number of users simultaneously access few popular websites.Therefore, the random URL attack is stealthier than the fixed URL attack.
The third category is a traversal URL attack, which is similar to the web crawler.This mode of attack is in the form of web crawler, grabbing URL and selecting the URL request.The attacker starts from the home page URL and selects a URL as the next request.Then the process is repeated and cycled until no new URL is obtained.

Definition of EUPI and EIPU.
The entropy of URL request per IP address (EUPI) is defined as where    (  ) is the probability of th URL request per source IP address.   (  ) indicates the percentage of different URL requests that are accessed by one IP address: The entropy of IP address per URL (EIPU) is defined as where    (  ) is the occurrence probability of th source IP address which accesses the th URL: Entropy, not only EUPI but also EIPU, represents the probability of occurrence of discrete random events.In other words, information entropy is low in an orderly system.On the contrary, the more disordered and random the system is, the higher the information entropy would become.So it is able to be a measure of the ordering degree of the system.Typically, legitimate users that access to the site have certain randomness.Users will access web pages based on their interests.According to statistics, 80% of users visit 20% of the hot web pages [17]. ] For a fixed URL attack whose IP address is represented as ,    (  ) would be significantly increased in the victim URLs, namely, to converge those access requests to the one or a few of attacked URLs.Therefore, the value of EUPI would reduce.For both a random URL and a traversal URL attack, EUPI of them would increase.In particular, because of the equal probability characteristic of traversal URL behavior, EUPI of a traversal URL attack would be close to maximum entropy of its value.
The space complexity of the algorithm in the article is ( 2 ), which increases with matrix dimension.Because the detection model of the article only needs to calculate the conditional entropy of the matrix, the time complexity of the algorithm is ( log ).
However, a situation, on which a lot of sudden hits and needs (e.g., hot events, festival online shopping, and centralized e-ticketing) will lead to a sharp increase of URL traffic in a certain time, must be considered.The situation is named flash crowd, whose burst traffic and high volume are the common characteristics of AL-DDoS attack.If simply relying on EUPI, there would have a higher false alarm rate.So in order to optimize the detection method and reduce false alarm rate, the entropy of IP address per URL (EIPU) is considered and applied to distinguish between AL-DDoS attacks and flash crowd.When a URL is accessed from many IP addresses, there is approximated uniformly distributed traffic of each IP source address under the event of flash crowd.Thus, with characteristics of EUPI and EIPU, a matrix transform of (⋅) is constructed, which is defined as the following formula: Thereinto, the transform is needed to satisfy the condition of  = .  stands for a joint entropy vector: So we can obtain formulas of  (1)    = EUPI( |  ()  ) and  (2)   = EIPU( 1 1 In the condition of  ̸ = , extended processing of matrix is used to satisfy equal conditions. (a)  > .When  is greater than , the matrix needs to be extended to  order square matrix.The data from the  + 1 to  column comes from the normal access traffic of training data set.The extended URL from the  + 1 to  is named as virtual URL.
(b)  < .When  is less than , the matrix needs to be extended to  order square matrix.The data from the  + 1 to  row comes from the normal access traffic of training data set.The extended IP address from the  + 1 to  is named as virtual IP. 1 Because the extended rows or columns is from normal access traffic of training data set, it does not affect the judgment of attack behaviors or abnormal traffic.For the algorithm of joint entropy vector, the extended processing of the matrix only adds the number of normal entropy vectors and does not change the distribution or the number of attack vector.So the mapping from max(, )-dimensional space to two-dimensional space is expressed as Thereinto,   = ( (1)   ,  (2)   ) is two-dimensional entropy vector.

Boundary
(  =   ) is indicating function, which is defined as In summary, detecting AL-DDoS attacks is transformed into classifying points that represent entropy vectors in the coordinate system.According to the respective characteristics of different attack behavior types, the implementation of entropy vector detection algorithm is as follows.
In terms of training data set , the point number of each class is calculated.  stands for the point number of each class.
The boundary discriminant rule is defined as According to the characteristics of AL-DDoS entropy vector, the coordinate plane is divided into different regions by the boundary of   , in which the region decides what class it belongs to.The coordinate discrimination diagram of entropy vector is shown in Figure 1.
From Figure 1,  1 ,  2 ,  3 , and  4 , respectively, stand for class of a fixed URL attack, a traversal URL attack, flash crowd, and normal access.On the basis of training data set,  1 ,  2 , and  3 , respectively, stand for the point number of three classes of attacks.The boundary  1 is able to be calculated as The boundary  2 is able to be calculated as The boundary  2 is able to be calculated as Accordingly, AL-DDoS attack would inevitably cause the change of entropy.We take the entropy vector   as an index.
In terms of where   falls in, it can be found whether AL-DDoS attack has occurred and what type AL-DDoS attack could be.
When multiple training data sets are collected, the optimum classification boundary value is able to be obtained.A precision rate of class discrimination is defined as  pre : Suppose the total training data set is : or

Experimental Conditions and Processes.
Based on the open website log and MIT Lincoln Laboratory data sets [18], we use MATLAB software to simulate the access of the web server under the normal condition.Set up a website with 200 URLs, 10% of which are hot pages.There are about 800 visits per simulation time.Under normal circumstances, EUPI is shown in Figure 2.
When the fixed URL attack occurs, the change of EUPI of Web server is shown in Figure 3.At 30th time units, the fixed URL attack started, which made entropy obviously decreased.
As shown in Figure 4, EUPI instantly increases when the random URL attack occurs suddenly.A large number of random URL request makes the traffic of the server more disorder and chaos, so the corresponding URL request entropy will accordingly increase.
For traversal URL attacks, if the attacks started at 30th simulation time, the request entropy would suddenly rise as shown in Figure 5. On these attacks the URLs are relatively random in a single time unit and the detection results are also consistent with the results of random URL attack.

Analysis and Optimization of Approach.
Through above experiments, it can be seen that when those attacks have occurred the URL request entropy instantly changes, which are very obvious such as fixed, random, and traverse URL attacks.So it shows that the change of the value of EUPI can effectively detect the abrupt changes of traffic that are caused by DDoS attack.In order to discriminate between attacks and flash crowd on which a lot of sudden hits and needs (e.g., hot events, festival online shopping, and centralized e-ticketing) will lead to a sharp increase of URL traffic in a certain time.It is difficult to discriminate between attacks and flash crowd only through EUPI, so EIPU is considered and applied to detect whether AL-DDoS attacks exist.
Under normal circumstances, EIPU is shown in Figure 6.
From Figure 6, EIPU is between 5.16 to 5.28 under normal access.
At 30th time units, the URL attack started, which made EUPI obviously decreased in Figure 7.In order to detect whether AL-DDoS attack exists and discriminate what type the attack is, a simulation experiment of joint entropy vector algorithm is designed.The simulation parameter is shown in Table 2.
The simulation experiment scenario is constructed by the interaction process from 200 IP source addresses to 200 URLs access addresses in the matrix of [ ()   ] × .There are 6 IP nodes of the fixed URL attack, 20 IP nodes of the traverse URL attack, and 194 legitimate nodes in simulation experiment, whose proportion of all nodes, respectively, is 3%, 10%, and 87%.There are 2 URLs on flash crowd, whose proportion of all URLs is 1%.In the experiment scenario, we use SOAP replay attacks to simulate DDoS.SOAP replay attackers from distributed nodes send the soap request message repeatedly to URLs, which exhaust the source of the victim server.The number of requests is in proportion to the attack strength and is taken as a measure of attack strength.The experiment data is from Lab website log and MIT Lincoln Laboratory data sets [18].Matlab is used to integrate data and construct the matrix of [  ] × .The nodes are divided into four categories: c1, c2, c3, and c4, respectively, stand for class of a fixed URL attack, a traversal URL attack, flash crowd, and normal access.The corresponding access or attack behaviors are described in the paper (in Section 2).The server records the interactive process and saves it.And we refer the data format of MIT Lincoln Laboratory data sets.We change the distribution, attack strength, and proportion of the four kinds of nodes, and let it run many times.The data with labeled category is used as the training data.According to the boundary criterion (in Section 3.3), the optimal boundary value of the satisfied formulas ( 22) and ( 23) is obtained.
The simulation results is shown in the coordinate plane consisting of the EUPI as -axis and the jEIPU as -axis.From Figure 8, vector dots belonging to different attack types are distributed in different regions of the coordinate plane.According to the analysis of the third chapter and Figure 7, the position feature of the entropy vector can reflect the characteristics of the AL-DDoS attack.
In order to verify the effectiveness of the algorithm on different cases, AL-DDoS attack strength is defined: The traverse (20) The fixed (6) In terms of formula (15),   =   = ∑  (  =   ).  only counts  ()   that is not equal to 0: Also, we can define relative strength of AL-DDoS attack: In Figure 9, 500 sets of data  = { 1 ,  2 , . . .,  500 } are collected.At the same time, the detection precision rate  pre under three kinds of relative strength  stre is compared.As can be seen from the Figure 8, with the increase of relative strength, the precision rate of the algorithm is be able to increase and reach more than 90%.In Figure 9, the detection precision rate under the different relative strength is compared.The joint entropy vector can be used to quickly judge what class of DDoS attacks has happened.It also can effectively distinguish the web server DDoS attacks and flash crowd and improve the detection precision rate with the increase of relative strength.

Conclusions and Future Work
With the popularity of the network and the rapid growth of network traffic, the burst traffic caused by hot events and centralized access often leads to the service congestion and even paralysis.This burst of traffic is usually called "flash crowd."Flash Crowd and DDoS attacks are essentially  different.In this paper, based on URL access entropy, an anomaly detection algorithm is proposed.The novel method can effectively distinguish between AL-DDoS attacks, which has great reference value for further analysis of DDoS attack and its effective detection.As we have discussed, there is currently a lack of analysis on a big bot-net attacks, for example, hundreds of thousands of zombie machines.(The experiment scenario of this article is constructed by the interaction process from 200 IP source addresses to 200 URLs access addresses.)Focusing more closely on the application layer, we plan to further detect attacks on a larger network scale and more nodes in future.With improving experimental conditions and environment, in subsequent studies, we will further analyze the complexity of the defense technique under increasing the number of simulation nodes.As for future work, we also plan to extend the detection capabilities of the framework, namely, by supporting detection of other indicators, which can be used as a measure of the attack strength, such as the amount of traffic and the number of packets.

Figure 1 :
Figure 1: The coordinate discrimination diagram of entropy vector.

Figure 8 :
Figure 8: The simulation distribution figure of entropy vector.

Figure 9 :
Figure 9: Comparisons of detection precision rate under different relative strength.

Table 1 :
The map from IP to URL.Sequence number of URL (from 1 to ) 1 ,  2 , . . .,   } .data set in one sampling time Δ.The optimum classification boundary value  opt is defined as

Table 2 :
Table of simulation parameter.