A Novel Real-Time DDoS Attack Detection Mechanism Based on MDRA Algorithm in Big Data

1 Information and Network Center, Institute of Network Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China 2Science and Technology on Information Transmission and Dissemination in Communication Networks Laboratory, Shijiazhuang 050081, China 3National Engineering Laboratory for Mobile Network Security (No. [2013] 2685), Beijing 100876, China 4Network and Information Center, Institute of Network Technology and Institute of Sensing Technology and Business, Beijing University of Posts and Telecommunications, Beijing 100876, China


Introduction
The Denial of Service (DoS) attack is one of the most popular attacks on the Internet. It is implemented by forcing a kidnapped computer to launch or consuming its resources, such as CPU cycle, memory, and network bandwidth. When the DoS attack is generated by a great variety of distributed computers, it is called Distributed Denial of Service (DDoS). DDoS has become one of the main challenges to cyber security today.
DDoS attack is launched by some remote-controlled Zombies. It prevents legitimate users from accessing some specific network services or paralyzes the victims' own services by occupying computer resources or network bandwidth partly or completely. If there are more abnormal traffic data packets and more kidnapped Zombies hosts, more damage occurs in the network. If the number of Zombies hosts is large enough, it even can disrupt the whole network environment and all servers fleetly.
In the summer of 1999, the Computer Incident Advisory Capability (CIAC) reported the first DDoS attack incident [1]. Since then, DDoS has become the mostly convenient and effective attack means frequently used by hackers. In 2000, it is the answer told by Internet sites (e.g., Microsoft, Yahoo, and Amazon) that cannot be accessed for a long time, because of severe DDoS attack.
DDoS attacks are mainly classified into three categories based on different attacked subjects. The first kind is called 2 Mathematical Problems in Engineering Netflow-DDoS attack and there are many typical instances such as DNS amplification attack, SNMP amplification attack, UDP Flood, and ICMP Flood. The second one is connection-DDoS attack. SYN Flood and TCP Flood are the most influential attack cases. Besides, there is a kind of DDoS attack based on application such as HTTP Get Flood and SSL Flood. In this paper, we focus on how to detect the Netflow-DDoS and connection-DDoS attacks.
In spite of all the effort from industry to academia, DDoS attack is still an open problem. In recent years, technique and level of DDoS attack are ceaselessly advancing with the improvement of capability for attack detection. With the emergence of Big Data technology, it is particularly much more difficult than ever before to prevent the network from various DDoS attacks. The continuously growing network traffic makes it impossible to detect network attack behavior from such large scale of network traffic based on previous detection methods.
In this paper, we address the abovementioned challenges and propose a novel method for real-time DDoS attack detection based on Multivariate Dimensionality Reduction Analysis (MDRA) algorithm, which combines Principal Component Analysis (PCA) and Multivariate Correlation Analysis (MCA). Compared with the previous solutions, our proposed algorithm has the following advantages: (i) Higher precision rate approximates to 100% in True Negative Rate (TNR).
(ii) CPU computing time is one-eightieth of the previous detection method based on MCA.
(iii) Memory resource consumption is one-third of the previous detection method based on MCA.
(iv) Computing complexity is constant.
To the best of our knowledge, this paper proposes the theoretical method for the first time and attempts to apply it in the field of DDoS attack detection.
The remainder of this paper is organized as follows. Section 2 introduces the related work in DDoS attack detection and analyzes related shortcomings. Section 3 describes the theoretical approach to our detection mechanism. What is more, we design the attack detection framework based on MDRA. Section 4 discusses the experimental details and gives the experimental results and analyses. In Section 5, we summarize this paper.

Related Work
Although there is a development history of almost 20 years for it, DDoS attack detection is still a hot field of research in industry and academia. And its corresponding method and technique have to keep up with the times along with complexity and diversity of DDoS attack means. Previous work mainly includes the following.
In 2004, Kim et al. [2] proposed a combined data mining approach for the DDoS attack detection of the various types, which studied the automatic feature selection module and the classifier generation module. Because the analysis of per data flow is indispensable to DDoS attack detection, they used the data based on Netflow as the gathering data. In 2007, Scherrer et al. [3] focused on how to extract DDoS attack features and how to detect and filter DDoS attack packets by a number of known characteristics. In 2008, Lee et al. [4] designed a method for proactive detection of DDoS attack by exploiting its architecture and selecting different variables based on attack features; then, they performed cluster analysis for proactive detection of attack. In 2010, Nguyen and Choi [5] introduced a method for preliminary detection of DDoS attacks by classifying the network conditions. They selected some variables based on the key features. What is more, they applied the -nearest neighbor ( -NN) method to classify the network conditions into each phase of DDoS attack. In addition, Tsai and Lin [6] told us a new method to detect the DDoS attack called "Triangle Area Based Nearest Approach." By using this approach, the accuracy and the False Positive Rate (FPR) were improved. In 2012, Bhange et al. [7] presented the idea about the DDoS attack and its impact on network traffic. This paper studied DDoS attack by analyzing the distribution of network traffic in order to distinguish anomaly traffic from the normal network behavior. In 2014, Tan et al. [8] brought forth a more sophisticated DoS attack detection approach using MCA. Following the emerging method, their paper proposed a new detection system based on MCA to protect online services against DoS attacks. In the same year, Luo et al. [9] developed a mathematical model for estimating the combined impact of DDoS attack pattern and network environment on attack effect by originally capturing the adjustment behaviors of victim TCPs congestion window.
DDoS attack can be detected by statistical analysis, data mining, and machine learning. However, some existing detection methods and techniques still suffer from low precision and TNR, or some of them cannot actively detect DDoS attacks. The previous detection methods and techniques already cannot meet the requirements of the Big Data era in particular because of their low detection efficiency, high resource consumption, and high computing complexity. In this paper, we propose a novel detection mechanism based on MDRA to show how to detect DDoS attack traffic effectively and in real time. Figure 1 shows the overview of our real-time DDoS detection framework. We first collect network traffic data sample from Internet and then input them into data acquisition system, which is composed of data cleaning, data store, and data anonymization module. Next, the processed traffic data are fed into traffic feature Big Data system. The traffic features in this system have two functions. The first one is applied to Online Attack Detection, and the other one is used for Offline Traffic Analysis based on Knowledge Base. Here, the results of Offline Traffic Analysis provide the feature recognition for Online Attack Detection. Last but not least, current network is adjusted on the basis of routing policy offered by the results of Online Attack Detection.

Detection Mechanism
In this section, our novel method is separated into three components, that is, traffic feature dimensionality reduction, traffic feature correlation analysis, and attack  detection framework based on MDRA and threshold. These components are introduced in following subsections.

Traffic Feature Dimensionality Reduction.
A network traffic record encompasses a wide variety of high dimensional features. However, some of these high dimensional features are redundant or noisy. They may influence the effectiveness and efficiency of attack detection. In order to eliminate data redundancy and data noise, we introduce a dimensionality reduction technique into our detection method. The PCA method is used to extract less dimensional and more representative features. The projections on the remaining dimensionalities are called the principal components [10]. One advantage of PCA is its data-driven design by keeping the principal components of feature data and eliminating the correlated and measured feature data. Currently, PCA has been widely applied in the domain of intrusion detection [11] (such as [12,13]) and the other fields (such as [14]).
In the PCA method, some original dependent random variables are transformed into new random variables whose components are uncorrelated by orthogonal transformation. The covariance matrix that is composed of original random variables is transformed into a diagonal matrix in the form of algebra. The original coordinate system is transformed into a new orthogonal coordinate system that points to multiple orthogonal directions in the form of geometry.
PCA is able to obtain principal components. The first principal component is the linear combination for the maximum variance. If the first principal component is not enough to represent information of the original variables, we select the second linear combination. In order to effectively reflect the original information, the existing information for the first principal component needs not to appear in the second principal component. By this analogy, all subsequent principal components can be constructed. We assume that a network traffic record sample set includes samples and the dimension of each sample is . That is to say, = { 1 , 2 , . . . , } and = ( i1 , 2 , . . . , ) ∈ , = 1, 2, . . . , . The representation of sample matrix is ∈ × . Then, the covariance matrix of sample matrix is calculated by the following formula: Next, the covariance matrix needs to be diagonalizable. Here, the matrix is a symmetric matrix, and the purpose of symmetric matrices diagonalization is to find an orthogonal matrix ; let Assuming that we get the corresponding dimensions for the first ( < ) biggest eigenvalues, a new diagonal matrix Λ 1 (Λ 1 ∈ × ) is set up according to the eigenvalues. The corresponding eigenvalues constitute a new eigenvector matrix 1 ( 1 ∈ × ). Actually, these eigenvalues in 1 constitute a new coordinate system in low dimension space, and those are the principal components.
Assuming that the sample matrix after PCA dimensionality reduction is 1 , according to the purpose of PCA, the covariance between every two dimensions basically is zero in 1 . In other words, the covariance matrix of 1 is Λ 1 . It is to satisfy the following condition: We can get the following formula by (2): 4

Mathematical Problems in Engineering
Equation (4) is put into (2), and we get Because the covariance matrix of 1 is a diagonal matrix, it means that the components are basically independent between every two different dimensions. The process of PCA has been done.

Traffic Feature Correlation Analysis.
From the view of the correlation based on statistical theory, DDoS attack traffic features reflect different statistical properties versus legitimate network traffic features. Here, we apply MCA [8,15,16]. This approach is based on a triangle area technique and Mahalanobis distance (MD). The triangle area technique is able to extract geometrical correlative information between every two features in an acquired network traffic record. And MD is capable of similarity measurement between every two traffic records. The analysis is presented as follows.
Assume that there is a captured network traffic record data set: represents the th traffic record and indicates the th feature in the th record. For example, and are a couple of features in . The area of a triangle , is shown as where 1 ≤ ≤ , 1 ≤ , ≤ , and ̸ = . Figure 2 shows the area of a triangle.
On the basis of (6), we get the area of the triangle for every two distinct features in . By that analogy, the areas of these corresponding triangles between every two distinct features for each and every network traffic record of all are acquired. And a Triangle Area Matrix (TAM) has been set up. When is equal to , the value of , is zero. So the values of these elements on the main diagonal of the matrix are zero. Because , and , represent the same triangle area, the values of the two are equal.
As a consequence, we draw the following conclusion: TAM is a symmetric matrix, and the elements of its main diagonal are zero. Here, the low triangle of TAM is chosen to convert into another vector TAM low , and it is shown as follows: x y T i j,k DDoS attack is detected by the application to inherent MCA of traffic features in the network environment of Big Data. The geometrical correlation between every two pairs of traffic features has changed when anomaly behaviors of DDoS attack appear on the Internet. This approach provides an important warning signal.

Attack Detection Framework.
In this section, we first establish benchmark data by covariance matrix and MD. Secondly, the attack traffic detection based on MD and the selected threshold is implemented. Last but not least, we present the MDRA DDoS attack detection algorithm. ] .

Benchmark Data Formation by Covariance
In this formula, the covariance between every two arbitrary elements in TAM lower is defined as follows: where the mean of the ( , )th elements and the mean of the ( , V)th elements of TAMs for normal training traffic records are, respectively, defined as nor , nor (ii) Computing the MD between Every Two TAMs of Traffic Records. The covariance distance of data is signified by MD. MD is an effective approach to compute the similarity of the two unknown sample sets. The difference between MD and Euclidean Distance (ED) is that the relations between all kinds of characters are considered and that MD is not relevant to the scale of the measurement. The MD between the normal training records and their expectation and the MD between the fresh captured traffic record and the expectation of normal training records are shown by the following formulas: Moreover, the expectation of TAM nor lower for the normal training records is shown as follows:

Attack Detection Standard Based on MD and Threshold.
For DDoS attack detection, we set a threshold value to distinguish DDoS anomaly traffic from the normal traffic feature. Next, we give a formula [8] about the threshold value: where was shown by (10) or (11) and is shown as follows: In order to conform to the normal distribution [8], the range of the value is set from 1 to 3 with the increment of 0.2 in this paper. Then, the standard of DDoS attack detection is obtained. An attack behavior is considered when the MD between a fresh acquired traffic record and the expectation of normal training records is greater than the threshold.

A MDRA DDoS Attack Detection Algorithm.
Tan et al. [8] proposed the algorithm to use for normal profile generation based on triangle-area and MCA and to use for attack detection based on MD. By evaluation and comparison with some state-of-the-art approaches, it is easy to find that the previous attack detection method and its system have some advantages in detection performance, Detection Rate (DR) and accuracy rate. However, in Big Data of cyberspace security, especially when the network attack behaviors of large traffic are growing increasingly, detection efficiency, resource consumption, and computing complexity need be taken adequately into account for attack detection. For the reasons given above, we propose the MDRA algorithm to detect efficiently the network anomaly traffic. Algorithm 1 depicts the procedures of the algorithm for DDoS attack detection metric based on MDRA in detail.

Experiments
In this section, we discuss how to apply our algorithm in detecting efficiently the DDoS attack traffic. The flowchart of attack detection is shown in Figure 3. Firstly, we present the data set used in our experiments and the data pretreatment approach to serve our experiments. Then, the experimental results are got to evaluate the algorithm performance. Finally, we make comparisons with the previous unoptimized approach in terms of time cost, resource consumption, and computing complexity.
The computer environment to run our experiments is shown in Table 1.
Next, we describe our experiments in detail.

Data Set and Pretreatment.
In this paper, we use the famous Knowledge Discovery and Data Mining (KDD) Cup 1999 data set [17][18][19][20][21] as our novel algorithm verification. We have to admit that this data set has some shortages, but it is still uniquely public and relatively credible labeled benchmark data set so far. This data set has been widely applied to researching and evaluating network intrusion detection methods [22,23].   KDD CUP 1999 data set comprises about five million network records and provides a training subset of 10 percent of the network records and a testing subset. It covers four main categories of attack, that is, DoS, R2L, U2R, and Probing. Here, we use these records labeled as "normal" in the abovementioned training subset to construct our benchmark data and employ this testing subset "corrected" to verify the validity and efficiency of our algorithm. In this paper, we choose DoS network attack as our algorithm evaluation and comparison with the previous approaches. The data sets used in our experiments are shown in Table 2. The data pretreatment procedure is shown as follows.
Firstly, for each network traffic record, it includes the information that has been separated into 41 features plus 1 class label [24] in this data set. In our experiments, we need to get all numeric data for 41 features of every record. However, there are 3 nonnumeric features in all features, and these are protocol type, service, and flag. They must be transformed into numeric type. The type conversion is achieved according to Table 3, where we emphatically analyze the pretreatment process with reference to the feature "service." The analysis process is as follows.
There are 70 kinds of network service types in the "service" feature; however, some of them rarely appear or never appear. For these features, we can ignore them completely. Among the 494021 records in the training subset of 10 percent, we find that the top three network service types, respectively, are ecr i, private, and http by counting and sorting, and their ratios, respectively, are 56.96%, 22.45%, and 13.01%. The sum of all the other types accounts merely for 7.58%. The ratios of the top four types in "service" feature are shown in Table 4.
Secondly, among the 41 features of these records labeled as "normal" in the training subset of 10 percent, there are three invalid features (i.e., wrong fragment, num outbound cmds, and is hot login) by PCA. This is because all the values of the three features are zero. Therefore, we get rid of the three features in our experiments.
Last but not least, we extract the principal components according to the rate of accumulative contribution based on PCA algorithm. As a general rule, we set the value of the rate of accumulative contribution to be equal to or to be greater than 50% to extract important features from the chosen data set [6]. In order to obtain the more important principal components, the value of the rate of accumulative contribution is set to 70% in our experiments. These principal components extracted in the 41 features are listed in Table 5.   These results prove that the latter is superior to the former. In order to estimate the advantage of our method, it is indispensable to establish some evaluating indications. Here, we present four formulae to evaluate our algorithm, and they are Precision, TNR, FPR, and DR [11]. The formulae are defined as follows:    (iv) FN (False Negative) is the number of attacks incorrectly classified as normal records.   In Figure 4, it is not hard to find that when the value of gradually increases from 1 to 3 with the increment of 0.2, the precision of attack detection method based on MDRA is superior to the counterpart based on MCA, and the former is about 0.4 to 0.6 percent higher than the latter.
In Figure 5, similarly, we find that the TNR of our detection method is completely superior to another one with the progressive increment of , and the former is about 1.2 to 2.4 percent higher than the latter.
In addition, the relationship between DR and FPR is frequently used to evaluate the detection performance by the Receiver Operating Characteristic (ROC) curve. The ROC curve is obtained by setting different thresholds, and there is a tradeoff between the DR and FPR [25]. The ROC curves of the comparisons about the two detection methods are shown in Figure 6. In Figures 6(a) and 6(b), the two ROC curves that are used to analyze attack detection performance based on our method and another one show the growing tendency. In Figure 6(a), the ROC curve of our method climbs gradually from 72.34% to 72.35% for DR, and it reflects that the change of DR with different values is fairly small. Likewise, in Figure 6(b), this change is relatively large, and the ROC curve jumps dramatically from 83.18% to 89.84%. However, in Big Data, we pay more attention to instantaneity, time cost, resource consumption, and computational complexity of attack detection. Therefore, a shade of discrepancy of DR could be ignored. At this point, our method has the vast majority of advantages in comparison to other methods. The discussion about this topic will be opened up in the next section.

Results Comparisons in terms of Time Cost and Resource
Consumption. Here, we emphatically analyze time cost and memory resource consumption based on MDRA and MCA.
On the one hand, our detection mechanism is superior to another one based on triangle-area and MCA proposed by Tan et al. in time cost. In our experimental environment, we employ this server which has two CPUs and where every CPU has 16 cores. When we ran the abovementioned experimental data, one of two CPUs opened and 16 cores of this CPU would gradually load to its full capacity. At the moment, the comparing results in CPU time of running the experimental data based on our detection method and the other one are shown in Figure 7. However, in the same    On the other hand, in terms of memory consumption, our detection mechanism is also a cut above the rest of the method proposed by Tan et al. This is because the memory occupied by our detection method in the experiments takes up less than 1 GB; however, another one needs memory space of more than 3 GB. In the same experimental environment, the occupied memory space in the detection method proposed by Tan et al. is more than 3 times as long as ours. The comparing results in memory consumption of running the experimental data are shown in Figure 8.
To sum up, our detection method can be perfectly applied in real-time DDoS attack detection under the environment of vast amount of network traffic in Big Data.

Computing Complexity Analysis.
In this section, we analyze the computing complexity of our detection method.
Because the previous method based on MCA has the computing complexity of ( 2 ) and is a fixed number, the overall computing complexity is equal to (1) [8]. However, our detection mechanism based on MDRA uses the similar computational principle. What is more, the fixed feature dimensionality after reducing dimensionality in our method is one-third of the previous method based on MCA. Hence, the computing complexity of our method is also equal to (1). At this point, our detection mechanism is equal to or is better than the other methods in [6,8,16].

Conclusion
In this paper, we present a real-time DDoS attack detection mechanism based on the MDRA algorithm in Big Data. Compared with previous methods, the experimental results demonstrate that our solution has the better effectiveness and efficiency to distinguish attack traffic from vast amount of normal network traffic on the aspects of precision rate, TNR, time cost, memory resource consumption, and computing complexity.