Network-Wide Traffic Anomaly Detection and Localization Based on Robust Multivariate Probabilistic Calibration Model

Networkanomalydetectionandlocalizationareofgreatsignificancetonetworksecurity.Comparedwiththetraditionalmethodsof hostcomputer,singlelinkandsinglepath,thenetwork-wideanomalydetectionapproacheshavedistinctiveadvantageswithrespect todetectionprecisionandrange.However,whenfacingtheactualproblemsofnoiseinterferenceordataloss,thenetwork-wide anomalydetectionapproachesalsosuffersignificantperformancereductionormayevenbecomeunavailable.Besides,researches onanomalylocalizationarerare.Inordertosolvethementionedproblems,thispaperpresentsarobustmultivariateprobabilistic calibrationmodelfornetwork-wideanomalydetectionandlocalization.Itappliesthelatentvariableprobabilitytheorywith multivariate t -distributiontoestablishthenormaltrafficmodel.Notonlydoesthealgorithmimplementnetworkanomalydetection by judging whether the sample’s Mahalanobis distance exceeds the threshold, but also it locates anomalies by contribution analysis. Both theoretical analysis and experimental results demonstrate its robustness and wider use. The algorithm is applicable when dealing with both data integrity and loss. It also has a stronger resistance over noise interference and lower sensitivity to the change of parameters, all of which indicate its performance stability.


Introduction
Network traffic anomalies are unusual and significant changes at network's traffic level.Intrusions such as DDos attacks and zombie networks significantly jeopardize the Internet security, and network jams and malfunctions have unpleasant impact on service quality; therefore it is critical to detect and locate network anomalies for both network operators and end users.It is a challenging task to detect and locate them because one must extract and interpret anomalous patterns from large amounts of high-dimensional, intricate, and noisy background traffic data.
There are a great number of researches on anomaly detection.Host-based anomaly detection system monitors and analyzes the internals of a computing system by applying data mining of the system logs and audit records [1,2]; another detection method based on performance measurement data such as end to end round-trip time and packet loss probability in a single path can be implemented by single variable time series analysis [3,4]; network anomaly detection based on traffic measurements from single link can be implemented by applying machine learning and signal analysis [5,6].All of these methods have their limits, because they only concentrate on a part of the information and their detection area is limited.When the scale of network enlarges and data transfer rate speeds up, many network anomalies often exhibit strong network-wide characteristics [7,8], and their impact may always spread to multiple links, while their local characteristics may not be that obvious.It is difficult to conduct network-wide analysis with above methods and their accuracy cannot be guaranteed.
In order to solve the problems mentioned above, Lakhina et al. come up with network-wide anomaly detection based on subspace construction via PCA [9].This method employs network traffic of many Origin-Destination (OD) flows to establish a model of normal behavior and detects anomalies by measuring deviations from that model.With the concept of network-wide detection, researches are conducted in space-time expansibility [10][11][12][13], robustness [14][15][16], realtime processing [17,18], and anomaly measure [19,20], which enrich network-wide anomaly detection.This kind of methods uses whole-network traffic data, which has huge performance advantages over single point, single path, and single link.At the same time, compared with other methods, it sets up normal behavior model which enables us to avoid building anomalous feature library.Thus it can be implemented to detect known anomalies as well as unknown anomalies, and it can be used widely.Network-wide anomaly detection improves detection performance by introducing wider and multidimensional network information, but it also faces some real problems when implemented in large scale and high speed backbone network.Firstly, because of its wider collection area, more collection equipment, and faster network speed, it might not be applicable if collected data were lost during collection or transfer process [21]; secondly, traffic flows in backbone network continue to grow in volume and complexity, and hidden noise like anomalous traffic could degrade performance of the anomaly detection algorithms [22], while some of the attacks might even pollute the detectors [14][15][16]; thirdly, the above anomaly detection methods can only find when anomalies happen, but they still have some defects on locating those anomalies [22].
Therefore, we propose an approach named RMPCM based on robust multivariate probabilistic calibration model to overcome these problems discussed above.This anomaly detection and locating algorithm introduces a latent variable probabilistic model based on -distribution instead of a Gaussian distribution to establish a normal traffic model.By judging if the sample's Mahalanobis distance from the normal model exceeds the threshold, traffic anomaly detection is achieved.Locating anomalies is attained with contribution analysis.RMPCM approach is more robust; not only can it be widely used for processing complete data as well as missing data, but also it acquires stronger robustness under the noise interference and lower sensibility of model parameters.The contributions of this paper consist of following 4 aspects; the RMPCM approach can (1) solve anomaly detection problem when data loss occurs by establishing a latent variable probabilistic model, (2) increase detection accuracy by introducing multivariate -distribution to relieve noise interference in modeling the normal traffic behavior, (3) correctly locate the underlying Origin-Destination (OD) flows being the source of the anomaly, (4) reduce the amount of work involved in implementing complicated parameter testing, because RMPCM has a better stability and lower sensibility for model parameters.
This paper is organized as follows.We begin in Section 2 with a discussion of the related work.We describe data source model and problems that need to be solved in Section 3. In Section 4, we describe the RMPCM approach in detail and solve three problems raised in Section 3. We validate our approach in three different ways of experiments and contrast our RMPCM with existing approach on traffic anomaly detection in Section 5. A discussion of several details is presented in Section 6. Concluding remarks and our ongoing work are presented in Section 7.

Related Work
Back in 1987 Denning had demonstrated statistic model for detecting network anomalies [23].And it is becoming more and more important with the development of the Internet.There are many traditional anomaly detection approaches based on host computer, single path, and single link.
Researches in [7] indicated the generation and development of network anomalies have exhibited a tendency of networkwide characteristic.They found that the performance of the anomaly detectors increases with enlarging the range of detection beyond linear growth, which sets up the foundation for network-wide anomaly detection.The authors of [9] proposed network-wide anomaly detection algorithm based on network traffic for the first time, which illustrated low dimensionality of OD flows.They also integrated traffic statistics of multiple OD flows to build up a model of normal behavior and detected anomalies by measuring deviations from that model.The authors of [11] came along to improve the anomaly detection approach based on PCA by applying stochastic matrix perturbation theory and proposed a PCAbased distributed approach for network-wide anomaly detection.Reference [13] expanded the classical PCA and proposed the Karhunen-Loeve expansion for network-wide anomaly detection.Reference [17] proposed an online anomaly detection approach using kernel recursive least squares algorithm to solve the problem of online detection.All of them did not address the problems of how to detect anomalies in the condition of noise interference and data loss and how to locate the anomalies in the actual network environment.
The authors of [22] took in-depth study on the influence of anomalous traffic on the performance of detector and indicated that large anomalies may cause the offset of normal model based on PCA, which increased the false positive rate (FPR) of anomaly detection.References [14,15] took further steps to study poisoning attacks on anomaly detectors and evaluated poisoning techniques and developed defense.The authors of [16] listed 3 mechanisms of poisoning attacks and proposed defense based on robust PCA with projection pursuit.All of the above only focus on the poisoning and defense techniques based on PCA detector, but there is lack of researches on common modeling approaches in the intricate and noisy environment.
Data loss is very common in many fields, and the question of how to get enough information from missing data needs to be answered.The authors of [24,25] proposed an algorithm to solve the problem of goodness-of-fit test for varying coefficient models with missing data.The authors of [21] gave their opinion on data loss problem when network traffic flows were measured in large scale and high speed backbone network; they proposed an approach of sparsity regularized matrix factorization (SRMF) to make the data complete.This is applicable in traffic engineering, capacity planning, forecasting, and so forth, but it did not research deeply in network anomaly detection when facing data loss.
Network anomaly detection can determine when anomalies take place, but locating anomaly is an extremely challenging task.The authors of [22] point out the deficiency of network-wide detection algorithm based on PCA in locating anomaly.Reference [26] then proposed an approach of Basis-Detect for network-wide anomaly detection and locating, but it could only locate anomalies to border router; it was unable to pinpoint the position.
In this paper we propose a network-wide anomaly detection algorithm based on RMPCM, which will later be proved to have a better performance in solving problems of noise interference, data loss, and locating anomalies.

Data Model and Problems Description
3.1.Data Model.Conventionally the researches of the Internet traffic flow mainly focused on temporal characteristics of data package on a single link, which help in developing concepts of self-similar stochastic processes, longrange dependence, and so forth.One ISP (Internet service provider), however, consists of hundreds of those links which are connected all over, and the Internet contains several thousand ISPs.In such a vast background the spatial characteristics of network traffic come to people's attention inevitably.However, it is difficult to analyze traffic flow data of all links in the network simultaneously, because it amplifies the complexity of modeling traffic on a single link which is itself a complicated task.As compact and elegant descriptions of traffic flows between nodes in a certain network structure, traffic matrix is a constantly employed model to conduct explorations on the spatiotemporal component of networkwide traffic.Traffic matrix is an overview of network-wide traffic.Instead of studying traffic on all links, applying traffic matrix provides more straightforward and fundamental insights into network-wide traffic study [8].We employ traffic matrix at PoP (point of presence) level as data source in our research.
Traffic matrix at PoP level: assume that an autonomous system (AS) has  PoPs.Continuous measuring of the traffic between each PoP pair at a certain period can obtain the traffic of Origin-Destination (OD) flows.An OD flow denotes the collection of all traffic that enters the network from an ingress node and departs from an egress node.Arranging these point-to-point measured values in  ×  matrix can obtain this AS's traffic matrix X.As shown in Figure 1,  denotes the number of measurement periods, and  denotes the number of measured value of OD flows at each measurement ( =  × ).The element   in X denotes the volume of a certain traffic measure at the th period and the th OD.Traffic volume (the number of bytes, packets, or flows) is adopted as a traffic measure in this paper.

Data Loss.
In the process of collecting data, data loss may occur.This is because massive data in high speed backbone network may increase burden of collecting equipment and reduce its stability.Another reason is due to network congestion, equipment or link malfunctions when transferring data.Traffic data loss is not all completely random, and many of the cases are highly structural.In order to describe the scenario of data loss in the process of data collection and transfer, four kinds of loss mechanisms are adopted.
(1) PureRandLoss.The elements in  are missing independently at random with probability .This may be due to unexpected congestion by chance in measuring equipment or unreliable transfer mechanism.
(2) PeriodRandLoss.The rows of traffic matrix X are corresponding to measurement periods, and PeriodRandLoss means the rows in  are missing at random with probability .The reason of this type of structured loss may be that storage devices are overloaded or program breaks down when centralized processing large number of measured data during this period.
(3) ODRandLoss.The columns of traffic matrix X are corresponding to OD flows, and ODRandLoss means the columns in X are missing at random with probability .This type of structured loss could stem from OD identification error caused by either flow filtration or links/routers' malfunction.
(4) PieceRandLoss.PieceRandLoss means the submatrixes in  are missing at random with probability .This type of structured loss may be caused when storage devices are full but they are still keeping collecting data for several periods or devices' breakdown for some time, which is corresponding to data loss of traffic matrix multiple adjacent columns.Note that the case single row PieceRandLoss corresponds to Peri-odRandLoss, and single column PieceRandLoss corresponds to ODRandLoss.

Noise Interference.
Model parameter estimation often uses maximum likelihood estimation (MLE) when samples are known, and the probability distribution of the samples is needed.However, it is very troublesome to accurately describe the distribution of data; therefore it is always assumed that the data generally follows the Gaussian distribution because its nice analytical property always yields tractable algorithms.The MLE is equivalent to the least square estimation in the linear Gaussian regression model which is noted for its unduly sensitive to atypical samples such as outliers and it would affect the accuracy of the model [27].In real network measurement collected traffic Locating anomalies in this paper is corresponding to pointing out the intersections of the rows and columns of traffic matrix X when anomalies occur.The row of X corresponds to the time of anomaly occurrence, and the column of X corresponds to the position where anomaly occurs (i.e., on one OD or a few OD).

RMPCM
The relationship between anomalous event generation and its detection and localization is shown in Figure 2. Anomaly events will affect part of traffic flows in a network and thus cause changes of corresponding statistics of overall network traffic.Traditional network anomaly detection algorithms will not be applicable any longer in the condition of incomplete data.It is considered to adopt Bayes method, but because of the complexity of network traffic data posterior mean estimation and asymptotic variance cannot be directly derived from this method.Therefore a latent variable probabilistic model is to be introduced, meaning some "latent data" are to be added in the known data in order to simplify the parameter estimation.In this process missing data along with unknown parameters treated as "latent data" will be solved by applying expectation-maximization (EM) algorithm to achieve the maximum likelihood estimation (MLE) of model parameters.
When computing MLE, probabilistic distribution of known data is required.Normally it is assumed that they are normally distributed, but because of some anomalous traffic, this assumption will cause parameter estimation to have a large deviation; therefore, multivariate Gaussian distribution is replaced by -distribution in our paper.Compared with Gaussian distribution, -distribution has heavier tails, which is a desirable property to handle data sets in the presence of anomalies.The explanation is as follows.
Specifically, suppose sample data x  ( = 1, . . ., ) are recorded ( represents the number of the samples), and one assumes that they are independent identically distributed Gaussian random vectors: in which  is the dimension of the sample,  is the mean vector function with parameter , and Σ is the covariance matrix with parameter .-distribution model replaces the above model, which is is also the dimension of the sample,  is the location vector, Λ is the scale matrix, and V is degree of freedom for -distribution; probabilistic density function (p.d.f.) is calculated as where is the weight assigned to sample .  3.
In order to solve the problem that Gaussian noise models are too sensitive to atypical observations such as anomalous traffic observations, we suppose the noise is drawn from -distribution model instead of a Gaussian noise model.Assume each -dimensional latent vector t  comes from a linear probabilistic projection of -dimensional  ( ≥ ) eigenvector x  , and we build a latent variable probabilistic model where we select a unit variance -distribution as the prior distribution on the latent vectors.The probabilistic model is where W is the projection matrix,  is the location vector, and I is the unit matrix.
It is inapplicable to be analyzed and resolved using MLE directly.As noted in [28] -distribution model can be extended to an infinite Gaussian mixture model with the same mean where the prior distribution on  is a Gamma distribution with parameters depending only on distribution's degrees of freedom V: where Ga( | , ) =    −1  − /Γ() is the probability density function of Gamma distribution.Based on (6), latent variable  is introduced, and  is a scalar,  ∼ Ga(V/2, V/2).
As can be known in [29] The latent variable model can be derived as follows.Figure 4 indicates the relationship of variables in the model: In order to calculate model parameters,  | x and t | x,  are also needed to be calculated, and from ( 8) and ( 9), it can be solved that Set Ψ = WW  + I  , which is  ×  matrix, and the marginal distribution of x follows -distribution, which is (x  ) = (, Ψ, V).
From ( 7) and ( 10) it can be calculated that where From ( 8) and ( 9) it can be computed that where The above procedure establishes the latent variable probabilistic model by replacing Gaussian distribution model with -distribution model.
The model parameters {, , W, V, } need to be estimated.Parameter  is the intrinsic dimension of traffic matrix X.As shown in Figure 3 the value of  is determined by scree plot based on PCA, and the details are in Section 6.1.Parameters {, , W, V} are estimated by MLE, where corresponding log-likelihood is MLE can be calculated by applying EM algorithm.In order to simplify the calculation, REM (rapid expectationmaximization) is employed in this paper.REM remarkably increases the algorithm's rate of convergence.This algorithm consists of 2 stages, estimating different parameters by applying EM algorithm in every stage, and does 2 stages' iteration until satisfying the condition of convergence.
The First Stage.It is only estimating , ignoring t  in the first stage [30].
The log-likelihood is

E-
Step.As can be known from Jensen inequality, when (  ) = (  | x  ), the inequality becomes equality, and the lower bound of  1 is set.Its expectation can be calculated from ( 10), (11), and ( 14) and get (15), where ⟨⋅⟩ is the expectation operator: where

M-
Step.Maximization of ⟨ 1 ⟩ with respect to  can give the estimated value of .This process can be implemented by setting the partial derivative of ⟨ 1 ⟩ with respect to  at 0: The Second Stage.In the second stage, REM algorithm introduces the latent variable t  and estimates parameters {, W, V}.The estimated value μ from the first stage is used in the second stage.

M-
Step.Maximizing ⟨ 2 ⟩ with respect to W and  gives the following updating formulas of {W, }: The MLE of Ṽ can be computed by solving the following equation by line search:

Normal Traffic Modeling with Data Loss.
If data loss occurs on some dimensions, -dimensional data x  can still be used to conduct estimation for model parameters.The sample data x  can be divided into observed data and missing data; that is, Its mean and covariance can be also divided into blocks: As noted in [31], ] ; The procedures for normal traffic model establishment with data loss are shown in Figure 5.When calculating model parameters, the method of determining  is the same as Section 4.2.1, and REM is also used to implement estimations of model parameters {, , W, V} under the condition of data loss.

The First Stage
where

M-
Step.Maximization of ⟨  1 ⟩ with respect to  results in the updating formula of : The Second Stage.In the second stage, the latent variable t  is introduced, and then the algorithm estimates parameters {, W, V}.The estimated value μ from the first stage is used in the second stage. where

M-
Step.Maximizing ⟨ 2 ⟩ with respect to W and  gives the following updating formulas of {W, }: The updating formula of V is the same as (25).The algorithm in Section 4.2.2 is applicable when data loss occurs as well as noise interference.

Anomaly Detection.
Anomaly traffic flow samples of complex traffic flow data need to be determined by choosing measurement standards.There are mainly 2 strategies for determining anomaly samples: one is to determine whether samples are leverage outliers by judging if Hotelling's  2 exceeds the threshold; the other is to determine whether samples are orthogonal outliers by judging squared prediction error (SPE) which exceeds the threshold [32].Because probabilistic model is established in this paper and Mahalanobis distance is employed to conduct anomaly measurement, there is no need for the two strategies [33].
For intact data samples, the squared Mahalanobis distance  2 is For samples containing some of dimension loss, the squared Mahalanobis distance  2  is to be computed using its expectation: Normal distribution "3" control chart is adopted to determine anomaly, and the choosing reason is discussed in Section 6.2.When anomaly occurs in time series, corresponding distribution goes beyond the control boundary.
Establishing time series of the squared Mahalanobis distance as  2 (), its mean is   , and the variance is  2  , and then the configuration of "3" control chart is as follows:
According to the Gaussian distribution, where Φ(⋅) is the probability distribution function of standard normal distribution.Adopting normal distribution "3" control chart, it is certain that anomaly occurs when the values deviate from   by more than 3 times standard deviation   , and its confidence coefficient is 99.74%.

Anomaly Localization.
After confirming anomalous samples, it is needed to be determined which dimension (corresponding to OD) of the selected anomalous sample should be responsible for the anomaly, which is anomaly localization.The anomalous sample x  is a -dimensional vector, and   ( = 1, . . ., ) is the th-dimensional variable of x  , adopting following contribution analysis (as shown in Figure 6) to locate OD of anomaly occurrence, which is divided into 2 stages.

Stage One.
(1) Eliminate any   in x  , and get x   and x   .
(2) Compute the conditional distribution of the selected anomalous sample with the th-dimensional variable missing from formula (28) to derive corresponding z and Q, and compute the squared Mahalanobis distance  2  with formula (37).
(3) Calculate the contribution of   using contri It represents the degree of change in anomaly measurement after a certain dimension of anomaly sample missing.
(4) Take each different value of  ( = 1, . . ., ), and repeat (1) ∼ (3).If  2   is smaller than the threshold of anomaly discriminant, the corresponding dimension (OD  ) is mainly responsible for the anomaly, since its elimination would bring the sample back to normal.
Stage Two.If there is not any  2  in Step (4) of Stage 1 smaller than the threshold of anomaly discriminant, it is indicated that the anomaly is not caused by anomalous traffic of only one OD flow but multiple OD, and then Stage 2 is to be carried out.
(1) Arrange contri (  ) of every dimension of x  in descending order.
(2) Remove the first two dimensions of x  after the arrangement in (1), and if  2  is smaller than the threshold, the corresponding two dimensions are responsible for the anomaly.
(3) Otherwise, sequentially increase the number of missing dimensions in the order arranged in (1) until corresponding  2   is smaller than the threshold, and then several missing dimensions are jointly responsible for the anomaly.

Algorithm Complexity Analysis.
In RMPCM the major overheads are the inverse of Ψ and iterations of REM.Ψ is ×  matrix;  is dimensions of X corresponding to the number of columns of traffic matrix, which is the number of OD ( × ).In the calculation process, directly computing Ψ −1 affects the algorithm complexity significantly, and Woodbury matrix identity is adopted in the paper: is determined by applying PCA dimensionality reduction, and  ≪  (refer to Section 6.1).Calculating the inverse of  ×  matrix Ψ is changed into inverting  ×  matrix M, which simplifies the algorithm complexity noticeably, and the time complexity is ( 2 ).The time complexity is also related with the iterations of EM algorithm.Iterative times are smaller than 15 in the paper.Applying Matlab to run RMPCM anomaly detection algorithm on selected data, execution time is shown in Table 1, and computer specifications are Windows 7, i7 3.5 GHz CPU, and 4 GB RAM.In the process of implementation the calculation will be faster when converting Matlab code into binary executable program.

Evaluation
Normally there are 3 methods that can be applied to assess the performance of network anomaly detection algorithm: network traffic simulation experiment [10,[14][15][16], testbed experiment [19], and real network data analysis [9,10,20,32,34].There are pros and cons regarding each method.For example, network traffic simulation experiment produces synthetic data which can be entirely controlled by researchers, but it is not so close to reality; real network data analysis is closest to reality but it is difficult to establish a benchmark; testbed experiment gives attention to both reality and controllability, but the real output data might also deviate from the experiment settings.In order to assess the performance of RMPCM more objectively, all the three of them were combined to conduct analysis.Experimental content and method are shown in Table 2.
RMPCM will be compared with the anomaly detection method based on subspace construction via PCA and its improved method ANTIDOTE [15] to assess its real performance.As it is known, the anomaly detection method based on PCA is accepted and applied commercially (a commercial anomaly detection system (NetReflex by Guavus) is based on the well-known anomaly detector using PCA [34]).Therefore, the results should be persuasive.

Network Traffic Simulation Experiment.
Network anomalous traffic, especially some poisoning attack traffic, may cause the skewing of detection model, which would significantly decrease the performance of anomaly detectors Steps for synthetic generation of anomalies.[14,22].In this section the performance of RMPCM under poisoning will be assessed.

Anomaly Generation.
The structure of network-wide traffic was revealed for the first time by Lakhina et al., which is that OD flows consist of large number of periodic and deterministic trends, some of the noises, and few of spikes [8].In order to be closer to the reality, the real network data as shown in Table 1 is chosen as the blueprint of simulation, and Discrete Wavelet Transform is adopted to obtain periodic and deterministic trends of the real data which are the normal daily mode by eliminating noises and spikes [10].After that, artificial noises and anomalies are synthetically injected into the periodic daily mode components.The process is as follows.
(1) Daubechies 5 orthogonal wavelet is adopted for multiresolution analysis, and high-frequency components containing noise and anomalies are removed by filtration from 121 OD flows' data in the traffic matrix, which computes the periodic smoothing components as the basis, as shown in Figure 7(a).
121 OD flows are produced in this way.Data collection occurs every 5 minutes, which is marked as one collection cycle.Traffic matrix with 2016 rows and 121 columns is created with one week's collections in 121 OD flows.As we mainly focused on traffic volume anomalies, five kinds of typical anomalies were simulated: DoS, DDoS, ALPHA, ingress/egress shift, and flash crowd.Their brief description is shown in Table 3. Table 4 indicates how we have them injected.

Poisoning Generation.
The poisoning method Add-More-If-Bigger discussed in [15] was employed in this section.It suggests injecting poisoning traffic flow into targeted OD flow to increase its variance, which skews the detector's normal traffic model and thereby increases the escape probability of anomalies.This poisoning method requires attackers to grasp the real-time traffic volume of background flows.Add-More-If-Bigger means attackers inject poisoning traffic when the background traffic volume exceeds its mean value.The volume of poisoning traffic is   = (max{0,   () − mean})  , where mean = (1/) ∑  =1   (),   () is the volume of targeted OD traffic at the moment of .  is a function with parameter  which determines the degree of poisoning.In this experiment  = 0.8 means medium poisoning, and  = 1.2 means high poisoning.

Anomaly Detection under Poisoning.
Anomalies were injected according to the method in Section 5.1.1.Different anomalies usually have different characteristics; thus injection should also be used in a different way.Usually DoS or DDoS attacks are the gradually increasing traffic volume of single or multiple OD flows, which normally lasts 5 to 30 minutes; ALPHA is always the sharp rise of single OD flow with uncertain duration; flash crowd occurs when traffic volume of multiple OD flows uprises fast and goes back to normal gradually; ingress/egress shift is likely to show a step change of 2 OD flows' volume, which normally lasts a long time until the routing policy change.According to their own properties, 5 ALPHA attacks were injected in 100∼500 with its volume accounting for 50% of the mean volume of OD flows; each of them lasted 20 minutes; at 600, an hour's flash crowd anomaly was injected into OD flows from 3 origin nodes to one destination node, the volume of which was 10% ∼40% of the mean volume of OD flows; at the time of 700, 200 minutes' ingress/egress shift were injected, and the volume of transferred traffic was 50% of the mean of OD flows; 5 Dos attacks were injected in 1000∼1400 with its volume accounting for 30% ∼50% of the mean volume of OD flows, and every one of them lasted 30 minutes; at the time of 1700, 30 minutes' DDoS attacks were injected into OD flows from 4 origin nodes to 1 destination node, and the volume of the anomalies is from 5% to 30% of the mean volume of OD flows.
Subspace construction via PCA and ANTIDOTE was selected in order to be compared with RMPCM under exactly the same circumstance to evaluate its detection performance under poisoning, and we plot the squared norm of the residual vector as a function of time to show PCA and ANTIDOTE method results.Receiver Operations Characteristic (ROC) curve is also applied to the overall performance estimation of three above methods.The  axis means false positive rate (FPR), and  axis represents true positive rate (TPR).ROC curve can display the relationship between TPR and FPR with the change of thresholds.The area under the curve is the performance indicator.The larger the area is, the better the performance is.
The experiment was conducted with the data generated in one week to compare those three methods.Three ROC curves are drawn corresponding to no poisoning, medium poisoning, and high poisoning (Figure 10), and part of the experiment results is shown in Figures 8 and 9. Before poisoning (Figure 8) these three methods have a similar performance.After medium poisoning (Figure 9) the volume of the residual vector of background traffic increases significantly in PCA method, and the performance of PCA reduces dramatically because lots of anomalies submerge in the background traffic (Figure 9(b)); ANTIDOTE performs well against poisoning, but low volume anomalies could easily escape its detection (Figure 9(c)); RMPCM has a stable performance, which does not decline too much after injection (Figure 9(a)).Figure 10 reveals that poisoning has a large impact on PCA anomaly detector, and FPR increases quickly after TPR exceeded 60%; RMPCM and ANTODOTE have a better antipoisoning ability; the detection performance of RMPCM is better than ANTIDOTE because its ROC curve rises up more quickly.5 illustrates the setting of anomaly locating test, and a single OD or multiple OD flows were injected with anomalies generated in Section 5.1.3.The result of anomaly localization test in RMPCM method is shown in Figure 11, the histograms represent anomaly contribution degree of each OD at the time of anomaly  occurrence, and the result agrees with the setting from Table 5.As shown in Figure 11(a), anomaly contribution degree of OD50 noticeably exceeds the contribution degree threshold; therefore it is certain that anomaly at the time of 300 occurs on OD50; from Figures 11(b), 11(c), and 11(d) it is found that there is no single OD contribution degree exceeding the threshold, and this requires locating multiple OD when anomalies occur.

Testbed Experiment.
Cyber-defense technology experimental research laboratory testbed (DETERLab) [35] proposed by the University of Southern California is chosen to conduct the testbed experiment.Not only can DETERLab interconnect nodes in the prototype system with any topological structure, but also its configuration for experiment condition is flexible.It provides researchers with needed background traffic and attack traffic injection for conducting network attack-defense experiments and then deploys and evaluates feasible solutions.It can fully integrate local hardware resources and has a better simulation advantage over software such as NS2.
In our experiment attacking tools based on Metasploit frame was integrated into DETERLab's security experimentation environment (SEER) toolset to generate multiple anomalies on the DETERLab platform.Our experiment sets up 10 PoP nodes and chooses the adjacent node of each PoP node as collection device, and the topological configuration is shown in Figure 12.The duration of the experiment is one week, data are collected every 5 minutes which is a collection cycle, and there are total 2016 cycles for a week.The collecting data is byte count of traffic flows.

Anomaly Detection in Noisy
Traffic.In order to verify RMPCM's performance in noisy traffic, the experiment set up three situations to compare with the detection method based on subspace construction via PCA, which test the detection accuracy, factors impacting performance, and poisoning of large anomalies, respectively, in the two methods.
At the time of 500 and 1000 DoS attacks using TCP SYN Flood were initiated from PoP1 to PoP2, at the time of 1800 DoS attacks were carried out from PoP3 to PoP4, and the duration of DoS attacks was all 4 cycles; at the time of 800 port scan was initiated from PoP1 to PoP2, PoP5, and PoP6 applying by Nmat, and the duration was 5 cycles; at the time of 1200 ingress/egress shift was initiated by transferring 50% of the traffic volume of the OD path (PoP1 to PoP2) to another (PoP7 to PoP8), and the duration was 40 cycles; at the time of 1500 DDoS attacks using UDP Flood on PoP10 were initiated from PoP2, PoP4, PoP5, and PoP8 simultaneously, and it lasted 6 cycles.Figure 13(a) and Table 6(a) illustrate the test results applying RMPCM and subspace construction via PCA.Both methods succeed in detecting 6 anomalies, but the latter does not detect anomalies every time in the corresponding cycles of anomaly occurrence, especially for the 1200-1239 ingress/egress shifts.The anomaly cycles detected by PCA are shorter than the settings, while RMPCM made a notable improvement over PCA.
In order to reveal the impacting factors and performance differences of two methods further, anomalies were adjusted.The volume of DoS attacks beginning at 500 and 1800 was reduced by 50%; the range of port scan beginning at 800 decreased to the range only from PoP1 to PoP2, and scanning frequency was deducted by 50% as well; the duration of ingress/egress shift beginning at 1200 was cut down by 20 cycles; DDoS attacks beginning at 1500 were narrowed down to the range from PoP2 and PoP4 to PoP10 and kept the same attack volume; DoS attacks beginning at 1000 remained unchanged.Figure 13(b) and Table 6(b) indicate that the change of volume and range of anomalies would influence the performance of both of the methods, and the impact of duration deduction is hardly noticeable.PCA gives us false positive alarms at total of 10 time points including 1272, 1407, and 1451, which is higher than RMPCM obviously.
The attack volume of DoS increased by 220% to produce a large anomaly, and other factors remained the same as the first test.A large anomaly could raise variance level of the path that it was on, and other anomalies causing small variance would be mistaken for a normal event in this situation.Figure 13(c) and Table 6(c) show that the large anomaly leads to the dramatic decrease in the detection accuracy of PCA method.OD paths of anomaly occurrence contain the OD path of the large anomaly (PoP1 to PoP2), and then the corresponding anomaly detection rate is affected: Dos beginning at 1000 and the large anomaly are on the same path, and the Dos attack is completely missing from the figure; ingress/egress shift involves 2 OD paths which contain the OD path of the large anomaly, and the residual vector also seriously decays and does not reach the detection threshold; port scan and DDoS also contain OD path where the large anomaly occurs, and its residual vector is affected in varying degrees; DoS beginning at 1800 does not involve the OD path of the large anomaly, thus the detection to the above DoS attack is not impacted.Comparably RMPCM has a strong robustness, and detection accuracy is not affected.The above experiment demonstrates RMPCM has a better detection performance than the method based on subspace construction via PCA.Add-More-If-Bigger poisoning experiment in Section 5.1.3and large anomaly poisoning experiment verify that RMPCM has a strong robustness under the poisoning of variance injection.

Anomaly Detection with Data
Loss.Data loss may be caused by network malfunction, device breakdown, and so on; thus algorithms based on conventional nonstatistical model become inapplicable because of the incomplete input data.RMPCM based on the latent variable probability model of multivariate -distribution has its distinct advantage when solving the problem of data loss.The detection performance of RMPCM with data loss is tested on DETERLab.The scenarios are set to the occurrence of data loss when facing link fault, collection device breakdown, and PoP node malfunction.Link fault will incur data loss when OD flows go through the link; collection device breakdown will cause the data loss of OD flows whose source node is the node connected with the broken-down device; data loss caused by PoP node malfunction is related with malfunction types, network topology, and router strategies.Figure 12 displays experimental topology configuration, and 100 anomalies were injected in the way that was used in Section 5.2.1.In order to eliminate occasionality 10 experiments were conducted for each type of malfunctions, 403 cycles and 1008 cycles were randomly selected (which accounted for 20% and 50% of all 2016 cycles in one week resp.), in which a randomly selected link (collection device, PoP node) broke down, and the malfunction duration lasted 20 cycles.The mean of 10 experiment results was used to draw the ROC curves.The experiments results indicate that link fault, collection device breakdown, and PoP node malfunction regularly and increasingly deteriorate the accuracy of anomaly detection, but overall RMPCM has better robustness in this situation.Its TPR is close to 70% when FPR is 20% even in the worst scenario (PoP node malfunction occurred in 50% cycles) (Figure 14(c)).is shown in Figure 15, the histograms represent anomaly contribution degree of each OD at the time of anomaly occurrence, and the results agree with the setting from Table 7.As shown in Figure 15(a), anomaly contribution degree of OD2 noticeably exceeds the contribution degree threshold; therefore it is certain that anomaly at the time of 502 occurs on OD2; from Figures 15(b) and 15(c) it is found that there is not a single OD contribution degree exceeding the threshold, and this requires locating multiple OD when anomalies occur.Some problems of anomaly localization will be discussed in Section 6.5.users are American universities and scientific institutions.We choose the data from Abilene data sets dating from 15 December 2003 to 21 December 2003 for its completeness and convenience for reference.The raw data is NetFlow data from 11 PoP nodes, the access point and exit point of each flow are obtained on the basis of BGP and ISIS routing table, and then the volume of OD flows and traffic matrix can be calculated, as shown in Table 8.Detection performance evaluation with data loss and sensitivity analysis were conducted with the data set.

Anomaly Detection with Data Loss.
When conducting anomaly detection with data loss, data set  was chosen to implement four kinds of data loss mechanisms discussed in Section 3.2.1.In order to clearly identify the different performance of RMPCM in different data loss mechanisms and evaluate the factors influencing the performance, nearly the same loss rate was assigned to PureRandLoss, Peri-odRandLoss, and ODRandLoss; the influence of different volume of missing piece on the detection performance was analyzed under PieceRandLoss.In order to eliminate accidents, 10 experiments were conducted, and mean values were calculated.
The true anomalies in the real network traffic data could be hardly obtained exactly, and RMPCM and PCA generated very similar alerts with the complete data set  (as shown in Figure 16), so the benchmark was set to be the detecting result of RMPCM with complete data in order to test the performance of RMPCM under the situations of data loss.
In the experiment of PureRandLoss, three kinds of loss rates were selected which accounted for 10%, 20%, and 50% of total data; in PeriodRandLoss the missing periods were set to be 200, 400, and 1000, as the total periods were 2010; data loss rates in here were close to 10%, 20%, and 50%; in ODRandLoss we chose half adjacent data of a certain column to be empty because of the algorithm limitation that column of traffic matrix could not be empty entirely, lost OD numbers were set to be 24, 48, and 121, and the loss rate per OD was 50%; therefore, the total loss rates of ODRandLoss were still 10%, 20%, and 50%.Parts of the results and ROC curves under three kinds of loss mechanisms, respectively, are displayed in Figures 17(a), 17(b), and 17(c).Three tests above indicate the relatively high accuracy of RMPCM: when the loss rate is 10%, the effect on anomaly detectors is slight, and the detection performance is very similar to that of RMPCM and PCA with complete data; as the loss rate increasing the performance begins to decline, even when the loss rate achieves 50%, test results are still applicable.
The tests also demonstrate that ODRandLoss has the biggest impact on the performance of anomaly detectors, followed by PeriodRandLoss and PureRandLoss.The total data loss remains the same when applying these three methods, but the performance of anomaly detectors deteriorates regularly, and it is highly probable that this is due to the increase Mathematical Problems in Engineering    of structured loss of data.The volume of missing piece in PureRandLoss is smallest, while volume of missing piece in ODRandLoss is largest.In order to verify the impact of structured loss on the detection performance, we conducted experiment with PieceRandLoss, and the total volume of loss is set to be the same while the volume of each missing piece is set at different sizes, which are 5 * 5, 16 * 16, and 40 * 40, and the numbers of missing pieces were 2000, 200, and 30, respectively; the volume of total loss accounts for 20% of whole data.Parts of the results and ROC curves in Figure 17(d) indicate that the performance of anomaly detector declines significantly when the volume of structured loss increases while the total loss rate remains the same.However, RMPCM achieves about 70% TPR with less than 20% FPR when applying PieceRandLoss overall, which means the performance could satisfy the requirement.[22] point out that the detection algorithm based on subspace construction via PCA is too sensitive to the change of intrinsic dimensions and traffic measures; therefore, the same issues should also be considered with RMPCM.The test will be conducted in two scenarios: one is to analyze the sensitivity to the change of intrinsic dimension ; the other is to analyze the sensitivity to the change of traffic measures.The results of RMPCM are to be compared with the algorithm based on subspace construction via PCA.

Sensitivity Analysis. The authors of
Real network data set  in Table 8 was selected to conduct the sensitivity to the change of intrinsic dimension analysis.Intrinsic dimension  varies depending on the accumulative variance contribution of principal components which is defined beforehand.The experiment results are displayed in Figure 18.Figures 18(a) and 18(b), respectively, represent detection curves when PCA model's intrinsic dimension  was set to be 4 and 5, as it can be seen that the detection results vary largely; Figures 18(c) and 18(d) represent corresponding curves of RMPCM, and the detection results agree with each other.The further procedures are taken to test  varying from 2 to 10 when using RMPCM, and test results are almost identical, which suggest that RMPCM is highly robust.
Real network data sets , , and  in Table 8 were employed to analyze the sensitivity to the change of traffic measures.The accumulative variance contribution of principal components was set to be the same value of 0.85 when using different traffic measures.The experiment results are Mathematical Problems in Engineering shown in Figure 19.The detection results in three kinds of measures (, , ) are completely different by applying PCA.While using RMPCM the baselines of three curves are similar to each other, and three test results are highly correlative.Although benchmarks of anomalies in real network data sets could not be acquired, the statistical data in three kinds of measures (flow, packet, and byte) has interrelationship; therefore corresponding anomaly detection results should somewhat coincide.PCA produces too much conflict, which indicates it is much sensitive to the change of traffic measures, while RMPCM generates more satisfying results and shows a strong robustness.
To summarize, RMPCM is convenient for practical implementation for its lower sensibility and higher robustness to the change of parameters such as intrinsic dimensions and traffic measures.

Discussion
Issues like parameter selection, distribution characters of data source, and so forth will be discussed further in this section.

Intrinsic Dimension.
The Internet traffic matrix has this characteristic of low dimensionality.The intrinsic dimension  of the matrix needs to be determined at the beginning in the process of modelling normal traffic by applying RMPCM.The scree plot of data sets of , , and  in Table 8 is drawn using PCA, as is shown in Figure 20.The accumulative variance contribution threshold is set to 0.85, and then intrinsic dimension of these three data sets are all close to 5, which is far from their dimension  ( = 121) of traffic matrix.We then go further to Abilene traffic data in the following three weeks and find that their intrinsic dimension  is considerably stable.What is more, the experiment in Section 5.3.3 also shows that RMPCM performance keeps high stability when  varies from 2 to 10, so we chose  = 5 in this paper.In fact, when implementing RMPCM in real situation, if network does not change a lot, the intrinsic dimension of traffic matrix doesn't need to be changed, and there is no need to determine the intrinsic dimension every time before detecting.This significantly reduces the complexity of computation.

Anomaly Detection Threshold.
When network traffic is in normal status, the distribution of the squared Mahalanobis distance of traffic samples is close to Gaussian distribution.In order to verify this, the normal probability plot of  in Table 8 is drawn by normplot of Matlab, as is shown in Figure 21.Samples are mainly distributed on the normal line, which are concentrated in the middle and sparse on both ends.A Jarque-Bera test is employed in order to further verify its normality.Since the samples' squared Mahalanobis distance almost follows Gaussian distribution, "3" control chart is taken for anomaly detection in Section 4.3.a Gaussian or Gaussian-like distribution [13].We conduct statistical analysis for real network data sets in Table 8 and draw the normal probability plot.We find that the data (traffic volume) exhibit a highly skewed distribution, many of which are diverged from the normal line, and the density of samples scattered on two ends is higher than Gaussian distribution (Figure 22 illustrates some of drawn results).Frequency histograms with superimposed normal density curves are then plotted (as partially shown in Figure 23), which indicates that real traffic volume's heavy-tailed feature is more noticeable compared with Gaussian distribution.This is naturally one of the reasons why Gaussian distribution is replaced by multivariate -distribution when modeling the traffic matrix.

False Positives and False Negatives.
Compared with other methods listed in the paper, RMPCM achieves a better performance on reducing false positives and false negatives in the experimental scenarios.One reason is that it is more accurate for RMPCM to describe the traffic data.Conventional traffic anomaly detection methods usually assume that the traffic data are drawn from a Gaussian or Gaussian-like distribution.We conduct statistical analysis for real network data sets, and it indicates that real traffic volume's heavy-tailed feature is more noticeable compared with Gaussian distribution.So we use the multivariate -distribution as prior distribution when modeling the traffic data.Another reason is that introducing a latent variable probabilistic model based on -distribution can relieve the noise interference and achieve a better performance.In addition, RMPCM method is within a probabilistic framework, making it attain advantages over traditional method of nonprobabilistic model (such as subspace construction via PCA (Lakhina et al.)) in handling missing data.
However, RMPCM also suffers from false positives and false negatives to some degree.This is due to the dynamic nature of the network traffic flows, which inevitably leads to some deviation when describing the change of traffic by applying this algorithm based on baseline.Moreover, false positives and false negatives will occur under the circumstances in which anomalous traffic generated in  the experiments is excessively small or large along with the interference of background traffic.But on the whole the false positives and false negatives of RMPCM can meet the needs of engineering practices.

Problems on Anomaly
Localization.The anomaly localization in scenarios with more anomalous OD is discussed here.The experimental environment is DETERLab with 10 PoP nodes the same as in Section 5.2.The benchmark is set up with the scenario where DDoS attacks took place from 1∼9 nodes to one node.When the number of anomalous OD is from 1 to 4, the results agree with the setting.When the number of anomalous OD is 5, the preset anomalous OD are OD20, OD40, OD50, OD60, and OD80, but the localization results are OD20, OD39, OD40, OD60, and OD80, as shown in Figure 24.During the experiment, it is found that the accuracy of anomaly localization decreases with the increasing number of anomalous OD, as shown in Figure 25.This problem will be researched further in depth in the future.There are several preliminary considerations for this taking place.(a) With the increasing number of anomalous OD, the volume of anomalous traffic also increases significantly, if the volume of anomalous traffic is excessively large which exceeds the anomalous observations tolerance of the proposed approach, and then it is highly likely to impact the detection performance and location results of the proposed approach.
(b) In addition, with the increasing number of anomalous OD and the volume of anomalous traffic, the impact on every single OD becomes relatively smaller, which brings negative impact on the accuracy of anomalous OD localization.

Conclusion
In conclusion, traditional network-wide anomaly detection methods have actual problems of performance reduction or being unavailable when noise interference or data loss takes place, and in order to solve these problems and advance anomaly detection and localization, we propose a networkwide approach based on robust multivariate probabilistic calibration model in this paper.The analysis conducted with simulations, DETERLab experiments, and real data from the Internet indicates that the performance of RMPCM is better than PCA and ANTIDOTE and has a better robustness.Regardless of data loss or noise interference, RMPCM demonstrates a stable performance and less sensitivity to the change of parameters.RMPCM can be also applied to locating anomalies.In the future, we will take our researches on more accurate and fine-grained anomaly localization and online RMPCM algorithm.

Figure 4 :
Figure 4: Graphical model of RMPCM.The shaded node is the observed vector, and arrows denote conditional dependencies between these random variables.

Figure 5 :
Figure 5: Steps of modeling normal traffic with data loss.

Figure 6 :
Figure 6: Steps of locating OD of anomaly occurrence.
Figure 7: Steps for synthetic generation of anomalies.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Comparison of three methods' test results with no poisoning.

Figure 11 :Figure 12 :
Figure 11: Results of locating OD of anomaly occurrence in simulation experiment.

Figure 13 :
Figure 13: Comparisons of RMPCM and PCA test results on DETERLab.

Figure 15 :
Figure 15: Results of locating OD of anomaly occurrence on DETERLab.

Figure 16 :
Figure 16: Test results of real network data with complete data, PCA (a) and RMPCM (b).

Figure 17 :
Figure 17: Test results of real network data under four loss mechanisms.

Figure 18 :
Figure 18: Comparison of sensitivity to the change of intrinsic dimensions of PCA (a, b) and RMPCM (c, d).

Figure 19 :
Figure 19: Comparison of sensitivity to the change of traffic measures of PCA (left) and RMPCM (right).

Figure 22 :
Figure 22: Comparison of normal probability plot of real traffic volume and Gaussian random data.

Figure 23 :
Figure 23: Frequency histograms with superimposed normal density curves of real traffic volume.
The weight   is a function of Mahalanobis distance   which indicates that   decreases while   increases.The anomaly sample has a relatively larger ; therefore   is smaller, meaning adopting -distribution in model parameter estimation lowers the sensitivity to anomaly samples.Compared with Gaussian distribution, distribution model has a stronger robustness.4.2.Normal Traffic Modeling4.2.1.Normal Traffic Modeling in NoisyTraffic.RMPCM models normal traffic in noisy traffic by establishing a latent variable probabilistic model based on multivariate distribution, and the procedure is illustrated in Figure

Table 1 :
Execution time of RMPCM anomaly detection.

Table 2 :
Evaluation content and method.

Table 3 :
Typical anomalies in the Internet.

Table 5 :
Setting of anomaly localization test.

Table 6 :
Comparisons of RMPCM and PCA test results on DETERLab.
Table 7 illustrates the setting of anomaly locating test, and a single OD or multiple OD flows were injected with anomalies generated in Section 5.2.1.The result of anomaly localization test in RMPCM method

Table 7 :
Setting of anomaly localization test.

Table 8 :
Abilene traffic matrix data set.