Network anomaly detection and localization are of great significance to network security. Compared with the traditional methods of host computer, single link and single path, the networkwide anomaly detection approaches have distinctive advantages with respect to detection precision and range. However, when facing the actual problems of noise interference or data loss, the networkwide anomaly detection approaches also suffer significant performance reduction or may even become unavailable. Besides, researches on anomaly localization are rare. In order to solve the mentioned problems, this paper presents a robust multivariate probabilistic calibration model for networkwide anomaly detection and localization. It applies the latent variable probability theory with multivariate
Network traffic anomalies are unusual and significant changes at network’s traffic level. Intrusions such as DDos attacks and zombie networks significantly jeopardize the Internet security, and network jams and malfunctions have unpleasant impact on service quality; therefore it is critical to detect and locate network anomalies for both network operators and end users. It is a challenging task to detect and locate them because one must extract and interpret anomalous patterns from large amounts of highdimensional, intricate, and noisy background traffic data.
There are a great number of researches on anomaly detection. Hostbased anomaly detection system monitors and analyzes the internals of a computing system by applying data mining of the system logs and audit records [
In order to solve the problems mentioned above, Lakhina et al. come up with networkwide anomaly detection based on subspace construction via PCA [
Therefore, we propose an approach named RMPCM based on robust multivariate probabilistic calibration model to overcome these problems discussed above. This anomaly detection and locating algorithm introduces a latent variable probabilistic model based on
This paper is organized as follows. We begin in Section
Back in 1987 Denning had demonstrated statistic model for detecting network anomalies [
The authors of [
Data loss is very common in many fields, and the question of how to get enough information from missing data needs to be answered. The authors of [
Network anomaly detection can determine when anomalies take place, but locating anomaly is an extremely challenging task. The authors of [
In this paper we propose a networkwide anomaly detection algorithm based on RMPCM, which will later be proved to have a better performance in solving problems of noise interference, data loss, and locating anomalies.
Conventionally the researches of the Internet traffic flow mainly focused on temporal characteristics of data package on a single link, which help in developing concepts of selfsimilar stochastic processes, longrange dependence, and so forth. One ISP (Internet service provider), however, consists of hundreds of those links which are connected all over, and the Internet contains several thousand ISPs. In such a vast background the spatial characteristics of network traffic come to people’s attention inevitably. However, it is difficult to analyze traffic flow data of all links in the network simultaneously, because it amplifies the complexity of modeling traffic on a single link which is itself a complicated task. As compact and elegant descriptions of traffic flows between nodes in a certain network structure, traffic matrix is a constantly employed model to conduct explorations on the spatiotemporal component of networkwide traffic. Traffic matrix is an overview of networkwide traffic. Instead of studying traffic on all links, applying traffic matrix provides more straightforward and fundamental insights into networkwide traffic study [
Traffic matrix at PoP level: assume that an autonomous system (AS) has
Schematic diagram of traffic matrix.
In the process of collecting data, data loss may occur. This is because massive data in high speed backbone network may increase burden of collecting equipment and reduce its stability. Another reason is due to network congestion, equipment or link malfunctions when transferring data.
Traffic data loss is not all completely random, and many of the cases are highly structural. In order to describe the scenario of data loss in the process of data collection and transfer, four kinds of loss mechanisms are adopted.
Model parameter estimation often uses maximum likelihood estimation (MLE) when samples are known, and the probability distribution of the samples is needed. However, it is very troublesome to accurately describe the distribution of data; therefore it is always assumed that the data generally follows the Gaussian distribution because its nice analytical property always yields tractable algorithms. The MLE is equivalent to the least square estimation in the linear Gaussian regression model which is noted for its unduly sensitive to atypical samples such as outliers and it would affect the accuracy of the model [
The time at which anomalies take place can be determined by implementing network anomaly detection, but locating anomalies is crucial if we want to pinpoint and solve security problems more precisely. Locating anomalies in this paper is corresponding to pointing out the intersections of the rows and columns of traffic matrix
The relationship between anomalous event generation and its detection and localization is shown in Figure
Relation schema of RMPCM.
Applying network anomaly detection algorithm in the real world may encounter difficulties such as data loss in the process of transferring and collecting and modeling deviation caused by noise interference.
Traditional network anomaly detection algorithms will not be applicable any longer in the condition of incomplete data. It is considered to adopt Bayes method, but because of the complexity of network traffic data posterior mean estimation and asymptotic variance cannot be directly derived from this method. Therefore a latent variable probabilistic model is to be introduced, meaning some “latent data” are to be added in the known data in order to simplify the parameter estimation. In this process missing data along with unknown parameters treated as “latent data” will be solved by applying expectationmaximization (EM) algorithm to achieve the maximum likelihood estimation (MLE) of model parameters.
When computing MLE, probabilistic distribution of known data is required. Normally it is assumed that they are normally distributed, but because of some anomalous traffic, this assumption will cause parameter estimation to have a large deviation; therefore, multivariate Gaussian distribution is replaced by
Specifically, suppose sample data
For Gaussian distribution model (
RMPCM models normal traffic in noisy traffic by establishing a latent variable probabilistic model based on multivariate
Steps of modeling normal traffic in noisy traffic.
In order to solve the problem that Gaussian noise models are too sensitive to atypical observations such as anomalous traffic observations, we suppose the noise is drawn from
It is inapplicable to be analyzed and resolved using MLE directly. As noted in [
Graphical model of RMPCM. The shaded node is the observed vector, and arrows denote conditional dependencies between these random variables.
In order to calculate model parameters,
Set
From (
From (
The above procedure establishes the latent variable probabilistic model by replacing Gaussian distribution model with
The model parameters
The loglikelihood is
The loglikelihood is
The MLE of
If data loss occurs on some dimensions,
As noted in [
Thus
The procedures for normal traffic model establishment with data loss are shown in Figure
Steps of modeling normal traffic with data loss.
The updating formula of
The algorithm in Section
Anomaly traffic flow samples of complex traffic flow data need to be determined by choosing measurement standards. There are mainly 2 strategies for determining anomaly samples: one is to determine whether samples are leverage outliers by judging if Hotelling’s
For intact data samples, the squared Mahalanobis distance
For samples containing some of dimension loss, the squared Mahalanobis distance
Normal distribution “
Establishing time series of the squared Mahalanobis distance as
Centre line:
Upper control limit:
Lower control limit:
According to the Gaussian distribution,
Adopting normal distribution “
After confirming anomalous samples, it is needed to be determined which dimension (corresponding to OD) of the selected anomalous sample should be responsible for the anomaly, which is anomaly localization. The anomalous sample
Steps of locating OD of anomaly occurrence.
In RMPCM the major overheads are the inverse of
Execution time of RMPCM anomaly detection.
Simulation experiment  Testbed experiment  Real network data analysis  

Complete data  Complete data  Missing data (mean)  Complete data  Missing data (mean) 
2.15 s  1.94 s  3.56 s  2.53 s  4.86 s 
Normally there are 3 methods that can be applied to assess the performance of network anomaly detection algorithm: network traffic simulation experiment [
Evaluation content and method.
Method  Content  

Robustness under noise interference  Robustness under data loss  Anomaly localization  Sensitivity  
Noisy traffic  Poisoning  Intrinsic dimension  Traffic measure  
Simulation experiment  ✓  ✓  
Testbed experiment  ✓  ✓  ✓  
Real network data analysis  ✓  ✓ 
RMPCM will be compared with the anomaly detection method based on subspace construction via PCA and its improved method ANTIDOTE [
Network anomalous traffic, especially some poisoning attack traffic, may cause the skewing of detection model, which would significantly decrease the performance of anomaly detectors [
The structure of networkwide traffic was revealed for the first time by Lakhina et al., which is that OD flows consist of large number of periodic and deterministic trends, some of the noises, and few of spikes [
Steps for synthetic generation of anomalies.
Original OD flow and denoised OD flow
Normal OD flow with Gaussian white noise injected
OD flow with anomalies injected
121 OD flows are produced in this way. Data collection occurs every 5 minutes, which is marked as one collection cycle. Traffic matrix with 2016 rows and 121 columns is created with one week’s collections in 121 OD flows.
As we mainly focused on traffic volume anomalies, five kinds of typical anomalies were simulated: DoS, DDoS, ALPHA, ingress/egress shift, and flash crowd. Their brief description is shown in Table
Typical anomalies in the Internet.
Type  Description 

DoS  Single source node sending large amount of data to single destination node 
DDoS  Multiple source nodes sending large amount of data to single destination node 
ALPHA  Abnormal high speed rate transferring between two nodes 
Ingress/egress shift  Change of routing causing traffic ingress/egress shift 
Flash crowd  Abnormal large data request for a certain service 
Anomaly injection.
Type  Injection method 

DoS/DDoS  Increasing the volume of single/multiple OD flows gradually 
ALPHA  Promptly increasing the volume of a single OD flow 
Ingress/egress shift  Reducing a portion of volume of a certain OD flow which then added to another OD flow 
Flash crowd  Increasing the volume of multiple OD flows rapidly and then tuning them back to normal gradually 
The poisoning method AddMoreIfBigger discussed in [
Anomalies were injected according to the method in Section
Subspace construction via PCA and ANTIDOTE was selected in order to be compared with RMPCM under exactly the same circumstance to evaluate its detection performance under poisoning, and we plot the squared norm of the residual vector as a function of time to show PCA and ANTIDOTE method results. Receiver Operations Characteristic (ROC) curve is also applied to the overall performance estimation of three above methods. The
The experiment was conducted with the data generated in one week to compare those three methods. Three ROC curves are drawn corresponding to no poisoning, medium poisoning, and high poisoning (Figure
Comparison of three methods’ test results with no poisoning.
RMPCM
PCA
ANTIDOTE
Comparison of three methods’ test results with medium poisoning.
RMPCM
PCA
ANTIDOTE
ROC curve of three anomaly detectors under poisoning.
Table
Setting of anomaly localization test.
Time of anomaly occurrence  Injection position  Type 

300  OD50  ALPHA 
703  OD50, OD100  Ingress/egress shift 
602  OD7, OD40, OD60  Flash crowd 
1704  OD10, OD20, OD60, OD80  DDoS 
Results of locating OD of anomaly occurrence in simulation experiment.
Cyberdefense technology experimental research laboratory testbed (DETERLab) [
In our experiment attacking tools based on Metasploit frame was integrated into DETERLab’s security experimentation environment (SEER) toolset to generate multiple anomalies on the DETERLab platform. Our experiment sets up 10 PoP nodes and chooses the adjacent node of each PoP node as collection device, and the topological configuration is shown in Figure
Topological configuration on DETERLab.
In order to verify RMPCM’s performance in noisy traffic, the experiment set up three situations to compare with the detection method based on subspace construction via PCA, which test the detection accuracy, factors impacting performance, and poisoning of large anomalies, respectively, in the two methods.
At the time of 500 and 1000 DoS attacks using TCP SYN Flood were initiated from PoP1 to PoP2, at the time of 1800 DoS attacks were carried out from PoP3 to PoP4, and the duration of DoS attacks was all 4 cycles; at the time of 800 port scan was initiated from PoP1 to PoP2, PoP5, and PoP6 applying by Nmat, and the duration was 5 cycles; at the time of 1200 ingress/egress shift was initiated by transferring 50% of the traffic volume of the OD path (PoP1 to PoP2) to another (PoP7 to PoP8), and the duration was 40 cycles; at the time of 1500 DDoS attacks using UDP Flood on PoP10 were initiated from PoP2, PoP4, PoP5, and PoP8 simultaneously, and it lasted 6 cycles. Figure
Comparisons of RMPCM and PCA test results on DETERLab.
Initial settings
Preset anomaly cycles  Alerts cycles  Type  

RMPCM  PCA  
500~503  501~503  502, 503  DoS 
800~804  801  801  Port scan 
1000~1003  1000~1003  1000~1003  DoS 
1200~1239  1200~1239  1217, 1231–1239  Ingress/egress shift 
1500~1505  1501~1505  1501~1505  DDoS 
1800~1803  1802, 1803  1803  DoS 
After adjusting settings
Preset anomaly cycles  Alerts cycles  Type  

RMPCM  PCA  
500~503  503  502, 503  DoS 
800~804  Port scan  
1000~1003  1001, 1002  1001, 1002  DoS 
1200~1219  1200~1219  1200~1219, 1272  Ingress/egress shift 
1500~1505  1503, 1504  1407, 1416, 1451, 1503, 1504, 1637, 1665  DDoS 
1800~1803  1802  1701, 1733, 1814, 1849  DoS 
Injecting the large anomaly
Preset anomaly cycles  Alerts cycles  Type  

RMPCM  PCA  
500~503  500~503  500~503  DoS 
800~804  801  801  Port scan 
1000~1003  1001, 1002  DoS  
1200~1239  1200~1239  Ingress/egress shift  
1500~1505  1501~1504  1502~1504  DDoS 
1800~1803  1801, 1803  1801~1803  DoS 
Comparisons of RMPCM and PCA test results on DETERLab.
Initial settings, RMPCM on the left and PCA on the right
After adjusting settings, RMPCM on the left and PCA on the right
Injecting the large anomaly, RMPCM on the left and PCA on the right
In order to reveal the impacting factors and performance differences of two methods further, anomalies were adjusted. The volume of DoS attacks beginning at 500 and 1800 was reduced by 50%; the range of port scan beginning at 800 decreased to the range only from PoP1 to PoP2, and scanning frequency was deducted by 50% as well; the duration of ingress/egress shift beginning at 1200 was cut down by 20 cycles; DDoS attacks beginning at 1500 were narrowed down to the range from PoP2 and PoP4 to PoP10 and kept the same attack volume; DoS attacks beginning at 1000 remained unchanged. Figure
The attack volume of DoS increased by 220% to produce a large anomaly, and other factors remained the same as the first test. A large anomaly could raise variance level of the path that it was on, and other anomalies causing small variance would be mistaken for a normal event in this situation. Figure
The above experiment demonstrates RMPCM has a better detection performance than the method based on subspace construction via PCA. AddMoreIfBigger poisoning experiment in Section
Data loss may be caused by network malfunction, device breakdown, and so on; thus algorithms based on conventional nonstatistical model become inapplicable because of the incomplete input data. RMPCM based on the latent variable probability model of multivariate
RMPCM detection results (left) and ROC curves (right) with data loss on DETERLab.
Link malfunction
Collecting device malfunction
PoP node malfunction
Table
Setting of anomaly localization test.
Time of anomaly occurrence  Injection position  Type 

502  OD2  DoS 
1203  OD2, OD78  Ingress/egress shift 
802  OD2, OD5, OD6  Port scan 
Results of locating OD of anomaly occurrence on DETERLab.
Real network data sets are obtained from backbone network Abilene [
Abilene traffic matrix data set.
Duration  Time bin  Measure  Matrix form  Data set 

15 December–21 December 2003  5 min  Byte  2010 × 121 

15 December–21 December 2003  5 min  Packet  2010 × 121 

15 December–21 December 2003  5 min  Flow  2010 × 121 

When conducting anomaly detection with data loss, data set
The true anomalies in the real network traffic data could be hardly obtained exactly, and RMPCM and PCA generated very similar alerts with the complete data set
Test results of real network data with complete data, PCA (a) and RMPCM (b).
In the experiment of PureRandLoss, three kinds of loss rates were selected which accounted for 10%, 20%, and 50% of total data; in PeriodRandLoss the missing periods were set to be 200, 400, and 1000, as the total periods were 2010; data loss rates in here were close to 10%, 20%, and 50%; in ODRandLoss we chose half adjacent data of a certain column to be empty because of the algorithm limitation that column of traffic matrix could not be empty entirely, lost OD numbers were set to be 24, 48, and 121, and the loss rate per OD was 50%; therefore, the total loss rates of ODRandLoss were still 10%, 20%, and 50%. Parts of the results and ROC curves under three kinds of loss mechanisms, respectively, are displayed in Figures
Test results of real network data under four loss mechanisms.
PureRandLoss
PeriodRandLoss
ODRandLoss
PieceRandLoss
In order to verify the impact of structured loss on the detection performance, we conducted experiment with PieceRandLoss, and the total volume of loss is set to be the same while the volume of each missing piece is set at different sizes, which are 5
The authors of [
Real network data set
Comparison of sensitivity to the change of intrinsic dimensions of PCA (a, b) and RMPCM (c, d).
Real network data sets
Comparison of sensitivity to the change of traffic measures of PCA (left) and RMPCM (right).
Flow
Packet
Byte
To summarize, RMPCM is convenient for practical implementation for its lower sensibility and higher robustness to the change of parameters such as intrinsic dimensions and traffic measures.
Issues like parameter selection, distribution characters of data source, and so forth will be discussed further in this section.
The Internet traffic matrix has this characteristic of low dimensionality. The intrinsic dimension
Scree plot of data sets in Table
When network traffic is in normal status, the distribution of the squared Mahalanobis distance of traffic samples is close to Gaussian distribution. In order to verify this, the normal probability plot of
Normal probability plot of the squared Mahalanobis distance of normal traffic (normplot test).
Conventional networkwide traffic anomaly detection algorithms usually assume that the traffic matrix elements are drawn from a Gaussian or Gaussianlike distribution [
Comparison of normal probability plot of real traffic volume and Gaussian random data.
Frequency histograms with superimposed normal density curves of real traffic volume.
Compared with other methods listed in the paper, RMPCM achieves a better performance on reducing false positives and false negatives in the experimental scenarios. One reason is that it is more accurate for RMPCM to describe the traffic data. Conventional traffic anomaly detection methods usually assume that the traffic data are drawn from a Gaussian or Gaussianlike distribution. We conduct statistical analysis for real network data sets, and it indicates that real traffic volume’s heavytailed feature is more noticeable compared with Gaussian distribution. So we use the multivariate
However, RMPCM also suffers from false positives and false negatives to some degree. This is due to the dynamic nature of the network traffic flows, which inevitably leads to some deviation when describing the change of traffic by applying this algorithm based on baseline. Moreover, false positives and false negatives will occur under the circumstances in which anomalous traffic generated in the experiments is excessively small or large along with the interference of background traffic. But on the whole the false positives and false negatives of RMPCM can meet the needs of engineering practices.
The anomaly localization in scenarios with more anomalous OD is discussed here. The experimental environment is DETERLab with 10 PoP nodes the same as in Section
Anomaly location histogram with 5 anomalous OD settings.
Anomaly location results with 1~9 anomalous OD settings.
(a) With the increasing number of anomalous OD, the volume of anomalous traffic also increases significantly, if the volume of anomalous traffic is excessively large which exceeds the anomalous observations tolerance of the proposed approach, and then it is highly likely to impact the detection performance and location results of the proposed approach.
(b) In addition, with the increasing number of anomalous OD and the volume of anomalous traffic, the impact on every single OD becomes relatively smaller, which brings negative impact on the accuracy of anomalous OD localization.
In conclusion, traditional networkwide anomaly detection methods have actual problems of performance reduction or being unavailable when noise interference or data loss takes place, and in order to solve these problems and advance anomaly detection and localization, we propose a networkwide approach based on robust multivariate probabilistic calibration model in this paper. The analysis conducted with simulations, DETERLab experiments, and real data from the Internet indicates that the performance of RMPCM is better than PCA and ANTIDOTE and has a better robustness. Regardless of data loss or noise interference, RMPCM demonstrates a stable performance and less sensitivity to the change of parameters. RMPCM can be also applied to locating anomalies. In the future, we will take our researches on more accurate and finegrained anomaly localization and online RMPCM algorithm.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by the National Grand Fundamental Research 973 Program of China under Grant no. 2012CB315901 and no. 2013CB329104; Science and Technology Commission of Shanghai Municipality Research Program under Grant no. 13DZ1108800; Science and Technology on Information Transmission and Dissemination in Communication Networks Laboratory Research Fund.