In this paper, we develop a framework to estimate network flow length distributions in terms of the number of packets. We model the network flow length data as a threeway array with dayofweek, hourofday, and flow length as entities where we observe a count. In a highspeed network, only a sampled version of such an array can be observed and reconstructing the true flow statistics from fewer observations becomes a computational problem. We formulate the sampling process as matrix multiplication so that any sampling method can be used in our framework as long as its sampling probabilities are written in matrix form. We demonstrate our framework on a highvolume realworld data set collected from a mobile network provider with a random packet sampling and a flowbased packet sampling methods. We show that modeling the network data as a tensor improves estimations of the true flow length histogram in both sampling methods.
Monitoring network statistics is crucial for the maintenance and infrastructure planning for network service providers. Statistical information about traffic patterns helps a service provider to characterize its network resource usage and user behavior, to infer future traffic demands, to detect traffic and usage anomalies, and to provide insights to improve the performance of the network [
A network flow is defined as a set of Internet protocol (IP) packets with the same signature observed within a limited time period. The flow signature is composed of the IP and port pairs of both the source and destination nodes together with level3 protocol types such as transport control protocol (TCP) or user datagram protocol (UDP). A flow starts with the arrival of the first packet and terminated when the interpacket timeout is exceeded. The total number of packets in a flow is referred to as the flow length and the length distribution of a set of flows that are terminated in a time window is called flow length distribution.
In this work, we are using one of the most popular methods for collecting perflow information, i.e., passive measurement. In this method, network packets are processed as they pass through a passive measurement beacon connected to the network, e.g, router. The beacon keeps a lookup table for flow identification. The beacon processes a packet by searching its corresponding flow inside the lookup table using its signature. If such a flow is found, its statistics are updated. Otherwise, the packet is treated as the first packet of a new flow, and the new flow is inserted into the table. Once a flow is terminated, its statistics are transferred to a storage.
The flow length histogram can be calculated exactly by processing every packet that passes through the measurement beacon. In order to implement such a direct method, the monitoring beacon needs to maintain a table to hold information for all active flows on the network. However, substantial amount of concurrent flows with very short packet interarrival times of current highspeed networks (on the order of 10 Gbps to 100 Gbps inside carrier’s network today) make this bruteforce counting method very costly to implement. First of all, this method would require a large amount of memory to record the flow table. Secondly, in a highspeed link, the interarrival times between packets, which may be as small as 8 nanoseconds in an OC768 link, may be smaller than the time required to process flow hash operations such as identifying the packet and updating the flow statistics.
The characteristics of the network traffic data inevitably lead to the development of alternative methods for measurement such as random sampling, where a fraction of the network traffic is randomly selected and processed. The simplest sampling method is the uniform packet sampling [
Flowbased adaptive sampling methods [
Both packetbased and flowbased adaptive sampling methods rely on numerical methods to recover the true FLD. In this work, we propose a framework that can be used to recover the true FLD from the sampled observation obtained by any sampling method. This framework uses a variant of the nonnegative tensor factorization NTF model, which we call the thin nonnegative tensor factorization (ThinNTF), where the “thin” prefix emphasizes that the factorization is applied directly to the samples, or namely “thinned” data.
In our framework, the network traffic data is modeled as a 3way array, containing the number of flow length observations, with dimensions interpreted as (1) flow length, (2) hourofday, and (3) dayofweek to capture the hourly and daily periodicity in the data. The nonnegative factorization of this tensor basically gives us estimates in the form of a nonparametric mixture model. Therefore, our model is an improvement of the nonparametric flow length models in [
While the ordinary NTF model [
We model one week of flow length observations as a 3dimensional tensor and observe the periodic behavior.
We propose a novel tensor factorization scheme, ThinNTF, which is able to find the factors of a latent tensor from its sampled counterpart. By doing so, we also solve the reconstruction problem.
We apply ThinNTF to realworld data sampled with two different sampling methods: uniform random packet sampling and flowbased adaptive sampling.
The structure of the paper is as follows. In Section
Sampling methods have long been applied to network traffic monitoring. A survey on fundamental network sampling strategies is given in [
Flowbased sampling methods are proposed as alternatives to the uniform packet sampling since packet sampling has theoretical limitations when recovering true flow statistics [
Nonnegative tensor factorization is the generalization of the nonnegative matrix factorization (NMF) [
Modeling the flow length distribution as a mixture of distributions is first proposed by [
We describe our problem as a tensor thinning problem, where the count entities of the original flow lengths are stored in a tensor. We formulate the sampling process as a matrix multiplication operated on this data tensor. In order to do that, each sampling model should be represented as a matrix that transforms the original data tensor to a sampled one. We provide matrices for two sampling models: uniform packet sampling and ANLS flowbased packet sampling.
For a clear notation, the scalar values are denoted by lightface letters, such as the index variable
The index parameters are also fixed for clarity. The list of indexes and their ranges and semantic descriptions are given in Table
Indexes in the model.
Index  Range  Description 



Original flow lengths 


Sampled flow lengths 


Hours of day 


Days 


Components 
The original flow length data is represented in an
Working with large maximum flow size is not feasible for two reasons. First, learning a mixture model where each flow component has 2 million parameters is not a good formulation for this problem. Secondly, 99.9% of flows in our data have less than 100 packets, which means the tensor
Figure
Slices of the original flow length tensor
Independent of the sampling method, we can define an
For any given sampling method, we can calculate the
An important practical issue is that, if the original tensor
In uniform sampling, each packet is processed with a fixed probability of
Algorithm
flow = flow_table·lookup (packet)
flow
flow·length+ = 1
flow_table·insert_or_update (flow)
Figure
Sampling matrices for two different sampling schemes.
Figure
The ANLS [
Here,
The ANLS method is described in details in Algorithm
flow = flow_table·lookup (packet)
flow
flow·length+ = 1
flow_table·insert_or_update (flow)
We calculate the sampling matrix
Figure
Visualization of the Monday slice with uniform and ANLS sampling methods.
Our methodology is based on the nonnegative factorization of the data tensor. Our model, which we call ThinNTF, introduces the sampling matrix as a constant factor to the original NTF with the Poisson–Gamma observation model. The rationale for using factorization for recovering true flow sizes is that the flow size distributions have daily periodic behavior, as we show in Section
NTF is the generalization of the 2dimensional NMF model to multiple dimensions. In NTF, an Ndimensional tensor is approximated by the multiplication of lower dimensional factors. Unlike NMF, tensor factorization can be done in multiple ways. In this work, we are going to use the PARAFAC [
PARAFAC factorization.
Bro [
ThinNTF is basically an NTF with an additional constant factor, which in our case is the sampling matrix
Graphical models representing the dependency structure of NTF and ThinNTF models in PARAFAC scheme.
In Section
ThinNTF model.
In this scheme, one can immediately suspect that
In ThinNTF, we observe the
Taking the Bayesian approach, we first provide a generative model for the ThinNTF and then describe how we can estimate the posterior probabilities of model parameters (in this case, the factor matrices) conditioned on the sampled flow length observations
Tensors in the model and their corresponding index sets.
Tensor  Index set  Description 



Original flow length tensor 


Sampled flow length tensor 


Mask tensor 


Latent variable tensor 


Flow length factor 


Hour of day factor 


Day of week factor 


Sampling matrix 


Gamma priors for 


Gamma priors for 


Gamma priors for 
The original and latent data tensor
We choose the prior distributions for the factor entries as the Gamma distribution since it is the conjugate prior of Poisson distribution [
In order to avoid repetition, we are going to omit the equations regarding the factors
Finally, we generate
//Sample factor
//Sample factor
//Sample factor
//Sample latent tensor
//Randomly initialize factors and latent tensor
//Generate original tensor
//Generate sampled tensor
After defining the generative model, we can inter the factors
We start our Bayesian inference by calculating the posterior distributions over the factors
This loglikelihood is intractable due to the integration over the latent factors, but it is lower bounded as
Here, we provide the update equations for
Considering the terms in the loglikelihood expression in equation (
Similarly, considering the terms in loglikelihood equation (
We calculate the expectation of
The variational Bayes algorithm that uses the above equations is given in Algorithm
//Randomly initialize factors and latent tensor
Calculate
Calculate
Calculate
Calculate
Calculate
Calculate
Calculate
Calculate lower bound
The nonnegative tensor factorization is an NPhard problem [
Applying a tensor model to the flow length estimation problem requires highvolume data collected over a long period, to capture the timely behavior of the network. The already available online data sets do not fulfill this requirement. Therefore, we collected our own realworld data from a mobile network service provider in Turkey [
The system architecture of a mobile operator’s general packet radio service (GPRS) network infrastructure, including radio access and core network elements, is illustrated in Figure
Placement of our monitoring server inside the premises of the mobile operator running a commercial cellular network.
The Gn interface (Gn is an interface between SGSN and GGSN where GTP is the main protocol for network packets flowing through) carries user packets to be transferred between the mobile users and the Internet together with control packets necessary for the universal mobile telecommunications service (UMTS) core network [
The Gn interface carries mainly two types of GTP message structures or packets: GTPC and GTPU. GTPC is used for signaling between SGSN and GGSN in core network which carries packet data protocol (PDP) context messages such as activating and deactivating user session, configuring service parameters or updating the session. GTPU is used for transmitting user data between the radio access network and core network. In our experiments, we filtered out GTPC packets (since they are not considered to be part of a flow due to flow definition), which makes 10% of the total Gn traffic. Therefore, the sampling is applied to GTPU packets only. GTP is carried mainly over UDP.
The mobile operator network consists of several districts with more than 10 regional core areas throughout Turkey. The average total traffic in all regional areas consists of approximately over 15 billion packets in the uplink direction and over 20 billion packets in the downlink direction daily. This corresponds to approximately 80 terabytes of total data flowing in uplink and downlink daily inside the mobile operator’s core network as a whole. In this work, the Gn interface which connects the SGSN and GGSN nodes are mirrored, and the network traffic is transferred into a FLD server located at mobile operator’s technology center in the core network. A speed of 200 Mbit/sec at peak hours can be observed through one of the mirrored interfaces in the core network.
We monitored the network traffic in one of the servers of a mobile operator continuously for 10 days. We developed a packet extraction tool inside the monitoring server shown in Figure
After the data collection and flow extraction, the total number of packets collected is found to be
Cumulative flow lengths in the realworld data.
We designed two sets of experiments in order to verify our model: synthetic and realworld experiments. In each set, we sampled the original data with both uniform and ANLS models with different sampling parameters. Then, we tried to recover the original tensor with ThinNTF models. The ThinNTF model takes a single parameter
Both ThinNMF and ThinNTF models explain the data as a linear combination of
During the experiments, we always run the stochastic algorithms, i.e., ThinNMF, ThinNTF, and MLE, for 10 times and keep the parameters of the model with the highest lower bound value. Then, we reported the success of our algorithm with the weighted mean relative distance (WMRD) metric as this was used in all previous flow size estimation works. The WMRD is a metric which gives more weights to the relative differences that occur with larger frequency. It is formulated as
Additionally, we report the Kullback–Leibler divergence between the original and the estimated tensors, since this is the metric minimized during the variational Bayes algorithm. The KL divergence between two distributions
We prepared our synthetic experiments to test the validity of our models. In this experiment set, we used the generative model of the ThinNTF model as described in Algorithm
We sampled the synthetic data with uniform and ANLS sampling methods with different sampling parameters. The sampling was done simply by randomly drawing a sampled size for each flow according to the sampling probabilities in the
The ThinNTF model always performed best with the uniform sampling model, as shown in Table
Uniform sampling results on synthetic data.
Period  ThinNMFR3  ThinNTFR3  MLE 

2  0.53  0.49  0.88 
4  0.63  0.59  1.20 
8  0.65  0.61  1.29 
16  0.68  0.61  1.41 
32  0.74  0.61  1.37 
64  0.85  0.61  1.50 
ANLS sampling results on synthetic data.
U  ThinNMFR3  ThinNTFR3  MLE  ANLS 

0.01  0.29  0.27  0.09  0.15 
0.02  0.31  0.29  0.17  0.27 
0.05  0.33  0.32  0.36  0.28 
0.1  0.35  0.34  0.51  0.38 
0.2  0.38  0.36  0.59  0.67 
0.5  0.48  0.47  0.72  0.70 
Synthetic experiment results.
The original data collected from a mobile network provider as we described in Section
Since the number of components in the original flow distribution is unknown, we run our experiments with
The factorization models, both ThinNMF and ThinNTF helped lower the WMRD score in both uniform and ANLS sampling methods. ThinNTFR4 model consistently gave lower error than the MLE baseline for uniform model as shown in Table
Uniform sampling results on real data.
Period  ThinNMF  ThinNTF  MLE  






 
2  0.23  0.24  0.23  0.21  0.25  0.22  0.41 
4  0.55  0.52  0.53  0.50  0.48  0.49  0.69 
8  0.94  0.93  0.94  0.91  0.90  0.87  0.97 
16  1.15  1.11  1.11  1.09  1.05  1.04  1.05 
32  1.25  1.24  1.24  1.16  1.13  1.10  1.22 
64  1.31  1.29  1.30  1.09  1.06  1.04  1.22 
Realworld data results with Uniform sampler.
Figure
Another important issue is that for uniform sampling, 3way factorization is more successful than the 2way factorization. The periodicity information which is captured by the ThinNTF model helps improve the estimates and makes it a more successful model for this sampling method.
In ANLS, all our factorization models gave lower error values than the MLE and unbiased estimator of ANLS as shown in Table
ANLS sampling results on real data.
U  ThinNMF  ThinNTF  MLE  ANLS  






 
0.01  0.03  0.02  0.01  0.05  0.04  0.03  0.05  0.12 
0.02  0.04  0.03  0.02  0.06  0.04  0.03  0.08  0.21 
0.05  0.04  0.03  0.02  0.07  0.05  0.04  0.13  0.39 
0.1  0.06  0.05  0.04  0.08  0.07  0.05  0.17  0.61 
0.2  0.08  0.08  0.08  0.10  0.09  0.07  0.21  0.70 
0.5  0.13  0.13  0.11  0.16  0.15  0.13  0.33  0.94 
Realworld data results with ANLS sampler.
The choice of where to clamp the data can be given by multiple factors. First of all, one can set the clamping value
We run the best algorithms found in previous section for uniform and ANLS sampling methods with
Clamping experiments.
In this work, we introduced a novel nonnegative tensor factorization model called ThinNTF, which extends the classic nonnegative tensor factorization with an additional constant factor that can represent a network packet sampling method. We showed that this model can be employed to improve the current reconstruction algorithms in recovering the original flow length distributions.
We tested our model with two different types of sampling methods: the uniform packet sampling method and a flowbased packet sampling method, called ANLS. We described how to use these methods by showing how to build their sampling matrices.
In order to test our model, we collected highvolume data from a mobile network provider for a long period in order to observe the periodical behavior of the flow length distribution. In experiments on synthetic and realworld data, our models gave promising results by lowering the estimation errors compared to the baselines of each sampling method. We conclude that our model can be used to decrease estimation errors or to decrease the sampling probabilities without increasing the estimation error.
An important issue left as future work is the online execution of the ThinNTF model. Theoretically, the ThinNTF model can be used online once sufficient data from the target network is collected, and the flow length distribution components, i.e., the
The calculation of the lower bound includes a few arithmetic tricks. We provide a Bayesian nonnegative matrix factorization [
The authors declare that there are no conflicts of interest regarding the publication of this paper.