Network Traffic Prediction Based on Deep Belief Network and Spatiotemporal Compressive Sensing in Wireless Mesh Backbone Networks

1School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China 2School of Software, Dalian University of Technology, Dalian 116620, China 3School of Information Technology, Deakin University, Burwood, VIC 3125, Australia 4Department of Electrical, Computer, Software, and Systems Engineering, Embry-Riddle Aeronautical University, Daytona Beach, FL 32114, USA 5School of Astronautics & Aeronautics, University of Electronic Science and Technology of China, Chengdu 611731, China


Introduction
The Wireless Mesh Network (WMN) provides ubiquitous and last few miles connectivity for future wireless service, for example, IoT, 5G mobile network, and cognitive radio.Besides, it is also a promising solution to IoT crowdsensing applications by connecting a large group of individuals with capacities of computing and sensing.Compared with other wireless architectures (e.g., ad hoc networks), the WMN has high capacity, robustness, and low-cost deployment [1].Hence, it is much more popular as an emerging access paradigm in practice.With the rapid development of mobile communications, mobile cloud systems, and IoT, the applications provided by wireless networks become multitudinous.
Besides, the big data has become a crucial role in both industry and daily life.The quantity and substantial growth of WMN (both in scale and in service) bring a series of new challenges for the capacity; for example, the network congestion may appear in a wireless mesh backbone network induced by huge traffic demands.
Mesh routers constitute the principal infrastructure of the WMN known as the wireless mesh backbone network.The self-organized manner of the WMN architecture reinforces the resilience of the network to failures, but it arises some limitations, typically, resource allocation problem, and so on.Above all, imperative network management operation is useful to provide a cost-effective solution for improving the performance of a WMN.These network management operations are implemented in terms of the relative network traffic information.For instance, in order to improve the quality of service for users, predictive network planning is necessary for ISPs.This planning is carried out according to the future traces of network traffic flows between all possible origin-destination (OD) node pairs [2,3].
A great number of methods have been proposed to deal with the network traffic prediction problem in traditional IP backbone networks [4][5][6][7][8].Statistical methods have been widely adopted in this field.Originally, some simple models, such as Autoregressive (AR) and Autoregressive Integrated Moving Average (ARIMA), are used to pursue the shortrange dependence (SRD) of network traffic [4].However, current network traffic exhibits a long-range dependence (LRD) characteristic and multifractal features in terms of the behaviors of terminals [4].In this case, the Fractional Autoregressive Integrated Moving Average (FARIMA) model and the Multifractal Wavelet model (MWM) are involved in this field to deal with the network traffic prediction problem [5].With the variety of network services and applications, the characteristics of network traffic are much more complex.For instance, it exhibits some nonlinear features [4,6].Hence, some methods refer to the GARCH model to model network traffic for prediction.Besides, a lot of methods based on hybrid models have been proposed to predict network traffic [7].However, these methods are not suitable for dealing with the problems of network traffic prediction in a wireless mesh backbone network [2].Generally, the users join in a Wireless Mesh Network randomly.Additionally, they often have the complicated individual association, which is significantly discrepant compared with the users of a traditional IP backbone network.Although the users' applications of two networks are probably coincident, the dominant applications usually are distinguishing.
Motivated by this issue, we propose a network traffic prediction method based on the deep belief network (DBN) and the Spatiotemporal Compressive Sensing (STCS) method, named Deep Belief Network and Spatiotemporal Compressive Sensing (DBNSTCS) models.To the best of our knowledge, this is the first paper that focuses on the problem of traffic prediction for wireless mesh backbone networks and takes into account the spatiotemporal characteristic for prediction.We take account of the long-range dependence and irregular fluctuation behaviors of network traffic independently; see Figure 1 [9].By the discrete wavelet transform (DWT), the network traffic is divided into two components tagged by scaling and discrete wavelet transform coefficients.Namely, the DWT just likes a filter that decomposes the network traffic into a low-pass component and a high-pass component.The low-pass component expresses the longrange dependence of network traffic, and the high-pass one declares the gusty and irregular fluctuations.The LRD means that the network traffic at any time depends upon multiple previous traffic data.The LRD of network traffic derives from a series of interacted factors (e.g., the behaviors of users).To describe these multifarious relationships, the former is predicted by a deep architecture based on DBN.The proposed architecture can deeply learn the LRD of network traffic.For the short-range and irregular fluctuations, the STCS method known as an excellent interpolation algorithm is employed to predict them.
The contributions of the paper are proposed as follows: (i) We use the DWT to extract the low-pass and highpass components of network traffic.The DWT of a time series can be viewed as making this time series pass a low-pass filter and a high-pass filter, respectively.Hence, we can obtain the low-pass and highpass components of network traffic.They show the low-pass approximation and the details of network traffic, respectively.In our method, we predict two types of coefficients independently.(iii) We adopt the Spatiotemporal Compressive Sensing to fit the gusty and irregular fluctuations of network traffic.We first assume that the high-pass component obeys a spatiotemporal dependence.Then, we achieve a predictor of network traffic by the Sparsity Regularized Matrix Factorization (SRMF) method.
The remaining parts of this paper are organized as follows.Section 2 reviews the related work about network traffic prediction problem.In Section 3, we introduce some definitions about network traffic, the DWT techniques, the DBN theory, and the Spatiotemporal Compressive Sensing method, respectively.We propose our prediction method in Section 4.Then, we verify the performance of our method in Section 5. Section 6 concludes our work of this paper.

Related Work
Lots of researchers have investigated network traffic prediction that is instructive for congestion control, predictive network planning, and intelligent routing [10][11][12][13][14].The existing network traffic prediction techniques consist of four categories: linear time series methods, nonlinear time series methods, hybrid model methods, and decomposition model methods.
The linear time series methods (e.g., AR, MA, and ARMA) are frequently used to model end-to-end traffic flows for prediction.According to novel research findings, the traffic flows exhibit observably nonlinear features under complex network users behaviors and various applications.Typically, the GARCH model in [10] is used to model the burst characteristics.Besides, neural network is also a valid method to track the traffic flows with nonlinear characteristic [12].
With the rapidly development of network services, the ISP network has been a heterogeneous and complex network.Traffic flows show manifold statistic characteristics such as LRD, SRD, heavy-tailed distribution, and multifractal feature.Therefore, researchers adopt some hybrid models to model the traffic flows with complicated distributions.The methods based on hybrid model take advantage of two or more models to capture the traces of traffic flows.In [4], the authors combine the ARIMA model with the GARCH model to fit several characteristics of traffic flows (i.e., LRD and SRD characteristics).Meanwhile, the proposed hybrid model can also model the self-similarity and multifractal features of traffic flows.The autocorrelation function and the partial autocorrelation function are employed by the authors to determine the parameters in the hybrid model.The fourth method as mentioned in the above part is the decomposition model methods in which the traffic flows are divided into several components.Based on this decomposition, the gained components are, respectively, modeled and predicted.These methods can be viewed as an evolution of the hybrid model methods.In [13], the authors jointly use the Stationary Wavelet Transform (SWT), the Quantum Genetic Algorithm (QGA), and the Backpropagation Neural Network (BPNN) to implement traffic prediction of wireless network traffic.They first decompose the traffic flows by the SWT such that the traffic flows are made up of several stationary components.After that, all these components are predicted by a trained BPNN using the QGA.The authors in [14] decompose the traffic of a large scale cellular network into regular and random components by a classic time series decomposition method.

Traffic Matrix.
A traffic matrix is an expression form of network traffic.After collecting network traffic information, the operators implement appropriate network management functions in terms of this network traffic information.During this period, the network traffic information is expressed as the traffic matrix.If we denote an OD flow by  , () which describes the mean of the volume of traffic flow from the origin node  to the destination node  in the th time slot, then the traffic matrix is defined as where ,  ∈ {1, 2, . . ., }, and  ∈ {1, 2, . . ., }.This traffic matrix reports the network traffic data with  time slots.Generally, the length of time slot is 5 or 15 minutes.

Discrete Wavelet Transform.
For a time series (), it can be expressed by where  , and  , are scaling and discrete wavelet transform coefficients.discrete wavelet transform coefficients represent the details of ().Hence, (2) can be viewed as making the time series pass a filter and then obtaining a representation of () by the combination of low-pass and high-pass approximations.

Deep Belief Network.
The DBN is a common deep learning primitive.It is a combination of a number of Restricted Boltzmann Machines (RBMs) [15][16][17].A RBM that is two-layer undirected graphical model consists of the visible and hidden layers denoted by V and ℎ (shown by Figure 2) [9,15].Each unit in a layer is connected with all units of the other layer by undirected edges.The units in the same layer are disconnected with each other.Figure 3 shows an example of DBN architecture with two RBMs [9].The DBN is a stack of many RBMs.The values of all units are stochastic variables [16].Generally, they obey a Bernoulli distribution or a Gaussian distribution.When the visible and hidden units are Gaussian and Bernoulli, we have the following conditional distribution: where (  +∑  =1  , ℎ  , 1) denotes the Gaussian distribution whose mean and variance are   +∑  =1  , ℎ  and 1. sigm() = exp()/(1 + exp()) is the sigmoid function. and  are the numbers of visible and hidden units, respectively [18].  and   are the biases of visible and hidden units. , expresses the symmetric interaction term between the visible unit V  and the hidden unit ℎ  .For a RBM, the joint probability distribution function over visible and hidden units can be denoted by where (V, ℎ) is termed as the energy function defined as To train the DBN, the idea is to employ a layer-wise greedy strategy.Besides, the parameters are updated by minimizing the log probability log (V).

Spatiotemporal Compressive Sensing and Sparsity Regularized Matrix Factorization.
Compressive sensing is a novel sampling technique for signal processing in recent years, which makes good use of the structure or redundancy of real-world signals.It takes advantage of an adaptive sampling scheme to sense these structural signals.In detail, the structure of these signals means that it can be denoted by a vector that just has several nonzero elements (i.e., the vector is sparse).In the adaptive sampling scheme, a random matrix called the measurement matrix is used to concurrently implement compressing and coding.During the decoding phase, the compressive sensing reconstruction algorithm is a splendid approach to deal with the inverse problem with ill-posed feature.As a derivative of compressive sensing, the spatiotemporal compressive sensing technique is always used as an interpolation algorithm to recover the missing elements of a data set.The spatiotemporal feature means that the values of neighboring elements in the data set are properly similar.In terms of this feature, the SRMF method is proposed in [19], where the missing elements can be recovered precisely though the data loss probability is tremendous.Besides, the SRMF method is also an accurate tool for prediction.Under the prediction process, the elements that need to be predicted are viewed as continuous missing elements.

Decomposition of Network Traffic.
We assume that the known network traffic is denoted by  whose each OD flow is denoted by a time series  , (), where  = 1, 2, . . ., .According to (1), it can be denoted by If we set the scale to be 1, then we have The above equation divides the network traffic into two components.One is the low-pass approximation (shown by the scaling coefficients) that exhibits the LRD of the network traffic  , (), and the other is the high-pass approximation (described by the discrete wavelet transform coefficients) that expresses the gusty and irregular fluctuation behaviors of the network traffic  , ().For a traffic matrix that describes the volume of traffic between all OD node pairs, obviously, its low-pass and high-pass approximation components can be denoted by two matrices, respectively.Figures 4 and 5 give two examples of the traffic flow decomposition by the DWT.We select two OD flows from the real network traffic data set randomly, and plot their low-pass and high-pass components, respectively.Obviously, we see that the low-pass components are periodical, which means that they are much easier to be predicted comparing with the high-pass components shown by Figures 4(b) and 5(b).In this case, two components are predicted independently in this paper.

Deep Architecture for Low-Pass Component Prediction.
For an OD flow  , (), we assume that the length of this series is an even number.In this case, the number of scaling coefficients is /2.The deep architecture for prediction is plotted in Figure 6.There are  hidden layers in this architecture.Both the hidden and the input layers have /2 units.At the top of the deep architecture, a single neuron defined as the logistic regression is employed for prediction.The logistic regression model is made up of a hidden layer with /2 units and an output layer with one unit.The deep architecture is trained by the backpropagation algorithm in our method.The parameters of the deep architecture are determined lay by layer [18].
In our method, we first collect  training set denoted by (   predictors are (ĉ 1 , , . . ., ĉ , ).By training the proposed deep architecture using the scaling coefficient training set, we can obtain a relationship between input and output scaling coefficients.Furthermore, we use the scaling coefficients of  , () as an input, and then a predictor of the scaling coefficient will be achieved.

Spatiotemporal Compressive Sensing for High-Pass Component Prediction.
For the discrete wavelet transform coefficients, SRMF is a matrix-oriented interpolation algorithm.Therefore, different from the prediction of the scaling coefficients where each OD flow is predicted independently, we predict the discrete wavelet transform coefficients of all OD flows at the same time.According to (7), the discrete wavelet transform coefficients of all OD flows constitute a matrix denoted by  in this paper.The matrix  consists of two portions.One is from the training data obtained by measured network traffic data, and the other needs to be predicted.We denote the final predicting result by D, and then it can be predicted by the following regularized optimization model: where the notations ‖ ⋅ ‖  and (⋅)  denote the Frobenius norm and transposition, respectively.The matrix , so-called the temporal constraint matrix (shown by ( 9)), describes the temporal neighbors.
The matrices  and  are from the singular value decomposition of the matrix D; that is, and  are two unitary matrices, and Σ is a diagonal matrix whose diagonal elements are the singular values of D. The matrices  and  are equal to Σ 1/2 and Σ 1/2 .Finally, according to the predictors of the scaling and discrete wavelet transform coefficients, we predict the network traffic by inverse discrete wavelet transform.Algorithm 1 proposes the details of our method.

Simulation Results and Analysis
This section will verify the performance of our prediction method.In our simulations, a real network traffic data set with 2016 time slots is sampled on a time scale of 5 minutes.For exact prediction, the first 2000 time slots are used as the prior information to train the deep architecture and the  matrix .We will compare our method with three state-ofthe-art methods in network traffic prediction field, that is, the principal component analysis (PCA) method [20], the Tomogravity method [21], and the SRMF method [19].The proposed method is implemented by MATLAB in a single machine with Core i5 central processing unit, 4 GB memory, and 1664 MB graphics processing unit (GPU) memory.Meanwhile, we set  = 8 and  = 800.We first plot the real network traffic versus their predictors from four methods, respectively.Figure 7 displays the prediction results of our method.The -axis and axis denote predictors and real network traffic, respectively.From Figure 7, we see that our method has low prediction biases for small network traffic flows.By contrast, our method shows positive predictions for large network traffic.The same conclusion can be obtained from Figure 8.For large network traffic, it also has positive predictions.For small network traffic, PCA has much larger prediction bias.Tomogravity has consistently positive predictions for large network traffic shown by Figure 9.For small network traffic, Tomogravity shows a desired prediction error.Besides, for large network traffic, SRMF in Figure 10 has positive or negative predictions more or less.Now, we refer to the spatial and temporal relative errors as a metric to compare four methods.The spatial and temporal relative errors are defined as , where   () and x () are the th end-to-end network traffic flow and its predictor.As mentioned above, ,  ∈ {1, 2, . . ., }; thus the number of OD flows is  2 .Figure 11( In addition, contrasting to the high-pass component, this component has a significant effect on prediction.Hence, the weak improvement of SRE is caused by predicting the high-pass component using the Spatiotemporal Compressive Sensing method.The diminutive error or bias of a prediction method does not mean it is available.Though it has low error, it fails to provide precise predictors when it has high variance.Hence, the standard deviation is involved in our simulation as a metric for variance, which is defined as where () = (1/) ∑  =1 ( x () −   ()).In (12),  is the length of predicted traffic data set.Figure 13 shows the bias versus standard deviation of four methods.We find that the four methods perform very differently with respect to variance.Tomogravity has larger variance compared with the other methods.In contrast, the PCA method exhibits relatively high variance.The DBNSTCS and SRMF methods show relatively low variance.
Finally, the performance improvement ratio is shown in Figure 14 as an overall evaluation.The performance improvement ratio is defined as where  , () and  , () denote the predictors via the algorithms  and , respectively.The performance improvement ratios of DBNSTCS are 68.74%,5.24%, and 14.70% to PCA, Tomogravity, and SRMF.

Conclusions and Future Work
This paper focuses on the problem of network traffic prediction in wireless mesh backbone networks and proposes a hierarchical prediction method.The proposed hierarchical prediction method divides the network traffic into two components and then predicts each component by different models.In detail, the proposed method takes advantage of DBN and Spatiotemporal Compressive Sensing for network traffic prediction.In our method, the DWT is applied to dividing the network traffic into two components, that is, the long-range dependence component and the fluctuation component represented by the low-pass and high-pass components, respectively.A deep architecture consisting of a DBN layer and a logistic regression architecture is proposed to predict the low-pass component.Meanwhile, the other is predicted by the SRMF method which can capture the spatiotemporal characteristic of the high-pass component.We assess the performance of the proposed prediction method and compare it with three methods that are widely used for network traffic prediction.According to the simulation, our method is well in prediction error, especially in TRE.The main bottleneck of wireless mesh backbone network traffic prediction is the predicted accuracy for the irregular fluctuations of network traffic.Thereby, the prediction algorithm aiming at low-pass components is necessary in the future.

(
ii) We propose a deep architecture based on DBN to capture the low-pass component of network traffic.Under this architecture, by learning the built deep architecture in terms of a training set via known network traffic, the deep architecture can describe the LRD characteristic of traffic flows and carry out a prediction for network traffic.
High-pass component of OD 105

Figure 7 :
Figure 7: Real traffic data versus their predictors via DBNSTCS.

7 Figure 8 :
Figure 8: Real traffic data versus their predictors via PCA.