Spatiotemporal Self-Attention-Based Network Traf ﬁ c Prediction in IIoT

The sixth-generation (6G) mobile communications are considered as a future network and very closed to the Industrial Internet of Things (IIoT) due to its low latency and high throughput. Massive nodes supported by 6G make up the complexity of the network. Moreover, the heterogeneous traf ﬁ c brings dif ﬁ culties to the network management. Long-term network traf ﬁ c matrix (TM) prediction is a crucial technology for realizing network edge intelligence and dealing with the above issues. However, predicting long-term network traf ﬁ c in heterogeneous IIoT is challenging. Due to the powerful feature extraction capability over long sequences, self-attention is widely applied in language inference tasks. Motivated by these observations, we propose a self-attention traf ﬁ c matrix prediction (SATMP) model for long-term network TM prediction in IIoT scenarios. SATMP consists of three components: (a) a spatial – temporal encoding for obtaining the spatial – temporal features of network TM; (b) a learnable positional encoding for providing positional correlation to the traf ﬁ c sequence; and (c) a self-attention module for capturing long-term dependence. These components work together to enhance long-term prediction performance in complex networks effectively. Extensive experiments on three publicly available datasets demonstrate that SATMP is feasible and accurate in IIoT long-term network TM prediction.


Introduction
The sixth-generation (6G) has superior features to previous generations of mobile communications, such as low latency communications, high throughput, and massive connection [1]. These advantages will bring new developments to Industrial Internet of Things (IIoT), especially the wide application of edge intelligence [2,3]. Edge intelligence is a combination of edge computing and artificial intelligence, which deploys machine learning algorithms to edge devices and gets closer to data sources. Edge intelligence relies on edge devices to cooperate with other devices to complete tasks [4], shortening the data transmission delay. However, IIoT is characterized by large-scale heterogeneity and requires high fault tolerance. The application of edge intelligence will make computing, storage, and communication in IIoT more complex. These changes not only put forward the higher requirements for network quality of service (QoS), but also bring more challenges to network management: (i) Data transfer security and stability: IIoT calculates, generates, stores large amount of production and user data. A secure and reliable network is needed to protect industrial data and user privacy [5,6]. In addition, IIoT networks are densely connected and have expensive network equipment, which requires protection against industrial accidents caused by network latency and network attacks. (ii) Efficient resource allocation: IIoT has a large number of sensor devices and the amount of data that the nodes need to collect and transmit increases significantly. The computing, storage, and communication capabilities of individual nodes are limited in IIoT. In this situation, deploying a complex algorithm framework may lead to higher network latency and energy consumption [7]. (iii) High heterogeneity of the network: 6G enables IIoT to have a large number of mobile devices and sensors Figure 1 shows the edge network traffic prediction in IIoT. We list three scenarios for IIoT: smart cars, smart grids, and smart factories. Wireless base stations are connected through backbone links. Base stations on the edge provide wireless access to various nodes, such as smart vehicles, industrial sensors, industrial equipment, and electricity pylons. At the same time, UAV-assisted communication is available in some complex areas [14]. Wireless networks are also used to communicate between nodes. The heterogeneous IIoT scenario makes it difficult to predict network traffic. The proposed self-attention traffic matrix prediction (SATMP) can be deployed on edge computing nodes, such as wireless base stations, vehicles, UAVs, and so forth. Edge intelligence can quickly process computing tasks on edge without being limited to the network conditions of cloud computing [15]. IIoT can be better managed by referring to the predicted results. For example, primary routing nodes can reasonably allocate bandwidth and channels to improve resource utilization. Factories and grids can detect attacks based on changes in the network traffic pattern.
In essence, the prediction of network TM is a timeseries prediction problem. However, the network TM series in IIoT is significantly different from other time series. First, network traffic in IIoT is stochastic and dynamic, which leads to more complex statistical characteristics. Existing studies have shown that network traffic has selfsimilarity, long-time dependence, heavy tail distribution [16], and other characteristics. Second, network traffic is affected by other factors besides time. For example, backbone networks in IIoT are affected by network topology and routing settings, while wireless communication is influenced by movement patterns and specific functions of mobile nodes. All the above factors bring difficulties to IIoT traffic prediction. In addition, long-term prediction is another challenge, which makes some models lose their predictive ability.
The attention mechanism originated from human vision. It assigns different weights to different input parts. The selfattention mechanism is a variant of the attention mechanism. The self-attention mechanism is first proposed by Lin et al. [17] for extracting an interpretable sense embedding. The self-attention mechanism reduces the dependence on external information and better captures the internal correlation of data. In recent years, the self-attention mechanism has made remarkable achievements in the natural language processing (NLP) field and achieved leading results in tasks such as text translation and language information [18,19]. The self-attention module can calculate the data of different time steps in parallel through reasonable position encoding without attaching them to the recurrent neural network. It shortens the distance between any two inputs significantly to enhance the learning ability of long-term dependence.
Motivated by the above analysis, we utilize the selfattention mechanism for IIoT traffic prediction and propose a network TM prediction model based on spatiotemporal features. We first add spatial and temporal encoding to the data to effectively utilize the spatiotemporal information of network TM. Then, the self-attention mechanism is used to learn the spatiotemporal characteristics. In the encoder, we design a learnable positional encoding so that the model can perceive the order of input data. Meanwhile, positional encoding is used to ensure the parallelization of training. The main contributions of this paper are as follows: (i) We propose SATMP, a novel model based on the self-attention mechanism for long-term network TM prediction in IIoT. The self-attention mechanism with spatiotemporal encoding is used to learn the complex associations of network traffic. (ii) We design a learnable positional encoding scheme to provide position correlation for the long input sequence, enhancing model awareness of input order. (iii) We evaluate the performance of SATMP on three publicly available datasets. The results show that SATMP is feasible and outperforms other methods in predicting accuracy.
The rest of this paper is organized as follows. Section 2 overviews the studies related to traffic prediction and describes their characteristics. In Section 3, we first provide a detailed definition of long-term IIoT traffic matrix prediction and then elaborate on the related challenges. The framework and details of the proposed algorithm are discussed in Section 4. In Section 5, we introduce the datasets we use, describe the experiment's specific details, and analyze the experimental results. Finally, we summarize the research of this paper and discuss our future work.

Related Work
Network traffic prediction is one of the hot topics in network characterization and measurements. It has attracted the wide attention of researchers, and the related literature is abundant in some specific contexts.
Many studies use statistical models to predict network traffic. For example, Moayedi and Masnadi-Shirazi [20] decompose network traffic into normal and abnormal parts and use the autoregressive integrated moving average model (ARIMA) to predict network traffic and detect anomalies. Wang et al. [21] propose a traffic flow modeling and prediction method based on an autoregressive moving average model (ARMA), which is easy to calculate. The autoregression-based model uses the linear combination of historical sequence data to generate predictions results. It cannot model the nonlinear features in the flow and requires the sequence to have a certain stability. Kim [22] uses integer-valued generalized autoregressive conditional heteroscedasticity (INGARCH) to capture the nonlinear characteristics of network traffic. Bayati et al. [23] use Gaussian process regression (GPR) to predict the flow and use self-similar covariance functions to improve the prediction accuracy. However, the above models usually do not take into account the spatial connection or interdependence of network nodes. The 6G-enabled IIoT tends to contain much more nodes and links, limiting the use of these statistics-based methods in more general problems.
At present, the mainstream methods pertain to machine learning and deep learning. Nikravesh et al. [24] compare the performance of support vector machines (SVM), multilayer perceptron with weight decay (MLPWD), and multilayer perceptron (MLP) in traffic prediction tasks. Jain and Prasad [25] use the XGBoost algorithm to predict the traffic of telecom network in peak hours. The machine learning algorithm can mine some complex network traffic patterns, but compared with deep learning, its feature extraction ability is relatively insufficient. Many researchers have used deep learning algorithms with stronger feature extraction ability, such as convolutional neural network (CNN) [26], deep belief network (DBN) [27], recurrent neural network (RNN) [28], long shortterm memory networks (LSTM) [29], meta-learning method [30]. Some scholars specifically study traffic prediction methods in IIoT. For instance, Nie et al. [31,32] design two prediction methods based on multitask learning and reinforcement learning, respectively, in complex and heterogeneous IIoT. Compared with linear and machine learning methods, the deep learning model is more complex and can better model the time dependence of traffic flow series. RNN is more suitable for processing time series with shared parameters in the time dimension. However, the recurrent structure of RNN is easy to cause gradient disappearance and gradient explosion when using too many units. LSTM adopts cell states and three gates: an input gate, an output gate, and a forget gate to alleviate the above problems. It could effectively learn the long-term correlation characteristics of the traffic. Gated Recurrent Unit (GRU) is a variant of LSTM that reduces the gates to two. LSTM and GRU have obtained better prediction results and have become a widely used traffic prediction model in recent years [33,34].
Some researchers combine different models to improve performance. Compared with using these models alone, the hybrid model has improved the prediction effect. For Wireless Communications and Mobile Computing example, Li et al. [35] propose a method combining wavelet transform and artificial neural network (ANN), which uses wavelet transform to decompose time-domain traffic and adds nonlinear prediction ability to the model with the help of ANN to reduce the prediction error. Similarly, Tian et al. [36] use the Mallat wavelet transform algorithm to decompose the network traffic, then use ARIMA and least-squares support vector machine (LSSVM) to predict different components, respectively. Zhang et al. [37] combine wavelet transform with LSTM and reduce the prediction error. Zhao et al. [38] and Tian et al. [39] decompose complex network data into low-frequency smooth sequence through empirical mode decomposition. Then they use LSTM and ARMA, respectively to predict, effectively improving the prediction performance.
The literature analysis shows that more and more researchers have preferred neural networks and their hybrid models in recent years. RNN, LSTM, and their variant networks can better model the network traffic sequence because of their solid time-dependent learning ability. Compared with linear and machine learning models, those models have a significant advantage in prediction accuracy.
In addition, the new research trend is the application of attention mechanisms in the network structure to enhance the model's attention to spatiotemporal correlation. For example, Feng et al. [40] propose a network traffic prediction model with attention mechanisms to capture long and complex dependencies. Zhao et al. [41] propose a spatial-temporal attention-CNN to effectively obtain the dynamic spatiotemporal correlations of cellular networks. In order to fully learn the characteristics of the IIoT long-term network TM sequence, we apply the self-attention mechanism to network TM prediction. We aim to improve the long-term prediction performance by using the powerful long-term feature extraction capability of the self-attention mechanism.

Problem Formulation
This section elaborates on the IIoT long-term TM problem and related challenges. We first define the long-term prediction problem of network TM in IIoT. Then, we analyze the challenges of the research problem from three aspects: multivariate, spatial-temporal correlation, and long-term dependence.

Long-Term IIoT TM Prediction.
In order to facilitate the subsequent consideration of topological and spatial factors, we use a weighted directed graph to model the IIoT network. Considering an IIoT with N nodes, which are connected by H links, we establish a directed graph G ¼ V; ð IÞ, where V is the set of nodes, and I is the set of links. For the backbone network, if there is a link between node i and node j, then I has two edges v i ; ð v j Þ with different directions. The weight of each link represents the route weight of the link. For wireless networks, we think there is a direct connection between any two research nodes regardless of their link weight. The traffic in the IIoT network TM includes the traffic flow that each origin node sends to other destination nodes within a certain interval. The definitions covered in this paper are as follows: where N is the number of nodes in the IIoT network, m t i; j represents the traffic flow sent from node i to node j in the t sampling interval.
Assuming that the number of the obtained traffic is T, all the obtained traffic can be donated as TM ¼ X 1 ; f X 2 ; …; X T g 2 R N 2 ×T . Then, the prediction task in this paper is using where ω is the historical length and l is the traffic length to be predicted. The formula is as follows: where R is the spatial correlation of the network and P is the timestamp. We predict longer lengths than previous work [33,42] in our work. The goal is to build a model that achieves low error between predicted value b Y and ground truth Y.

3.2.
Challenges. In addition to the challenges in short-term time-series prediction, such as multiple factors influence, prediction lag, high randomness, and so forth; the longterm prediction of IIoT TM faces more challenges due to the complexity of network traffic itself and the influence of potential factors: (i) Multiple dimensions: the network TM represents all the origin-destination traffic in the IIoT network. Assuming an IIoT has N nodes, there are N 2 traffic flows in the network TM. Compared with singledimensional time series, multiple dimensional traffic prediction requires higher prediction performance and hardware consumption. Models must have solid predictive power to simultaneously capture deeper features and predict all dimensions. In addition, nodes in IIoT have limited power and computing capacity [43], so the computational time and space complexity of the prediction model should not be high. (ii) Complex temporal and spatial associations: some IIoT, such as smart vehicles and smart logistics, are designed to provide services for human beings. Their traffic is closely related to human activities, so there is a temporal correlation between future traffic and historical traffic. However, the temporal correlation pattern has different dimensions, such as quarter, month, and week. In other cases, IIoT traffic has a high suddenness in fine granularity, which brings difficulties to learning association mode. In addition, the change in network traffic is affected by spatial association, such as other network nodes, network topology, routing algorithm, link bandwidth, and so forth. For example, many backbone networks use the Open Shortest Path First (OSPF) gateway protocol, which generally uses the shortest path algorithm, for example, Dijkstra, to construct a routing table. Network packets choose the route to the destination node according to the routing table, which affects the traffic changes of different nodes and links. (iii) Long-term dependencies: it is difficult to capture the long-term dependencies. The prediction target of many previous studies is the network traffic situation at the next interval [32,44], while the purpose of our study is to predict the change of network traffic in a long period (such as the next 48 hr). Long-term prediction requires a more vital ability to model longterm dependencies. Many existing models have limited ability to capture long-term dependencies. For instance, RNN and LSTM have the advantage of processing time-series data. However, their performance will decrease in long-term prediction tasks. For example, Zhou et al. [45] prove that as the prediction length increases, the inference time of LSTM becomes longer, and the prediction error increases.
In view of the above challenges, we designed a more efficient prediction scheme of self-attention mechanism for IIoT. We use the parallelism capability of the self-attention mechanism to improve the processing capability of multidimensional data. In addition, we design spatiotemporal encoding to provide characteristics of IIoT network TM sequences. Last but not least, we design learnable position encoding to provide time correlations for long input sequences. The framework and details of the model are explained in Section 4.

System Model
In this section, we first introduce the main framework of the proposed model. Then we discuss the various modules in our proposed work, including spatiotemporal encoding, learnable positional encoding, self-attention module, feedforward layer, and output layer. The main notations covered in this article and their descriptions are summarized in Table 1. Figure 2 shows the main framework of the prediction model with the spatiotemporal self-attention mechanism proposed in this paper. The model's input consists of three parts: network TM, traffic flow timestamp, and network topology and route weight. We first add spatial and temporal encodings to the preprocessed flow matrix sequence. Then the sequence is input into the stacked encoders to learn the temporal and spatial characteristics and dependent correlations. Every encoder consists of learnable positional encoding, a multihead self-attention module, and a feedforward neural network. The positional encoding provides position relations for input sequences. The temporal and spatial dependence of different sequences is learned by the self-attention modules. The feedforward neural networks provide nonlinear transformations for each encoder. Residual connection is used between the three parts to prevent network degradation and enhance network stability. Finally, the fully connected layer is used to output the prediction results.

Spatiotemporal Encoding.
In urban traffic flow prediction, the spatial dependencies between road segments are of great importance, which has a significant influence on the change of traffic flow [46]. Unlike the transportation system, the relative position of nodes in the network is not essential, while the interactive relationship between nodes or regions has a more significant impact on network traffic. In the backbone network scenario, we utilize the network topology and routing protocol to build spatial encoding. The network topology and route weight are used to calculate the shortest path matrix of the network. The transformed routing matrix is added into the model as spatial routing encoding to enhance the model's perception of spatial dependence. For graph G, we first use the shortest path algorithm corresponding to the network, such as Dijkstra, to calculate the routing distance of each pair of origin-destination nodes according to the weight of I. The routing matrix is donated by R. Then we flatten R in the way just like we map the TM. Since the path selection strategy takes a shorter path, we invert every element in R. Then we normalized R to prevent interference with the original data. In the wireless network scenario, we build the spatial encoding of network TM by utilizing the functional area type of telecommunication region. First, we use Google Map to find the points of interest (PoIs) of each study region. Next, we determine the functional area types of all study regions by considering the PoIs and CORINE Land Cover (CLC) map [47]. CLC provides information on land cover and land cover change across Europe. Based on satellite images, the land is divided into urban fabric, industrial or commercial fabric, green urban areas, and so forth. The dataset of wireless cellular network used in this paper was collected from 2013 to 2014, so the  Table 2. Then, we list the functional area pairs formed by source and destination region and number them. Finally, we constructed the spatial encoding of functional area R ¼ F 11 ; f F 12 ; …; F NN g, where F ij represents the number of functional area pairs from region i to region j. We also normalize R as we do for the backbone network dataset.
Network traffic series are highly correlated with time changes. Statistical models, such as ARIMA, can learn some regression features of previous time series. However, the features it learned do not correspond to time. For example, there is usually a peak in network traffic between 9 and 10 am, which corresponds to people's working hours. We add easily available timestamp information encoding as features to network TM so that the model can learn the influence of different times on traffic changes. In other words, we enhance the model's ability to perceive different timestamps rather than just learning the periodic changes of traffic as time steps move backward. One-hot is a commonly used encoding method for the discrete temporal feature. However, one-hot encoding is sparse. When considering multidimensional time (month, day, and week), concatenating one-hot encoding yields large dimensions. Using word embedding encoding ensures that the time encoding dimension of each level is the same as the input data. For example, one component X t of the input sequence represents the network TM sampled in time t. It has d dimensions, which means there is d origin-destination traffic flow in the network. Assuming that its timestamp is P, for example, "2021-10-22 15:25". The encoding of the timestamp is shown in Figure 3. We first decompose P in different granularity, for example, year, month, day, hour, minute, and the day of the week. Then the timestamps are input into different fully connected layers with different embedding dimensions. Then we map them into d-dimension embeddings. Finally, the time embeddings are added together as the final timestamp encoding. We design different granularity timestamp encodings to provide different span time markers for subsequent self-attention modules. We expect the self-attention module to learn different traffic patterns through these markers, such as weekly and daily patterns. Figure 4 shows the training process of RNN and self-attention model, where X 1 , X 2 , …, X t     Wireless Communications and Mobile Computing is the input sequence, H j i is the j th hidden layer of time step i, and Y 1 , Y 2 ,…, Y t is the output. In the training process of RNN, LSTM, and other recurrent neural networks, since the solution of the current system state requires the results of the previous time step, the input sequence will be fed into the network in time order. This kind of model can distinguish the before and after time relation of the input sequence. However, in the self-attention mechanism used in this paper, the encoder receives timing sequence data of different time steps simultaneously and calculates their similarity. It leads to the model's failure to capture the input sequence's time association.

Learnable Positional Encoding.
Adding positional encoding (shown as P i in Figure 4) to the encoder is an effective solution to the above problem, which provides the position relation of sequence for the encoder to ensure the normal training of the model. In addition, positional encoding could ensure the parallelization of training and speed up the training of the model. Vaswani et al. [19] use fixed positional encoding consisting of sine and cosine functions. The common data processing method is the sliding window method in the time-series prediction problem. If the data are not moved out of the window, then the same data remain in the following input. The difference is that it is shifted forward by one unit. Fixed position encoding is weak in learning relative position relationships, so we use another position encoding: learnable position encoding. We add different positional encodings for the input sequence in different encoders. First, we construct the tensor L ¼ L 1 ; f L 2 ; …; L d g, where d is the dimension of the input sequence. We compare several initialization methods, such as normal distribution initialization, xavier initialization, and uniform distribution initialization. Different initialize methods have little effect on the results. We chose the uniform distribution and the positional encodings are initialized within [−1, 1]. The learnable positional encoding is trained along with the whole structure during training. As shown in Figure 4, the added positional encoding plays a role in providing location relations. Multilayer encoders contain multiple levels of position encoding, and we expect to improve the model's ability to learn position relationships at different levels in this way.

Self-Attention Module.
According to the previous analysis, the temporal and spatial correlation between the input and target sequences is significant in network TM prediction. When the input sequence and the prediction sequence are long, it becomes difficult for the model to learn the dependencies of the sequence. The self-attention mechanism is widely used in NLP tasks. In translation tasks, the selfattention mechanism can calculate the similarity between other words and the current word in an input sentence, assigning different attention weights to different words. In this way, self-attention could learn the dependency relationship in the sentence. Furthermore, the self-attention mechanism does not use recurrent structures to capture features. It uses matrix multiplication to calculate the similarity of two words. This approach shortens the distance between two words to capture longer dependencies. Inspired by this idea, we transfer the self-attention mechanism to the network TM prediction problem. We treat the network TM at the moment as a word and use the self-attention mechanism to calculate the dependency of the long-term input sequence.
The self-attention mechanism essentially reflects how much each token pays attention to other tokens. Figure 5 shows the calculation process of the self-attention mechanism. Suppose X 1 ; X 2 ; …; X t is the input sequence encoded by space-time and position. E is formulated as the sum of all the encodings: : The training process of RNN and self-attention model. Wireless Communications and Mobile Computing 7 where R is the spatial encoding, P is the timestamp encoding, and L is the positional encoding. First, query, key, and value matrices are generated for each token through different fully connected layers, which are donated by Q, K, and V, respectively. In the case of Q, the equation is as follows: where W Q is the weight of the fully connected layer that calculates Q, K, and V are also calculated by the same formula, using weights W K and W V , respectively. Take X 2 as an example, multiply Q 2 by K i of each time step to calculate the attention score, and then we can get the attention of X 2 to the input of other time steps. The output of this layer can be expressed as: where i and j are the serial numbers of tokens, and a i;j is the attention weight of the X i to X j . That is, the output is the weighted sum of each value matrix. This method can effectively capture the dependencies in a long sequence. The specific calculation process of attention score is shown in Equation (7), where d is the dimension of network TM. The dot product of the matrices is used to calculate the attention score of the two matrices. After scaling, the attention score is multiplied by V as a weight to obtain the weighted matching result.
We can understand how the self-attention mechanism captures the spatiotemporal dependence by analyzing the computational process. Assuming that W Q and W K are the weights of the fully connected layer forming query matrix and key matrix, respectively, and X is the input sequence. According to Equations (5) and (7), the calculation process of attention with A i and A j is as follows: where (d) in this equation contains the encodings of two sequences: E i and E j , which reflects the model's learning of the spatiotemporal dependence of the two sequences.
Multihead attention modules are combined with several self-attention modules. Assuming that attention modules use h heads, note that h can divide d, which is the dimensions of network TM embedding. The multihead attention module divides the d dimensions of Q, K and V into h parts first, and uses the Equation (7) to calculate the attention score, respectively. Finally, the module contacts each result as its final output. Multihead attention can play a role in ensemble learning, prevent overfitting and help to extract multiple features.
In order to optimize the temporal and spatial complexity of the model, we use the self-attention mechanism with linear complexity proposed in [48]. The following equation shows its calculation process: where ρ row and ρ row denote applying the softmax function along each row or column of matrix, respectively. This mechanism has O dn þ d 2 ð Þmemory and O d 2 n ð Þ computational complexities, where n is the input length, d is the dimension of Q, K, and V. d is usually much less than n in the long-term prediction, so the memory and computational complexity can be approximated as O n ð Þ. Compared with the original attention whose memory and computational complexity is O n 2 ð Þ, the computational complexity is significantly reduced.

Feedforward and Output
Layer. The subsequent structure in the encoder is the feedforward layer, which contains two fully connected layers. The activation function of the first layer is rectified linear unit (ReLU), which can provide nonlinear transformation. The feedforward layer can be described as the following equation: where X is the output of multihead attention module, and W, b are the weight and bias of the fully connected layer, respectively. In the language translation task, word2vec or other methods are needed to encode each word first. Since there is no complex embedding of traffic TM, we only use the encoder structure. We use the linear layer to produce the prediction results in the output module. ReLU is used as its activation function. Y and the ground truth Y. The loss function can be defined by: where N is the number of network nodes, and b Y , Y represents the predicted value and the ground truth, respectively.
The training process of SATMP is described in Algorithm 1.

Datasets and Preprocessing.
We use three publicly available and well-known traffic datasets to evaluate the performance of the proposed algorithm. These three datasets are summarized in Table 3. Abilene dataset [49] is sampled from the Abilene Network and consists of 12 nodes and 15 links. Due to the discontinuity of the dataset sampling, we select continuous data for the experiment instead of using all the data. Specifically, we use Abilene data from May 1 st 00:00 to May 28 th 23:55, which contains 8,064 TM. For Geant, we select data from June 1 st 00:00 to June 28 th 16:30, which contain 2658 TM. The MItoMI telecommunications dataset provides the directional interaction strengths between different areas of Milan from November 1 st 2013 to January 1 st 2014 [51]. The dataset divides Milan into 100 × 100 grids and records the interaction at a sampling interval of 10 min. Since the communication intensity of most regions is 0 most of the time, we selected 20 active regions for study. We get a total of 8,640 TM with 400 dimensions.
The way we preprocess the data is as follows. First, we perform a unit conversion on the dataset. The original values are converted into MBit=s values on backbone network datasets. Then we normalize the TM by dividing its maximum value. We also convert the timestamp to the "Europe/Rome" timezone on the MItoMI dataset. Finally, we use a sliding window to build the training dataset, as is shown in Figure 6. Assuming that the dataset has a total of T TM, the sliding window size is ω, and the predicted window size is l. The values of ω and l are determined by actual needs, and the relationship between them in our study is ω ¼ l. Then we roll the sliding window with stride = 1. According to the above assumptions, the length of the dataset generated by our preprocess is T − ω − l þ 1. We divide it into the training set, validation set, and testing set according to the ratio of 6 : 1 : 3.

Experimental Details.
We use Pytorch to build our model. The loss function we choose is MSELoss. The AdamW algorithm with decay learning rate is utilized to optimize the model. In addition, we use dropout and gradient clipping to avoid overfitting. The hyperparameter settings of the proposed model are shown in Table 4.
Prophet [25], LSTM [52], and GRU [53] are selected to compare with the proposed framework. Prophet is an additive model that can effectively capture the periodicity of time series. LSTM model effectively alleviates the problems of RNN gradient disappearance and explosion by adding gated structure and cell state. GRU simplifies the structure of LSTM and achieves better results on some tasks.
We use the above models to train and test their performance in predicting flow sequences of different lengths. The predicted lengths of time we choose are 24, 48, 72, and 96 hr. For the methods used for comparison, we tune their hyperparameters with a validation set for better results.
We evaluate the performance of network TM prediction based on two metrics: MSE (Equation (11)) and mean absolute error (MAE), which are defined as: where N is the number of network nodes, and b Y , Y represents the predicted value and the ground truth respectively.

Results and Analysis.
The prediction performance of all methods is summarized in Table 5. The best result for each prediction task is highlighted in bold in the table. Note that the dataset changes due to the different predicted lengths, so we only make horizontal comparisons and do not explore the performance changes of each model as the sequence length increases. Our proposed SATMP significantly reduces the prediction error. Take the traffic prediction for 72 hr as an example, the MSE of SATMP is 68%, 48%, and 50% lower than that of Prophet, LSTM, and GRU on the Abilene. On MItoMI, the MSE of SATMP is 61%, 27%, and 24% lower than that of methods for comparison.
According to the experimental results, the MSE and MAE of Prophet are very high. Prophet can model the periodic features of historical data. However, it cannot use additional information to capture the spatiotemporal association of Input: Historical network traffic X with a time span of T. Spatial encoding R. Temporal encoding P. Output: SATMP model ð X tþ1 ; ⋯; X e Þ 5: obtain the target sequence Y ¼ X eþ1 ; ð X eþ2 ; ⋯; X eþw Þ. 6: build the training instance L; f ð R; Pg; YÞ by Equations (4). 7: end for 8: Initialize the trainable parameters in SATMP 9: Update the parameters in SATMP using backpropagation algorithm with loss function Loss as Equation (11). Wireless Communications and Mobile Computing network TM, which leads to a high error. LSTM and GRU have similar prediction performance, which proves the effectiveness of their time-series modeling. Obviously, SATMP significantly reduces MSE and MAE. By utilizing spatiotemporal encodings and self-attention mechanisms, SATMP can better capture the potential features of the network TM. Figure 7 shows the 48 hr prediction result of SATMP, Prophet, LSTM, and GRU on the Abilene dataset. The blue curves represent the real flow, while the curves in other colors represent the predicted results of other methods. It is evident that Prophet failed in long-term prediction. It can only predict a general trend. All the predicted results are basically the same and deviate greatly from the real value. SATMP, LSTM, and GRU have similar prediction performance before the second 18:00. However, after the second 18:00, the results of LSTM and GRU begin to show large errors, which indicates that the accuracy of these two models will decline when the prediction time is very long. The results Time X ω + l X ω + l + 1 X ω + l + 2 X ω + l X ω + l + 1 X ω + l + 2 X ω + l X ω + l + 1 X ω + l + 2   demonstrate that SATMP has a more potent ability to extract long-term correlation. Spatiotemporal encoding can effectively provide spatiotemporal information to the selfattention module and enhance the prediction performance. Figure 8 shows the prediction of SATMP on Abilene (48 hr with 576 time steps). It can be found that the traffic of the backbone network conforms to certain rules, but there is a lot of fluctuation. The prediction result confirms that SATMP can better predict the fluctuation of irregular traffic. It can cope with high burst traffic scenarios as shown in Figures 8(b) and 8(c). The proposed algorithm makes full use of the spatiotemporal characteristics and uses the selfattention mechanism to capture its long-term spatiotemporal dependence, so as to learn the properties of network traffic more comprehensively. The aforementioned result shows that SATMP has the advantages of high accuracy and long-term prediction in complex networks.
In addition, we compare the hardware requirements of LSTM with the proposed algorithm. The sampling interval is 5 min on the Abilene dataset. 24 hr traffic data contains 288 network TM, and 96 hr traffic data contains 1,152 network TM. Such a long sequence imposes massive memory consumption of graphics cards on LSTM. Take the prediction of 96 hr as an example, LSTM occupies 14 GB of video memory on average, sometimes more than 24 GB. The average occupied memory of SATMP is 9 GB, saving a lot of computation power. Besides, SATMP eliminates the recurrent structure and can compute all inputs in parallel. With the same number of parameters, it has a faster training speed. As discussed above, the model proposed in this paper has a great advantage in consuming computational power.
In the highly dynamic environment of IIoT, the network changes over time. For example, the addition and departure of nodes will affect the traffic of the entire network. Therefore, our algorithm needs to be retrained after the network changes greatly. Fortunately, the training process of our algorithm is parallel, which has higher training efficiency and can realize model iteration faster.

Ablation Study.
To test the effectiveness of each module in SATMP, an ablation experiment is designed in this paper. We design two ablation models and test them on Abilene. Model 1 deletes the spatiotemporal feature extraction module of proposed framework. Model 2 replaces the learnable positional encoding with fixed position code. At the same time, if the removed modules involve neural networks, we use the fully connected layer instead to ensure that the number of parameters is roughly unchanged. Figure 9 shows the results of ablation study. The result confirms that the spatiotemporal feature extraction module provides more additional features to the model, which can significantly reduce the prediction error. Compared with fixed positional encoding, learnable positional encoding has better ability to describe the temporal correlation of traffic sequence, which is helpful to predict network traffic matrix more accurately.

Conclusion
This paper investigates the problem of long-term prediction of network TM in large-scale IIoT networks. The 6G-enabled IIoT will contain many heterogeneous networks, making traffic prediction difficult. We provide a novel method by applying the self-attention mechanism to resolve this issue. The selfattention mechanism can reduce the distance of time-series dependence. Inspired by that, we apply the mechanism to the long-term prediction of IIoT network TM and propose SATMP, a self-attention prediction model combining spatiotemporal encoding. We show the effectiveness of SATMP for long-term TM prediction with a detailed analysis and evaluation on three backbone and wireless network datasets. SATMP's accurate long-term prediction results enable IIoT networks to implement effective resource allocation, congestion control, and attack detection. In addition, SATMP supports parallel computing and can be deployed on edge IIoT nodes for edge intelligence. The data in 6G-enabled IIoT need to be computed securely and quickly. Several learning frameworks have been proposed to address this characteristic. For instance, federal learning is considered a key technology for the future of IIoT, which supports the collaborative training of nodes while protecting data privacy. We plan to combine self-attention mechanisms with Federal learning for further IIoT network research in future work.

Conflicts of Interest
The authors declare that they have no conflicts of interest.